In this post, we will provide some examples of what you can do with OpenCV.
We will give a walk-through example of how we can blend images using Python OpenCV. Below we represent the target and the filter images.
Target Image
Filter Image
import cv2 # Two images img1 = cv2.imread('target.jpg') img2 = cv2.imread('filter.png') # OpenCV expects to get BGR images, so we will convert from BGR to RGB img1 = cv2.cvtColor(img1, cv2.COLOR_BGR2RGB) img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB) # Resize the Images. In order to blend them, the two images # must be of the same shape img1 =cv2.resize(img1,(620,350)) img2 =cv2.resize(img2,(620,350)) # Now, we can blend them, we need to define the weight (alpha) of the target image # as well as the weight of the filter image # in our case we choose 80% target and 20% filter blended = cv2.addWeighted(src1=img1,alpha=0.8,src2=img2,beta=0.2,gamma=0) # finally we can save the image. Now we need to convert it from RGB to BGR cv2.imwrite('Blending.png',cv2.cvtColor(blended, cv2.COLOR_RGB2BGR))
And voilà!
We will show how you can apply image processing with OpenCV in Python. We will work with this image obtained from Unsplash.
import cv2 import numpy as np import matplotlib.pyplot as plt %matplotlib inline img = cv2.imread('panda.jpeg') img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) blurred_img = cv2.blur(img,ksize=(20,20)) cv2.imwrite("blurredpanda.jpg", blurred_img)
You can have a look at Sobel Operator at Wikipedia and you can also start experimenting with some filters. Let’s apply the horizontal and vertical Sobel Operator.
img = cv2.imread('panda.jpeg',0) sobelx = cv2.Sobel(img,cv2.CV_64F,1,0,ksize=5) sobely = cv2.Sobel(img,cv2.CV_64F,0,1,ksize=5) cv2.imwrite("sobelx_panda.jpg", sobelx) cv2.imwrite("sobely_panda.jpg", sobely)
We can also binarize the images.
img = cv2.imread('panda.jpeg',0) ret,th1 = cv2.threshold(img,100,255,cv2.THRESH_BINARY) fig = plt.figure(figsize=(12,10)) plt.imshow(th1,cmap='gray')
We will discuss about how we can apply Face Detection using OpenCV. We go straightforward with a practical reproducible example.
The logic it the following: We get the image from the URL (or from the hard disk). We convert it to an numpy array
and then to a gray scale. Then by applying the proper CascadeClassifier we get the bounding boxes of the faces. Finally, using PIllow (or even OpenCV) we can draw the boxes on the initial image.
import cv2 as cv import numpy as np import PIL from PIL import Image import requests from io import BytesIO from PIL import ImageDraw # I have commented out the cat and eye cascade. Notice that the xml files are in the opencv folder that you have downloaded and installed # so it is good a idea to write the whole path face_cascade = cv.CascadeClassifier('C:\\opencv\\build\\etc\\haarcascades\\haarcascade_frontalface_default.xml') #cat_cascade = cv.CascadeClassifier('C:\\opencv\\build\\etc\\haarcascades\\haarcascade_frontalcatface.xml') #eye_cascade = cv.CascadeClassifier('C:\\opencv\\build\\etc\\haarcascades\\haarcascade_eye.xml') URL = "https://images.unsplash.com/photo-1525267219888-bb077b8792cc?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80" response = requests.get(URL) img = Image.open(BytesIO(response.content)) img_initial = img.copy() # convert it to np array img = np.asarray(img) gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY) faces = face_cascade.detectMultiScale(gray) # And lets just print those faces out to the screen #print(faces) drawing=ImageDraw.Draw(img_initial) # For each item in faces, lets surround it with a red box for x,y,w,h in faces: # That might be new syntax for you! Recall that faces is a list of rectangles in (x,y,w,h) # format, that is, a list of lists. Instead of having to do an iteration and then manually # pull out each item, we can use tuple unpacking to pull out individual items in the sublist # directly to variables. A really nice python feature # # Now we just need to draw our box drawing.rectangle((x,y,x+w,y+h), outline="red") display(img_initial)
The initial Image was this one:
And then after drawing the Bounding Boxes we got:
As we can see, we managed to get correctly the four faces BUT we discovered also a “ghost” behind the window…
We can also crop the faces to separate images
for x,y,w,h in faces: img_initial.crop((x,y,x+w,y+h)) display(img_initial.crop((x,y,x+w,y+h)))
For example, the first face that we get is:
Notice: In case you wanted to read the image from the hard disk you could simply type the following three lines:
# read image from the PC initial_img=Image.open('my_image.jpg') img = cv.imread('my_image.jpg') gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
This post is a practical example of how we can use OpenCV with Python for detecting faces in a video. In a previous post, we explained how to apply Object Detection in Tensorflow and Face Detection in OpenCV. Generally, the Computer Vision and the Object Detection is a hot topic in Artificial Intelligence. Think for instance the autonomous vehicles which must detect continuously many different objects around them (pedestrians, other vehicles, signs etc).
In the following example, we apply a Face Detection with our USB Camera and we write the video to an .mp4
file. As you can see, the OpenCV is able to detect the face, and when it is hiding behind the hands the OpenCV is losing it.
import cv2 # change your path to the one where the haarcascades/haarcascade_frontalface_default.xml is face_cascade = cv2.CascadeClassifier('../DATA/haarcascades/haarcascade_frontalface_default.xml') cap = cv2.VideoCapture(0) width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) # MACOS AND LINUX: *'XVID' (MacOS users may want to try VIDX as well just in case) # WINDOWS *'VIDX' writer = cv2.VideoWriter('myface.mp4', cv2.VideoWriter_fourcc(*'XVID'),25, (width, height)) while True: ret, frame = cap.read(0) frame = detect_face(frame) writer.write(frame) cv2.imshow('Video Face Detection', frame) # escape button to close it c = cv2.waitKey(1) if c == 27: break cap.release() writer.release() cv2.destroyAllWindows()
Few lines of code are needed to record this video with dynamic face detection. If you run the above block of code you will get a similar video (Of course with a different face
With a little bit of time off I decided to explore the discord.py
module again with trying to make some of my own Discord bots. The bot I decided to make is a simple stock/crypto ticker bot that tracks the live price of a stock/crypto that I want.
This was inspired by the r/Cryptocurrency discord server which had such bots for various cryptocurrencies/DEFI but for some reason did not have them publicly available.
Because I thought it was really cool to see live prices of stocks/crypto in real time on Discord, I wanted to make one myself. Lets run through the steps for making a ticker bot like I did today!
The script requires that you have the discord
and the yahoo_fin
modules installed. So before you start be sure to have them installed in your environment.
pip install discord pip install yahoo_fin
First things first, you’re going to want to log into the Discord developers portal navigate to “Applications” and click on “New Application” after which you will be given a prompt to put in the name of the application.
After the application is created, click on your application and create a bot by going to the “Bot” section of the settings and create a bot.
After that, go to the OAuth2 section of the bot and under “scopes” click “bot” and invite the bot with the link generated.
Once you invited you can proceed with developing your own bot.
It is very important to have your unique token for your bot. This can be found in your application settings under the “bot” tab.
Once the token is copied, put it in a .txt
file that you can access.
The script follows the following structure:
from discord.ext import commands from yahoo_fin import stock_info as si def read_token(): with open('botTokenPath.txt', 'r') as f: lines = f.readlines() return lines[0].strip() token = read_token() bot = commands.Bot(command_prefix="!!") @bot.command() async def run(ctx): while True: change = si.get_quote_data("StockSymbol") name="S&P500: \n$"+ str(round(si.get_live_price("StockSymbol"),2))+ "("+str(round(change["regularMarketChangePercent"],2))+")"+"%" await ctx.guild.me.edit(nick=name)
Run the script, and there you have it! Once you have your token and Stock Symbols picked, type !run
in one of your channel servers and you should have your stock ticker up and running!
As for hosting, there are lots of great resources out there for figuring how to do that. If you’re curious, I recommend either the solution offered by FreeCodeCamp (see the end of the article) or by Tech With Tim.
Thank you for reading! If you have any questions be sure to reach out!
Be sure to subscribe and never miss an update!
Least squares is a cornerstone of linear algebra, optimization and therefore also for statistical and machine learning models. Given a matrix A ∈ R^{n,p} and a vector b ∈ R^{n}, we search for
x^\star = \argmin_{x \in R^p} ||Ax - b||^2
Translation for regression problems: Search for coefficients β→x given the design or features matrix X→A and target y→b.
In the case of a singular matrix A or an underdetermined setting n<p, the above definition is not precise and permits many solutions x^{⋆}. In those cases, a more precise definition is the minimum norm solution of least squares:
x^\star = \argmin_{x} ||x||^2 \quad \text{subject to} \quad \min_{x \in R^p} ||Ax - b||^2
In words: We solve the least squares problem and choose the solution with minimal Euclidean norm.
We will show step by step what this means on a series on overdetermined systems. In linear regression with more observations than features, n>p, one says the system is overdetermined, leading to ||Ax^\star-b||^2 > 0
in most cases.
Let’s start with a well-behaved example. To have good control over the matrix, we construct it by its singular value decomposition (SVD) A=USV’ with orthogonal matrices U and V and diagonal matrix S. Recall that S has only non-negative entries and — for regular matrices — even strictly positive values. Therefore, we start by setting \operatorname{diag}(S) = (1, 2, 3, 4, 5)
.
from pprint import pprint import matplotlib.pyplot as plt import numpy as np from numpy.linalg import norm from scipy.stats import ortho_group from scipy import linalg from scipy.sparse import linalg as spla def generate_U_S_Vt(n=10, p=5, random_state=532): """Generate SVD to construct a regular matrix A. A has n rows, p columns. Returns ------- U: orthogonal matrix S: diagonal matrix Vt: orthogonal matrix """ r = min(n, p) S = np.diag(1.0 * np.arange(1, 1 + r)) if n > p: # add rows with value 0 S = np.concatenate((S, np.zeros((n - p, p))), axis=0) elif p > n: # add columns with value 0 S = np.concatenate((S, np.zeros((n, p - n))), axis=1) U = ortho_group.rvs(n, random_state=random_state) Vt = ortho_group.rvs(p, random_state=random_state + 1) return U, S, Vt def solve_least_squares(A, b): """Solve least squares with several methods. Returns ------ x : dictionary with solver and solution """ x = {} x["gelsd"] = linalg.lstsq(A, b, lapack_driver="gelsd")[0] x["gelsy"] = linalg.lstsq(A, b, lapack_driver="gelsy")[0] x["lsqr"] = spla.lsqr(A, b)[0] x["lsmr"] = spla.lsmr(A, b)[0] x["normal_eq"] = linalg.solve(A.T @ A, A.T @ b, assume_a="sym") return x def print_dict(d): np.set_string_function(np.array2string) pprint(d) np.set_string_function(None) np.set_printoptions(precision=5) n = 10 p = 5 U, S, Vt = generate_U_S_Vt(n=n, p=p) A = U @ S @ Vt x_true = np.round(6 * Vt.T[:p, 0]) # interesting choice rng = np.random.default_rng(157) noise = rng.standard_normal(n) b = A @ x_true + noise S_inv = np.copy(S.T) S_inv[S_inv>0] = 1/S_inv[S_inv>0] x_exact = Vt.T @ S_inv @ U.T @ b print(f"x_exact = {x_exact}") print_dict(solve_least_squares(A, b))
x_exact = [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ] {'gelsd': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ], 'gelsy': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ], 'lsmr': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ], 'lsqr': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ], 'normal_eq': [ 0.78087 -4.74942 -0.99938 -2.38327 -3.7431 ]}
We see that all least squares solvers do well on regular systems, be it the LAPACK routines for direct least squares, GELSD and GELSY, the iterative solvers LSMR and LSQR or solving the normal equations by the LAPACK routine SYSV for symmetric matrices.
We convert our regular matrix in a singular one by setting its first diagonal element to zero.
S[0, 0] = 0 A = U @ S @ Vt S_inv = np.copy(S.T) S_inv[S_inv>0] = 1/S_inv[S_inv>0] # Minimum Norm Solution x_exact = Vt.T @ S_inv @ U.T @ b print(f"x_exact = {x_exact}") x_solution = solve_least_squares(A, b) print_dict(x_solution)
x_exact = [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ] {'gelsd': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'gelsy': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'lsmr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'lsqr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'normal_eq': [-0.08393 -0.60784 0.17531 -0.57127 -0.50437]}
As promised by their descriptions, the first four solvers find the minimum norm solution. If you run the code yourself, you will get a LinAlgWarning
from the normal equation solver. The output confirms it: Solving normal equations for singular systems is a bad idea. However, a closer look reveals the following
print(f"norm of x:\n" f"x_exact: {norm(x_exact)}\n" f"normal_eq: {norm(x_solution['normal_eq'])}\n" ) print(f"norm of Ax-b:\n" f"x_exact: {norm(A @ x_exact - b)}\n" f"normal_eq: {norm(A @ x_solution['normal_eq'] - b)}" )
norm of x: x_exact: 0.5092520023062155 normal_eq: 0.993975690303498 norm of Ax-b: x_exact: 6.9594032092014935 normal_eq: 6.9594032092014935
Although the solution of the normal equations seems far off, it reaches the same minimum of the least squares problem and is thus a viable solution. Especially in iterative algorithms, the fact that the norm of the solution vector is large, at least larger than the minimum norm solution, is an undesirable property.
As the blogpost title advertises the minimum norm solution and a figure is still missing, we will visualize the many solutions to the least squares problem. But how to systematically construct different solutions? Luckily, we already have the SVD of A. The columns of V that multiply with the zero values of D, let’s call it V_{1}, give us the null space of A, i.e. AV_{1} t = 0 for any vector t. In our example, there is only one such zero diagonal element, V_{1} is just one column and t reduces to a number.
t = np.linspace(-3, 3, 100) # free parameter # column vectos of x_lsq are least squares solutions x_lsq = (x_exact + Vt.T[:, 0] * t.reshape(-1, 1)).T x_norm = np.linalg.norm(x_lsq, axis=0) lsq_norm = np.linalg.norm(A @ x_lsq - b.reshape(-1, 1), axis=0) plt.plot(t, 2 * x_norm, label="2 * norm of x") plt.plot(t, lsq_norm, label="norm of Ax-b") plt.legend() plt.xlabel("t") plt.ylabel("Norm") plt.title("Euclidean norm of solution and residual")
In order to have both lines in one figure, we scaled the norm of the solution vector by a factor of two. We see that all vectors achieve the same objective, i.e. Euclidean norm of the residuals Ax – b, while t=0 has minimum norm among those solution vectors.
Let us also have a look what happens if we add a tiny perturbation to the vector b.
eps = 1e-10 print(f"x_exact = {x_exact}") print_dict(solve_least_squares(A, b + eps))
x_exact = [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ] {'gelsd': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'gelsy': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'lsmr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'lsqr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'normal_eq': [-0.12487 -0.41176 0.23093 -0.48548 -0.35104]} [-0.08393 -0.60784 0.17531 -0.57127 -0.50437]
We see that the first four solvers are stable while the solving the normal equations shows large deviations compared to the unperturbed system above. For example, compare the first vector element of -0.08
vs 0.12
even for a perturbation as tiny as 1.0e-10
.
The last section showed an example of an ill-conditioned system. By ill-conditioned we mean a huge difference between largest and smallest eigenvalue of A, the ratio of which is called condition number. We can achieve this by setting the first diagonal element of S to a tiny positive number instead of exactly zero.
S[0, 0] = 1e-10 A = U @ S @ Vt S_inv = np.copy(S.T) S_inv[S_inv>0] = 1/S_inv[S_inv>0] # Minimum Norm Solution x_exact = Vt.T @ S_inv @ U.T @ b print(f"x_exact = {x_exact}") print_dict(solve_least_squares(A, b))
x_exact = [ 9.93195e+09 -4.75650e+10 -1.34911e+10 -2.08104e+10 -3.71960e+10] {'gelsd': [ 9.93194e+09 -4.75650e+10 -1.34910e+10 -2.08104e+10 -3.71960e+10], 'gelsy': [ 9.93196e+09 -4.75650e+10 -1.34911e+10 -2.08104e+10 -3.71961e+10], 'lsmr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'lsqr': [-0.21233 0.00708 0.34973 -0.30223 -0.0235 ], 'normal_eq': [ 48559.67679 -232557.57746 -65960.92822 -101747.66128 -181861.06429]}
print(f"norm of x:\n" f"x_exact: {norm(x_exact)}\n" f"lsqr: {norm(x_solution['lsqr'])}\n" f"normal_eq: {norm(x_solution['normal_eq'])}\n" ) print(f"norm of Ax-b:\n" f"x_exact: {norm(A @ x_exact - b)}\n" f"lsqr: {norm(A @ x_solution['lsqr'] - b)}\n" f"normal_eq: {norm(A @ x_solution['normal_eq'] - b)}" )
norm of x: x_exact: 66028022639.34349 lsqr: 0.5092520023062157 normal_eq: 0.993975690303498 norm of Ax-b: x_exact: 2.1991587442017146 lsqr: 6.959403209201494 normal_eq: 6.959403209120507
Several points are interesting to observe:
Note, that LSQR and LSMR can be fixed by requiring a higher accuracy via the parameters atol
and btol
.
Solving least squares problems is fundamental for many applications. While regular systems are more or less easy to solve, singular as well as ill-conditioned systems have intricacies: Multiple solutions and sensibility to small perturbations. At least, the minimum norm solution always gives a well defined unique answer and direct solvers find it reliably.
The Python notebook can be found in the usual github repository: 2021-05-02 least_squares_minimum_norm_solution.ipynb
Very good, concise reference:
Good source with respect to Ordinary Least Squares (OLS)
In this article we will discuss how to convert Pandas DataFrame to NumPy array in Python.
Table of Contents
We are used to working with DataFrames in Python for a lot of functionality provided by the pandas library. It is very efficient for any of the data manipulation tasks we do when preparing the features for machine learning models. However, sometimes we want to extend the functionality and make use of some NumPy mathematical functions. In this case, we can’t work with Pandas DataFrames and need to convert them to NumPy arrays.
In this tutorial we will explore how we can easily convert our data to an array for the user to work with more mathematical functions. The pandas library has a very useful method which we will discuss. It makes the conversion a very simple one line code, but allows us to perform more extensive work on the data analysis.
To continue following this tutorial we will need the following Python library: pandas.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install pandas
As the first step we will create a sample Pandas DataFrame that we will later convert to a NumPy array.
import pandas as pd df = pd.DataFrame( { "Education" : [5, 5, 7], "Experience" : [1, 3, 8], "Salary" : [ 40000, 50000, 80000] } )
This DataFrame will have 3 observations with three columns: Education, Experience, and Salary.
Let’s take a look at the data:
print(df)
And we should get:
Education Experience Salary 0 5 1 40000 1 5 3 50000 2 7 8 80000
Converting a Pandas DataFrame to a NumPy array is very simple using .to_numpy() Pandas method. It parses a DataFrame and converts it to an array:
np_array = df.to_numpy()
Let’s take a look:
print(np_array)
And we should get:
[[ 5 1 40000] [ 5 3 50000] [ 7 8 80000]]
In this article we discussed how to convert Pandas DataFrame to NumPy array in Python using pandas library.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.
The post Convert Pandas DataFrame to NumPy Array in Python appeared first on PyShark.
Two weeks ago, I presented GPopt: a Python package for Bayesian optimization. In particular, I’ve presented a way to stop the optimizer and resume it later by adding more iterations.
This week, I present a way to save and resume, that makes the optimizer’s data persistent. Behind this saving feature, are hidden Python shelves which are – sort of – hash tables on disk.
We start by installing packages necessary for the demo.
!pip install GPopt !pip install matplotlib==3.1.3
Import packages.
import GPopt as gp import numpy as np import matplotlib.pyplot as plt
Objective function to be minimized.
# branin def branin(x): x1 = x[0] x2 = x[1] term1 = (x2 - (5.1*x1**2)/(4*np.pi**2) + (5*x1)/np.pi - 6)**2 term2 = 10*(1-1/(8*np.pi))*np.cos(x1) return (term1 + term2 + 10)
Start the optimizer, and save on disk after 25 iterations.
print("Saving after 25 iterations") gp_opt3 = gp.GPOpt(objective_func=branin, lower_bound = np.array([-5, 0]), upper_bound = np.array([10, 15]), n_init=10, n_iter=25, save = "./save") # will create save.db in the current directory gp_opt3.optimize(verbose=1) print("current number of iterations:") print(gp_opt3.n_iter) gp_opt3.close_shelve() print("\n") print("current minimum:") print(gp_opt3.x_min) print(gp_opt3.y_min) plt.plot(gp_opt3.max_ei)
current number of iterations: 25 current minimum: [3.17337036 2.07962036] 0.4318831996378023
On this figure, we observe that there’s still room for advancement in the convergence of expected improvement (EI). We can add more iterations to the procedure by loading the saved object.
print("---------- \n") print("loading previously saved object") gp_optload = gp.GPOpt(objective_func=branin, lower_bound = np.array([-5, 0]), upper_bound = np.array([10, 15])) gp_optload.load(path="./save") # loading the saved object print("current number of iterations:") print(gp_optload.n_iter) gp_optload.optimize(verbose=2, n_more_iter=190, abs_tol=1e-4) # early stopping based on expected improvement print("current number of iterations:") print(gp_optload.n_iter) print("\n") print("current minimum:") print(gp_optload.x_min) print(gp_optload.y_min) plt.plot(gp_optload.max_ei)
current number of iterations: 51 current minimum: [9.44061279 2.48199463] 0.3991320518189241
Now that the EI has effectively converged to 0, we can stop the optimization procedure.
ARIMA is one of the most popular statistical models. It stands for AutoRegressive Integrated Moving Average and it’s fitted to time series data either for forecasting or to better understand the data. We will not cover the whole theory behind the ARIMA model but we will show you what’s the steps you need to follow to apply it correctly.
The key aspects of ARIMA model are the following:
The ARIMA model can be applied when we have seasonal or non-seasonal data. The difference is that when we have seasonal data we need to add some more parameters to the model.
For non-seasonal data the parameters are:
For seasonal data we need to add also the following:
This is very easy to understand. Seasonal data is when we have intervals, such as weekly, monthly, or quarterly. For example, in this tutorial, we will use data that are aggregated by month and our “season” is the year. So, we have seasonal data and for the m parameter in the ARIMA model, we will use 12 which is the number of months per year.
ARIMA models can be applied only in stationary data. That means that we don’t want to have a trend in time. If the time series has a trend, then it’s non-stationary and we need to apply differencing to transform it into stationary. Below is an example of a stationary and a non-stationary series.
We can use also the Augmented Dickey-Fuller test to help us conclude if the series is stationary or not. The null hypothesis of the test is that there is a unit root with the alternative that there is no unit root. In other words, If the p-value is below 0.05 (or any critical size you will use), our Series is Stationary.
Let’s start coding.
For this tutorial we will use the Air Passengers data from Kaggle.
import pandas as pd import numpy as np from statsmodels.tsa.seasonal import seasonal_decompose #https://www.kaggle.com/rakannimer/air-passengers df=pd.read_csv('AirPassengers.csv') #We need to set the Month column as index and convert it into datetime df.set_index('Month',inplace=True) df.index=pd.to_datetime(df.index) df.head()
First things first, let’s plot our data.
df.plot()
As you can clearly see, there is a trend in time and that suggests that the data are not stationary. However just to be sure we will use an Augmented Dickey-Fuller test.
from statsmodels.tsa.stattools import adfuller result=adfuller(df['#Passengers']) #to help you, we added the names of every value dict(zip(['adf', 'pvalue', 'usedlag', 'nobs', 'critical' 'values', 'icbest'],result))
{'adf': 0.8153688792060468, 'pvalue': 0.991880243437641, 'usedlag': 13, 'nobs': 130, 'criticalvalues': {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 'icbest': 996.692930839019}
As expected we failed to reject the Null Hypothesis and the series has a unit root thus is not stationary.
The next step is to transform our data to Stationary so we will have an estimate for d and D parameters we will use in the model. This can be done using Differencing and it’s performed by subtracting the previous observation from the current observation.
difference(T) = observation(T) – observation(T-1)
Then, we will test it again for stationarity using the Augmented Dickey-Fuller test and if it’s stationary we will proceed to our next step. If not we will apply differencing again till we have a stationary series. Differencing can be done very easily with pandas using the shift function.
df['1difference']=df['#Passengers']-df['#Passengers'].shift(1) df['1difference'].plot()
It seems that we removed the trend and the series is Stationary. However we will use the Augmented Dickey-Fuller test to proove it.
#note we are dropping na values because the first value of the first difference is NA result=adfuller(df['1difference'].dropna()) dict(zip(['adf', 'pvalue', 'usedlag', 'nobs', 'critical' 'values', 'icbest'],result))
{'adf': -2.8292668241699857, 'pvalue': 0.05421329028382734, 'usedlag': 12, 'nobs': 130, 'criticalvalues': {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 'icbest': 988.5069317854084}
As you can see we fail to reject the null hypothesis because we have a p-value>0.05. That suggests that the series is not stationary and we need to use differencing again taking the second difference. The second difference can be computed as the first but this time instead of using the observations, we will use the first difference.
df['2difference']=df['1difference']-df['1difference'].shift(1) df['2difference'].plot()
Let’s get the results from Augmented Dickey-Fuller test for the second difference.
result=adfuller((df['2difference']).dropna()) dict(zip(['adf', 'pvalue', 'usedlag', 'nobs', 'critical' 'values', 'icbest'],result))
{'adf': -16.384231542468452, 'pvalue': 2.732891850014516e-29, 'usedlag': 11, 'nobs': 130, 'criticalvalues': {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 'icbest': 988.602041727561}
The p-value is less than 0.05 so we can reject the null hypothesis. That means the second difference is stationary and that suggests that a good estimate for the value d is 2.
Our data are seasonal so we need to estimate also the D value which is the same as the d value but for Seasonal Difference. The seasonal difference can be computed by shifting the data by the number of rows per season (in our example 12 months per year) and subtracting them from the previous season. This is not the first seasonal difference. If we get that the seasonal difference is stationary then the D value will be 0. If not then we will compute the seasonal first difference.
seasonal difference(T) = observation(T) – observation(T-12)
seasonal first difference(T) = seasonal difference(T) – seasonal difference(T-1)
df['Seasonal_Difference']=df['#Passengers']-df['#Passengers'].shift(12) ax=df['Seasonal_Difference'].plot()
result=adfuller((df['Seasonal_Difference']).dropna()) dict(zip(['adf', 'pvalue', 'usedlag', 'nobs', 'critical' 'values', 'icbest'],result))
{'adf': -3.383020726492481, 'pvalue': 0.011551493085514952, 'usedlag': 1, 'nobs': 130, 'criticalvalues': {'1%': -3.4816817173418295, '5%': -2.8840418343195267, '10%': -2.578770059171598}, 'icbest': 919.527129208137}
The p-value is less than 0.05 thus it’s stationary and we don’t have to use differencing. That suggests to use 0 for the D value.
The last step before the ARIMA model is to create the Autocorrelation and Partial Autocorrelation Plots to help us estimate the p,q, P, and Q parameters.
There are some very useful rules for ARIMA and Seasonal ARIMA models that we are using to help us estimate the parameters by looking at the Autocorrelation and Partial Autocorrelation Plots. We will create the plots for the second difference and the seasonal difference of our time series because these are the stationary series we end up using in ARIMA (d=2, D=0).
First, let’s plot ACF and PACF for the second difference.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf fig1=plot_acf(df['2difference'].dropna()) fig2=plot_pacf(df['2difference'].dropna())
We can see that we have a sharp cut-off at lag-1 in both of our plots. According to the rules we mentioned above, this suggests using an AR and MA term. In other words, p=1 and q=1.
Now, we need the same plots of Seasonal Difference.
fig1=plot_acf(df['2difference'].dropna()) fig2=plot_pacf(df['2difference'].dropna())
We have a gradual decrease in the Autocorrelation plot and a sharp cut-off in the Partial Autocorrelation plot. This suggests using AR and not over the value of 1 for the seasonal part of the ARIMA.
The values we chose may not be optimum. You can play around with these parameters to fine-tune the model having as a guide the rules we mentioned above.
from statsmodels.tsa.statespace.sarimax import SARIMAX model=SARIMAX(df['#Passengers'],order=(1,2,1),seasonal_order=(1, 0, 0, 12)) result=model.fit()
We can plot the residuals of the model to have an idea of how well the model is fitted. Basically, the residuals are the difference between the original values and the predicted values from the model.
result.resid.plot(kind='kde')
It’s time for a forecast. We will create some future dates to add them to our data so we can predict the future values.
from pandas.tseries.offsets import DateOffset new_dates=[df.index[-1]+DateOffset(months=x) for x in range(1,48)] df_pred=pd.DataFrame(index=new_dates,columns =df.columns) df_pred.head()
The ARIMA model predicts taking as arguments the start and the end of the enumerated index and not the date range.
We created an empty data frame having indexes future dates and we concatenated them in our original data. Our data had 144 rows and the new dada we added have 48 rows. Therefore, to get the predictions only for the future data, we will predict from row 143 to 191.
df2=pd.concat([df,df_pred]) df2['predictions']=result.predict(start=143,end=191) df2[['#Passengers','predictions']].plot()
Arima is a great model for forecasting and It can be used both for seasonal and non-seasonal time series data.
For non-seasonal ARIMA you have to estimate the p, d, q parameters, and for Seasonal ARIMA it has 3 more that applies to seasonal difference the P, D, Q parameters.
The pipeline that we are using to run an ARIMA model is the following:
Useful links:
Rules for ARIMA
Arima In R
How To Backtest Your Crypto Trading Strategies In R
ARIMA in Wikipedia
That’s the bird on the cover of Advancing into Analytics, copies of which I received the other day. There’s that new author enthusiasm!
You can now get your digital or paperback copy too at your bookseller of choice.
This hasn’t been an easy past few months, so I need to thank everyone who contributed to and expressed interest in this project. To do that, I share the book’s Acknowledgments below:
First, I want to thank God for giving me this opportunity to cultivate and share my talents. At O’Reilly, Michelle Smith and Jon Hassell have been so enjoyable to work with, and I will be forever grateful for their offer to have me write a book. Corbin Collins kept me rolling during the book’s development. Danny Elfanbaum and the production team turned the raw manuscript into an actual book. Aiden Johnson, Felix Zumstein, and Jordan Goldmeier provided invaluable technical reviews.
Getting people to review a book isn’t easy, so I have to thank John Dennis, Tobias Zwingmann, Joe Balog, Barry Lilly, Nicole LaGuerre, and Alex Bodle for their comments. I also want to thank the communities who have made this technology and knowledge available, often without direct compensation. I’ve made some fantastic friends through my analytics pursuits, who have been so giving of their time and wisdom. My educators at Padua Franciscan High School and Hillsdale College made me fall in love with learning and with writing. I doubt I’d have written a book without their influence.
I also thank my mother and father for providing me the love and support that I’m so privileged to have. Finally, to my late Papou: thank you for sharing with me the value of hard work and decency.
If you are a spreadsheet user looking to level up your data game, this book is for you.
So, what’s next for me? I’ve got some exciting projects in the pipeline. I’d also love to continue networking with leaders in data literacy and data education — if you’d like to connect or work together, please find me on LinkedIn or set up a call.
Finally, I am looking forward to contributing new content to the blog. Expect to see great new posts and newsletters regularly. If you’ve not done so already, please subscribe below to get all of it for free. You’ll also get exclusive free access to my analytics education resource library.
Also, a little R&R may be in order. So, please check out the book as I take the backseat for a while. I look forward to seeing you back at the blog soon.
In this article we will discuss how to calculate a factorial in Python.
Table of Contents
some text here
Basically, for any integer n, that is greater than or equal to 1 (\(n \geq 1\)), the factorial is the product of all integers in range \((1:n)\) and is denoted by \(n!\).
The factorial formula:
$$n! = n(n-1)(n-2)…(n-m)$$
where \(m=n-1\).
For example, the factorial of 5 would be:
$$5! = 5(5-1)(5-2)(5-3)(5-4) = 120$$
With the following steps we will create a program to calculate the factorial in Python:
Similarly to solving quadratic equation using Python, the first step is to obtain some user input. In this case, we would need a user to enter an integer for which the factorial should be calculated:
n = eval(input("Please enter the integer for the factorial calculation: "))
Of course, in this step a user can input any data type they want. There are some rules for the factorial calculation that we need to check for:
Let’s add these checks to the user input code:
check_input = True while check_input: n = eval(input("Please enter the integer for the factorial calculation: ")) if isinstance(n, int) and n>=1: check_input = False else: print("Please make sure the input is an integer greater or equal to 1") check_input = True
Now we know the \(n\) value and can continue with the factorial calculation.
Recall that the formula calculates the product of every integer in range \(n:1\), which means that integer \(i_{t-1} = i_t – 1\). To do this in Python we will simply run a for loop through every \(i_t\) value:
f = 1 for i in range(n,1,-1): f=f*i print(f)
Here we start with an initial factorial value f = 1. We need to define a variable to do the calculations, and since the operation is multiplication, f=1 doesn’t change the final result of the calculation.
Then we basically multiply the starting value by each integer in the range. As an example, if we set \(n = 5\), then range(5, 1, -1) would be [5, 4, 3, 2, 1]. And the calculation would look like:
$$5! = 1*(5*4*3*2*1) = 120$$
Running the above program for \(n = 5\) should have the result equal to 120.
The math library (prebuilt in Python) has a function to calculate factorial. The result would be the same as in the code in the above section but will only take one line to calculate it.
from math import factorial f = factorial(n) print(f)
You can use the user input code from the previous section to get the \(n\) value from the user and use the factorial() function to calculate the factorial.
In this article we covered how you can calculate factorial in Python with math library.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Optimization articles.
The post Calculate Factorial in Python appeared first on PyShark.
In this article we will discuss how to solve a quadratic equation using Python.
Table of Contents
In algebra, quadratic equations are widely used in a lot of tasks. A quadratic equation (second-degree polynomial) always has a squared term which differentiates it from our usual linear equations.
In this tutorial we will use the math library (prebuilt in Python) for solving quadratic equations using Python.
We begin with understanding the standard form of quadratic equation:
$$ax^2 + bx + c = 0$$
where a, b, c are real numbers and \(a \neq 0\).
So how do we know if the equation has a solution? And if it does, how many solutions?
The first step to solve a quadratic equation is to calculate the discriminant. Using simple formula:
$$D = b^2 – 4ac$$
we can solve for discriminant and get some value. Next, if the value is:
To solve for each x value, we use the following quadratic formula:
$$x = \frac{-b \pm \sqrt{b^2 – 4ac}}{2a} = \frac{-b \pm D}{2a}$$
which means that:
$$x_1 = \frac{-b + D}{2a}$$
$$x_2 = \frac{-b – D}{2a}$$
and these are all the steps we need to take to find the solution for the quadratic equation.
As an example, let’s consider the following quadratic equation:
$$x_1 – 5x_2 – 14 = 0$$
where \(a = 1\), \(b = -5\), and \(c = -14\).
First, we need to get these coefficients entered by the user:
a, b, c = eval(input("Please enter the a, b, c coefficients of your quadratic equation: "))
Here we will need to pass the three comma-separated values as: 1,-5,-14.
Clearly, if you pass anything other than a real number (string or boolean), it won’t break the input function but further calculations won’t work. To prevent this, you should consider adding a set of checks to validate the user input:
valid_input = True while valid_input: a, b, c = eval(input("Please enter the a, b, c coefficients of your quadratic equation: ")) try: float(a), float(b), float(c) valid_input = False except ValueError: print("Please make sure the coefficients are real numbers and try again") valid_input = True
So far we created the a, b, and c variables in Python.
Next we will calculate discriminant. We will need to use the math library (prebuilt in Python) to use the square root function:
from math import sqrt disc = sqrt(b*b-4*a*c)
For the values we entered above, the discriminant value should be 9.
And finally we solve for the roots of the equation. Recall that we also need to check if the discriminant is less than zero, then the quadratic equation has no solutions:
if disc >=0: x1 = (-b+disc)/(2*a) x2 = (-b-disc)/(2*a) print("The roots of the equation are:", x1, x2) else: print("The equation has no solutions")
For our example we should get \(x_1 = 7\) and \(x_2 = -2\).
from math import sqrt valid_input = True while valid_input: a, b, c = eval(input("Please enter the a, b, c coefficients of your quadratic equation: ")) try: float(a), float(b), float(c) valid_input = False except ValueError: print("Please make sure the coefficients are real numbers and try again") valid_input = True disc = sqrt(b*b-4*a*c) if disc >=0: x1 = (-b+disc)/(2*a) x2 = (-b-disc)/(2*a) print("The roots of the equation are:", x1, x2) else: print("The equation has no solutions")
In this article we covered how you can solve a quadratic equation using Python and math library.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Optimization articles.
The post Solve Quadratic Equation using Python appeared first on PyShark.
Part 1: Selenium Template — Docker Host Network
Part 2: Selenium Template — Docker Bridge Network
Part 3: Selenium Template — Docker Compose
Part 4: Selenium Template — Deploying to ECS
In the last few posts we’ve looked at a few ways to set up the infrastructure for a ...
The post Selenium Template #4: Deploying to ECS first appeared on Python-bloggers.]]>This is part of a series of posts:
In the last few posts we’ve looked at a few ways to set up the infrastructure for a Selenium crawler using Docker to run both the crawler and Selenium. In this post we’ll launch this setup in the cloud using AWS Elastic Container Service (ECS).
The earlier posts form an important prelude to deploying on ECS. My strategy is:
There are certainly better and more efficient ways to do this, but that’s not the objective here. We’re just aiming to build something minimal: it’s simple and it works.
We’ll need to enhance the crawler script slightly.
import logging from time import sleep from subprocess import Popen, DEVNULL, STDOUT from selenium import webdriver logging.basicConfig( level=logging.INFO, format='%(asctime)s [%(levelname)7s] %(message)s', ) HOST = "localhost" PORT = 4444 # Check connection to host and port. # def check_connection(): process = Popen(['nc', '-zv', HOST, str(PORT)], stdout=DEVNULL, stderr=STDOUT) # if process.wait() != 0: logging.warning(f"Unable to communicate with {HOST} on port {PORT}.") return False else: logging.info(f"Can communicate with {HOST} on port {PORT}!") return True RETRY = 10 for i in range(RETRY): if check_connection(): break logging.info("Sleeping.") sleep(1) SELENIUM_URL = f"{HOST}:{PORT}" browser = webdriver.Remote(f"http://{SELENIUM_URL}/wd/hub", {'browserName': 'chrome'}) browser.get("https://www.google.com") logging.info(f"Retrieved URL: {browser.current_url}.") browser.close()
The major changes are:
logging
package for enhanced logging.nc
command line utility and it’s fun to run shell commands from Python. We all have different ideas about what consitutes “fun”.The Dockerfile
now also installs the netcat
package.
FROM python:3.8.5-slim AS base RUN apt-get update && apt-get install -y netcat RUN pip3 install selenium==3.141.0 COPY google-selenium.py / CMD python3 google-selenium.py
Build the Docker image. I’m also tagging it with my username because I’ll be pushing it to Docker Hub.
docker build -t google-selenium -t datawookie/google-selenium .
Check that it works locally. If it doesn’t work on localhost
then it’s not going to work on ECS!
docker run --net=host google-selenium
2021-04-24 17:08:49,114 [ INFO] Can communicate with localhost on port 4444! Retrieved URL: https://www.google.com/.
Success. Now we push the image to Docker Hub so that it’s available to ECS.
docker login docker push datawookie/google-selenium
The first step towards deploying on ECS is to create a cluster. If you have an existing cluster then you can skip this step.
Once we have a cluster we can create a task which specifies one or more containers which will run together.
Container 1: Selenium
selenium
selenium/standalone-chrome:3.141
4444
Container 2: Crawler
google-selenium
4444
The details of the task are captured as JSON.
The configuration file below has been abridged and edited for clarity.
{ "family": "google-selenium", "revision": 1, "status": "ACTIVE", "cpu": "256", "memory": "1024", "networkMode": "awsvpc", "volumes": [], "containerDefinitions": [ { "image": "selenium/standalone-chrome:3.141", "name": "selenium", "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "/ecs/google-selenium", } }, "portMappings": [ { "hostPort": 4444, "protocol": "tcp", "containerPort": 4444 } ] }, { "image": "datawookie/google-selenium", "name": "google-selenium", "logConfiguration": { "options": { "awslogs-group": "/ecs/google-selenium", } } } ], "compatibilities": [ "EC2", "FARGATE" ], "requiresCompatibilities": ["FARGATE"] }
Details to note:
family
key.cpu
and memory
keys.containerDefinitions
array./ecs/google-selenium
, which can be used to find logs in CloudWatch.Networking: The task will run using the AWS Fargate serverless compute engine. The network type associated with the task is thus awsvpc, which means that containers within the task can communicate via localhost
. This makes communication between containers quite simple and is not dissimilar to using the host network with Docker. The crawler will simply look for the Selenium instance at 127.0.0.1:4444
(or, equivalently, localhost:4444
).
The moment of truth: we’re going to run the task.
The running task will be assigned an unique task ID (like eb86fa0e2bff4aeb8e0d69bdc79eea5a). Click on the task ID link. You’ll see a table with the two containers listed, both of which will initially be in the PENDING
state. Wait a moment and refresh the page. You should find that both of the containers are RUNNING
. Refresh again and they should both be STOPPED
. This means that the task has run and we can now inspect the logs. Click the dropdown next to each container name and follow the View logs in CloudWatch link.
These are the logs for the Selenium container:
2021-04-25 04:29:22,559 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" 2021-04-25 04:29:22,561 INFO supervisord started with pid 8 2021-04-25 04:29:23,563 INFO spawned: 'xvfb' with pid 10 2021-04-25 04:29:23,565 INFO spawned: 'selenium-standalone' with pid 11 2021-04-25 04:29:24,566 INFO success: xvfb entered RUNNING state 2021-04-25 04:29:24,566 INFO success: selenium-standalone entered RUNNING state 04:29:25.167 INFO Selenium server version: 3.141.59, revision: e82be7d358 04:29:25.765 INFO Launching a standalone Selenium Server on port 4444 04:29:27.661 INFO Initialising WebDriverServlet 04:29:28.562 INFO Selenium Server is up and running on port 4444 04:29:34.765 INFO Detected dialect: W3C 04:29:35.061 INFO Started new session 04468dff0f93c1e29008b35b019799b1 Trapped SIGTERM/SIGINT/x so shutting down supervisord... 2021-04-25 04:29:39,156 WARN received SIGTERM indicating exit request 2021-04-25 04:29:39,156 INFO waiting for xvfb, selenium-standalone to die 2021-04-25 04:29:40,158 INFO stopped: selenium-standalone (terminated by SIGTERM) 2021-04-25 04:29:41,160 INFO stopped: xvfb (terminated by SIGTERM)
The logs above have been abridged and edited for clarity.
The Selenium container was initialised at around 04:29:22 and terminated at 04:29:41. It was ready to receive requests at 04:29:28 and created a single new session at 04:29:35.
And these are the logs for the crawler.
2021-04-25 04:29:22,557 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:22,557 [ INFO] Sleeping. 2021-04-25 04:29:23,560 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:23,561 [ INFO] Sleeping. 2021-04-25 04:29:24,656 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:24,656 [ INFO] Sleeping. 2021-04-25 04:29:25,756 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:25,756 [ INFO] Sleeping. 2021-04-25 04:29:26,761 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:26,761 [ INFO] Sleeping. 2021-04-25 04:29:27,766 [WARNING] Unable to communicate with localhost on port 4444. 2021-04-25 04:29:27,767 [ INFO] Sleeping. 2021-04-25 04:29:28,860 [ INFO] Can communicate with localhost on port 4444! 2021-04-25T06:29:38,258 [ INFO] Retrieved URL: https://www.google.com/.
The crawler container was initialised at 04:29:22. This was before the Selenium container was ready to receive requests, so the first few attempts to communicate with Selenium were unsuccessful. However, by 04:29:28 the Selenium container was ready to receive requests (see Selenium logs above) and the crawler was able to establish communication. It was at this point that the crawler triggered the creation of a new Selenium session and retrieved the content of https://www.google.com/. The crawler then exited, which in turn caused the Selenium container to terminate.
Waiting for the Selenium service is paramount. If we didn’t wait then the crawler would fail and terminate before Selenium was ready to accept requests. In later posts we’ll see how to create explicit dependencies between containers.
So there you have it, a minimal setup to run your containers in a serverless environment on AWS Elastic Container Service. Although I’ve illustrated the principles with a simple Selenium crawler, the same approach can be used for many other container configurations.