Related Post
]]>It is method which enables us to highlight the most frequently used keywords in a paragraph of texts or compilation of several text documents.
It is the visual representation of text data, especially the keywords in the text documents.
R has very simple and straightforward approaches for text mining and creating word clouds.
The text mining package “(tm)” will be used for mining the text and the word cloud generator package (wordcloud) will be used for visualizing the keywords as a word cloud.
As the starting point of qualitative research, you need to create the text file. Here I’ve used the lecture delivered by great Indian Hindu monk Swami Vivekananda at the first World’s Parliament of Religions held from 11 to 27 September 1893. Only two lecture notes – opening and closing address, will be used.
Both the lectures are saved in text file (chicago).
library("tm") library("SnowballC") library("wordcloud") library("RColorBrewer")
The text file (chicago) is imported using the following code in R.
The R code for leading the text is given below:
text <- readLines(file.choose())
The ‘text’ object will now be loaded as ‘Corpora’ which are collections of documents containing (natural language) text. The Corpus() function from text mining(tm) package will be used for this purpose.
The R code for building the corpus is given below:
docs <- Corpus(VectorSource(text))
Next use the function inspect() under the tm package to display detailed information of the text document.
The R code for inspecting the text is given below:
inspect(docs) The output is not, however, produced here due to space constraint
After inspecting the text document (corpora), it is required to perform some text transformation for replacing special characters from the text. To do this, use the ‘tm_map()’ function.
The R code for transformation of the text is given below:
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x)) docs <- tm_map(docs, toSpace, "/") docs <- tm_map(docs, toSpace, "@") docs <- tm_map(docs, toSpace, "\\|")
After removing the special characters from the text, it is now the time to remove the to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like ‘the’, “we”. This is required as the The information value of ‘stopwords’ is near zero due to the fact that they are so common in a language. For doing this exercise, the same ‘tm_map()’ function will be used.
The R code for cleaning the text along with the short self-explanation is given below:
# Convert the text to lower case docs <- tm_map(docs, content_transformer(tolower)) # Remove numbers docs <- tm_map(docs, removeNumbers) # Remove english common stopwords docs <- tm_map(docs, removeWords, stopwords("english")) # Remove your own stop word # specify your stopwords as a character vector docs <- tm_map(docs, removeWords, c("I", "my")) # Remove punctuations docs <- tm_map(docs, removePunctuation) # Eliminate extra white spaces docs <- tm_map(docs, stripWhitespace)
Document matrix is the frequency distribution of the words used in the given text. I hope that readers will easily understand this frequency distribution of words.
The R function TermDocumentMatrix() from the text mining package ‘tm’ will be used for building this frequency table for words in the given text.
The R code is given below:
docs_matrix <- TermDocumentMatrix(docs) m <- as.matrix(docs_matrix) v <- sort(rowSums(m),decreasing=TRUE) d <- data.frame(word = names(v),freq=v)
head(d, 10) word freq religions religions 7 world world 6 earth earth 6 become become 6 hindu hindu 5 religion religion 5 thanks thanks 5 different different 4 men men 4 proud proud 4
Finally, the frequency table of the words (document matrix) will be visualized graphically by plotting in a word cloud with the help of the following R code.
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
You can also use barplot to plot the frequencies of the keywords using the following R code:
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word, col ="lightblue", main ="Most commonly used words", ylab = "Word frequencies", xlab="Keywords")
The above word cloud clearly shows that “religions”, “earth”, “world”, “hindu”, “one” etc. are the most important words in the lecture delivered by Swamiji in Chicago World’s Parliament of Religions.
Related Post
]]>Related Post
]]>RECENCY (R): Days since last purchase
FREQUENCY (F): Total number of purchases
MONETARY VALUE (M): Total money this customer spent
Step 1: Calculate the RFM metrics for each customer.
Step 2: Add segment numbers to RFM table.
Step 3: Sort according to the RFM scores from the best customers (score 111).
Since RFM is based on user activity data, the first thing we need is data.
The dataset we are using today is the same as when we did Market Basket Analysis — Online retail data set that can be downloaded from UCI Machine Learning Repository.
import pandas as pd import warnings warnings.filterwarnings('ignore') df = pd.read_excel("Online_Retail.xlsx") df.head() df1 = df
The dataset contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered online retailer.
It took a few minutes to load the data, so I keep a copy as a backup.
1. Missing values in important columns;
2. Customers distribution in each country;
3. Unit price and Quantity should > 0;
4. Invoice date should < today.
There were 38 unique countries as follows:
df1.Country.nunique() df1.Country.unique() 38 array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany', 'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal', 'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland', 'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria', 'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore', 'Lebanon', 'United Arab Emirates', 'Saudi Arabia', 'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA', 'European Community', 'Malta', 'RSA'], dtype=object)
customer_country=df1[['Country','CustomerID']].drop_duplicates() customer_country.groupby(['Country'])['CustomerID'].aggregate('count').reset_index().sort_values('CustomerID', ascending=False) Country CustomerID 36 United Kingdom 3950 14 Germany 95 13 France 87 31 ....
However, more than 90% of the customers in the data are from the United Kingdom. There’s some research indicating that customer clusters vary by geography, so I’ll restrict the data to the United Kingdom only.
df1 = df1.loc[df1['Country'] == 'United Kingdom']
Check whether there are missing values in each column.
df1.isnull().sum(axis=0) InvoiceNo 0 StockCode 0 Description 1454 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 133600 Country 0 dtype: int64
There are 133,600 missing values in CustomerID column, since our analysis is based on customers, we will remove these missing values.
df1 = df1[pd.notnull(df1['CustomerID'])]
Check the minimum values in UnitPrice and Quantity columns.
df1 = df1[pd.notnull(df1['CustomerID'])] 0.0
df1.Quantity.min() -80995
Remove the negative values in Quantity column.
df1 = df1[(df1['Quantity']>0)] df1.shape df1.info() (354345, 8) Int64Index: 354345 entries, 0 to 541893 Data columns (total 8 columns): InvoiceNo 354345 non-null object StockCode 354345 non-null object Description 354345 non-null object Quantity 354345 non-null int64 InvoiceDate 354345 non-null datetime64[ns] UnitPrice 354345 non-null float64 CustomerID 354345 non-null float64 Country 354345 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(4) memory usage: 24.3+ MB
After cleaning up, we are now dealing with 354,345 rows and 8 columns.
Check unique value for each column.
def unique_counts(df1): for i in df1.columns: count = df1[i].nunique() print(i, ": ", count) unique_counts(df1) InvoiceNo : 16649 StockCode : 3645 Description : 3844 Quantity : 294 InvoiceDate : 15615 UnitPrice : 403 CustomerID : 3921 Country : 1
Add a column for total price.
df1['TotalPrice'] = df1['Quantity'] * df1['UnitPrice']
Find out first and last order date in the data.
df1['InvoiceDate'].min() df1['InvoiceDate'].max() Timestamp('2010-12-01 08:26:00') Timestamp('2011-12-09 12:49:00')
Since recency is calculated for a point in time. The last invoice date is 2011–12–09, we will use 2011–12–10 to calculate recency.
import datetime as dt NOW = dt.datetime(2011,12,10) df1['InvoiceDate'] = pd.to_datetime(df1['InvoiceDate'])
RFM segmentation starts from here.
Create a RFM table and Calculate RFM metrics for each customer
rfmTable = df1.groupby('CustomerID').agg({'InvoiceDate': lambda x: (NOW - x.max()).days, 'InvoiceNo': lambda x: len(x), 'TotalPrice': lambda x: x.sum()}) rfmTable['InvoiceDate'] = rfmTable['InvoiceDate'].astype(int) rfmTable.rename(columns={'InvoiceDate': 'recency', 'InvoiceNo': 'frequency', 'TotalPrice': 'monetary_value'}, inplace=True) rfmTable.head()
Gives this table:
Interpretation:
Let’s check the details of the first customer.
first_customer = df1[df1['CustomerID']== 12346.0] first_customer
Gives this table:
The first customer has shopped only once, bought one product at a huge quantity(74,215). The unit price is very low, seems a clearance sale.
The easiest way to split metrics into segments is by using quartile.
quantiles = rfmTable.quantile(q=[0.25,0.5,0.75]) quantiles = quantiles.to_dict()
segmented_rfm = rfmTable
Lowest recency, highest frequency and monetary are our best customers.
def RScore(x,p,d): if x <= d[p][0.25]: return 1 elif x <= d[p][0.50]: return 2 elif x <= d[p][0.75]: return 3 else: return 4 def FMScore(x,p,d): if x <= d[p][0.25]: return 4 elif x <= d[p][0.50]: return 3 elif x <= d[p][0.75]: return 2 else: return 1
segmented_rfm['r_quartile'] = segmented_rfm['recency'].apply(RScore, args=('recency',quantiles,)) segmented_rfm['f_quartile'] = segmented_rfm['frequency'].apply(FMScore, args=('frequency',quantiles,)) segmented_rfm['m_quartile'] = segmented_rfm['monetary_value'].apply(FMScore, args=('monetary_value',quantiles,)) segmented_rfm.head()
Gives this table:
RFM segments split your customer base into an imaginary 3D cube. It is hard to visualize. However, we can sort it out.
Add a new column to combine RFM score, 111 is the highest score as we determined earlier
segmented_rfm['RFMScore'] = segmented_rfm.r_quartile.map(str) + segmented_rfm.f_quartile.map(str) + segmented_rfm.m_quartile.map(str) segmented_rfm.head()
Gives this table:
It is obvious that the first customer is not our best customer at all.
segmented_rfm[segmented_rfm['RFMScore']=='111'].sort_values('monetary_value', ascending=False).head(10)
Interest to learn more?
Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.
Reference: Blast Analytics and Marketing
Related Post
]]>Related Post
]]>The best example of unsupervised learning is when a small child is given some unlabelled pictures of cats and dogs , so only by looking at the structural and visual similarities and dissimilarities between the images , the child classifies one as a dog and other as cat.
Unsupervised learning is inclined towards finding groups and subgroups from data by finding the associations, similarities and relationships between the inputs \(X_i\). It is important for understanding the variations and grouping structure of a dataset and is also used as a pre-processing tool for finding the best and most important features \(X_i\) which explain the most variance and summarize the most information in the data using techniques such as principal component analysis(PCA) for supervised learning techniques.
example-If we have a dataset with 100 predictors and we wanted to generate a model,it would be highly inefficient to use all those 100 predictors because that would increase the variance and complexity of the model and which in turn would lead to overfitting. Instead, what PCA does is find 10 most correlated variables and linearly combine them to generate principal components -\(Z_m\) which could be further used as features for our model.
PCA introduces a lower-dimensional representation of the dataset.It finds a sequence of linear combination of the variables called the principal components-\(Z_1,Z_2…Z_m\) that explain the maximum variance and summarize the most information in the data and are mutually uncorrelated.
What we try to do is find most relevant set of variables and simply linearly combine the set of variables into a single variable-\(Z_m\) called a principal component.
We have tons of correlated variables in a high dimensional dataset and what PCA tries to do is pair and combine them to a set of some important variables that summarize all information in the data.
PCA will give us new set of variables called principal components which could further be used as inputs in a supervised learning model. So now we have lesser and most important set of variables paired together to form a new single variable which explains most variance in data. This technique is often termed as dimensionality reduction which is famous technique to do feature selection/reduction and use only relevant features \(X_i\) in the model.
We have a set of input vectors \( x_1,x_2,x_3…..x_p\) with \(n\) observations in dataset.
The 1st principal component \(Z_1\) of a set of features is the normalized linear combination of the features \(x_1,x_2….x_p\).
$$Z_1 = \sum_{i=1}^p \phi_{1}x_1 + \phi_{{2}}x_2 + \phi_{3}x_3 + ………\phi_{i}x_i $$
where n=no of observations, p = number of variables. It is a linear combination to find out the highest variance across data. By normalized it means \( \sum_{j=1}^{p} \phi_{j}^2 = 1\).
We refer to the weights \( \phi_{n X p}\) as Loading matrix.The loadings make up the principal components loading vector. \( \phi_1 = ( \phi_{11},\phi_{21},\phi_{31} , \phi_{41}……,\phi_{n1} )^T \) is the loading vector for \(PC_1\).
We constrain the loadings so that their sum of squares could be 1, as otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance
The first Principal component solves the below optimization problem of maximizing variance across the components-
$$maximize: \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^p (\phi_{ji}.X_{ij})^2 subject \ to \sum_{j=1}^p \phi_{ji}^2=1 $$
Here each principal component has mean 0.
The above problem can be solved via Single value decomposition of matrix \(X\) ,which is a standard technique in linear algebra.
Enough maths now let’s start implementing PCA in R.
We will use USAarrests data.
?USArrests #dataset which contains Violent Crime Rates by US State dim(USArrests) dimnames(USArrests) ## [1] 50 4 ## [[1]] ## [1] "Alabama" "Alaska" "Arizona" "Arkansas" ## [5] "California" "Colorado" "Connecticut" "Delaware" ## [9] "Florida" "Georgia" "Hawaii" "Idaho" ## [13] "Illinois" "Indiana" "Iowa" "Kansas" ## [17] "Kentucky" "Louisiana" "Maine" "Maryland" ## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" ## [25] "Missouri" "Montana" "Nebraska" "Nevada" ## [29] "New Hampshire" "New Jersey" "New Mexico" "New York" ## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" ## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" ## [41] "South Dakota" "Tennessee" "Texas" "Utah" ## [45] "Vermont" "Virginia" "Washington" "West Virginia" ## [49] "Wisconsin" "Wyoming" ## ## [[2]] ## [1] "Murder" "Assault" "UrbanPop" "Rape"
Finding means for all variables.
#finding mean of all apply(USArrests,2,mean) ## Murder Assault UrbanPop Rape ## 7.788 170.760 65.540 21.232
Finding variance of all variables.
apply(USArrests,2,var) ## Murder Assault UrbanPop Rape ## 18.97047 6945.16571 209.51878 87.72916
There is a lot of difference in variances of each variables. In PCA mean does not play a major role, but variance plays a major role in defining principal components so very large differences in variance value of a variable will definately dominate the principal components. We need to standardize the variables so as to get mean \(\mu=0\) and variance \(\sigma^2=1\). To standardize we use formula \(x’ = \frac{x – mean(x)}{sd(x)}\).
The function prcomp()
will do the needful of standardizing the variables.
pca.out<-prcomp(USArrests,scale=TRUE) pca.out ## Standard deviations (1, .., p=4): ## [1] 1.5748783 0.9948694 0.5971291 0.4164494 ## ## Rotation (n x k) = (4 x 4): ## PC1 PC2 PC3 PC4 ## Murder -0.5358995 0.4181809 -0.3412327 0.64922780 ## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 ## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 ## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
#summary of the PCA summary(pca.out) ## Importance of components%s: ## PC1 PC2 PC3 PC4 ## Standard deviation 1.5749 0.9949 0.59713 0.41645 ## Proportion of Variance 0.6201 0.2474 0.08914 0.04336 ## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
names(pca.out) ## [1] "sdev" "rotation" "center" "scale" "x"
Now as we can see maximum % of variance is explained by \(PC_1\), and all PCs are mutually uncorrelated. Around 62 % of variance is explained by \(PC_1\).
Let’s build a biplot to understand better.
biplot(pca.out,scale = 0, cex=0.65)
Now in the above plot red colored arrows represent the variables and each direction represent the direction which explains the most variation. Example- for all the countries in the direction of ‘UrbanPop’ are countries with most urban-population and opposite to that direction are the countries with least. So this is how we interpret our Bi-plot.
PCA is a great pre-processing tool for picking out the most relevant linear combination of variables and use them in our predictive model.It helps us find out the variables which explain the most variation in the data and only use them. PCA plays a major role in the data analysis process before going for advanced analytics and model building.
The only drawback PCA has is that it generates the principal components in a unsupervised manner i.e without looking the target vector \( (y_1,y_2,y_3…..y_n) \) ,hence the principal components which explains the most variation in dataset without looking at the target-\(Y\) variable,may or may not explain good percentage of variance for the response variable \(Y\) which could affect and degrade the performance of the predictive model.
The R code for implementing PCA in R is adapted from the amazing online course “Statistical learning” offered by Stanford University Online. I urge readers to definately go and try out this course to get clear with the core statistics and maths behind various statistical models. The details about the course can be found here-Statistical Learning
Also, this book Elements Of statistical learning has helped me learn lots of amazing stuff
Hope you guys liked the article, make sure to like and share it. Happy machine learning!!
Related Post
]]>Related Post
The dataset considered for the analysis is the Arbuthnot dataset containing information of male and female births in London from year 1639 to 1710. Based on that, a metric representing the fractional excess of boys births versus girls is defined as:
$$
\begin{equation}
\begin{aligned}
\dfrac{(Boys – Girls)}{Girls}
\end{aligned}
\end{equation}
$$
The time series so defined is analyzed to determine candidate ARIMA models. The present tutorial is so organized. First, a brief exploratory analysis is carried on. Then, six ARIMA models are defined, analyzed and compared. Forecast of the time series under analysis is computed. Finally, a short historical background digression is introduced describing what was happening in England on 17-th century and citing studies on the matter of sex ratio at birth.
suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(forecast)) suppressPackageStartupMessages(library(astsa)) suppressPackageStartupMessages(library(lmtest)) suppressPackageStartupMessages(library(fUnitRoots)) suppressPackageStartupMessages(library(FitARMA)) suppressPackageStartupMessages(library(strucchange)) suppressPackageStartupMessages(library(reshape)) suppressPackageStartupMessages(library(Rmisc)) suppressPackageStartupMessages(library(fBasics))
Note: The fUnitRoots package is not any longer maintained by CRAN, however it can be installed from source available at the following link:
fUnitRoots version 3010.78 sources
Loading the Arbuthnot dataset and showing some basic metrics and plots.
url <- "https://www.openintro.org/stat/data/arbuthnot.csv" abhutondot <- read.csv(url, header=TRUE) nrow(abhutondot) [1] 82
head(abhutondot) year boys girls 1 1629 5218 4683 2 1630 4858 4457 3 1631 4422 4102 4 1632 4994 4590 5 1633 5158 4839 6 1634 5035 4820
abhutondot_rs <- melt(abhutondot, id = c("year")) head(abhutondot_rs) year variable value 1 1629 boys 5218 2 1630 boys 4858 3 1631 boys 4422 4 1632 boys 4994 5 1633 boys 5158 6 1634 boys 5035
tail(abhutondot_rs) year variable value 159 1705 girls 7779 160 1706 girls 7417 161 1707 girls 7687 162 1708 girls 7623 163 1709 girls 7380 164 1710 girls 7288
ggplot(data = abhutondot_rs, aes(x = year)) + geom_line(aes(y = value, colour = variable)) + scale_colour_manual(values = c("blue", "red"))
Boys births appear to be consistently greater than girls ones. Let us run a t.test to further verify if there is a true difference in the mean of the two groups as represented by boys and girls births.
t.test(value ~ variable, data = abhutondot_rs) Welch Two Sample t-test data: value by variable t = 1.4697, df = 161.77, p-value = 0.1436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -128.0016 872.9040 sample estimates: mean in group boys mean in group girls 5907.098 5534.646
Based on the p-value, we cannot reject the null hypothesis that the true difference in means is equal to zero.
basicStats(abhutondot[-1]) boys girls nobs 8.200000e+01 8.200000e+01 NAs 0.000000e+00 0.000000e+00 Minimum 2.890000e+03 2.722000e+03 Maximum 8.426000e+03 7.779000e+03 1. Quartile 4.759250e+03 4.457000e+03 3. Quartile 7.576500e+03 7.150250e+03 Mean 5.907098e+03 5.534646e+03 Median 6.073000e+03 5.718000e+03 Sum 4.843820e+05 4.538410e+05 SE Mean 1.825161e+02 1.758222e+02 LCL Mean 5.543948e+03 5.184815e+03 UCL Mean 6.270247e+03 5.884477e+03 Variance 2.731595e+06 2.534902e+06 Stdev 1.652754e+03 1.592137e+03 Skewness -2.139250e-01 -2.204720e-01 Kurtosis -1.221721e+00 -1.250893e+00
p1 <- ggplot(data = abhutondot_rs, aes(x = variable, y = value)) + geom_boxplot() p2 <- ggplot(data = abhutondot, aes(boys)) + geom_density() p3 <- ggplot(data = abhutondot, aes(girls)) + geom_density() multiplot(p1, p2, p3, cols = 3)
Gives this plot.
Let us define the time series to be analysed with frequency = 1 as data is collected yearly, see ref. [5] for details.
excess_frac <- (abhutondot$boys - abhutondot$girls)/abhutondot$girls excess_ts <- ts(excess_frac, frequency = 1, start = abhutondot$year[1]) autoplot(excess_ts)
Basic statistics metrics are reported.
basicStats(excess_frac) excess_frac nobs 82.000000 NAs 0.000000 Minimum 0.010673 Maximum 0.156075 1. Quartile 0.048469 3. Quartile 0.087510 Mean 0.070748 Median 0.064704 Sum 5.801343 SE Mean 0.003451 LCL Mean 0.063881 UCL Mean 0.077615 Variance 0.000977 Stdev 0.031254 Skewness 0.680042 Kurtosis 0.073620
Boys births were at least 1% higher than girls ones, reaching a top percentage excess equal to 15%.
Further, unit roots tests (run by urdfTest() within fUnitRoots package) show that we cannot reject the null hypothesis of unit root presence. See their test statistics compared with critical values below (see ref. [2] for details about the urdfTest() report).
urdftest_lag = floor(12*(length(excess_ts)/100)^0.25) urdfTest(excess_ts, type = "nc", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression none Call: lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.052739 -0.018246 -0.002899 0.019396 0.069349 Coefficients: Estimate Std. Error t value Pr(>|t|) z.lag.1 -0.01465 0.05027 -0.291 0.771802 z.diff.lag1 -0.71838 0.13552 -5.301 1.87e-06 *** z.diff.lag2 -0.66917 0.16431 -4.073 0.000143 *** z.diff.lag3 -0.58640 0.18567 -3.158 0.002519 ** z.diff.lag4 -0.56529 0.19688 -2.871 0.005700 ** z.diff.lag5 -0.58489 0.20248 -2.889 0.005432 ** z.diff.lag6 -0.60278 0.20075 -3.003 0.003944 ** z.diff.lag7 -0.43509 0.20012 -2.174 0.033786 * z.diff.lag8 -0.28608 0.19283 -1.484 0.143335 z.diff.lag9 -0.14212 0.18150 -0.783 0.436769 z.diff.lag10 0.13232 0.15903 0.832 0.408801 z.diff.lag11 -0.07234 0.12774 -0.566 0.573346 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03026 on 58 degrees of freedom Multiple R-squared: 0.4638, Adjusted R-squared: 0.3529 F-statistic: 4.181 on 12 and 58 DF, p-value: 0.0001034 Value of test-statistic is: -0.2914 Critical values for test statistics: 1pct 5pct 10pct tau1 -2.6 -1.95 -1.61
urdfTest(excess_ts, type = "c", lags = urdftest_lag, doplot = FALSE) Title: Augmented Dickey-Fuller Unit Root Test Test Results: Test regression drift Call: lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag) Residuals: Min 1Q Median 3Q Max -0.051868 -0.018870 -0.005227 0.019503 0.067936 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02239 0.01789 1.251 0.2159 z.lag.1 -0.31889 0.24824 -1.285 0.2041 z.diff.lag1 -0.44287 0.25820 -1.715 0.0917 . z.diff.lag2 -0.40952 0.26418 -1.550 0.1266 z.diff.lag3 -0.34933 0.26464 -1.320 0.1921 z.diff.lag4 -0.35207 0.25966 -1.356 0.1805 z.diff.lag5 -0.39863 0.25053 -1.591 0.1171 z.diff.lag6 -0.44797 0.23498 -1.906 0.0616 . z.diff.lag7 -0.31103 0.22246 -1.398 0.1675 z.diff.lag8 -0.19044 0.20656 -0.922 0.3604 z.diff.lag9 -0.07128 0.18928 -0.377 0.7079 z.diff.lag10 0.18023 0.16283 1.107 0.2730 z.diff.lag11 -0.04154 0.12948 -0.321 0.7495 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03012 on 57 degrees of freedom Multiple R-squared: 0.4781, Adjusted R-squared: 0.3683 F-statistic: 4.352 on 12 and 57 DF, p-value: 6.962e-05 Value of test-statistic is: -1.2846 0.8257 Critical values for test statistics: 1pct 5pct 10pct tau2 -3.51 -2.89 -2.58 phi1 6.70 4.71 3.86
ACF and PACF plots are given.
par(mfrow=c(1,2)) acf(excess_ts) pacf(excess_ts)
We observe the auto-correlation spike at lag = 10 beyond confidence region. That suggests the presence of a seasonal component with period = 10. Structural changes are now investigated. First let us verify if regression against a constant is significative for our time series.
summary(lm(excess_ts ~ 1)) Call: lm(formula = excess_ts ~ 1) Residuals: Min 1Q Median 3Q Max -0.060075 -0.022279 -0.006044 0.016762 0.085327 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.070748 0.003451 20.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03125 on 81 degrees of freedom
The intercept is reported as significative. Let us see if there are any structural breaks.
(break_point <- breakpoints(excess_ts ~ 1)) Optimal 2-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: 42 Corresponding to breakdates: 1670
plot(break_point)
summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = excess_ts ~ 1) Breakpoints at observation number: m = 1 42 m = 2 20 42 m = 3 20 35 48 m = 4 20 35 50 66 m = 5 17 30 42 56 69 Corresponding to breakdates: m = 1 1670 m = 2 1648 1670 m = 3 1648 1663 1676 m = 4 1648 1663 1678 1694 m = 5 1645 1658 1670 1684 1697 Fit: m 0 1 2 3 4 5 RSS 0.07912 0.06840 0.06210 0.06022 0.05826 0.05894 BIC -327.84807 -330.97945 -330.08081 -323.78985 -317.68933 -307.92410
The BIC minimum value is reached when m = 1, hence just one break point is determined corresponding to year 1670. Let us plot the original time series against its structural break and its confidence interval.
plot(excess_ts) lines(fitted(break_point, breaks = 1), col = 4) lines(confint(break_point, breaks = 1))
Boys vs girls sex ratio at birth changed from:
fitted(break_point)[1] 0.08190905
to:
fitted(breakpoint)[length(excess_ts)] 0.05902908
Running a t.test() to verify further the difference in mean is significative across the two time windows identified by the breakpoint date, year 1670.
break_date <- breakdates(break_point) win_1 <- window(excess_ts, end = break_date) win_2 <- window(excess_ts, start = break_date + 1) t.test(win_1, win_2) Welch Two Sample t-test data: win_1 and win_2 t = 3.5773, df = 70.961, p-value = 0.0006305 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.01012671 0.03563322 sample estimates: mean of x mean of y 0.08190905 0.05902908
Based on reported p-value, we reject the null hypothesis that the true difference is equal to zero.
I am going to compare the following six ARIMA models (represented with the usual (p,d,q)(P,D,Q)[S] notation):
1. non seasonal (1,1,1), as determined by auto.arima() within forecast package 2. seasonal (1,0,0)(0,0,1)[10] 3. seasonal (1,0,0)(1,0,0)[10] 4. seasonal (0,0,0)(0,0,1)[10] with level shift regressor as intervention variable 5. seasonal (1,0,0)(0,0,1)[10] with level shift regressor as intervention variable 6. non seasonal (1,0,0) with level shift regressor as intervention variable
Here we go.
The first model is determined by the auto.arima() function within the forecast package, using the options:
a. stepwise = FALSE, which allows for a more in-depth search of potential models
b. trace = TRUE, which allows to get a list of all the investigated models
Further, as default input to auto.arima() :
c. stationary = FALSE, so that models search is not restricted to stationary models
d. seasonal = TRUE, so that models search is not restricted to non seasonal models
(model_1 <- auto.arima(excess_ts, stepwise = FALSE, trace = TRUE)) ARIMA(0,1,0) : -301.4365 ARIMA(0,1,0) with drift : -299.3722 ARIMA(0,1,1) : -328.9381 ARIMA(0,1,1) with drift : -326.9276 ARIMA(0,1,2) : -329.4061 ARIMA(0,1,2) with drift : Inf ARIMA(0,1,3) : -327.2841 ARIMA(0,1,3) with drift : Inf ARIMA(0,1,4) : -325.7604 ARIMA(0,1,4) with drift : Inf ARIMA(0,1,5) : -323.4805 ARIMA(0,1,5) with drift : Inf ARIMA(1,1,0) : -312.8106 ARIMA(1,1,0) with drift : -310.7155 ARIMA(1,1,1) : -329.5727 ARIMA(1,1,1) with drift : Inf ARIMA(1,1,2) : -327.3821 ARIMA(1,1,2) with drift : Inf ARIMA(1,1,3) : -325.1085 ARIMA(1,1,3) with drift : Inf ARIMA(1,1,4) : -323.446 ARIMA(1,1,4) with drift : Inf ARIMA(2,1,0) : -317.1234 ARIMA(2,1,0) with drift : -314.9816 ARIMA(2,1,1) : -327.3795 ARIMA(2,1,1) with drift : Inf ARIMA(2,1,2) : -325.0859 ARIMA(2,1,2) with drift : Inf ARIMA(2,1,3) : -322.9714 ARIMA(2,1,3) with drift : Inf ARIMA(3,1,0) : -315.9114 ARIMA(3,1,0) with drift : -313.7128 ARIMA(3,1,1) : -325.1497 ARIMA(3,1,1) with drift : Inf ARIMA(3,1,2) : -323.1363 ARIMA(3,1,2) with drift : Inf ARIMA(4,1,0) : -315.3018 ARIMA(4,1,0) with drift : -313.0426 ARIMA(4,1,1) : -324.2463 ARIMA(4,1,1) with drift : -322.0252 ARIMA(5,1,0) : -315.1075 ARIMA(5,1,0) with drift : -312.7776 Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7
As we can see, all investigated models have d=1, which is congruent with the augmented Dickey-Fuller test results.
summary(model_1) Series: excess_ts ARIMA(1,1,1) Coefficients: ar1 ma1 0.2224 -0.9258 s.e. 0.1318 0.0683 sigma^2 estimated as 0.0009316: log likelihood=167.94 AIC=-329.88 AICc=-329.57 BIC=-322.7 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.002931698 0.02995934 0.02405062 -27.05674 46.53832 0.8292635 -0.01005689
The significance of our (1,1,1) model coefficients is further investigated.
coeftest(model_1) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.222363 0.131778 1.6874 0.09153 . ma1 -0.925836 0.068276 -13.5603 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A spike at lag = 1 in the ACF plot suggests the presence of an auto-regressive component. Our second model candidate takes into account the seasonality observed at lag = 10. As a result, the candidate model (1,0,0)(0,0,1)[10] is investigated.
model_2 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period = 10), include.mean = TRUE) summary(model_2) Series: excess_ts ARIMA(1,0,0)(0,0,1)[10] with non-zero mean Coefficients: ar1 sma1 mean 0.2373 0.3441 0.0708 s.e. 0.1104 0.1111 0.0053 sigma^2 estimated as 0.0008129: log likelihood=176.23 AIC=-344.46 AICc=-343.94 BIC=-334.83 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002212383 0.02798445 0.02274858 -21.47547 42.96717 0.7843692 -0.004420048
coeftest(model_2) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2372975 0.1104323 2.1488 0.031650 * sma1 0.3440811 0.1110791 3.0976 0.001951 ** intercept 0.0707836 0.0052663 13.4409 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the third model, I introduce a seasonal auto-regressive component in place of the moving average one as present into model #2.
model_3 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(1,0,0), period = 10), include.mean = TRUE) summary(model_3) Series: excess_ts ARIMA(1,0,0)(1,0,0)[10] with non-zero mean Coefficients: ar1 sar1 mean 0.2637 0.3185 0.0705 s.e. 0.1069 0.1028 0.0058 sigma^2 estimated as 0.0008173: log likelihood=176.1 AIC=-344.19 AICc=-343.67 BIC=-334.56 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -0.0002070952 0.02806013 0.02273145 -21.42509 42.85735 0.7837788 -0.005665764
coeftest(model_3) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.2636602 0.1069472 2.4653 0.013689 * sar1 0.3185397 0.1027903 3.0989 0.001942 ** intercept 0.0704636 0.0058313 12.0836 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This model takes into account the level shift highlighted by the structural change analysis and the seasonal component at lag = 10 observed in the ACF plot. To represent the structural change as level shift, a regressor variable named as level is defined as equal to zero for the timeline preceeding the breakpoint date and as equal to one afterwards such date. In other words, it is a step function which represents a permanent level shift. Such variable is input to the Arima() function as xreg argument. That is one of the most simple representation of an intervention variable, for a more complete overview see ref. [6].
level <- c(rep(0, break_point$breakpoints), rep(1, length(excess_ts) - break_point$breakpoints)) model_4 <- Arima(excess_ts, order = c(0,0,0), seasonal = list(order = c(0,0,1), period = 10), xreg = level, include.mean = TRUE) summary(model_4) Series: excess_ts Regression with ARIMA(0,0,0)(0,0,1)[10] errors Coefficients: sma1 intercept level 0.3437 0.0817 -0.0225 s.e. 0.1135 0.0053 0.0072 sigma^2 estimated as 0.0007706: log likelihood=178.45 AIC=-348.9 AICc=-348.38 BIC=-339.27 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 0.0001703111 0.02724729 0.02184016 -19.82639 41.28977 0.7530469 0.1608774
coeftest(model_4) z test of coefficients: Estimate Std. Error z value Pr(>|z|) sma1 0.3437109 0.1135387 3.0273 0.002468 ** intercept 0.0817445 0.0052825 15.4745 < 2.2e-16 *** level -0.0225294 0.0072468 -3.1089 0.001878 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The ‘level’ regressor coefficient indicates an average 2.2% decrease in the boys vs. girls excess birth ratio, afterwards year 1670.
The present model adds to model #4 an auto-regressive component.
model_5 <- Arima(excess_ts, order = c(1,0,0), seasonal = list(order = c(0,0,1), period=10), xreg = level, include.mean = TRUE) summary(model_5) Series: excess_ts Regression with ARIMA(1,0,0)(0,0,1)[10] errors Coefficients: ar1 sma1 intercept level 0.1649 0.3188 0.0820 -0.0230 s.e. 0.1099 0.1112 0.0061 0.0084 sigma^2 estimated as 0.0007612: log likelihood=179.56 AIC=-349.11 AICc=-348.32 BIC=-337.08 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 8.225255e-05 0.02690781 0.02174375 -19.42485 40.90147 0.7497229 0.007234682
coeftest(model_5) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1648878 0.1099118 1.5002 0.133567 sma1 0.3187896 0.1112301 2.8660 0.004156 ** intercept 0.0819981 0.0061227 13.3926 < 2.2e-16 *** level -0.0230176 0.0083874 -2.7443 0.006064 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The auto-regressive coefficient ar1 is reported as not significative, hence this model is not taken into account further, even though it would provide a (very) slight improvement in the AIC metric.
This model consideres the structural change as in model #4 without the seasonal component. That because I want to evaluate if any benefits comes from including a seasonal component.
model_6 <- Arima(excess_ts, order = c(1,0,0), xreg = level, include.mean = TRUE) summary(model_6) Series: excess_ts Regression with ARIMA(1,0,0) errors Coefficients: ar1 intercept level 0.1820 0.0821 -0.0232 s.e. 0.1088 0.0053 0.0076 sigma^2 estimated as 0.0008369: log likelihood=175.68 AIC=-343.35 AICc=-342.83 BIC=-333.73 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set -7.777648e-05 0.02839508 0.02258574 -21.93086 43.2444 0.7787544 0.0003897161
coeftest(model_6) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ar1 0.1820054 0.1088357 1.6723 0.094466 . intercept 0.0821470 0.0053294 15.4139 < 2.2e-16 *** level -0.0232481 0.0076044 -3.0572 0.002234 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
It is relevant to verify if our models residuals are white noise or not. If not, the model under analysis has not captured all the structure of our original time series. For this purpose, I will take advantage of the checkresiduals() function within forecast package and the sarima() within the astsa package. What I like of the sarima() function is the residuals qqplot() with confidence region and the Ljung-Box plot to check significance of residuals correlations. If you like to further verify the Ljung-Box test results, I would suggest to take advantage of the LjungBoxTest() function within FitARMA package.
Notes:
1. Model #5 was taken out of the prosecution of the analysis, hence its residuals will not be checked.
2. checkresiduals() and sarima() gives textual outputs which are omitted as equivalent information is already included elsewhere. The checkresiduals() reports a short LjungBox test result. The sarima() function reports a textual model summary showing coefficients and metrics similar to already shown summaries.
Checking ARIMA model (1,1,1) residuals.
checkresiduals(model_1)
It is important to verify the significance of residuals auto-correlation, in order to see if the spike at lag = 16 is as such. In fact, the purpose of running the LjungBox test is to verify if any auto-correlation beyond the confidence region (as seen in the ACF plot) is a true positive (p-value < 0.05) or is false positive (p-value >= 0.05).
LjungBoxTest(residuals(model_1), k = 2, lag.max = 20) m Qm pvalue 1 0.01 0.92611002 2 0.01 0.91139925 3 0.17 0.68276539 4 0.82 0.66216496 5 1.35 0.71745835 6 1.99 0.73828536 7 3.32 0.65017878 8 3.98 0.67886254 9 5.16 0.64076272 10 13.34 0.10075806 11 15.32 0.08260244 12 15.32 0.12066369 13 15.35 0.16692082 14 15.39 0.22084407 15 15.40 0.28310047 16 23.69 0.04989871 17 25.63 0.04204503 18 27.65 0.03480954 19 30.06 0.02590249 20 30.07 0.03680262
The p-value at lag = 16 is < 0.05 hence the auto-correlation spike at lag = 16 out of the confidence interval is significative. Our model #1 has not captured all the structure of the original time series. Also auto-correlations at lags = {17, 18, 19, 20} have p-value < 0.05, however they stand within the confidence inteval.
Now doing the same with sarima() function.
sarima(excess_ts, p = 1, d = 1, q = 1)
The sarima() output plot shows results congruent with previous comments.
Checking ARIMA (1,0,0)(0,0,1)[10] model residuals.
checkresiduals(model_2)
We observe how the model #2 does not show auto-correlation spikes above the confidence interval, and that is a confirmation of the presence of seasonality with period = 10.
LjungBoxTest(residuals(model_2), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9674875 2 0.00 0.9545816 3 0.30 0.5815774 4 0.69 0.7096699 5 0.82 0.8442508 6 0.98 0.9121811 7 2.01 0.8481327 8 4.24 0.6442302 9 8.56 0.2861290 10 8.63 0.3742209 11 10.06 0.3459882 12 10.13 0.4290298 13 10.15 0.5172343 14 10.20 0.5985932 15 10.44 0.6577499 16 14.30 0.4275627 17 17.14 0.3104476 18 18.86 0.2759461 19 22.35 0.1715052 20 22.41 0.2142307
No p-value > 0.05 are shown by the LjungBox test report. Now showing sarima() plots.
sarima(excess_ts, p = 1, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10)
The sarima() output plot shows results congruent with previous comments. As a conclusion, model #2 has white noise residuals.
Checking ARIMA (1,0,0)(1,0,0)[10] model residuals.
checkresiduals(model_3)
LjungBoxTest(residuals(model_3), k = 2, lag.max = 20) m Qm pvalue 1 0.00 0.9583318 2 0.00 0.9553365 3 0.18 0.6719971 4 0.37 0.8313599 5 0.46 0.9285781 6 0.63 0.9600113 7 1.63 0.8975737 8 3.90 0.6901431 9 8.23 0.3126132 10 8.34 0.4005182 11 10.36 0.3222430 12 10.36 0.4091634 13 10.39 0.4960029 14 10.52 0.5708185 15 10.78 0.6290166 16 14.81 0.3914043 17 17.91 0.2675070 18 19.69 0.2343648 19 23.37 0.1374412 20 23.70 0.1651785
sarima(excess_ts, p = 1, d = 0, q = 0, P = 1, D = 0, Q = 0, S = 10)
Model#3 has white noise residuals.
Checking residuals of the ARIMA (0,0,0)(0,0,1)[10] model with level shift.
checkresiduals(model_4)
LjungBoxTest(residuals(model_4), k = 1, lag.max = 20) m Qm pvalue 1 2.20 0.1379312 2 2.23 0.1349922 3 2.24 0.3262675 4 3.68 0.2977682 5 4.99 0.2884494 6 5.23 0.3884858 7 5.52 0.4787801 8 7.45 0.3837810 9 10.78 0.2142605 10 10.79 0.2905934 11 12.61 0.2465522 12 12.61 0.3198632 13 12.61 0.3979718 14 12.63 0.4769589 15 12.65 0.5538915 16 16.37 0.3578806 17 16.77 0.4005631 18 19.73 0.2882971 19 25.25 0.1182396 20 25.31 0.1504763
sarima(excess_ts, p = 0, d = 0, q = 0, P = 0, D = 0, Q = 1, S = 10, xreg = level)
As a conclusion, model #4 has white noise residuals.
Checking residuals of the ARIMA (1,0,0) model with level shift.
checkresiduals(model_6)
We can observe ACF spikes above the confidence region for lags = {10, 16}.
LjungBoxTest(residuals(model_6), k = 1, lag.max = 20) m Qm pvalue 1 0.00 0.99713258 2 0.00 0.96036600 3 0.07 0.96612792 4 1.51 0.67947886 5 2.70 0.60998895 6 4.47 0.48361437 7 5.20 0.51895133 8 5.49 0.59997164 9 6.51 0.58979614 10 14.72 0.09890580 11 17.09 0.07239050 12 17.21 0.10175761 13 17.21 0.14179063 14 17.24 0.18844158 15 17.34 0.23872998 16 25.31 0.04596230 17 27.53 0.03591335 18 29.50 0.03021450 19 31.47 0.02537517 20 31.71 0.03370585
The LjungBox test reports the residuals auto-correlation at lag = 10 as not significative (p-value > 0.05). The lag = 16 residuals auto-correlation is significative (p-value < 0.05). Let us do same verifications with sarima().
sarima(excess_ts, p = 1, d = 0, q = 0, xreg = level)
The sarima() plots confirm the presence of significative auto-correlations in the residuals at lag = 16. As a conclusion, this model does not capture all the structure of our original time series.
As overall conclusion, only the seasonal models have white noise residuals.
Finally, it is time to gather the overall AIC, AICc and BIC metrics of our five candidate models (please remember that model #5 has been cut off from the prosecution of the analysis) and choose the final model.
df <- data.frame(col_1_res = c(model_1$aic, model_2$aic, model_3$aic, model_4$aic, model_6$aic), col_2_res = c(model_1$aicc, model_2$aicc, model_3$aicc, model_4$aicc, model_6$aicc), col_3_res = c(model_1$bic, model_2$bic, model_3$bic, model_4$bic, model_6$bic)) colnames(df) <- c("AIC", "AICc", "BIC") rownames(df) <- c("ARIMA(1,1,1)", "ARIMA(1,0,0)(0,0,1)[10]", "ARIMA(1,0,0)(1,0,0)[10]", "ARIMA(0,0,0)(0,0,1)[10] with level shift", "ARIMA(1,0,0) with level shift") df AIC AICc BIC ARIMA(1,1,1) -329.8844 -329.5727 -322.7010 ARIMA(1,0,0)(0,0,1)[10] -344.4575 -343.9380 -334.8306 ARIMA(1,0,0)(1,0,0)[10] -344.1906 -343.6711 -334.5637 ARIMA(0,0,0)(0,0,1)[10] with level shift -348.8963 -348.3768 -339.2694 ARIMA(1,0,0) with level shift -343.3529 -342.8334 -333.7260
The model which provides the best AIC, AICc and BIC metrics at the same time is model #4, ARIMA(0,0,0)(0,0,1)[10] with level shift.
Note: In case of structural changes, the (augmented) Dickey-Fuller test is biased toward the non rejection of the null, as ref. [2] explains. This may justify why the null hypothesis of unit root presence was not rejected by such test and model #1 performs worse than the other ones in terms of AIC metric. Further, you may verify that with d=1 in models from #2 up to #6, the AIC metric does not improve.
I take advantage of the forecast() function provided within forecast package using model #4 as input. The xreg variable is determined as a constant level equal to one congruently with the structural analysis results.
h_fut <- 20 plot(forecast(model_4, h = h_fut, xreg = rep(1, h_fut)))
So we observed a level shift equal approximately to 2.25% in the boys vs girls births excess ratio occurring on year 1670. Two questions arises:
1. What could have influenced the sex-ratio at birth ?
2. What was happening in London during the 17-th century ?
There are studies showing results in support of the fact that sex-ratio at birth is influenced by war periods. Studies such as ref. [7], [8] and [9] suggest an increase of boys births during and/or after war time. General justification is for an adaptive response to external conditions and stresses. Differently, studies as ref. [10] state that there is no statistical evidence of war time influence on sex-ratio at births. Under normal circumstances, the boys vs girls sex ratio at birth is approximately 105:100 on average, (according to some sources), and generally between 102:100 and 108:100.
Our time series covers most of the following eras of the England history (ref. [11]):
* Early Stuart era: 1603-1660
* Later Stuart era 1660-1714
The English Civil War occured during the Early Stuart era and consisted of a series of armed conflicts and political machinations that took place between Parliamentarians (known as Roundheads) and Royalists (known as Cavaliers) between 1642 and 1651. The English conflict left some 34,000 Parliamentarians and 50,000 Royalists dead, while at least 100,000 men and women died from war-related diseases, bringing the total death toll caused by the three civil wars in England to almost 200,000 (see ref. [12]).
If we step back to view the first plot showing the abhutondot dataset, we can observe a sharp drop on births (both boys and girls) between 1642 and 1651, as testify a period of scarce resources and famine, and we can infer it was due to the civil war. Let us run a quick analysis on the total number of births.
abhutondot.ts <- ts(abhutondot$boys + abhutondot$girls, frequency = 1 , start = abhutondot$year[1]) autoplot(abhutondot.ts)
I then verify if any structural breaks are there with respect a constant level as regressor.
summary(lm(abhutondot.ts ~ 1)) Call: lm(formula = abhutondot.ts ~ 1) Residuals: Min 1Q Median 3Q Max -5829.7 -2243.0 371.3 3281.3 4703.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11442 358 31.96 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3242 on 81 degrees of freedom
The regression is reported as significative. Going on with the search for structural breaks, if any.
(break_point <- breakpoints(abhutondot.ts ~ 1)) Optimal 4-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: 15 33 52 Corresponding to breakdates: 1643 1661 1680
plot(break_point)
summary(break_point) Optimal (m+1)-segment partition: Call: breakpoints.formula(formula = abhutondot.ts ~ 1) Breakpoints at observation number: m = 1 39 m = 2 33 52 m = 3 15 33 52 m = 4 15 33 52 68 m = 5 15 32 44 56 68 Corresponding to breakdates: m = 1 1667 m = 2 1661 1680 m = 3 1643 1661 1680 m = 4 1643 1661 1680 1696 m = 5 1643 1660 1672 1684 1696 Fit: m 0 1 2 3 4 5 RSS 851165494 211346686 139782582 63564154 59593922 62188019 BIC 1566 1461 1436 1380 1383 1396
The BIC minimum value is reached with m = 3. Let us plot the original time series against its structural breaks and their confidence intervals.
plot(abhutondot.ts) fitted.ts <- fitted(break_point, breaks = 3) lines(fitted.ts, col = 4) lines(confint(break_point, breaks = 3))
The fitted levels and the breakpoints dates are as follows.
unique(as.integer(fitted.ts)) [1] 9843 6791 11645 14902
breakdates(break_point, breaks = 3) [1] 1648 1663 1676
So it is confirmed that the total number of births went through a sequence of level shift due to exogeneous shocks. The civil war is very likely determining the first break and its end the second one. It is remarkable the total births decrease from the 9843 level down to 6791 occurring around year 1648. As well remarkable, the level up shift happened on year 1663, afterwards the civil war end. The excess sex ratio structural break on year 1670 occurred at a time approximately in between total births second and third breaks.
The fitted multiple level shifts (as determined by the structural breaks analysis) can be used as intervention variable to fit an ARIMA model, as shown below.
fitted.ts <- fitted(break_point, breaks = 3) autoplot(fitted.ts)
abhutondot_xreg <- Arima(abhutondot.ts, order = c(0,1,1), xreg = fitted.ts, include.mean = TRUE) summary(abhutondot_xreg) Series: abhutondot.ts Regression with ARIMA(0,1,1) errors Coefficients: ma1 fitted.ts -0.5481 0.5580 s.e. 0.1507 0.1501 sigma^2 estimated as 743937: log likelihood=-661.65 AIC=1329.3 AICc=1329.61 BIC=1336.48 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 71.60873 846.5933 622.1021 0.0132448 5.92764 1.002253 0.08043183
coeftest(abhutondot_xreg) z test of coefficients: Estimate Std. Error z value Pr(>|z|) ma1 -0.54809 0.15071 -3.6368 0.0002761 *** fitted.ts 0.55799 0.15011 3.7173 0.0002013 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
checkresiduals(abhutondot_xreg)
LjungBoxTest(residuals(abhutondot_xreg), k=1, lag.max=20) m Qm pvalue 1 1.39 0.2387836 2 2.26 0.1325458 3 2.31 0.3156867 4 2.69 0.4416912 5 2.93 0.5701546 6 3.32 0.6503402 7 4.14 0.6580172 8 4.53 0.7170882 9 4.54 0.8054321 10 7.86 0.5479118 11 8.51 0.5791178 12 8.54 0.6644773 13 8.69 0.7292904 14 9.31 0.7491884 15 11.16 0.6734453 16 11.50 0.7163115 17 12.58 0.7035068 18 12.60 0.7627357 19 13.01 0.7906889 20 14.83 0.7334703
sarima(abhutondot.ts, p=0, d=1, q=1, xreg = fitted.ts)
The residuals are white noise.
To have spot seasonality and level shifts were important aspects in our ARIMA models analysis. The seasonal component allowed to define models with white noise residuals. The structural breaks analysis allowed to define models with improved AIC metric introducing as regressor the identified level shifts. We also had the chance to go through some historical remarks of the history of England and get to know about sex ratio at birth studies.
If you have any questions, please feel free to comment below.
References
Related Post
Related Post
]]>Keeping the above assumptions in mind, let’s look at our dataset.
The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y). The dataset can be downloaded from here.
import pandas as pd import numpy as np from sklearn import preprocessing import matplotlib.pyplot as plt plt.rc("font", size=14) from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split import seaborn as sns sns.set(style="white") sns.set(style="whitegrid", color_codes=True)
The dataset provides the bank customers’ information. It includes 41,188 records and 21 fields.
data = pd.read_csv('bank.csv', header=0) data = data.dropna() print(data.shape) print(list(data.columns)) (41188, 21) ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y']
y — has the client subscribed a term deposit? (binary: “1”, means “Yes”, “0” means “No”)
sns.countplot(x='y',data=data, palette='hls') plt.show()
data.isnull().sum() age 0 job 0 marital 0 education 0 default 0 housing 0 loan 0 contact 0 month 0 day_of_week 0 duration 0 campaign 0 pdays 0 previous 0 poutcome 0 emp_var_rate 0 cons_price_idx 0 cons_conf_idx 0 euribor3m 0 nr_employed 0 y 0 dtype: int64
sns.countplot(y="job", data=data) plt.show()
sns.countplot(x="marital", data=data) plt.show()
sns.countplot(x="default", data=data) plt.show()
sns.countplot(x="housing", data=data) plt.show()
sns.countplot(x="loan", data=data) plt.show()
sns.countplot(x="poutcome", data=data) plt.show()
Our prediction will be based on the customer’s job, marital status, whether he(she) has credit in default, whether he(she) has a housing loan, whether he(she) has a personal loan, and the outcome of the previous marketing campaigns. So, we will drop the variables that we do not need.
data.drop(data.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]], axis=1, inplace=True)
In logistic regression models, encoding all of the independent variables as dummy variables allows easy interpretation and calculation of the odds ratios, and increases the stability and significance of the coefficients.
data2 = pd.get_dummies(data, columns =['job', 'marital', 'default', 'housing', 'loan', 'poutcome'])
data2.drop(data2.columns[[12, 16, 18, 21, 24]], axis=1, inplace=True) data2.columns Index(['y', 'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'marital_divorced', 'marital_married', 'marital_single', 'default_no', 'default_yes', 'housing_no', 'housing_yes', 'loan_no', 'loan_yes', 'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success'], dtype='object')
Perfect! Exactly what we need for the next steps.
sns.heatmap(data2.corr()) plt.show()
X = data2.iloc[:,1:] y = data2.iloc[:,0] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train.shape (30891, 23)
Great! Now we can start building our logistic regression model.
classifier = LogisticRegression(random_state=0) classifier.fit(X_train, y_train) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=0, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
The confusion_matrix() function will calculate a confusion matrix and return the result as an array.
y_pred = classifier.predict(X_test) from sklearn.metrics import confusion_matrix confusion_matrix = confusion_matrix(y_test, y_pred) print(confusion_matrix) [[9046 110] [ 912 229]]
The result is telling us that we have 9046+229 correct predictions and 912+110 incorrect predictions.
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(classifier.score(X_test, y_test))) Accuracy of logistic regression classifier on test set: 0.90
To quote from Scikit Learn:
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.
The support is the number of occurrences of each class in y_test.
from sklearn.metrics import classification_report print(classification_report(y_test, y_pred)) precision recall f1-score support 0 0.91 0.99 0.95 9156 1 0.68 0.20 0.31 1141 avg / total 0.88 0.90 0.88 10297
Interpretation: Of the entire test set, 88% of the promoted term deposit were the term deposit that the customers liked. Of the entire test set, 90% of the customer’s preferred term deposits that were promoted.
The purpose of this section is to visualize logistic regression classsifiers’ decision boundaries. In order to better vizualize the decision boundaries, we’ll perform Principal Component Analysis (PCA) on the data to reduce the dimensionality to 2 dimensions.
from sklearn.decomposition import PCA X = data2.iloc[:,1:] y = data2.iloc[:,0] pca = PCA(n_components=2).fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(pca, y, random_state=0) plt.figure(dpi=120) plt.scatter(pca[y.values==0,0], pca[y.values==0,1], alpha=0.5, label='YES', s=2, color='navy') plt.scatter(pca[y.values==1,0], pca[y.values==1,1], alpha=0.5, label='NO', s=2, color='darkorange') plt.legend() plt.title('Bank Marketing Data Set\nFirst Two Principal Components') plt.xlabel('PC1') plt.ylabel('PC2') plt.gca().set_aspect('equal') plt.show()
def plot_bank(X, y, fitted_model): plt.figure(figsize=(9.8,5), dpi=100) for i, plot_type in enumerate(['Decision Boundary', 'Decision Probabilities']): plt.subplot(1,2,i+1) mesh_step_size = 0.01 # step size in the mesh x_min, x_max = X[:, 0].min() - .1, X[:, 0].max() + .1 y_min, y_max = X[:, 1].min() - .1, X[:, 1].max() + .1 xx, yy = np.meshgrid(np.arange(x_min, x_max, mesh_step_size), np.arange(y_min, y_max, mesh_step_size)) if i == 0: Z = fitted_model.predict(np.c_[xx.ravel(), yy.ravel()]) else: try: Z = fitted_model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,1] except: plt.text(0.4, 0.5, 'Probabilities Unavailable', horizontalalignment='center', verticalalignment='center', transform = plt.gca().transAxes, fontsize=12) plt.axis('off') break Z = Z.reshape(xx.shape) plt.scatter(X[y.values==0,0], X[y.values==0,1], alpha=0.8, label='YES', s=5, color='navy') plt.scatter(X[y.values==1,0], X[y.values==1,1], alpha=0.8, label='NO', s=5, color='darkorange') plt.imshow(Z, interpolation='nearest', cmap='RdYlBu_r', alpha=0.15, extent=(x_min, x_max, y_min, y_max), origin='lower') plt.title(plot_type + '\n' + str(fitted_model).split('(')[0]+ ' Test Accuracy: ' + str(np.round(fitted_model.score(X, y), 5))) plt.gca().set_aspect('equal'); plt.tight_layout() plt.legend() plt.subplots_adjust(top=0.9, bottom=0.08, wspace=0.02) model = LogisticRegression() model.fit(X_train,y_train) plot_bank(X_test, y_test, model) plt.show()
As you can see, the PCA has reduced the accuracy of our Logistic Regression model. This is because we use PCA to reduce the amount of the dimension, so we have removed information from our data. We will cover PCA in a future post.
The Jupiter notebook used to make this post is available here. I would be pleased to receive feedback or questions on any of the above.
Related Post
]]>Related Post
]]>Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.
Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:
install.packages('Quandl')
You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl
function and pass it the code of the series we want. We also specify “ts” for the type
argument so that the data is imported as an R ts
object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date
argument because I want the most recent data available. You can find this data on the web here.
library(Quandl) hpidata <- Quandl("YALE/NHPI", type="ts", start_date="1990-01-01") plot.ts(hpidata, main = "Robert Shiller's Nominal Home Price Index")
While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl
website. The data are located on the web here, here, and here.
gdpdata <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01") pidata <- Quandl("FRED/PINCOME", type="ts", start_date="1990-01-01") umdata <- Quandl("UMICH/SOC43", type="ts")[, 1] plot.ts(cbind(gdpdata, pidata), main="US GPD and Personal Income, billions $")
plot.ts(umdata, main = "University of Michigan Consumer Survey, Selling Conditions for Houses")
The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse
argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.
gdpdata_ann <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", collapse="annual") frequency(gdpdata_ann) [1] 1
We can also transform our data on the fly as its imported. The Quandl
function has a argument transform
that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform
argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.
For example, here we get the data in percent change form:
gdpdata_pc <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", transform="rdiff") plot.ts(gdpdata_pc * 100, ylab= "% change", main="US Gross Domestic Product, % change")
You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).
In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.
Related Post
]]>