(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**A**s I discovered (!) the Annals of Applied Statistics in my mailbox just prior to taking the local train to Dauphine for the first time in 2017 (!), I started reading it on the way, but did not get any further than the first discussion paper by Pengsheng Ji and Jiashun Jin on coauthorship and citation networks for statisticians. I found the whole exercise intriguing, I must confess, with little to support a whole discussion on the topic. I may have read the paper too superficially as a métro pastime, but to me it sounded more like a *post-hoc* analysis than a statistical exercise, something like looking at the network or rather at the output of a software representing networks and making sense of clumps and sub-networks *a posteriori*. (In a way this reminded of my first SAS project at school, on the patterns of vacations in France. It was in 1983 on pinched cards. And we spent a while cutting & pasting in a literal sense the 80 column graphs produced by SAS on endless listings.)

It may be that part of the interest in the paper is self-centred. I do not think analysing a similar dataset in another field like deconstructionist philosophy or Korean raku would have attracted the same attention. Looking at the clusters and the names on the pictures is obviously making sense, if more at a curiosity than a scientific level, as I do not think this brings much in terms of ranking and evaluating research (despite what Bernard Silverman suggests in his preface) or understanding collaborations (beyond the fact that people in the same subfield or same active place like Duke tend to collaborate). Speaking of curiosity, I was quite surprised to spot my name in one network and even more to see that I was part of the “High-Dimensional Data Analysis” cluster, rather than of the “Bayes” cluster. I cannot fathom how I ended up in that theme, as I cannot think of a single paper of mines pertaining to either high dimensions or data analysis [to force the trait just a wee bit!]. Maybe thanks to my joint paper with Peter Mueller. (I tried to check the data itself but cannot trace my own papers in the raw datafiles.)

I also wonder what is the point of looking at solely four major journals in the field, missing for instance most of computational statistics and biostatistics, not to mention machine learning or econometrics. This results in a somewhat narrow niche, if obviously recovering the main authors in the [corresponding] field. Some major players in computational stats still make it to the lists, like Gareth Roberts or Håvard Rue, but under the wrong categorisation of spatial statistics.

Filed under: Books, pictures, R, Statistics, University life Tagged: Annals of Applied Statistics, Annals of Statistics, Biometrika, citation map, coauthors, JASA, Journal of the Royal Statistical Society, JRSSB, network, Series B

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.

R-bloggers.com offers

(This article was first published on ** R – Real Data**, and kindly contributed to R-bloggers)

During my Monday morning ritual of avoiding work, I found this publication that is written in R, for people who use R – R Weekly. The authors do a pretty awesome job of aggregating useful, entertaining, and informative content about what’s happening surrounding our favorite programming language. Check it out, give the authors some love on GitHub, and leave a like if you find something useful there.

Have a good week,

Kiefer Smith

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Real Data**.

R-bloggers.com offers

(This article was first published on ** R – TomazTsql**, and kindly contributed to R-bloggers)

Recently, I did a session at local user group in Ljubljana, Slovenija, where I introduced the new algorithms that are available with MicrosoftML package for Microsoft R Server 9.0.3.

For dataset, I have used two from (still currently) running sessions from Kaggle. In the last part, I did image detection and prediction of MNIST dataset and compared the performance and accuracy between.

MNIST Handwritten digit database is available here.

Starting off with rxNeuralNet, we have to build a NET# model or Neural network to work it’s way.

Model for Neural network:

const { T = true; F = false; } input Picture [28, 28]; hidden C1 [5 * 13^2] from Picture convolve { InputShape = [28, 28]; UpperPad = [ 1, 1]; KernelShape = [ 5, 5]; Stride = [ 2, 2]; MapCount = 5; } hidden C2 [50, 5, 5] from C1 convolve { InputShape = [ 5, 13, 13]; KernelShape = [ 1, 5, 5]; Stride = [ 1, 2, 2]; Sharing = [ F, T, T]; MapCount = 10; } hidden H3 [100] from C2 all; // Output layer definition. output Result [10] from H3 all;

Once we have this, we can work out with rxNeuralNet algorithm:

model_DNN_GPU <- rxNeuralNet(label ~. ,data = dataTrain ,type = "multi" ,numIterations = 10 ,normalize = "no" #,acceleration = "gpu" #enable this if you have CUDA driver ,miniBatchSize = 64 #set to 1 else set to 64 if you have CUDA driver problem ,netDefinition = netDefinition ,optimizer = sgd(learningRate = 0.1, lRateRedRatio = 0.9, lRateRedFreq = 10) )

Then do the prediction and calculate accuracy matrix:

DNN_GPU_score <- rxPredict(model_DNN_GPU, dataTest, extraVarsToWrite = "label") sum(Score_DNN$Label == DNN_GPU_score$PredictedLabel)/dim(DNN_GPU_score)[1]

Accuracy for this model is:

[1] 0.9789

When working with H2O package, the following code was executed to get same paramethers for Neural network:

model_h20 <- h2o.deeplearning(x = 2:785 ,y = 1 # label for label ,training_frame = train_h2o ,activation = "RectifierWithDropout" ,input_dropout_ratio = 0.2 # % of inputs dropout ,hidden_dropout_ratios = c(0.5,0.5) # % for nodes dropout ,balance_classes = TRUE ,hidden = c(50,100,100) ,momentum_stable = 0.99 ,nesterov_accelerated_gradient = T # use it for speed ,epochs = 15)

When results of test dataset against the learned model is executed:

h2o.confusionMatrix(model_h20) 100-(416/9978)*100

the result is confusion matrix for accuracy of predicted values with value of:

# [1] 95.83083

For comparison, I have added xgBoost (eXtrem Gradient Boosting), but this time, I will not focus on this one.

Time comparison against the packages (in seconds), from left to right are: H20, MicrosoftML with GPU acceleration, MicrosoftML without GPU acceleration and xgBoost.

As for the accuracy of the trained model, here are results (based on my tests):

MicrosoftML – Neural Network – 97,8%

H20 – Deep Learning – 95,3 %

xgBoost – 94,9 %

As always, code and dataset are available at GitHub.

Happy R-ing

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – TomazTsql**.

R-bloggers.com offers

(This article was first published on ** Rsome - A blog about some R stuff**, and kindly contributed to R-bloggers)

I am pleased to announce my package `strcode`

, a package that should make structuring code easier. You can install it from GitHub, a CRAN submission is planned at a later stage.

The main feature of the package is its function to insert code breaks. These are helpful in breaking the code down into smaller sections. We suggest three levels of granularity for code structuring, wherein higher-level blocks can contain lower-level blocks:

You can notice from the above that:

- The number of
`#`

’s used in front of the break character (`___`

,`...`

,`. .`

) correspond to the level of granularity for a code separator. - The breaks characters
`___`

,`...`

,`. .`

were chosen such that they reflect the level of granularity, namely`___`

has a much higher visual density than`. .`

. - Each block has an (optional) short title on what that block is about.

Every title line ends with `####`

. Therefore, the titles are recognized by RStudio as sections. This has the advantage that you can get a quick summary of your code in RStudio’s code pane as depicted below.

In addition, it means that you can fold sections as you can fold function declarations or if statements.

The package `strcode`

provides an RStudio Add-in to insert each of the three separators presented above – with a single click. To invoke the Add-in, simply click on the button *Addins* in your RStudio Window and select the separator you want to insert.

By default, a Shiny Gadget will open in the viewer pane where you can also specify the title of the section (optional) and whether or not a unique identifier/anchor should be added to the section (see below).

If you prefer to insert the separator without the Shiny Gadget, you can change the option `strcode$insert_with_shiny`

to `FALSE`

which will only insert the break. The separators all have length 80. The value is looked up in the global option `strcode$char_length`

and can therefore be changed by the user as well. The length of separators is thought to correspond to the character width limit you use.

Sometimes it is required to refer to a code section, which can be done by a title. A better way, however, is to use a unique hash sequence – let us call it a code anchor – to create an arguably unique reference to that section. A code anchor in `strcode`

is enclosed by `#<`

and `>#`

so all anchors can be found using regular expressions. You can add section breaks that include a hash. Simply tick the box when you insert the break via the Shiny Gadget. The outcome might look like this

Once code has been structured with the separators introduced above, it can easily be summarized in a compact form. This is particularly handy when the code base is large, when a lot of people work on the code or when new people join a project. The function `sum_str`

is designed for the purpose of extracting separators and respective comments, in order to provide high level code summaries. Thanks to RStudio’s API, you can even create summaries of the file you are working on, simply by typing `sum_str()`

in the console.

The outcome might look like this:

`sum_str`

is highly customizable and flexible, with a host of options. For example, you can specify to omit level 3 or level 2 sections in the summary, summarizing multiple files at once, writing the summaries to a file and more.

Code anchors might prove helpful in other situations where one wants to anchor a single line. That is also possible with `strcode`

. An example of a code anchor is the following:

The hash sequences in strcode are produced with the R package digest.

To wrap it up, strcode tries to make structuring code easy and fun while keeping things simple. Give it a go. Feedback to the package is most welcome. If you find any bugs (in particular in `sum_str`

) or if you have any suggestions for extensions of the package, feel free to open an issue on the GitHub repo of the package.

As an appendix to this post, I would like to give a real-world example of how using strcode can improve the legibility of code.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Rsome - A blog about some R stuff**.

R-bloggers.com offers

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the fourth part of the series and it aims to cover partially the subject of Inferential statistics. Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

Before proceeding, it might be helpful to look over the help pages for the ` sample`

, `mean`

, ` sd`

, ` sort`

, ` pnorm`

. Moreover it is crucial to be familiar with the Central Limit Theorem.

You also may need to load the `ggplot2`

library.

`install.packages("moments")`

`library(moments)`

Please run the code below in order to load the data set and transform it into a proper data frame format:

`url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"`

`data <- read.table(url, fileEncoding="UTF-8", sep=",")`

`names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')`

`colnames(data) <- names`

`data <- data[-which(data$mass ==0),]`

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Generate (10000 iterations) a sampling distribution of sample size 50, for the variable `mass`

.

You are encouraged to experiment with different sample sizes and iterations in order to see the impact that they have to the distribution. (standard deviation, skewness, and kurtosis) Moreover you can plot the distributions to have a better perception of what you are working on.

**Exercise 2**

Find the mean and standard error (standard deviation) of the sampling distribution.

You are encouraged to use the values from the original distribution (`data$mass`

) in order to comprehend how you derive the mean and standard deviation as well as the importance that the sample size has to the distribution.

**Exercise 3**

Find the of the skewness and kurtosis of the distribution you generated before.

**Exercise 4**

Suppose that we made an experiment and we took a sample of size 50 from the population and they followed an organic food diet. Their average `mass`

was 30.5. What is the Z score for a mean of 30.5?

**Exercise 5**

What is the probability of drawing a sample of 50 with mean less than 30.5? Use the the z-table if you feel you need to.

**Exercise 6**

Suppose that you did the experiment again but to a larger sample size of 150 and you found the average `mass`

to be 31. Compute the z score for this mean.

**Exercise 7**

What is the probability of drawing a sample of 150 with mean less than 31?

**Exercise 8**

If everybody would adopt the diet of the experiment. Find the margin of error for the 95% of sample means.

**Exercise 9**

What would be our interval estimate that 95% likely contains what this population mean would be if everyone in our population would start adopting the organic diet.

**Exercise 10**

Find the interval estimate for 98% and 99% likelihood.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.

R-bloggers.com offers

(This article was first published on ** François Husson**, and kindly contributed to R-bloggers)

Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering.

But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

Thus, the graph gives simultaneously the information given by the principal component map, the hierarchical tree and the clusters (see th function HCPC in the FactoMineR package).

library(FactoMineR)

temperature <- read.table(“http://factominer.free.fr/livre/temperat.csv”,

header=TRUE, sep=”;”, dec=”.”, row.names=1)

res.pca <- PCA(temperature[1:23,], scale.unit=TRUE, ncp=Inf,

graph = FALSE,quanti.sup=13:16,quali.sup=17)

res.hcpc <- HCPC(res.pca)

The approaches complement one another in two ways:

- firstly, a continuous view (the trend identified by the principal components) and a discontinuous view (the clusters) of the same data set are both represented in a unique framework;
- secondly, the two-dimensional map provides no information about the position of the individuals in the other dimensions; the tree and the clusters, defined from more dimensions, offer some information “outside of the map”; two individuals close together on the map can be in the same cluster (and therefore not too far from one another along the other dimensions) or in two different clusters (as they are far from one another along other dimensions).

So why do we need to choose when we want to better visualize the data?

The example shows the common use of PCA and clustering methods, but rather than PCA we can use correspondence analysis on contingency tables, or multiple correspondence analysis on categorical variables.

If you want to learn more, you can see this video, or you cab enroll in this MOOC (free) and you can see this unpublished paper.

To **leave a comment** for the author, please follow the link and comment on their blog: ** François Husson**.

R-bloggers.com offers

Use filter() for subsetting data by rows. It takes logical expressions as inputs, and returns all rows of your data for which those expressions are true.

To demonstrate, let’s start by loading the tidyverse library (which includes dplyr), and we’ll also load the gapminder data.

library(tidyverse) library(gapminder)

Here’s how filter() works:

filter(gapminder, lifeExp<30)

Produces this output:

# A tibble: 2 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Rwanda Africa 1992 23.599 7290203 737.0686 >

The pipe operator is one of the great features of the tidyverse. In base R, you often find yourself calling functions nested within functions nested within… you get the idea. The pipe operator `%>%`

takes the object on the left-hand side, and “pipes” it into the function on the right hand side.

For example:

> gapminder %>% head() # A tibble: 6 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Afghanistan Asia 1957 30.332 9240934 820.8530 3 Afghanistan Asia 1962 31.997 10267083 853.1007 4 Afghanistan Asia 1967 34.020 11537966 836.1971 5 Afghanistan Asia 1972 36.088 13079460 739.9811 6 Afghanistan Asia 1977 38.438 14880372 786.1134 >

This is the equivalent of saying “head(gapminder).” So far, that doesn’t seem a lot easier… but wait a bit and you’ll see the beauty of the pipe.

We talked about using filter() to subset data by rows. We can use select() to do the same thing for columns:

> select(gapminder, year, lifeExp) # A tibble: 1,704 × 2 year lifeExp <int> <dbl> 1 1952 28.801 2 1957 30.332 3 1962 31.997 4 1967 34.020 5 1972 36.088 6 1977 38.438 7 1982 39.854 8 1987 40.822 9 1992 41.674 10 1997 41.763 # ... with 1,694 more rows

Here’s the same thing, but using pipes, and sending it through “head()” to make the display more compact:

> gapminder %>% + select(year, lifeExp) %>% + head(4) # A tibble: 4 × 2 year lifeExp <int> <dbl> 1 1952 28.801 2 1957 30.332 3 1962 31.997 4 1967 34.020 >

We are going to be making some changes to the gapminder data, so let’s start by creating a copy of the data. That way, we don’t have to worry about changing the original data.

new_gap <- gapminder

mutate() is a function that defines a new variable and inserts it into your tibble. For example, gapminder has GDP per capita and population; if we multiply these we get the GDP.

new_gap %>% mutate(gdp = pop * gdpPercap)

Note that the above code creates the new field and displays the resulting tibble; we would have had to use the “<-” operator to save the new field in our tibble.

arrange() reorders the rows in a data frame. The gapminder data is currently arranged by country, and then by year. But what if we wanted to look at it by year, and then by country?

new_gap %>% arrange(year, country)

# A tibble: 1,704 × 6 country continent year lifeExp pop gdpPercap <fctr> <fctr> <int> <dbl> <int> <dbl> 1 Afghanistan Asia 1952 28.801 8425333 779.4453 2 Albania Europe 1952 55.230 1282697 1601.0561 3 Algeria Africa 1952 43.077 9279525 2449.0082 4 Angola Africa 1952 30.015 4232095 3520.6103 5 Argentina Americas 1952 62.485 17876956 5911.3151 6 Australia Oceania 1952 69.120 8691212 10039.5956 7 Austria Europe 1952 66.800 6927772 6137.0765 8 Bahrain Asia 1952 50.939 120447 9867.0848 9 Bangladesh Asia 1952 37.484 46886859 684.2442 10 Belgium Europe 1952 68.000 8730405 8343.1051 # ... with 1,694 more rows

The group_by() function adds grouping information to your data, which then allows you to do computations by groups. The summarize() function is a natural partner for group_by(). summarize() takes a dataset with *n* observations, calculates the requested summaries, and returns a dataset with 1 observation:

my_gap %>% group_by(continent) %>% summarize(n = n())

The functions you’ll apply within `summarize()`

include classical statistical summaries, like `mean()`

, `median()`

, `var()`

, `sd()`

, `mad()`

, `IQR()`

, `min()`

, and `max()`

. Remember they are functions that take nn inputs and distill them down into 1 output.

new_gap %>% group_by(continent) %>% summarize(avg_lifeExp = mean(lifeExp))

# A tibble: 5 × 2 continent avg_lifeExp <fctr> <dbl> 1 Africa 48.86533 2 Americas 64.65874 3 Asia 60.06490 4 Europe 71.90369 5 Oceania 74.32621

To fully appreciate the wonders of the pipe command and the dplyr data manipulation commands, take a look at this example. It comes from Jenny Brian‘s excellent course, STAT545, at the University of British Columbia (to whom I owe a debt for much of the information included in this series of blog posts).

new_gap %>% select(country, year, continent, lifeExp) %>% group_by(continent, country) %>% ## within country, take (lifeExp in year i) - (lifeExp in year i - 1) ## positive means lifeExp went up, negative means it went down mutate(le_delta = lifeExp - lag(lifeExp)) %>% ## within country, retain the worst lifeExp change = smallest or most negative summarize(worst_le_delta = min(le_delta, na.rm = TRUE)) %>% ## within continent, retain the row with the lowest worst_le_delta top_n(-1, wt = worst_le_delta) %>% arrange(worst_le_delta)

Source: local data frame [5 x 3] Groups: continent [5] continent country worst_le_delta <fctr> <fctr> <dbl> 1 Africa Rwanda -20.421 2 Asia Cambodia -9.097 3 Americas El Salvador -1.511 4 Europe Montenegro -1.464 5 Oceania Australia 0.170

To quote Jenny: “Ponder that for a while. The subject matter and the code. Mostly you’re seeing what genocide looks like in dry statistics on average life expectancy.”

]]>

Continue reading "Is my time series additive or multiplicative?"

The post Is my time series additive or multiplicative? appeared first on Locke Data.

]]>Time series data is an important area of analysis, especially if you do a lot of web analytics. To be able to analyse time series effectively, it helps to understand how the interaction between general seasonality in activity and the underlying trend.

The interactions between trend and seasonality are typically classified as either additive or multiplicative. This post looks at how we can classify a given time series as one or the other to facilitate further processing.

It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

There are three components to a time series:

– **trend** how things are overall changing

– **seasonality** how things change within a given period e.g. a year, month, week, day

– **error**/**residual**/**irregular** activity not explained by the trend or the seasonal value

How these three components interact determines the difference between a multiplicative and an additive time series.

In a multiplicative time series, the components multiply together to make the time series. If you have an increasing trend, the amplitude of seasonal activity increases. Everything becomes more exaggerated. This is common when you’re looking at web traffic.

In an additive time series, the components added together to make the time series.If you have an increasing trend, you still see roughly the same size peaks and troughs. This is often seen in indexed time series where the absolute value is growing but changes stay relative.

You can have a time series that is somewhere in between the two, a statistician’s “it depends”, but I’m interested in attaining a quick classification so I won’t be handling this complication here.

When I first started doing time series analysis, the only way to visualise how a time series splits into different components was to use base R. About the time I was feeling the pain, someone released a ggplot2 time series extension! I’ll be using ggseas where I can.

We’ll use the `nzbop`

data set from ggseas to, first of all, examine a single time series and then process all the time series in the dataset to determine if they’re multiplicative or additive.

```
sample_ts<-nzdata[Account == "Current account" & Category=="Services; Exports total",
.(TimePeriod, Value)]
```

TimePeriod | Value |
---|---|

1971-06-30 | 55 |

1971-09-30 | 56 |

1971-12-31 | 60 |

1972-03-31 | 65 |

1972-06-30 | 65 |

1972-09-30 | 63 |

I’ll be using other packages (like data.table) and will only show relevant code snippets as I go along. You can get the whole script in a GIST.

To be able to determine if the time series is additive or multiplicative, the time series has to be split into its components.

Existing functions to decompose the time series include

`decompose()`

, which allows you pass whether the series is multiplicative or not, and`stl()`

, which is only for additive series without transforming the data. I could use`stl()`

with a multiplicative series if I transform the time series by taking the log. For either function, I need to know whether it’s additive or multiplicative first.

The first component to extract is the trend. There are a number of ways you can do this, and some of the simplest ways involve calculating a moving average or median.

```
sample_ts[,trend := zoo::rollmean(Value, 8, fill=NA, align = "right")]
```

TimePeriod | Value | trend |
---|---|---|

2014-03-31 | 5212 | 4108.625 |

2014-06-30 | 3774 | 4121.750 |

2014-09-30 | 3698 | 4145.500 |

2014-12-31 | 4752 | 4236.375 |

2015-03-31 | 6154 | 4376.500 |

2015-06-30 | 4543 | 4478.875 |

A moving median is less sensitive to outliers than a moving mean. It doesn’t work well though if you have a time series that includes periods of inactivity. Lots of 0s can result in very weird trends.

Seasonality will be cyclical patterns that occur in our time series once the data has had trend removed.

Of course, the way to de-trend the data needs to additive or multiplicative depending on what type your time series is. Since we don’t know the type of time series at this point, we’ll do both.

```
sample_ts[,`:=`( detrended_a = Value - trend, detrended_m = Value / trend )]
```

TimePeriod | Value | trend | detrended_a | detrended_m |
---|---|---|---|---|

2014-03-31 | 5212 | 4108.625 | 1103.375 | 1.2685509 |

2014-06-30 | 3774 | 4121.750 | -347.750 | 0.9156305 |

2014-09-30 | 3698 | 4145.500 | -447.500 | 0.8920516 |

2014-12-31 | 4752 | 4236.375 | 515.625 | 1.1217137 |

2015-03-31 | 6154 | 4376.500 | 1777.500 | 1.4061465 |

2015-06-30 | 4543 | 4478.875 | 64.125 | 1.0143172 |

To work out the seasonality we need to work out what the typical de-trended values are over a cycle. Here I will calculate the mean value for the observations in Q1, Q2, Q3, and Q4.

```
sample_ts[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE),
seasonal_m = mean(detrended_m, na.rm = TRUE)),
by=.(quarter(TimePeriod)) ]
```

TimePeriod | Value | trend | detrended_a | detrended_m | seasonal_a | seasonal_m |
---|---|---|---|---|---|---|

2014-03-31 | 5212 | 4108.625 | 1103.375 | 1.2685509 | 574.1919 | 1.2924422 |

2014-06-30 | 3774 | 4121.750 | -347.750 | 0.9156305 | -111.2878 | 1.0036648 |

2014-09-30 | 3698 | 4145.500 | -447.500 | 0.8920516 | -219.8363 | 0.9488803 |

2014-12-31 | 4752 | 4236.375 | 515.625 | 1.1217137 | 136.7827 | 1.1202999 |

2015-03-31 | 6154 | 4376.500 | 1777.500 | 1.4061465 | 574.1919 | 1.2924422 |

2015-06-30 | 4543 | 4478.875 | 64.125 | 1.0143172 | -111.2878 | 1.0036648 |

My actual needs aren’t over long economic periods so I’m not using a better seasonality system for this blog post. There are some much better mechanisms than this.

Now that we have our two components, we can calculate the residual in both situations and see which has the better fit.

```
sample_ts[,`:=`( residual_a = detrended_a - seasonal_a,
residual_m = detrended_m / seasonal_m )]
```

TimePeriod | Value | trend | detrended_a | detrended_m | seasonal_a | seasonal_m | residual_a | residual_m |
---|---|---|---|---|---|---|---|---|

2014-03-31 | 5212 | 4108.625 | 1103.375 | 1.2685509 | 574.1919 | 1.2924422 | 529.1831 | 0.9815146 |

2014-06-30 | 3774 | 4121.750 | -347.750 | 0.9156305 | -111.2878 | 1.0036648 | -236.4622 | 0.9122871 |

2014-09-30 | 3698 | 4145.500 | -447.500 | 0.8920516 | -219.8363 | 0.9488803 | -227.6637 | 0.9401098 |

2014-12-31 | 4752 | 4236.375 | 515.625 | 1.1217137 | 136.7827 | 1.1202999 | 378.8423 | 1.0012620 |

2015-03-31 | 6154 | 4376.500 | 1777.500 | 1.4061465 | 574.1919 | 1.2924422 | 1203.3081 | 1.0879763 |

2015-06-30 | 4543 | 4478.875 | 64.125 | 1.0143172 | -111.2878 | 1.0036648 | 175.4128 | 1.0106135 |

I’ve done the number crunching, but you could also perform a visual decomposition. ggseas gives us a function `ggsdc()`

which we can use.

```
ggsdc(sample_ts, aes(x = TimePeriod, y = Value), method = "decompose",
frequency = 4, s.window = 8, type = "additive")+ geom_line()+
ggtitle("Additive")+ theme_minimal()
```

The different decompositions produce differently distributed residuals. We need to assess these to identify which decomposition is a better fit.

After decomposing our data, we need to compare the residuals. As we’re just trying to classify the time series, we don’t need to do anything particularly sophisticated – a big part of this exercise is to produce a quick function that could be used to perform an initial classification in a batch processing environment so simpler is better.

We’re going to check the whether how much correlation between data points is still encoded within the residuals. This is the Auto-Correlation Factor (ACF) and it has a function for calculating it. As some of the correlations could be negative we will select the type with the smallest sum of squares of correlation values.

```
ssacf<- function(x) sum(acf(x, na.action = na.omit)$acf^2)
compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult), "Additive", "Multiplicative")
sample_ts[,.(ts_type = compare_ssacf(residual_a, residual_m ))]
```

ts_type |
---|

Multiplicative |

This isn’t a fully generalized function (as it doesn’t have configurable lags, medians, seasonality etc) but if I had to apply to run this exercise over multiple time series from this dataset, my overall function and usage would look like:

```
ssacf<- function(x) sum(acf(x, na.action = na.omit, plot = FALSE)$acf^2)
compare_ssacf<-function(add,mult) ifelse(ssacf(add)< ssacf(mult),
"Additive", "Multiplicative")
additive_or_multiplicative <- function(dt){
m<-copy(dt)
m[,trend := zoo::rollmean(Value, 8, fill="extend", align = "right")]
m[,`:=`( detrended_a = Value - trend, detrended_m = Value / trend )]
m[Value==0,detrended_m:= 0]
m[,`:=`(seasonal_a = mean(detrended_a, na.rm = TRUE),
seasonal_m = mean(detrended_m, na.rm = TRUE)),
by=.(quarter(TimePeriod)) ]
m[is.infinite(seasonal_m),seasonal_m:= 1]
m[,`:=`( residual_a = detrended_a - seasonal_a,
residual_m = detrended_m / seasonal_m)]
compare_ssacf(m$residual_a, m$residual_m )
}
# Applying it to all time series in table
sample_ts<-nzdata[ , .(Type=additive_or_multiplicative(.SD)),
.(Account, Category)]
```

Account | Category | Type |
---|---|---|

Current account | Inflow total | Additive |

Current account | Balance | Multiplicative |

Current account | Goods; Exports (fob) total | Additive |

Current account | Services; Exports total | Multiplicative |

Current account | Primary income; Inflow total | Multiplicative |

Current account | Secondary income; Inflow total | Multiplicative |

Current account | Goods balance | Multiplicative |

Current account | Services balance | Multiplicative |

Current account | Primary income balance | Additive |

Current account | Outflow total | Additive |

Current account | Goods; Imports (fob) total | Additive |

Current account | Services; Imports total | Additive |

Current account | Primary income; Outflow total | Additive |

Current account | Secondary income; Outflow total | Multiplicative |

Capital account | Inflow total | Multiplicative |

Capital account | Outflow total | Multiplicative |

NA | Net errors and omissions | Multiplicative |

Financial account | Foreign inv. in NZ total | Multiplicative |

Financial account | Balance | Additive |

Financial account | Foreign inv. in NZ; Direct inv. liabilities | Additive |

Financial account | Foreign inv. in NZ; Portfolio inv. liabilities | Multiplicative |

Financial account | Foreign inv. in NZ; Other inv. liabilities | Multiplicative |

Financial account | NZ inv. abroad total | Multiplicative |

Financial account | NZ inv. abroad; Direct inv. assets | Multiplicative |

Financial account | NZ inv. abroad; Portfolio inv. assets | Additive |

Financial account | NZ inv. abroad; Financial derivative assets | Multiplicative |

Financial account | NZ inv. abroad; Other inv. assets | Additive |

Financial account | NZ inv. abroad; Reserve assets | Multiplicative |

This is a very simple way of quickly assessing whether multiple time series are additive or multiplicative. It gives an effective starting point for conditionally processing batches of time series. Get the GIST of the code used throughout this blog to work through it yourself. If you’ve got an easier way of classifying time series, let me know in the comments!

The post Is my time series additive or multiplicative? appeared first on Locke Data.

]]>