Big data is now endemic in business, industry, government, environmental management, medical science, social research and so on. One of the commensurate challenges is how to effectively model and analyse these data.

This workshop will bring together national and international experts in statistical modelling and analysis of big data, to share their experiences, approaches and opinions about future directions in this field.

The workshop programme will commence at 8.30am and close at 5pm. Registration is free, however numbers are strictly limited so please ensure you register when you receive your invitation via email. Morning and afternoon tea will be provided; participants will need to purchase their own lunch.

Further details will be made available in early January.

Queensland University of Technology

Garden’s Theatre

X Block, Gardens Point Precinct

2 George Street (next to the City Botanic Gardens)

Brisbane QLD 4001

Contact Kerrie Mengersen: k.mengersen@qut.edu.au

]]>The model was first described in Hyndman and Fan (2010). We are continually improving it, and the latest version is decribed in the model documentation which will be updated from time to time.

The package is being released under a GPL licence, so anyone can use it. All we ask is that our work is properly cited.

Naturally, we are not able to provide free technical support, although we welcome bug reports. We are available to undertake paid consulting work in electricity forecasting.

]]>

The data are measurements from a medical diagnostic machine which takes 1 measurement every second, and after 32–1000 seconds, the time series must be classified into one of two classes. Some pre-classified training data is provided. It is not necessary to classify all the test data, but you do need to have relatively high accuracy on what is classified. So you could find a subset of more easily classifiable test time series, and leave the rest of the test data unclassified.

Accuracy is measured using

where true positive, false positive, true negative and false negative.

The prizes are:

- $5000 if using at least 50% of the test samples and achieving 0.75 accuracy.
- $15000 if using at least 50% of the test samples and achieving 0.85 or higher.
- for any accuracy above 0.75, while using less than 50% of test samples (but at least 25% of test samples), any additional 0.05 increase in accuracy, grants an additional $2K. For example, if you use 30% of test samples, and achieve accuracy of 0.85%, the price will be $5K+$4K=$9K
The winner will be:

The one with highest accuracy with highest amount of samples

OR

THE FIRST ONE that achieves 0.85 accuracy with at least 50% of data

OR

THE FIRST ONE that achieves 0.9 accuracy with at least 30% of data.In the link below you will see a text file that explains the data and how to access it and a png image which explain how the time series to classify was built and how the classes were assigned. The link also includes the actual train and test samples to be used for the challenge and some plots of the time series.

https://drive.google.com/folderview?id=0BxmzB6Xm7Ga1MGxsdlMxbGllZnM&usp=sharing

Entries should include:

- Proof of accuracy
- R code which grants the organizer full right to use
- R code to support new additional test samples.

The prizes create some strange discontinuities. Someone with accuracy of 0.75 using 50% of the data gets $5K, but someone with accuracy of 0.76 using only 25% of the data gets more. On the other hand, someone using 49% of the test with 0.85 accuracy gets $9K, but if they use 50% of the test they get $15K. Surely a continuous bivariate function of accuracy and percentage would have been better.

I also think this would have been better on Kaggle or CrowdAnalytix, but instead it has been posted on the R group on LinkedIn.

For all further questions, either ask via the comments on LinkedIn, or email the organizer Roni Kass

]]>This is (roughly) what I said.

Statisticians seem to go through regular periods of existential crisis as they worry about other groups of people who do data analysis. A common theme is: all these other people (usually computer scientists) are doing our job! Don’t they know that statisticians are the best people to do data analysis? How dare they take over our discipline!

I take a completely different view. I think our discipline is in the best position it has ever been in. The demand for data analysis skills is greater than ever. Our graduates are highly sought after, and well paid. Being a statistician has even been described as a sexy profession (which presumably is a good thing to be!).

The different perspectives are all about inclusiveness. If we treat statistics as a narrow discipline, fitting models to data, and studying the properties of those models, then statistics is in trouble. But if we treat what we do as a broad discipline involving data analysis and understanding uncertainty, then the future is incredibly bright.

Here are two quotes from well-known bloggers in the last year or two:

April 2013: Larry Wasserman blog

Data science: the end of statistics?

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

November 2013: Andrew Gelman blog

Statistics is theleastimportant part of data science

There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics as a subset of data science …Statistics is important—don’t get me wrong—statistics helps us correct biases … estimate causal effects … regularize so that we’re not overwhelmed by noise … fit models … visualize data … I love statistics! But it’s not the most important part of data science, or even close.

How can two professors of statistics have such different views on their discipline? The same perspectives can be seen in the following two diagrams (both reproduced with permission).

In the first narrow view, to be a data scientist you have to know a great deal about statistics, mathematics, computer science, programming, and the application discipline. If that’s true, I’ve never met a data scientist. I don’t believe they exist.

In the second broader view, everyone here is a data scientist, although we have different specializations and different perspectives and training.

I take the broad inclusive view. I am a data scientist because I do data analysis, and I do research on the methodology of data analysis. The way I would express it is that I’m a data scientist with a statistical perspective and training. Other data scientists will have different perspectives and different training.

We are comfortable with having medical specialists, and we will go to a GP, endocrinologist, physiotherapist, etc., when we have medical problems. We also need to take a team perspective on data science.

None of us can realistically cover the whole field, and so we specialise on certain problems and techniques. It is crazy to think that a doctor must know everything, and it is just as crazy to think a data scientist should be an expert in statistics, mathematics, computing, programming, the application discipline, etc. Instead, we need teams of data scientists with different skills, with each being aware of the boundary of their expertise, and who to call in for help when required.

Let’s not be too sectarian about our disciplines, thinking everyone not trained in the same way we were is a heretic.

It reminds me of a famous joke, written by comedian Emo Philips:

]]>I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. I immediately ran over and said “Stop! Don’t do it!“

“Why shouldn’t I?” he said.

I said, “Well, there’s so much to live for!“

“Like what?“

“Well … are you religious or atheist?“

“Religious.“

“Me too! Are you Christian or Jewish?“

“Christian.“

“Me too! Are you Catholic or Protestant?“

“Protestant.“

“Me too! What francise?“

“Baptist.“

“Wow! Me too! Northern Baptist or Southern Baptist?“

“Northern Baptist“

“Me too! Are you Northern Conservative Baptist or Northern Liberal Baptist?“

“Northern Conservative Baptist“

“Me too! Are you Northern Conservative Fundamentalist Baptist or Northern Conservative Reformed Baptist?“

“Northern Conservative Fundamentalist Baptist“

To which I said, “Die, heretic scum!” and pushed him off.

He is in a unique position to write such a paper as he has been doing forecasting research longer than anyone else on the planet — his first published paper on forecasting appeared in 1959. Herman is now 82 years old, and is still very active in research. Only a couple of months ago, he wrote to me with some new research ideas he had been thinking about, asking me for some feedback. He is also an extraordinarily conscientious and careful associate editor of the *IJF* and a delight to work with. He is truly “a scholar and a gentleman” and I am very happy that we can honor Herman in this manner. Thanks to Tara Sinclair, Prakash Loungani and Fred Joutz for putting this tribute together.

We also published an interview with Herman in the *IJF* in 2010 which contains some information about his early years, graduate education and first academic jobs.

In my research group meeting today, we discussed our (limited) experiences in competing in some Kaggle competitions, and we reviewed the following two papers which describe two prediction competitions:

- Athanasopoulos and Hyndman (IJF 2011). The value of feedback in forecasting competitions. [preprint version]
- Roy et al (2013). The Microsoft Academic Search Dataset and KDD Cup 2013.

Some points of discussion:

- The old style of competition where participants make a single submission and the results are compiled by the organizers is much less effective than competitions involving feedback and a leaderboard (such as those hosted on kaggle). The feedback seems to encourage participants to do better, and the results often improve substantially during the competition.
- Too many submissions results in over-fitting to the test data. Therefore the final scores need to be based on a different test data set than the data used to score the submissions during the competition. Kaggle does not do this, although they partially address the problem by computing the leaderboard scores on a subset of the final test set.
- The metric used in the competition is important, and this is sometimes not thought through carefully enough by competition organizers.
- There are several competition platforms available now including Kaggle, CrowdAnalytix and Tunedit.
- The best competitions are focused on specific domains and problems. For example, the GEFcom 2014 competitions are about specific problems in energy forecasting.
- Competitions are great for advancing knowledge of what works, but they do not lead to data scientists being well paid as many people compete but few are rewarded.
- The IJF likes to publish papers from winners of prediction competitions because of the extensive empirical evaluation provided by the competition. However, a condition of publication is that the code and methods are fully revealed, and winners are not always happy to comply.
- The IJF will only publish competition results if they present new information about prediction methods, or tackle new prediction problems, or measure predictive accuracy in new ways. Just running another competition like the previous ones is not enough. It still has to involve genuine research results.
- I would love to see some serious research about prediction competitions, but that would probably require a company like kaggle to make their data public. See Frank Diebold’s comments on this too.
- A nice side effect of some competitions is that they create a benchmark data set with well tested benchmark methods. This has worked well for the M3 data, for example, and new time series forecasting algorithms can be easily tested against these published results. However, over-study of a single benchmark data set means that methods are probably over-fitting to the published test data. Therefore, a wider range of benchmarks is desirable.
- Prediction competitions are a fun way to hone your skills in forecasting and prediction, and every student in this field is encouraged to compete in a few competitions. I can guarantee you will learn a great deal about the challenges of predicting real data — something you don’t always learn in classes or via textbooks.

The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.

Tim Riffe from the HMD has provided the following information about the update:

- All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
- Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
- Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.

Some of the data can be read into R using the `hmd.mx`

and `hmd.e0`

functions from the demography package. Tim has his own package on github that provides a more extensive interface.

It made me think about my own efforts to communicate future uncertainty through graphics. Of course, for time series forecasts I normally show prediction intervals. I prefer to use more than one interval at a time because it helps convey a little more information. The default in the forecast package for R is to show both an 80% and a 95% interval like this:

It is sometimes preferable to use a 50% and a 95% interval, rather like a boxplot:

In some circles (especially macroeconomic forecasting), fan charts are popular:

Personally, I don’t like these at all as they lose any specific probabilistic interpretability. What does the darker shaded region actually refer to? At least in the previous version, it is clear that the dark region contains 50% of the probability.

The above three examples are easily produced using the forecast package:

fit <- ets(hsales) plot(forecast(fit),include=120) plot(forecast(fit,level=c(50,95)),include=120) plot(forecast(fit,fan=TRUE),include=120) |

For multi-modal distributions I like to use highest density regions. Here is an example applied to Nicholson’s blowfly data using a threshold model:

The dark region has 50% coverage and the light region has 95% coverage. The forecast distributions become bimodal after the first ten iterations, and so the 50% region is split in two to show that. This graphic was taken from a *J. Forecasting* paper I wrote in 1996, so these ideas have been around for a while!

It is easy enough to produce forecast HDR with time series. Here is some R code to do it:

#HDR for time series object # Assumes that forecasts can be computed and futures simulated from object forecasthdr <- function(object, h = ifelse(object$m > 1, 2 * object$m, 10), nsim=2000, plot=TRUE, level=c(50,95), xlim=NULL, ylim=NULL, ...) { require(hdrcde) # Compute forecasts fc <- forecast(object) ft <- time(fc$mean) # Simulate future sample paths sim <- matrix(0,nrow=h,ncol=nsim) for(i in 1:nsim) sim[,i] <- simulate(object, nsim=h) # Compute HDRs nl <- length(level) hd <- array(0, c(h,nl,10)) mode <- numeric(h) for(k in 1:h) { hdtmp <- hdr(sim[k,], prob=level) hd[k,,1:ncol(hdtmp$hdr)] <- hdtmp$hdr mode[k] <- hdtmp$mode } # Remove unnecessary sections of HDRs nz <- apply(abs(hd),3,sum) > 0 hd <- hd[,,nz] dimnames(hd)[[1]] <- 1:h dimnames(hd)[[2]] <- level # Produce plot if required if(plot) { if(is.null(xlim)) xlim <- range(time(object$x),ft) if(is.null(ylim)) ylim <- range(object$x, hd) plot(object$x,xlim=xlim, ylim=ylim, ...) # Show HDRs cols <- rev(colorspace::sequential_hcl(52))[level - 49] for(k in 1:h) { for(j in 1:nl) { hdtmp <- hd[k,j,] nint <- length(hdtmp)/2 for(l in 1:nint) { polygon(ft[k]+c(-1,-1,1,1)/object$m/2, c(hdtmp[2*l-1],hdtmp[2*l],hdtmp[2*l],hdtmp[2*l-1]), col=cols[j], border=FALSE) } } points(ft[k], mode[k], pch=19, col="blue",cex=0.8) } #lines(fc$mean,col='blue',lwd=2) } # Return HDRs return(list(hdr=hd,mode=mode,level=level)) } |

We can apply it using the example I started with:

z <- forecasthdr(fit,xlim=c(1986,1998),nsim=5000, xlab="Year",ylab="US monthly housing sales") |

The dots are modes of the forecast distributions, and the 50% and 95% highest density regions are also shown. In this case, the distributions are unimodal, and so all the regions are intervals.

- Electricity price forecasting: A review of the state-of-the-art with a look into the future by Rafał Weron.
- The challenges of pre-launch forecasting of adoption time series for new durable products by Paul Goodwin, Sheik Meeran, and Karima Dyussekeneva.

Both tackle very important topics in forecasting. Weron’s paper contains a comprehensive survey of work on electricity price forecasting, coherently bringing together a large body of diverse research — I think it is the longest paper I have ever approved at 50 pages. Goodwin, Meeran and Dyussekeneva review research on new product forecasting, a problem every company that produces goods or services has faced; when there are no historical data available, how do you forecast the sales of your product?

We have a few other review papers in progress, so keep an eye out for them in future issues.

]]>

I have two large time series data. One is separated by seconds intervals and the other by minutes. The length of each time series is 180 days. I’m using R (3.1.1) for forecasting the data. I’d like to know the value of the “frequency” argument in the ts() function in R, for each data set. Since most of the examples and cases I’ve seen so far are for months or days at the most, it is quite confusing for me when dealing with equally separated seconds or minutes. According to my understanding, the “frequency” argument is the number of observations per season. So what is the “season” in the case of seconds/minutes? My guess is that since there are 86,400 seconds and 1440 minutes a day, these should be the values for the “freq” argument. Is that correct?

The same question was asked on crossvalidated.com.

Yes, the “frequency” is the number of observations per season. This is the opposite of the definition of frequency in physics, or in Fourier analysis, where “period” is the length of the cycle, and “frequency” is the inverse of period. When using the `ts()`

function in R, the following choices should be used.

Data | frequency |
---|---|

Annual | 1 |

Quarterly | 4 |

Monthly | 12 |

Weekly | 52 |

Actually, there are not 52 weeks in a year, but 365.25/7 = 52.18 on average. But most functions which use

`ts`

objects require integer frequency.
Once the frequency of observations is smaller than a week, then there is usually more than one way of handling the frequency. For example, hourly data might have a daily seasonality (frequency=24), a weekly seasonality (frequency=24x7=168) and an annual seasonality (frequency=24x365.25=8766). If you want to use a `ts`

object, then you need to decide which of these is the most important.

An alternative is to use a `msts`

object (defined in the `forecast`

package) which handles multiple seasonality time series. Then you can specify all the frequencies that might be relevant. It is also flexible enough to handle non-integer frequencies.

Data | frequencies | ||||
---|---|---|---|---|---|

minute | hour | day | week | year | |

Daily | 7 | 365.25 | |||

Hourly | 24 | 168 | 8766 | ||

Half-hourly | 48 | 336 | 17532 | ||

Minutes | 60 | 1440 | 10080 | 525960 | |

Seconds | 60 | 3600 | 86400 | 604800 | 31557600 |

You won’t necessarily want to include all of these frequencies — just the ones that are likely to be present in the data. For example, any natural phenomena (e.g., sunshine hours) is unlikely to have a weekly period, and if your data are measured in one-minute intervals over a 3 month period, there is no point including an annual frequency.

For example, the `taylor`

data set from the `forecast`

package contains half-hourly electricity demand data from England and Wales over about 3 months in 2000. It was defined as

taylor <- msts(x, seasonal.periods=c(48,336) |

One convenient model for multiple seasonal time series is a TBATS model:

taylor.fit <- tbats(taylor) plot(forecast(taylor.fit)) |

(Warning: this takes a few minutes.)

If an `msts`

object is used with a function designed for `ts`

objects, the largest seasonal period is used as the “frequency” attribute.