The package is based on this paper with Earo and Nikolay.

The basic idea is to measure a range of features of the time series (such as strength of seasonality, an index of spikiness, first order autocorrelation, etc.) Then a principal component decomposition of the feature matrix is calculated, and outliers are identified in 2-dimensional space of the first two principal component scores.

We use two methods to identify outliers.

- A bivariate kernel density estimate of the first two PC scores is computed, and the points are ordered based on the value of the density at each observation. This gives us a ranking of most outlying (least density) to least outlying (highest density).
- A series of –convex hulls are computed on the first two PC scores with decreasing , and points are classified as outliers when they become singletons separated from the main hull. This gives us an alternative ranking with the most outlying having separated at the highest value of , and the remaining outliers with decreasing values of .

I explained the ideas in a talk last Tuesday given at a joint meeting of the Statistical Society of Australia and the Melbourne Data Science Meetup Group. Slides are available here. A link to a video of the talk will also be added there when it is ready.

The density-ranking of PC scores was also used in my work on detecting outliers in functional data. See my 2010 JCGS paper and the associated rainbow package for R.

There are two versions of the package: one under an ACM licence, and a limited version under a GPL licence. Eventually we hope to make the GPL version contain everything, but we are currently dependent on the alphahull package which has an ACM licence.

]]>The changes are all outlined in the ChangeLog file as usual. I will highlight some of the more important changes since v5.0 here.

One of the most used functions in the package is `ets()`

and it provides a stock forecasting engine for many organizations. The default model selection is now restricted to exclude multiplicative trend models as these often give very poor forecasts due to the extrapolation of exponential trends. Multiplicative trend models can still be fitted if required. I compared the new default settings with the old defaults on the M3 data, and found a considerable difference in forecast accuracy:

MAPE | sMAPE | MASE | |
---|---|---|---|

ETS | 17.38 | 13.13 | 1.43 |

ETS (old) | 18.04 | 13.36 | 1.52 |

AutoARIMA | 19.12 | 13.85 | 1.47 |

Here “ETS” denotes the new default approach (without multiplicative trends) and “ETS (old)” is the old default approach including possibly multiplicative trends. For comparison, the results from applying `auto.arima`

to the same data are also shown.

The `auto.arima()`

function is now stricter on near unit-roots. Even if a model can be estimated, it will not be selected if the characteristic AR or MA roots are too close to the unit circle. This prevents occasional numerical instabilities occurring. Previously the roots had to be at least 0.001 away from the unit circle. Now they have to be at least 0.01 from the unit circle.

There is a new `allowmean`

argument in `auto.arima`

which can be used to prevent a mean term being included in a model.

There is a new `plot.Arima()`

function which plots the characteristic roots of an ARIMA model. This is based on a blog post I wrote last year.

It is now possible to easily obtain the fitted model order for use in other functions. The function `arimaorder`

applied to a fitted ARIMA model (such as that returned by `auto.arima`

) will return a numeric vector of the form (p,d,q) for a nonseasonal model and (p,d,q,P,D,Q,m) for a seasonal model. Similarly, `as.character`

applied to the object returned by `Arima`

or `auto.arima`

will give a character string with the fitted model, suitable for use in plotting or reports.

The models returned by `tbats`

and `bats`

were occasionally unstable. This problem has been fixed, again by restricting the roots to be further away from the unit circle.

`stlf`

and `forecast.stl`

combine forecasting with seasonal decomposition. The seasonally adjusted series are forecast, and then the forecasts are re-seasonalized. These functions now have a `forecastfunction`

argument to allow user-specified methods to be used in the forecasting step.

There is a new `stlm`

function and a corresponding `forecast.stlm`

function to allow the model estimation to be separated from the forecasting, thus matching most other forecasting methods in the package. This allows more flexible specification of the model to be used for the seasonally adjusted series.

The `Acf`

function replaces the `acf`

function to provide better plots of the autocorrelation function. The horizontal axis now highlights the seasonal lags.

I have added two new functions `taperedacf`

and `taperedpacf`

to implement the estimates and plots proposed in this recent paper.

The `fourier()`

and `fourierf()`

functions produce a matrix of Fourier terms for use in regression models for seasonal time series. These were updated to work with `msts`

objects so that multiple seasonalities can be fitted.

Occasionally, the period of the seasonality may not be known. The `findfrequency()`

function will estimate it. This is based on an earlier version I wrote for this blog post.

The `forecast.ts()`

function takes a time series and returns some forecasts, without the user necessarily knowing what is going on under the hood. It will use `ets`

with default settings (if the data are non-seasonal or the seasonal period is 12 or less) and `stlf`

(if the seasonal period is 13 or more). If the seasonal period is unknown, there is an option (`find.frequency=TRUE`

) to estimate it first.

]]>

**Graham Elliott** has decided to step down from the IJF editorial board after many years of service. Graham is best known for his research on optimal forecast combination, and forecasting under asymmetric and flexible loss functions. He is also the co-editor of the highly regarded multi-volume Handbook of Economic Forecasting (published by North-Holland). Graham became an IJF editor in 2007, and over the last 8 years he has made a major contribution to the journal. He has held authors to high research standards that have been evident in the excellent papers that have been published on his watch. He has also used his extensive contacts in the forecasting world to encourage some new associate editors to get involved with the journal. We will miss his legacy, but look forward to seeing more of his forecasting research in the pages of the journal.

In light of the increasing number of submissions we are handling, we have decided to increase the number of editors. I am delighted that George Kapetanios and Mike McCracken have both agreed to become IJF editors. Mike has been an IJF associate editor for about five years, while George is new to the editorial board.

**Michael McCracken** is Assistant Vice-President of the Federal Reserve Bank of St Louis, USA. He is well-known for his research on forecast evaluation, macroeconomic forecasting and real-time data. His 2001 *Journal of Econometrics* paper with Todd Clark on “Tests of equal forecast accuracy and encompassing for nested models” has received more than 700 citations.

**George Kapetanios** is a Professor of Economics at Queen Mary University of London, UK. His best-known research is in the area of unit-root and cointegration tests. He has also worked extensively on exchange rate forecasting, and with the Bank of England on inflation and GDP forecasting.

With these appointments, the team of editors now consists of Dick van Dijk, George Kapetanios, Mike McCracken, Dilek Önkal, Esther Ruiz, and me.

In addition to these changes, we have also made some changes to the associate editor panel. Obviously we have recently lost two associate editors in Michael McCracken and Dilek Önkal who have both become editors. In addition, two more associate editors have retired.

**Wilpen Gorr** is retiring after 30 years on the editorial board. He joined the board in 1985, and served as an editor from 1996–2002, before returning to the role of associate editor for the last 13 years. I would like to acknowledge the extraordinary long time Wil has been involved with the journal; his contributions and experience have been extremely valuable. Wil has worked on a wide range of forecasting problems, including crime forecasting, forecasting by analogy, application of the receiver operating characteristic (ROC) framework to forecasting, and the application forecasting tools in public policy analysis. Wil’s contribution to the IJF, and to the International Institute of Forecasters more generally, were acknowledged in 2005 when he was inducted as a fellow of the institute. I wish him well in his retirement, and I hope we still see him at forecasting events in the future.

**Bruce McCullough** served as an associate editor from 1999–2015, focusing on forecasting software and computational issues. For several years we published “Software reviews” and Bruce looked after that section of the journal. I would like to thank him for his services to the IJF over the years. Bruce is best-known for his papers on assessing the numerical reliability of software, and especially for his papers showing the unreliability of the statistical procedures in Microsoft Excel. I hope Bruce keeps up this valuable line of research and that software vendors take notice and improve numerical reliability of their products!

We have appointed three new associate editors in the last few weeks: Tommaso Proietti, Marcelo Medeiros, Sébastien Laurent.

**Tommaso Proietti** is Professor in Economic Statistics at the University of Rome Tor Vergata in Italy. He will handle papers on time series forecasting, state space models, frequency domain methods, unobserved components models, seasonal models, economic trends and cycles, temporal disaggregation.

**Marcelo Medeiros** is Associate Professor in the Department of Economics at the Pontifical Catholic University of Rio de Janeiro, Brazil. His expertise is in machine learning, volatility forecasting, high-dimensional models and nonlinear time series.

**Sébastien Laurent** is Professor of Econometrics at the Aix-Marseille University, France. His areas of expertise include GARCH models, realized volatility, jumps, correlations, and Value at Risk forecasting.

It’s a pleasure to welcome them to the IJF editorial board.

]]>

- Bellotti, T., & Crook, J. (2012). Loss given default models incorporating macroeconomic variables for credit cards. IJF, 28(1), 171–182.

The first rule for the award of best paper should be that the paper clearly reflects the value of the new method/approach when compared to established alternatives in the particular problem context chosen by the researchers. This paper examines alternative models in the important problem of predicting loss from defaulting consumers. The problem context is clear and important — the appraisal of the inclusion of macroeconomic variables and the comparison with other specifications is thorough. It should have impact on the many users of these models.

- Clements, M. P. (2012). Do professional forecasters pay attention to data releases?. IJF, 28(2), 297–308.

This paper is important because it seeks to determine how forecasters make their forecasts and whether they incorporate new information into their predictions. The methodology again is applicable to all fields.

**Diebold, F. X., & Yilmaz, K. (2012). Better to give than to receive: Predictive directional measurement of volatility spillovers. IJF, 28(1), 57–66.**

This is a methodological paper developing ways to estimate spillovers from one market to others. They use a generalized vector autoregressive framework in which forecast error variance decompositions are invariant to the variable ordering. Even though Diebold and Yilmaz used the method to look at volatility spillovers internationally in the time domain, the procedure is usable more generally in cross-sectional data with spatial interconnections too.

**Galvão, A. B. (2013). Changes in predictive ability with mixed frequency data. IJF, 29(3), 395–410.**

The premise of this paper is just so ‘common sense’: if we have disaggregated data, why don’t we use it? The manuscript links disaggregated data (with different frequencies) with non-linear features of models, which tend to disappear when the data is aggregated. It is a smart way to use all available information, and, at the same time, learning which features are interesting in the production of a forecast. This approach makes the study of nonlinearities valuable.

**Genre, V., Kenny, G., Meyler, A., & Timmermann, A. (2013). Combining expert forecasts: Can anything beat the simple average?. IJF, 29(1), 108–121.**

The authors explore an extensive set of methods to show that, on aggregating forecasts, a simple average is a benchmark that is very difficult to beat by more sophisticated aggregation schemes. Although the finding per se is not new (we have numerous studies examining the “forecast combination puzzle”), the rigorous approach to comparison of methods makes this manuscript very relevant.

- González-Rivera and E. Yoldas (2012), Autocontour-based evaluation of multivariate predictive densities, 28, 328–342.

This paper deals with an important problem from the point of view of empirical forecasting which is measuring the uncertainty of forecasts. Second, the problem considered is interesting from the methodological point of view. Third, the procedure proposed can be implemented in practice as it is not extremely complicated. Therefore, the balance between methodology and empirical interest is appropriate.

**Jordà, Ò., Knüppel, M., & Marcellino, M. (2013). Empirical simultaneous prediction regions for path-forecasts. IJF, 29(3), 456–468.**

This paper will revive interesting discussion on simultaneous confidence bands, path forecasts, or whatever name different communities use for multi-step ahead probabilistic forecasts in their various forms. Works on this topic are fairly rare while it is of utmost importance to further develop probabilistic forecasting in that direction. The authors have done previous work (in

*Journal of Applied Econometrics*) on the verification on these so-called path-forecasts. In this paper, they push it further by linking them to simultaneous confidence regions obtained in an hypothesis framework. They also show the practical interest of their proposal through a relevant case-study.- Lahiri, K., & Wang, J. G. (2013). Evaluating probability forecasts for GDP declines using alternative methodologies. IJF, 29(1), 175–190.

The paper deals with an important macroeconomic topic: predicting recessions. The failure to forecast recessions is one of the main failures in that field. The paper also is important because it presents methodologies that can be used in other areas of forecasting.

- Lanne, M., Luoto, J., & Saikkonen, P. (2012). Optimal forecasting of noncausal autoregressive time series. IJF, 28(3), 623–631.

This paper is innovative in that brings unexplored issues like noncausal representation of AR processes into the forecasting literature. It opens new lines of inquiry. It may offer advantages for non-Gaussian processes, which are so prevalent in financial data.

- Ng, J., C.S. Forbes, G.M. Martin and B.P.M. McCabe (2013). Non-parametric estimation of forecast distributions in non-Gaussian, non-linear state space models, IJF, 29, 411–430.

This paper deals with an important problem from the point of view of empirical forecasting which is measuring the uncertainty of forecasts. Second, the problem considered is interesting from the methodological point of view. Third, the procedure proposed can be implemented in practice as it is not extremely complicated. Therefore, the balance between methodology and empirical interest is appropriate.

**Soyer, E., & Hogarth, R. M. (2012). The illusion of predictability: How regression statistics mislead experts. IJF, 28(3), 695–711.**

This paper shows how regression analysis is misunderstood and misused by leading scholars when they do regression analyses. The paper has attracted much attention. It adds to the research showing that leading scholars made serious errors in papers that they publish in leading economics journals, and this problem has gotten worse over time. The implication is that if even the best and the brightest get it wrong, how can we expect others to get it right? There are more effective ways to analyze data and Soyer and Hogarth suggest one approach.

My talk is on * “Exploring the boundaries of predictability: what can we forecast, and when should we give up?”* Essentially I will start with some of the ideas in this post, and then discuss the features of hard-to-forecast time series.

So if you’re in the San Francisco Bay area, please come along. Otherwise, it will be streamed live on the Yahoo Labs website.

**Abstract**

Why is it that we can accurately forecast a solar eclipse in 1000 years time, but we have no idea whether Yahoo’s stock price will rise or fall tomorrow? Or why can we forecast electricity consumption next week with remarkable precision, but we cannot forecast exchange rate fluctuations in the next hour?

In this talk, I will discuss the conditions we need for predictability, how to measure the uncertainty of predictions, and the consequences of thinking we can predict something more accurately than we can.

I will draw on my experiences in forecasting Australia’s health budget for the next few years, in developing forecasting models for peak electricity demand in 20 years time, and in identifying unpredictable activity on Yahoo’s mail servers.

]]>We encourage students to attend conferences, and provide funding for them to attend one international conference and one local conference during their PhD candidature. Thilaksha was previously funded to attend last year’s COMPSTAT in Geneva, Switzerland and IMS conference in Sydney. Having exhausted local funding, she has now convinced several other organizations to support her conference habit.

Now she just has to finish that thesis…

]]>It’s nice to see that it has been getting some good reviews. It is rated 4.6 stars on Amazon.com with 6 out of 8 reviewers giving it 5 stars (the 3 reviewers on Amazon.co.uk all gave it 5 stars).

My favourite Amazon review is this one:

The book is well written and up to date — the online edition is likely to continue to be updated frequently. Hyndman is an inspiration. His blog is very interesting if you are a statistician, and written in a very clear style. His research group is the author of the R forecast package used in this book. He and his collaborators have made great strides in systematizing smoothing methods. You are not only reading a clear introductory textbook, you are reading one that’s up to date with modern forecasting practice (excluding the more exotic data mining methods, which clearly go beyond introductory texts). (I’m sure George Athanasopoulos has many fine qualities, but I’m less familiar with him.)

This book is ideal for self-study because the associated website has the answers to the exercises.

Isn’t he nice? Yes, George is a great guy!

However, the review is not entirely accurate. The website does not contain answers to the exercises. We do have solutions that we can give to instructors using the book, but at this stage they are not available more widely.

The only bad review (3 stars) was this one:

Online version much better than printed version.

Seriously? The only difference between the online and print versions is that a small number of typos have been corrected in the online version.

A few reviewers have called for an index. The reason it doesn’t have one is that the online version can be searched. Look up anything you like using the search box at the top of every page. Surely that’s better than me constructing an index by hand.

Another review appeared yesterday on the Information Management site, also very positive.

It’s nice to know that our work is appreciated, but we are well aware that there are areas for improvement. George and I hope to work on a revision to the book this year. If anyone has any suggestions for the next edition, please let us know in the comments below.

]]>

I use neither. I did use Mendeley for several years (and blogged about it a few years ago), but it became slower and slower to sync as my reference collection grew. Eventually it simply couldn’t handle the load. I have over 11,000 papers in my collection of papers, and I was spending several minutes every day waiting for Mendeley just to update the database.

Then I came across **Paperpile**, which is not so well known as some of its competitors, but it is truly awesome. I’ve now been using it for over a year, and I have grown to depend on it every day to keep track of all the papers I read, and to create my bib files.

Paperpile is not free, but it is relatively inexpensive — $2.99 per month for academic users. Even poor research students can afford that.

It works differently from Mendeley and Zotero, in that everything is stored in the cloud and is accessible on any device with the Chrome browser. So there is no software to install other than a Chrome extension. It is blindingly fast and, like all good software, just works.

Just over 2000 of my references have attached pdfs and they are accessible on every device. A local copy is cached so if you come back to the same pdf later, it will not download a new copy.

There is a marvellous Chrome extension that detects references in the current browser tab and imports the details into your Paperpile collection with the click of a button. There is also tight integration with Google Scholar.

References can be shared by email from within Paperpile, including any attached pdfs.

Papers can be assigned to folders (which are actually more like tags as a paper can appear in multiple folders). I tend to set up a folder for each paper I am writing, and then export a bib file for that folder.

The switch from Mendeley was very easy — Paperpile simply imported the whole library, and it was ready to go.

There are some things I’ve lost in making the move from Mendeley to Paperpile:

- Paperpile’s search facility is limited to metadata (titles, authors, journal name, year, abstract and notes). It will not allow searching within pdfs, unlikely Mendeley which is particularly good in this area. My workaround is to sync my Paperpile library with Google Drive, so all the pdfs are stored in Google Drive as well as in Paperpile. I can easily search within pdfs on GDrive.
- Paperpile does not allow annotation of pdfs. It is easy to add notes to each paper, but it is not possible to highlight or annotate pdfs directly. But since I hardly ever did that, it wasn’t much of a loss.
- BibTeX keys were all replaced when I imported my library from Mendeley. I still have the bib files from my old Mendeley library, so old papers can be compiled without a problem. But with new papers, I need to either use the keys generated by Paperpile, or manually change the keys within Paperpile.

None of these have been serious enough to make me want to go back to Mendeley, and the speed and simplicity of Paperpile have won me over and made me more productive.

]]>The data set comprises real traffic to Yahoo services, along with some synthetic data. There are 367 time series in the data set, each of which contains between 741 and 1680 observations recorded at regular intervals. Each series is accompanied by an indicator series with a 1 if the observation was an anomaly, and 0 otherwise. The anomalies in the real data were determined by human judgement, while those in the synthetic data were generated algorithmically. For the synthetic data, some information about the components used to construct the data is also provided.

Although the Yahoo announcement claims that the data are publicly available, in fact they are only available to people with an edu address. Further, you have to apply to use them, and it takes about 24 hours before approval is granted. I have suggested that they remove these restrictions, and make the data available without restriction to anyone who wants to use them.

Research on anomaly detection in time series seems to be growing in popularity. Twitter has also released their own Anomaly Detection R package. Their approach has some similarities with my own `tsoutliers`

function in the `forecast`

package. The `tso`

function in the `tsoutliers`

package is another approach to the same problem.

Hopefully having a large public data set available will lead to improvements in time series outlier detection methods, at least for detecting outliers in internet traffic data.

]]>I’ve seen authors citing as many references as possible to try to please potential referees. Many of those references are low quality papers though. Any general guidance about a typical length for the reference section?

It depends on the subject and style of the paper. I’ve written a paper with over 900 citations, but that was a review of time series forecasting over a 25 year period, and so it had to include a lot of references.

I’ve also written a paper with just four citations. As it was a commentary, it did not need a lot of contextual information.

Rather than provide guidance on the length of the reference section, I think it is better to follow some general principles of citation in research.

Think about the purpose of citations: they are there to provide support for statements that you make or to provide context for your own research. They are not there just to keep an editor or a reviewer happy.

In particular, a literature review is meant to provide context — showing the development of ideas that led to your paper, and showing the connection between your paper and related ideas. You are telling a story, and the references you cite are those that are important to your story. If your story cannot be told without referring to many papers, then do it. But if the citations involve tangents or irrelevant details, leave them out.

Missing important papers, especially if they are recent, demonstrates a lack of awareness of the research context in which you are working.

At least make sure you’ve searched on Google scholar for any related work, and cite anything that looks important to the context and development of your own ideas.

If you know of a seminal paper, but it is a few years old, track its citations on Google scholar to find if there has been any important follow-up work.

Check if the journal you are submitting to has published any related work in the last five years and make sure you cite the most important of those papers. (If there is nothing in the journal related to your work, perhaps you are submitting to the wrong journal.)

I’ve written on this elsewhere on this blog, so I won’t repeat myself again. But please make sure you check everything you cite, and don’t just copy references from other papers.

]]>