Big data is now endemic in business, industry, government, environmental management, medical science, social research and so on. One of the commensurate challenges is how to effectively model and analyse these data.

This workshop will bring together national and international experts in statistical modelling and analysis of big data, to share their experiences, approaches and opinions about future directions in this field.

]]>**Paul Goodwin** was appointed as an associate editor in 1999 and as an editor in 2010. Paul is retiring from his position as Professor of Management Science at the University of Bath, UK, and has decided to also retire from the IJF editorial board. He has provided wonderful service to the IJF and I’ve really appreciated having someone with his expertise and experience to handle all the papers that are outside the domains that the other editors cover. I think the journal is in excellent shape, and that is due to the great work of the editorial board and especially the editors. So thanks to Paul for his wonderful contributions, and best wishes for the future.

**Dilek Önkal** has been an associate editor of the IJF since 2002, covering many topics including judgmental forecasting, psychological aspects of forecasting, organizational aspects of forecasting, forecasting support systems, and supply chains. She is a Professor in the Faculty of Business Administration at Bilkent University in Turkey, and until a few months ago was Dean of the faculty. I’m delighted that she has agreed to take on greater responsibility on the editorial board, and welcome her to the team of IJF editors.

There are five IJF editors:

- Rob Hyndman (Editor-in-Chief)
- Dick van Dijk
- Graham Elliott
- Dilek Önkal
- Esther Ruiz

along with a team of about 40 associate editors.

]]>For those who like this sort of thing (as I do), there is a nice collection of statistical poetry here.

David Gordon Goddard

An inference that’s very often made –

‘a population tends to closely share

some feature that a smaller group displayed’ –

is why my students need to be aware

of standard error. Yet, it’s hard to learn,

and students’ drive to listen fades with each

attempt of mine to strive for ways to earn

attention till my message has its reach.

The teaching cycle brings around today

my yearly chance to make this topic clear.

I’ll aid my students, shorten what I say,

and they can later go and persevere …

… ask how it differs from the things it’s like,

and why we need it, what we couldn’t do

if it weren’t here. But can my students strike

the hours in busy lives to see this through …

… to think and tell themselves of things they know …

related things … then try to make the link

with standard error, find their gaps and so

seek remedy where knowledge meets its brink?

It’s four months on. A student whom I taught

then calls to see me and in measured way

explains she liked the insight I had brought

to ‘estimation’, then goes on to say:

“My research gathered data from a group,

defined a mob of which the group’s just part;

a feature in the group at six per cent

would be, I thought, like echoed in the mob.

I viewed the group as sample of the mob;

but samples vary some from whence they’re drawn.

Your standard error helped me calculate

how far from six my estimate might stray.”

She smiles and says she’d thought I’d like to know

that standard error served to underlie

her thinking. Yes, I’m happy that is so –

but more, she’d thought it worth enough to try.

The theme of the 2015 competition is around the analysis of **climate-related data** and your primary data must come from one or more climate-related databases. There are several websites containing such databases. **Your entry must be submitted as a poster in PDF format**.

You must clearly specify in your poster the source of your data, i.e., by listing the relevant URLs and the steps required to obtain the data.

In your analysis, you may concentrate on a single country (e.g., your own country), a region, a continent or even the entire world.

You are allowed to work individually or in a small group of up to five participants on your poster.

Posters will be judged according to these criteria:

- Appropriateness of analysis
- Novelty of approaches used in the analysis
- Clarity of objectives, approaches, displays, and results
- Significance of findings
- Generalizability of approaches to data sets in other arenas.
- Overall quality of poster

Not all posters are expected to meet all criteria to the same degree. Your poster may (but need not) be accompanied by a short description (maximum five pages). All materials must be submitted in PDF format.

Some examples of websites from where you may obtain your data are given below. However, you may obtain your data from any other climate-related database as well.

- National Aeronautics and Space Administration Goddard Institute for Space Studies

http://www.giss.nasa.gov/ - National Climatic Data Centre

http://www.ncdc.noaa.gov/monitoring-references/faq/anomalies.php - University of East Anglia Climate Research Unit

http://www.cru.uea.ac.uk/cru/data/temperature/

It is expected that you will connect the information from different data sets which do not necessarily have to be obtained from the same database to obtain interesting and original conclusions.

Final submissions (PDF) are due on 15th April 2015.

Submit to: iasc.competition@gmail.com

Web page of the IASC Data Analysis Competition

Web page for Joint Meeting of IASC-ABE Satellite Conference

For all inquiries contact :

Associate Professor Ann Maharaj iasc.competition@gmail.com

There are about 90 journals on the list, mostly in statistics, but some from machine learning, operations research and econometrics. I excluded probability journals, and areas of application that are well outside my research interests (such as bioinformatics, psychology and pharmacology). But I included every statistical methodology journal that was rated A, A* or B by the Australian ERA exercise in 2010, and several of the C journals as well (including, of course, the grossly under-rated *Journal of Statistical Software*). I also included the good new journals that have appeared since then including *Annual Reviews* and *Statistics & Public Policy*. I included the best regional journals (including *ANZJS*, *Statistica Neerlandica*, *Canadian J Statistics*, *Scandinavian J. Statistics*, and *J. Korean Statistical Society*). The two forecasting journals are on the list of course, plus the four A* journals and a couple of A journals in econometrics. Finally, I included the best machine learning, data mining, and operations research journals — as rated on the ERA 2010 list.

Where possible, I use the “new articles” feed so that articles appear as soon they are online rather than after they appear in print. Some publishers seem to be stuck in the print era, and only provide a feed for articles in print.

Unfortunately, some of the publishers also make it difficult to get the appropriate RSS feed from the journal website. Wiley is great — requiring just one click from the front page of the journal. Springer requires two clicks if you know where to look. Elsevier has an appalling procedure requiring about 5 or 6 clicks, and when you finally get the feed into feedly, the title is wrong (every journal becomes “ScienceDirect Publications”). Some publishers had the feed hidden so deep that they clearly don’t want anyone using it.

My final beef with the publishers, is that they occasionally change the RSS feeds without warning, and then the system breaks. I spent several hours fixing up my feeds because Springer and Elsevier went and broke things that previously worked.

In addition to the journal feeds, I have also included in the collection any new working papers that appear in the Statistics section of arXiv, plus any new forecasting papers that appear on RePEc (in the NEP-FOR report).

Thanks to feedly for allowing me to publish this as a “feedly collection”. This is a new feature in feedly that is not yet available to all users, but I was given advanced access in order to demonstrate how it could be used.

]]>College of Technology Management, Institute of Service Science,

National Tsing Hua University, Hsinchu

時間及地點：2015.1.7 (Wed.) 5pm @ TSMC building, 6F, room 622. 台積館6F孫運璿紀念中心

Many applications require a large number of time series to be forecast completely automatically. For example, manufacturing companies often require weekly forecasts of demand for thousands of products at dozens of locations in order to plan distribution and maintain suitable inventory stocks. In these circumstances, it is not feasible for time series models to be developed for each series by an experienced analyst. Instead, an automatic forecasting algorithm is required. In addition to providing automatic forecasts when required, these algorithms also provide high quality benchmarks that can be used when developing more specific and specialized forecasting models.

I will describe some algorithms for automatically forecasting univariate time series that have been developed over the last 20 years. The role of forecasting competitions in comparing the forecast accuracy of these algorithms will also be discussed.

Institute of Statistical Science, Academia Sinica

時 間 2015/01/12 11:00 星期一

地 點 中研院-統計所 2F 交誼廳

備 註 茶 會：上午10：40統計所二樓交誼廳

Time series can often be naturally disaggregated in a hierarchical or grouped structure. For example, a manufacturing company can disaggregate total demand for their products by country of sale, retail outlet, product type, package size, and so on. As a result, there can be millions of individual time series to forecast at the most disaggregated level, plus additional series to forecast at higher levels of aggregation.

The first problem with handling such large numbers of time series is how to produce useful graphics to uncover structures and relationships between series. I will demonstrate some data visualization tools that help in exploring big time series data.

The second problem is that the disaggregated forecasts need to add up to the forecasts of the aggregated data. This is known as reconciliation. I will show that the optimal reconciliation method involves fitting an ill-conditioned linear regression model where the design matrix has one column for each of the series at the most disaggregated level. For problems involving huge numbers of series, the model is impossible to estimate using standard regression algorithms. I will also discuss some fast algorithms for implementing this model that make it practicable for implementing in business contexts.

]]>Di is a world leader in data visualization, and is well-known for her work on interactive graphics. She is also the academic supervisor of several leading data scientists including Hadley Wickham and Yihui Xie, both of whom work for RStudio.

Di has a great deal of energy and enthusiasm for computational statistics and data visualization, and will play a key role in developing and teaching our new subjects in business analytics.

The Monash Business School is already exceptionally strong in econometrics (ranked 7th in the world on RePEc), and forecasting (ranked 11th on RePEc), and we have recently expanded into actuarial science. With Di joining the department, we will be extending our expertise in the area of data visualization as well.

]]>

The model was first described in Hyndman and Fan (2010). We are continually improving it, and the latest version is decribed in the model documentation which will be updated from time to time.

The package is being released under a GPL licence, so anyone can use it. All we ask is that our work is properly cited.

Naturally, we are not able to provide free technical support, although we welcome bug reports. We are available to undertake paid consulting work in electricity forecasting.

]]>

The data are measurements from a medical diagnostic machine which takes 1 measurement every second, and after 32–1000 seconds, the time series must be classified into one of two classes. Some pre-classified training data is provided. It is not necessary to classify all the test data, but you do need to have relatively high accuracy on what is classified. So you could find a subset of more easily classifiable test time series, and leave the rest of the test data unclassified.

Accuracy is measured using

where true positive, false positive, true negative and false negative.

The prizes are:

- $5000 if using at least 50% of the test samples and achieving 0.75 accuracy.
- $15000 if using at least 50% of the test samples and achieving 0.85 or higher.
- for any accuracy above 0.75, while using less than 50% of test samples (but at least 25% of test samples), any additional 0.05 increase in accuracy, grants an additional $2K. For example, if you use 30% of test samples, and achieve accuracy of 0.85%, the price will be $5K+$4K=$9K
The winner will be:

The one with highest accuracy with highest amount of samples

OR

THE FIRST ONE that achieves 0.85 accuracy with at least 50% of data

OR

THE FIRST ONE that achieves 0.9 accuracy with at least 30% of data.In the link below you will see a text file that explains the data and how to access it and a png image which explain how the time series to classify was built and how the classes were assigned. The link also includes the actual train and test samples to be used for the challenge and some plots of the time series.

https://drive.google.com/folderview?id=0BxmzB6Xm7Ga1MGxsdlMxbGllZnM&usp=sharing

Entries should include:

- Proof of accuracy
- R code which grants the organizer full right to use
- R code to support new additional test samples.

The prizes create some strange discontinuities. Someone with accuracy of 0.75 using 50% of the data gets $5K, but someone with accuracy of 0.76 using only 25% of the data gets more. On the other hand, someone using 49% of the test with 0.85 accuracy gets $9K, but if they use 50% of the test they get $15K. Surely a continuous bivariate function of accuracy and percentage would have been better.

I also think this would have been better on Kaggle or CrowdAnalytix, but instead it has been posted on the R group on LinkedIn.

For all further questions, either ask via the comments on LinkedIn, or email the organizer Roni Kass

]]>This is (roughly) what I said.

Statisticians seem to go through regular periods of existential crisis as they worry about other groups of people who do data analysis. A common theme is: all these other people (usually computer scientists) are doing our job! Don’t they know that statisticians are the best people to do data analysis? How dare they take over our discipline!

I take a completely different view. I think our discipline is in the best position it has ever been in. The demand for data analysis skills is greater than ever. Our graduates are highly sought after, and well paid. Being a statistician has even been described as a sexy profession (which presumably is a good thing to be!).

The different perspectives are all about inclusiveness. If we treat statistics as a narrow discipline, fitting models to data, and studying the properties of those models, then statistics is in trouble. But if we treat what we do as a broad discipline involving data analysis and understanding uncertainty, then the future is incredibly bright.

Here are two quotes from well-known bloggers in the last year or two:

April 2013: Larry Wasserman blog

Data science: the end of statistics?

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

November 2013: Andrew Gelman blog

Statistics is theleastimportant part of data science

There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics as a subset of data science …Statistics is important—don’t get me wrong—statistics helps us correct biases … estimate causal effects … regularize so that we’re not overwhelmed by noise … fit models … visualize data … I love statistics! But it’s not the most important part of data science, or even close.

How can two professors of statistics have such different views on their discipline? The same perspectives can be seen in the following two diagrams (both reproduced with permission).

In the first narrow view, to be a data scientist you have to know a great deal about statistics, mathematics, computer science, programming, and the application discipline. If that’s true, I’ve never met a data scientist. I don’t believe they exist.

In the second broader view, everyone here is a data scientist, although we have different specializations and different perspectives and training.

I take the broad inclusive view. I am a data scientist because I do data analysis, and I do research on the methodology of data analysis. The way I would express it is that I’m a data scientist with a statistical perspective and training. Other data scientists will have different perspectives and different training.

We are comfortable with having medical specialists, and we will go to a GP, endocrinologist, physiotherapist, etc., when we have medical problems. We also need to take a team perspective on data science.

None of us can realistically cover the whole field, and so we specialise on certain problems and techniques. It is crazy to think that a doctor must know everything, and it is just as crazy to think a data scientist should be an expert in statistics, mathematics, computing, programming, the application discipline, etc. Instead, we need teams of data scientists with different skills, with each being aware of the boundary of their expertise, and who to call in for help when required.

Let’s not be too sectarian about our disciplines, thinking everyone not trained in the same way we were is a heretic.

It reminds me of a famous joke, written by comedian Emo Philips:

]]>I was walking across a bridge one day, and I saw a man standing on the edge, about to jump off. I immediately ran over and said “Stop! Don’t do it!“

“Why shouldn’t I?” he said.

I said, “Well, there’s so much to live for!“

“Like what?“

“Well … are you religious or atheist?“

“Religious.“

“Me too! Are you Christian or Jewish?“

“Christian.“

“Me too! Are you Catholic or Protestant?“

“Protestant.“

“Me too! What francise?“

“Baptist.“

“Wow! Me too! Northern Baptist or Southern Baptist?“

“Northern Baptist“

“Me too! Are you Northern Conservative Baptist or Northern Liberal Baptist?“

“Northern Conservative Baptist“

“Me too! Are you Northern Conservative Fundamentalist Baptist or Northern Conservative Reformed Baptist?“

“Northern Conservative Fundamentalist Baptist“

To which I said, “Die, heretic scum!” and pushed him off.