(More)…]]>

In demography, “coherent” forecasts are where male and females (or other sub-groups) do not diverge over time. (Essentially, we require the difference between the groups to be stationary.) When we wrote the 2008 paper, we did not know how to constrain the forecasts to be coherent in a functional data context and so this was not discussed. My later 2013 paper provided a way of imposing coherence. This blog post shows how to implement both ideas using R.

First we need to get the historical mortality rates and the fertility rates, and the population data for 1 January in the year after the final mortality and fertility rates. As there are mortality data up to 2009 easily available for Australia, I will use population data for 1 January 2010 for this example.

The population and mortality data can be obtained from the Human Mortality Database. You need a username and password, but registration is free.

library(demography) # Get population data for 1 Jan 2010 pop2010 <- hmd.pop("AUS",username,password,"Australia") pop2010 <- extract.ages(pop2010, age=0:100) # Get mortality data aus.mort <- hmd.mx("AUS",username,password,"Australia") aus.mort <- extract.years(extract.ages(aus.mort, 0:100, combine.upper=FALSE), year=1950:max(aus.mort$year)) |

The fertility data for some countries is available on the Human Fertility Database, but not for Australia. Instead, I have obtained the data from the Australian Bureau of Statistics and made them available in an R workspace (along with `pop2010`

and `aus.mort`

). Download it here.

Now we can construct the net migration data as the differences between the populations in consecutive years, taking account of any deaths or births.

##### Compute net migration aus.mig <- netmigration(aus.mort, aus.fert, mfratio = 1.05) |

Before continuing, it is important to plot the data to make sure everything looks in order.

## Population plot(pop2010,series="male") plot(pop2010,series="female") # Mortality plot(aus.mort,'male',ylim=c(-10.4,-0.5)) plot(aus.mort,'female',ylim=c(-10.4,-0.5)) # Fertility plot(aus.fert) # Net migration plot(aus.mig,'male',ylim=c(-4100,7800)) plot(aus.mig,'female',ylim=c(-4100,7800)) |

Next we smooth all data using penalized regression splines.

ausmort.sm <- smooth.demogdata(aus.mort, obs.var="theoretical") ausfert.sm <- smooth.demogdata(aus.fert, obs.var="theoretical") ausmig.sm <- smooth.demogdata(aus.mig) |

Now we are ready to fit some functional data models. This part is different from the approach in Hyndman and Booth (2008) as we can now fit coherent models to prevent forecasts of the two sexes from diverging.

# MORTALITY mort.fit <- coherentfdm(ausmort.sm) plot(residuals(mort.fit$product)) plot(residuals(mort.fit$ratio$male)) # FERTILITY fert.fit <- fdm(ausfert.sm) plot(residuals(fert.fit)) # MIGRATION mig.fit <- coherentfdm(ausmig.sm) plot(residuals(mig.fit$product)) plot(residuals(mig.fit$ratio$male)) |

The residuals for each model are plotted to ensure things look ok. In fact, the fertility residuals show up some weird problems with the data, but I’ll ignore that for now and proceed.

Although the components of migration are called “product” and “ratio”, they are actually sums and differences as we do not use a log transformation for migration.

Next we fit forecasting models to the coefficients of each functional data model. Some plots are produced in each case to check that the results look reasonable.

# Mortality mortf <- forecast(mort.fit, h=50, max.d=1) plot(mortf$product, 'c', comp=2) plot(mort.fit$product$y, col='gray', ylim=c(-11,-0.5), main="Mortality forecasts product: 2010-2059") lines(mortf$product) plot(mortf$ratio$male, 'c', comp=2) plot(mort.fit$ratio$male$y,col='gray', main="Mortality forecasts ratio (M/F): 2010-2059") lines(mortf$ratio$male) # Fertility fertf <- forecast(fert.fit, h=50,max.d=1) plot(fertf, 'c', comp=2) plot(ausfert.sm,col='gray', main="Fertility forecasts: 2010-2059") lines(fertf) # Migration migf <- forecast(mig.fit, h=50, stationary=TRUE) plot(migf$product, 'c', comp=2) plot(mig.fit$product$y, col='gray', main="Migration forecasts product: 2010-2059") lines(migf$product) plot(migf$ratio$male, 'c', comp=2) plot(mig.fit$ratio$male$y, col='gray', main="Migration forecasts ratio (M/F): 2010-2059") lines(migf$ratio$male) |

Finally we can simulate future populations

aus.sim <- pop.sim(mortf, fertf, migf, firstyearpop=pop2010, N=1000) |

The value of `N`

determines the number of sample paths to simulate. I normally set to a very low number (e.g., 5) at first initially to ensure there are no bugs in the code and that the results look reasonable. Then I set it to at least 1000 and leave it to run while I make myself a coffee.

The object `aus.sim`

is a list of two arrays of dimension 101 x 50 x `N`

, one for males and one for females. The first dimension is age groups (0:100), the second dimension is the forecast horizon (2010:2059), while the third dimension gives simulated replicates.

The simulations can be used to compute any quantities that can be derived from populations numbers by sex and age. For example, the mean male population for each age for the next 50 years:

popm.mean <- apply(aus.sim$male,c(1,2),mean) |

We can also plot population pyramids with prediction intervals.

## Means and intervals popm.mean <- apply(aus.sim$male,c(1,2),mean) popm.lo <- apply(aus.sim$male,c(1,2),quantile,p=.1) popm.hi <- apply(aus.sim$male,c(1,2),quantile,p=.9) popf.mean <- apply(aus.sim$female,c(1,2),mean) popf.lo <- apply(aus.sim$female,c(1,2),quantile,p=.1) popf.hi <- apply(aus.sim$female,c(1,2),quantile,p=.9) ##plot of total population age structure in 2030 par(mar=c(4.3,3.9,3.0,3.6)) plot(c(0,100),c(0,100),type="n", main="Forecast population: 2030", yaxt="n", xlab="Population ('000)",ylab="Age",xaxt="n", xlim=c(-1,1)*max(abs(popm.hi[,20]),abs(popf.hi[,20]))/1e3) mtext("Male Female") abline(v=0) axis(1,at=seq(-300,0,by=50),labels=rev(seq(0,300,by=50))) axis(1,at=seq(50,300,by=50)) axis(2,at=seq(0,100,by=10),labels=seq(0,100,by=10)) axis(4,at=seq(0,100,by=10),labels=seq(0,100,by=10)) axis(2,at=seq(0,100,by=20),labels=seq(0,100,by=20)) axis(4,at=seq(0,100,by=20),labels=seq(0,100,by=20)) axis(4,at=50,line=1.5,labels="Age",col=2,tick=FALSE) polygon(c(-popm.lo[,20],rev(-popm.hi[,20]))/1000,c(0:100,100:0), col=rgb(0,0,1,alpha=.4),border=FALSE) polygon(c(popf.lo[,20],rev(popf.hi[,20]))/1000,c(0:100,100:0), col=rgb(0,0,1,alpha=.4),border=FALSE) lines(-popm.mean[,20]/1000,0:100,col="blue",lwd=2) lines(popf.mean[,20]/1000,0:100,col="blue",lwd=2) |

The shaded blue regions denote 80% prediction intervals for each age. Notice how certain we can be of the elderly population even twenty years ahead (due to the predictability of mortality rates) but how uncertain we are of the future young population (due to the lack of predictability in fertility rates).

(More)…]]>

# Compute AR roots arroots <- function(object) { if(class(object) != "Arima" & class(object) != "ar") stop("object must be of class Arima or ar") if(class(object) == "Arima") parvec <- object$model$phi else parvec <- object$ar if(length(parvec) > 0) { last.nonzero <- max(which(abs(parvec) > 1e-08)) if (last.nonzero > 0) return(structure(list(roots=polyroot(c(1,-parvec[1:last.nonzero])), type="AR"), class='armaroots')) } return(structure(list(roots=numeric(0),type="AR"),class='armaroots')) } # Compute MA roots maroots <- function(object) { if(class(object) != "Arima") stop("object must be of class Arima") parvec <- object$model$theta if(length(parvec) > 0) { last.nonzero <- max(which(abs(parvec) > 1e-08)) if (last.nonzero > 0) return(structure(list(roots=polyroot(c(1,parvec[1:last.nonzero])), type="MA"), class='armaroots')) } return(structure(list(roots=numeric(0),type="MA"),class='armaroots')) } plot.armaroots <- function(x, xlab="Real",ylab="Imaginary", main=paste("Inverse roots of",x$type,"characteristic polynomial"), ...) { oldpar <- par(pty='s') on.exit(par(oldpar)) plot(c(-1,1),c(-1,1),xlab=xlab,ylab=ylab, type="n",bty="n",xaxt="n",yaxt="n", main=main, ...) axis(1,at=c(-1,0,1),line=0.5,tck=-0.025) axis(2,at=c(-1,0,1),label=c("-i","0","i"),line=0.5,tck=-0.025) circx <- seq(-1,1,l=501) circy <- sqrt(1-circx^2) lines(c(circx,circx),c(circy,-circy),col='gray') lines(c(-2,2),c(0,0),col='gray') lines(c(0,0),c(-2,2),col='gray') if(length(x$roots) > 0) { inside <- abs(x$roots) > 1 points(1/x$roots[inside],pch=19,col='black') if(sum(!inside) > 0) points(1/x$roots[!inside],pch=19,col='red') } } |

The `arroots`

function will return the autoregressive roots from the AR characteristic polynomial while the `maroots`

function will return the moving average roots from the MA characteristic polynomial. Both functions take an `Arima`

object as their only argument. If a seasonal ARIMA model is passed, the roots from both polynomials are computed in each case.

The `plot.armaroots`

function will plot the inverse of the roots on the complex unit circle. A causal invertible model should have all the roots outside the unit circle. Equivalently, the inverse roots should like inside the unit circle.

Here are a couple of examples demonstrating their use.

A simple example with three AR roots:

library(forecast) plot(arroots(Arima(WWWusage,c(3,1,0)))) |

A more complicated example with ten AR roots and four MA roots. (This is not actually the best model for these data.)

library(forecast) fit <- Arima(woolyrnq,order=c(2,0,0),seasonal=c(2,1,1)) par(mfrow=c(1,2)) plot(arroots(fit),main="Inverse AR roots") plot(maroots(fit),main="Inverse MA roots") |

Finally, here is an example where two inverse roots lie outside the unit circle (shown in red).

library(fma) plot(arroots(ar.ols(jcars))) |

Note that the `Arima`

function will never return a model with inverse roots outside the unit circle. The `auto.arima`

function is even stricter and will not select a model with roots close to the unit circle either, as such models are unlikely to be good for forecasting.

I won’t add these functions to the `forecast`

package as I don’t think enough people would use them, and the package is big enough as it is. So I’m making them available here instead for anyone who wants to use them.

(More)…]]>

Last week at my research group meeting, I spoke about some of the differences I have noticed. Coincidentally, Andrew Gelman blogged about the same issue a day later.

Econometrics is often “theory driven” while statistics tends to be “data driven”. I discovered this in the interview for my current job when someone criticized my research for being “data driven” and asked me to respond. I was confused because I thought statistical research *should* be driven by data analytic issues, not by some pre-conceived theory, but that was not the perspective of the people interviewing me. (Fortunately, I was hired anyway.) Typically, econometricians test theory using data, but often do little if any exploratory data analysis. On the other hand, I tend to build models after looking at data sets. I think this distinction also extends to many other areas where statistics is applied.

As a result of this distinction, econometricians do a lot of hypothesis testing but produce few graphics. Many research seminars in our department involve someone describing a model, applying it to some data, and showing the estimated parameters, standard errors, results of various hypothesis tests, etc. They do all that without ever plotting the data to start with! This seems bizarre to me, and I still get annoyed about it even though I’ve seen it at least a hundred times. I teach my students to first spend time getting to know their data through plots and other data visualization methods before even thinking about fitting a model or doing a hypothesis test.

Probably because of the emphasis that econometricians place on their theoretical models, they tend to fall in love with them and even seem to believe they are true. This is evident by the phrase “data generating process” (or its acronym DGP) that econometricians commonly use to describe a statistical model. I never think of my models as data generating processes. The data come from some real world, complicated, messy, nonlinear, nonstationary, nonGaussian process. At best, my model is a crude approximation. I often cite Box’s maxim that “All models are wrong, but some are useful”, and while my colleagues would agree in principle, they still behave as if their models are the true data generating processes.

When I first joined an econometrics department, I was struck by how much everyone knew about time series and regression, and how little they knew about a lot of other topics. There are vast areas of statistics that econometricians typically know little about including survey sampling, discriminant analysis, clustering, and the design of experiments. My training was much broader but in some ways shallower. There were standard undergraduate topics in econometrics that I knew nothing about — cointegration, endogeneity, ARCH/GARCH models, seemingly unrelated regression, the generalized methods of moments, and so on.

Because of the nature of economic data, econometricians have developed some specific techniques for handling time series and regression problems. In particular, econometricians have thought very carefully about causality, because it is usually not possible to conduct experiments within economics and finance, and so they have developed several methods to help identify potentially causal relationships. These developments do not always filter back to the general statistical community, although they can be very useful. For example, the method of instrumental variables (which allows consistent estimation when the explanatory variables are correlated with the error term of a regression model) can be used to help identify potentially causal relationships. Tests for “Granger causality”are another useful econometric development.

For some reason, econometricians have never really taken on the benefits of the generalized linear modelling framework. So you are more likely to see an econometrician use a probit model than a logistic regression, for example. Probit models tended to go out of fashion in statistics after the GLM revolution prompted by Nelder and Wedderburn (1972).

The two communities have developed their own sets of terminology that can be confusing. Sometimes they have different terms for the same concept; for example, “longitudinal data” in statistics is “panel data” in econometrics; “survival analysis” in statistics is “duration modelling” in microeconometrics.

In other areas, they use the same term for different concepts. For example, a “robust” estimator in statistics is one that is insensitive to outliers, whereas a “robust” estimator in econometrics is insensitive to heteroskedasticity and autocorrelation. A “fixed effect” in statistics is a non-random regression term, while a “fixed effect” in econometrics means that the coefficients in a regression model are time-invariant. This obviously has the potential for great confusion, which is evident in the Wikipedia articles on fixed effects and robust regression.

I’ve stayed in a (mostly) econometrics department for so long because it is a great place to work, full of very nice people, and is much better funded than most statistics departments. I’ve also learned a lot, and I think the department has benefited from having a broader statistical influence than if they had only employed econometricians.

I would encourage econometricians to read outside the econometrics literature so they are aware of what is going on in the broader statistical community. These days, most research econometricians do pay some attention to *JASA* and *JRSSB*, so the gap between the research communities is shrinking. However, I would suggest that econometricians add *Statistical Science* and *JCGS* to their reading list, to get a wider perspective.

I would encourage statisticians to keep abreast of methodological developments in econometrics. A good place to start is Hayashi’s graduate textbook *Econometrics* which we use at Monash for our PhD students.

One thing I have noticed in the last seventeen years is that the two communities are not so far apart as they once were. Nonparametric methods were once hardly mentioned in econometrics (too “data-driven”), and now the main econometrics journals are full of nonparametric asymptotics. There are special issues of statistical journals dedicated to econometrics (e.g., CSDA has regular special issues dedicated to computational econometrics).

Just as US television has made the Australian culture rather less distinctive than it once was, statistical ideas are infiltrating econometrics, and vice-versa. But until I hear a research seminar on Visualization of Macroeconomic Data, I don’t think I will ever feel entirely at home.

Some of these thoughts were prompted by this discussion on crossvalidated.com.

The article has been updated to reflect some of the comments made below. Thanks for the feedback.

]]>(More)…]]>

The simplest approach is to estimate the model on a single set of training data, and then compute one-step forecasts on the remaining test data. This can be handled by applying the fitted model to the whole data set, and then extracting the “fitted values” which are simply one-step forecasts.

library(fpp) train <- window(hsales,end=1989.99) fit <- auto.arima(train) refit <- Arima(hsales, model=fit) fc <- window(fitted(refit), start=1990) |

For multi-step forecasts, a loop is required. The following example computes 5-step forecasts:

h <- 5 train <- window(hsales,end=1989.99) test <- window(hsales,start=1990) n <- length(test) - h + 1 fit <- auto.arima(train) fc <- ts(numeric(n), start=1990+(h-1)/12, freq=12) for(i in 1:n) { x <- window(hsales, end=1989.99 + (i-1)/12) refit <- Arima(x, model=fit) fc[i] <- forecast(refit, h=h)$mean[h] } |

An alternative approach is to extend the training data and re-estimate the model at each iteration, before each forecast is computed. This is what I call “time series cross-validation” because it is analogous to leave-one-out cross-validation for cross-sectional data. This time, I will store the forecasts from 1– to 5-steps ahead at each iteration.

# Multi-step, re-estimation h <- 5 train <- window(hsales,end=1989.99) test <- window(hsales,start=1990) n <- length(test) - h + 1 fit <- auto.arima(train) order <- arimaorder(fit) fcmat <- matrix(0, nrow=n, ncol=h) for(i in 1:n) { x <- window(hsales, end=1989.99 + (i-1)/12) refit <- Arima(x, order=order[1:3], seasonal=order[4:6]) fcmat[i,] <- forecast(refit, h=h)$mean } |

A variation on this also re-selects the model at each iteration. Then the second line in the loop is replaced with

refit <- auto.arima(x) |

Information about past SAS-IIF awards is given on the IIF website. It is interesting to see the range of topics covered. Here are the winning projects in the last two years:

- Jeffrey Stonebraker: “Probabilistic Forecasting of the Global Demand for the Treatment of Hemophilia B.”
- Yongchen (Herbert) Zhao: “Robust Real-Time Automated Forecast Combination in SAS: Development of a SAS Procedure and a Comprehensive Evaluation of Recently Developed Combination Methods.”
- Zoe Theocharis, Nigel Harvey, Leonard Smith: “Improving judgmental input to hurricane forecasts in the insurance and reinsurance sector.”
- Elena-Ivona Dumitrescu, Janine Christine Balter, Peter Reinhard Hansen: “Forecasting Exchange Rate Volatility: Multivariate Realized GARCH Framework.”
- Yorghos Tripodis: “Forecasting the Cognitive Status in an Aging Population.”

It’s a nice introduction to trees, bagging and forests, plus a very brief entree to the LASSO and the elastic net, and to slab and spike regression. Not enough to be able to use them, but ok if you’ve no idea what they are.

It was more disappointing on boosting (completely ignoring the fact that boosting can be applied in a regression context as well as a classification context), and his comments on causality seemed curiously naive. His suggested approach involved forecasting using all variables but the one that is considered causal, and then comparing the results against what actually happened. That seems at least as likely to lead to false conclusions on causality as instrumental variables or differences-in-differences. Although Varian cites Pearl’s work approvingly, I doubt that Pearl would return the favour.

On a positive note, his Bayesian Structural Time Series model (which I heard him speak about in Rome 12 months ago) seems interesting and very useful. I wonder when the promised R package will appear?

]]>(More)…]]>

All aggregation structures can be represented as hierarchies or as cross-products of hierarchies. For example, a hierarchical time series may be based on geography: country, state, region, store. Often there is also a separate product hierarchy: product groups, product types, packet size. Forecasts of all the different types of aggregation are required; e.g., product type A within region X. The aggregation structure is a cross-product of the two hierarchies.

This framework includes even apparently non-hierarchical data: consider the simple case of a time series of deaths split by sex and state. We can consider sex and state as two very simple hierarchies with only one level each. Then we wish to forecast the aggregates of all combinations of the two hierarchies.

Any number of separate hierarchies can be combined in this way. Non-hierarchical factors such as sex can be treated as single-level hierarchies.

The hts package stores the data only at the bottom (most disaggregated) level, and records information about the various types of aggregates that are of interest. The hts() function is appropriate for a single hierarchy (i.e., strictly hierarchical data). More complicated aggregation structures can be specified using the more general gts() function.

Here is an example, based on a question asked on stackoverflow. The problem involves a geographical hierarchy and an industrial classification hierarchy.

Suppose there are two states with four and five counties respectively, and two industries with three and two sub-industries respectively. So there are 9x5 series at the most disaggregated level (sub-industry x county combinations). I will call the states A and B, and the counties A1,A2,A3,A4 and B1,B2,B3,B4,B5. I will call the industries X and Y with sub-industries Xa,Xb,Xc and Ya,Yb respectively. Suppose you have the bottom level series (the most disaggregated level) in a matrix `y`

, with one column per series, and columns in the following order:

```
County A1, industry Xa
County A1, industry Xb
County A1, industry Xc
County A1, industry Ya
County A1, industry Yb
County A2, industry Xa
County A2, industry Xb
County A2, industry Xc
County A2, industry Ya
County A2, industry Yb
...
County B5, industry Xa
County B5, industry Xb
County B5, industry Xc
County B5, industry Ya
County B5, industry Yb
```

So that we have a reproducible example, I will create `y`

randomly:

```
y <- ts(matrix(rnorm(900),ncol=45,nrow=20))
```

Then we can construct labels for the columns of this matrix as follows:

```
blnames <- paste(c(rep("A",20),rep("B",25)), # State
rep(1:9,each=5), # County
rep(c("X","X","X","Y","Y"),9), # Industry
rep(c("a","b","c","a","b"),9), # Sub-industry
sep="")
colnames(y) <- blnames
```

For example, the first series in the matrix has name `"A1Xa"`

meaning state A, county 1, industry X, sub-industry a.

We can then easily create the grouped time series object using

```
gy <- gts(y, characters=list(c(1,1),c(1,1)))
```

Only the bottom level series are contained in `y`

. The `characters`

argument species what aggregations are of interest. In this case, the `characters`

argument indicates there are two hierarchies (two elements in the list), and the first hierarchy is specified by the first two characters, with the second hierarchy specified by the next two characters. Each level of each hierarchy is specified using a single character (hence the 1s).

A slightly more complicated but analogous example (with labels taking more than one character each) is given in the help file for `gts`

in v4.3 of the `hts`

package.

It is possible to specify the grouping structure without using column labels. Then you have to specify the groups matrix which defines what aggregations are of interest. In the example above, the groups matrix is given by

```
gps <- rbind(
c(rep(1,20),rep(2,25)), # State
rep(1:9,each=5), # County
rep(c(1,1,1,2,2),9), # Industry
rep(1:5, 9), # Sub-industry
c(rep(c(1,1,1,2,2),4),rep(c(3,3,3,4,4),5)), # State x industry
c(rep(1:5, 4),rep(6:10, 5)), # State x Sub-industry
rep(1:18, rep(c(3,2),9)) # County x industry
)
```

The order of the rows does not matter. Each row is specifying an aggregation of the bottom level series which is of interest.

Then

```
gy <- gts(y, groups=gps)
```

The advantage of using the `characters`

argument is that the cross-products are handled for you. Also, if your data already comes with helpful column names that can be interpreted as specifying levels of one or more hierarchies, then there is really nothing to do but figure out what the `characters`

argument should be.

Once the `gts`

object has been created using the `gts()`

function, you can proceed to forecast. For exmaple

```
fc <- forecast(gy)
```

will generate forecasts for all the bottom level series, and all the aggregate series specified in the call to `gts()`

. Then it will reconcile the forecasts until they add up for all the specified aggregations, and finally it returns only the reconciled bottom level series. The reconciled aggregated series can easily be constructed from these when they are required.

**24–25 June**. Functional time series with applications in demography. Humboldt University, Berlin.

(More)…]]>

Instead, I produced a cut-down version of my beamer slides, leaving out some of the animations and other features that will not print easily. Then I produced a pdf file with several slides per page.

The beamer manual suggests using the handout option with the pgfpages package, like this:

\documentclass[handout]{beamer} \usepackage{pgfpages} \pgfpagesuselayout{4 on 1}[a4paper,border shrink=5mm] |

Unfortunately that won’t work for me because I also use the `textpos`

package to enable precisely positioned overlays, and pgfpages does not respect the positioning of the overlays.

Instead, the following works. First, add the handout option:

\documentclass[14pt,handout]{beamer} |

Note that I always use `14pt`

to make the text large enough to be easily read in a large lecture theatre.

Then I wrap some of the slides in `\mode<beamer>{`

… `}`

. These won’t appear in the handout, but they will appear when I remove the handout option from `\documentclass`

. Typically the slides I remove contain animations, photographs, and some dynamic overlays.

When the file is compiled, each slide should now be printable. But I want several slides on each page. So I then use the ** pdfnup** command (sorry Windows users, you need a real computer to do this):

pdfnup --a4paper --keepinfo --nup 1x3 --frame true --scale 0.92 --no-landscape --outfile handout.pdf slides.pdf |

To see an example, here are some slides from a talk I am giving in Switzerland, and here is the handout.

]]>(More)…]]>

**Experfy** provides a way for companies to find statisticians and other data scientists, either for short-term consultancies, or to fill full-time positions. They describe their “providers” as “Data Engineers, Data Scientists, Data Mining Experts, Data Analyst/Modelers, Big Data Solutions Architects, Visualization Designers, Statisticians, Applied Physicists, Mathematicians, Econometricians and Bioinformaticians.” Data scientists can sign up as “Providers”, companies can sign up as “Clients”.

Experfy makes money by taking a cut of any transaction made on the site.

Providers are subject to a selection process, although it is not clear what the criteria are. I heard about Experfy because they invited me to be a provider. I have far too much work already without needing more, but it looks like an interesting and useful service.

Experfy is a more specialized version of Zombal which covers science professionals generally, and not just data scientists.

**SnapAnalytx** takes a different approach where data scientists can post their algorithms which are then hosted on the site. Then anyone wanting to use the algorithm can upload their own data, train the model and get predictions. The “model author” can interact with the users on the site. So essentially SnapAnalytx provides the cloud hosting and computing infrastructure, and a way for data scientists and their clients to interact online.

The “unit” for sale with SnapAnalytx is a model or algorithm, whereas the “unit” for sale with Experfy is a person.

I imagine that the SnapAnalytx approach would be more suited to some problems than others. My algorithm for hierarchical forecasting would probably work well on such a platform as it takes a lot of computing power (for large hierarchies) and is suited to parallel processing. (I assume SnapAnalytx allows multiple processes.) It also works out-of-the-box for a lot of problems.

On the other hand, my algorithm for electricity demand forecasting would probably not work well on this platform as we have to tailor the model carefully to each particular region, so having a generic cloud-hosted algorithm is unlikely to give effective forecasts.

SnapAnalytx makes money from both model providers and model users. They charge $99 per month per model to providers to list models in the catalog (which seems to me like a huge cost, but perhaps there is a lot of manual work in setting up every model), and then each user is charged a fee for using the model. SnapAnalytx retains part of the user fees, and the rest goes to the model provider.

It will be interesting to see if these market places survive, and if any competition develops. Feel free to add links to competing services, or other data science market places in the comments below.

]]>