parsimonious pursuits

Native R versus tensorflow for matrix math speed comparison

2019-02-01T12:27:00.000-08:00

https://www.dummies.com/web-design-development/other-web-software/create-vector-matrix-operations-tensorflow/

Out of curiosity, I coded up a simple speed comparison of a matrix multiplication problem in native R versus tf$matmul in the tensorflow package in R. This was on my laptop and tensorflow is meant to allow distributed computing on multiple CPU/GPU, so this isn't the right sort of application for tensorFlow. Nonetheless, tensorflow have tensor multiplication functions which would make some of the MARSS EM calculations more compact, avoiding for loops.

For this test, I had a 30x60x900 matrix. I do the following tests

X[j,,] %*% t(X[j,,]) and sum across 1st dim
X[,,j] %*% t(X[,,j]) and sum across 3rd dim

The second test requires 2 calls to tf$transpose which slows things down quite a bit, but matches the way I have arrays stored in MARSS (with time in 3rd dim). The 2 calls to transpose can be avoided by using tf$einsum but that doesn't seem to give any speed gains.

Test 1 tensorflow is faster 2x. So not much gain. Note it is only faster for big matrices. Much slower for small matrices.

Unit: milliseconds
expr min lq mean median uq max neval
funy() 36.75568 43.13249 48.92774 46.30748 51.9293 70.86605 100
funx(X) 79.40066 86.42067 97.32483 95.05786 103.7837 229.00092 100

Test 2 tensorflow is slower 2x due to the extra tf$transpose call. However this depends on the size of the 3rd dimension. When I changed it so that the array was n^2 x 2n x n, tensorflow code was faster.

Unit: milliseconds
expr min lq mean median uq max neval
funy() 73.45847 80.07081 85.67677 86.22445 91.47107 97.61655 10
funx(X) 42.00226 43.86015 46.94928 46.94643 48.94002 55.19185 10

Code is below.

# This takes a 3D array, multiplies and sums up along 1st dimension

library(tensorflow)

x <- tf$placeholder("float", shape=shape(NULL,NULL,NULL))
y = tf$matmul(x,tf$transpose(x, perm=c(0L,2L,1L)))
z <- tf$reduce_sum(y, c(0L))

# write comparison funcs

funx<-function(x){
tmp <- tcrossprod(x[1,,])
for(i in 2:dim(x)[1]){
tmp <- tcrossprod(x[i,,])+tmp
}
tmp
}
funy<-function(){ sess$run(z, feed_dict = dict(x = X)) }

n <-30
X = array(rnorm(n+2*n+3*n),dim=c(n, 2*n, n^2))
tmp <- sess$run(z, feed_dict = dict(x = X))

mx <- microbenchmark( funy(), funx(X))
autoplot(mx)

######

# This takes a 3D array, multiplies and sums up along 3rd dimension

library(tensorflow)

x <- tf$placeholder("float", shape=shape(NULL,NULL,NULL))
y = tf$matmul(tf$transpose(x, perm=c(2L,0L,1L)),tf$transpose(x, perm=c(2L,1L,0L)))
# note you can write this more succinctly with einsum but it doesn't speed things up
# y = tf$einsum('ijl,kjl->ikl', x, x)
z <- tf$reduce_sum(y, c(0L))

# write comparison funcs

funx<-function(x){
tmp <- tcrossprod(x[,,1])
for(i in 2:dim(x)[3]){
tmp <- tcrossprod(x[,,i])+tmp
}
tmp
}
funy<-function(){ sess$run(z, feed_dict = dict(x = X)) }

n <-30
X = array(rnorm(n+2*n+3*n),dim=c(n, 2*n, n^2))
tmp <- sess$run(z, feed_dict = dict(x = X))

mx <- microbenchmark( funy(), funx(X), times=10)
autoplot(mx)

Notes on computing the Fisher Information matrix for MARSS models. Part IV Recursion in Harvey 1989

2017-05-31T17:19:00.000-07:00

MathJax and blogger can be iffy. Try reloading if the equations don't show up.

Notes on computing the Fisher Information matrix for MARSS models Part I Background, Part II Louis 1982, Part III Overview of Harvey 1989.

Part III Introduced the approach of Harvey (1989) for computing the expected and observed Fisher Information matrices by using the prediction error form of the log-likelihood function. Here I show the Harvey (1989) recursion on page 143 for computing the derivatives in his equations.

Derivatives needed for the 2nd derivative of the conditional log-likelihood

Equations 3.4.66 and 3.4.69 in Harvey (1989) have first and second derivatives of $v_t$ and $F_t$ with respect to $ \theta_i $ and $\theta_j$. These in turn involve derivatives of the parameter matrices and of $\tilde{x}_{t|t}$ and $\tilde{V}_{t|t}$. Harvey shows all the first derivatives, and it is easy to compute the second derivatives by taking the derivatives of the first.

The basic idea of the recursion is simple, if a bit tedious.

First we set up matrices for all the first derivatives of the parameters.
Then starting from t=1 and working forward, we will do the recursion (described below) for all $\theta_i$ and we store the first derivatives of $v_t$, $F_t$, $\tilde{x}_{t|t}$ and $\tilde{V}_{t|t}$ with respect to $\theta_i$.
Then we go through the parameter vector a second time, to get all the second derivatives with respect to $\theta_i$ and $\theta_j$.
We input the first and second derivatives of $v_t$ and $F_t$ into equations 3.4.66 and 3.4.69 to get the observed Fisher Information at time t and add to the Fisher Information from the previous time step. The Fisher Information matrix is symmetric, so we can use an outer loop from $\theta_1$ to $\theta_p$ ($p$ is the number of parameters) and an inner loop from $\theta_i$ to $\theta_p$. That will be $p(p-1)/2$ loops for each time step.

The end result with be the observed Fisher Information matrix using equation 3.4.66 and using 3.4.69.

Outline of the loops in the recursion

This is a forward recursion starting at t=1. We will save the previous time step's $ \partial v_t / \theta_i $ and $ \partial F_t / \theta_i $. That will be p x 2 (n x 1) vectors and n x 2 (n x n) matrices. We do not need to store all the previous time steps since this is a one-pass recursion unlike the Kalman smoother, which is forward-backward.

Set-up
Number of parameters = p.
Create Iijt and oIijt which are p x p matrices.
Create dvit which is a n x p matrix. n Innovations and p $\theta_i$.
Create d2vijt which is a n x p x p array. n Innovations and p $\theta_i$.
Create dFit which is a n x n x p array. n x n Sigma matrix and p $\theta_i$.
Create d2Fijt which is a n x n x p x p array. n x n Sigma matrix and p $\theta_i$.

Outer loop from t=1 to t=T.
Inner loop over all MARSS parameters: x0, V0, Z, a, R, B, u, Q. This is par$Z, e.g., and is a vector of the estimated parameters elements in Z.
Inner loop over parameters in parameter matrix, so, e.g. over the rows in the column vector par$Z.
Keep track of what parameter element I am on via p counter.

The form of the parameter derivatives

Within the recursion, we have terms like, $ \partial M/\partial \theta_i$, where M means some parameter matrix. We can write M as $ vec(M)=f+D\theta_m $, where $\theta_m$ is the vector of parameters that appear in M. This is the way that matrices are written in Holmes (2010). So \begin{equation} \begin{bmatrix}2a+c&b\\b&a+1\end{bmatrix} \end{equation} is written in vec form as \begin{equation} \begin{bmatrix}0\\0\\0\\1\end{bmatrix}+\begin{bmatrix}2&0&1\\ 0&1&0\\ 0&1&0\\ 1&0&0 \end{bmatrix}\begin{bmatrix}a\\b\\c\end{bmatrix} \end{equation} The derivative of this with respect to $ \theta_i=a$ is \begin{equation} \label{dpar} \begin{bmatrix}0\\0\\0\\1\end{bmatrix}+\begin{bmatrix}2&0&1\\ 0&1&0\\ 0&1&0\\ 1&0&0 \end{bmatrix}\begin{bmatrix}1\\0\\0\end{bmatrix} \end{equation} So in MARSS, $ \partial M/\partial \theta_i$ would be

dthetai=matrix(0,ip,1); dthetai[i,]=1 #set up the d theta_i bit.
dM=unvec(f+D%*%dthetai,dim(M)) #only needed if M is matrix

The reason is that MARSS allows any linear constraint of the form $\alpha+\beta a + \beta_2 b$, etc. The vec form allows me to work with a generic linear constraint without having to know the exact form of that constraint. The model and parameters are all specified in vec form with f, D, and p matrices (lower case = column vector).

The second derivative of a parameter matrix with respect to $ \theta_j $ is always 0 since \ref{dpar} has no parameters in it, only constants.

Derivatives of the innovations and variance of innovations

Equation 3.4.71b in Harvey shows $ \partial v_t / \partial \theta_i$. Store result in dvit[,p]. \begin{equation} \frac{\partial v_t}{\partial \theta_i}= -Z_t \frac{\partial \tilde{x}_{t|t-1}}{\partial \theta_i}- \frac{Z_t}{\partial \theta_i}\tilde{x}_{t|t-1}- \frac{\partial a_t}{\partial \theta_i} \end{equation} $\tilde{x}_{t|t-1}$ is the one-step ahead prediction covariance output from the Kalman filter, and in MARSSkf is xtt1[,t]. Next, use equation 3.4.73, to get $ \partial F_t / \partial \theta_i$. Store result in dFit[,,p]. \begin{equation} \frac{\partial F_t}{\partial \theta_i}= \frac{\partial Z_t}{\partial \theta_i} \tilde{V}_{t|t-1} Z_t^\top + Z_t \frac{\partial \tilde{V}_{t|t-1}}{\partial \theta_i} Z_t^\top + Z_t \tilde{V}_{t|t-1} \frac{\partial Z_t^\top}{\partial \theta_i} + \frac{\partial (H_t R_t H_t^\top)}{\partial \theta_i} \end{equation} $\tilde{V}_{t|t-1}$ is the one-step ahead prediction covariance output from the Kalman filter, and in MARSSkf is denoted Vtt1[,,t].

Recursion for derivatives of states and variance of states

If t=1

Case 1. $\pi=x_0$ is treated as a parameter and $V_0 = 0$. For any $\theta_i$ that is not in $\pi$, $Z$ or $a$, $\partial v_1/\partial \theta_i\ = 0$. For any $\theta_i$ that is not in $Z$ or $R$, $\partial F_1/\partial \theta_i\ = 0$ (a n x n matrix of zeros).

From equation 3.4.73a: \begin{equation} \frac{\partial \tilde{x}_{1|0}}{\partial\theta_i } = \frac{\partial B_1}{\partial \theta_i} \pi + B_1 \frac{\partial \pi}{\partial \theta_i} + \frac{\partial u_t}{\partial \theta_i} \end{equation} From equation 3.4.73b and using $V_0 = 0$: \begin{equation} \frac{\partial \tilde{V}_{1|0}}{\partial\theta_i } = \frac{\partial B_1}{\partial \theta_i} V_0 B_1^\top + B_1 \frac{\partial V_0}{\partial \theta_i} B_1^\top + B_1 V_0 \frac{\partial B_1^\top}{\partial \theta_i} + \frac{\partial (G_t Q_t G_t^\top)}{\partial \theta_i} = \frac{\partial (G_t Q_t G_t^\top)}{\partial \theta_i} \end{equation}

Case 2. $\pi=x_{1|0}$ is treated as a parameter and $V_{1|0}=0$. \[ \frac{\partial \tilde{x}_{1|0}}{\partial \theta_i}=\frac{\partial \pi}{\partial \theta_i} \text{ and } \partial V_{1|0}/\partial\theta_i = 0 \].

Case 3. $ x_0$ is specified by a fixed prior. $x_0=\pi$ and $V_0=\Lambda$. The derivatives of these are 0, because they are fixed.

From equation 3.4.73a and using $x_0 = \pi$ and $\partial \pi/\partial \theta_i = 0$: \begin{equation} \frac{\partial \tilde{x}_{1|0}}{\partial\theta_i } = \frac{\partial B_1}{\partial \theta_i} \pi + B_1 \frac{\partial \pi}{\partial \theta_i} + \frac{\partial u_t}{\partial \theta_i}=\frac{\partial B_1}{\partial \theta_i} \pi + \frac{\partial u_t}{\partial \theta_i} \end{equation} From equation 3.4.73b and using $V_0 = \Lambda$ and $\partial \Lambda/\partial \theta_i = 0$: \begin{equation} \frac{\partial \tilde{V}_{1|0}}{\partial\theta_i } = \frac{\partial B_1}{\partial \theta_i} V_0 B_1^\top + B_1 \frac{\partial V_0}{\partial \theta_i} B_1^\top + B_1 V_0 \frac{\partial B_1^\top}{\partial \theta_i} + \frac{\partial (G_t Q_t G_t^\top)}{\partial \theta_i} = \frac{\partial B_1}{\partial \theta_i} \Lambda B_1^\top + B_1 \Lambda \frac{\partial B_1^\top}{\partial \theta_i} + \frac{\partial (G_t Q_t G_t^\top)}{\partial \theta_i} \end{equation}

Case 4. $x_{1|0}$ is specified by a fixed prior. $x_{1|0}=\pi$ and $V_{1|0} = \Lambda$. $\partial V_{1|0}/\partial\theta_i = 0$ and $\partial x_{1|0}/\partial\theta_i = 0$.

Case 5. Estimate $ V_0$ or $ V_{1|0}$. That is unstable (per Harvey 1989, somewhere). I don't allow that in the MARSS package.

When coding this recursion, I will loop though the MARSS parameters (x0, V, Z, a, R, B, u, Q) and within that loop, loop through the individual parameters within the parameter vector. So say Q is diagonal and unequal. It has m variance parameters, and I'll loop through each.

Now we have $\frac{\partial \tilde{x}_{1|0}}{\partial \theta_i}$ and $\frac{\partial \tilde{V}_{1|0}}{\partial \theta_i}$ for $t=1$ and we can proceed.

If t>1

The derivative of $\tilde{x}_{t|t-1}$ is (3.4.73a in Harvey) \begin{equation} \frac{\partial \tilde{x}_{t|t-1}}{\partial\theta_i } = \frac{\partial B_t}{\partial \theta_i} \tilde{x}_{t-1|t-1} + B_t \frac{\partial \tilde{x}_{t-1|t-1}}{\partial \theta_i} + \frac{\partial u_t}{\partial \theta_i} \end{equation} Then we take the derivative of this to get the second partial derivative. \begin{align} \frac{\partial^2 \tilde{x}_{t|t-1}}{\partial\theta_i \partial\theta_j} = \frac{\partial^2 B_t}{\partial\theta_i \partial\theta_j} \tilde{x}_{t-1|t-1} + \frac{\partial B_t}{\partial \theta_i}\frac{\partial \tilde{x}_{t-1|t-1}}{\partial \theta_j} + \frac{\partial B_t}{\partial \theta_j} \frac{\partial \tilde{x}_{t-1|t-1}}{\partial \theta_i} + B_t \frac{\partial^2 \tilde{x}_{t-1|t-1}}{\partial\theta_i \partial\theta_j} + \frac{\partial^2 u_t}{\partial\theta_i \partial\theta_j}\\ = \frac{\partial B_t}{\partial \theta_i}\frac{\partial \tilde{x}_{t-1|t-1}}{\partial \theta_j} + \frac{\partial B_t}{\partial \theta_j} \frac{\partial \tilde{x}_{t-1|t-1}}{\partial \theta_i} + B_t \frac{\partial^2 \tilde{x}_{t-1|t-1}}{\partial\theta_i \partial\theta_j} \end{align} In the equations, $\tilde{x}_{t|t}$ is output by the Kalman filter. In MARSSkf, it is called xtt[,t]. $\tilde{x}_{t-1|t-1}$ would be called xtt[,t-1]. The derivatives of $\tilde{x}_{t-1|t-1}$ is from the next part of the recursion (below).

The derivative of $\tilde{V}_{t|t-1}$ is (3.4.73b in Harvey) \begin{equation} \label{derivVtt1} \frac{\partial \tilde{V}_{t|t-1}}{\partial\theta_i } = \frac{\partial B_t}{\partial \theta_i} \tilde{V}_{t-1|t-1} B_t^\top + B_t \frac{\partial \tilde{V}_{t-1|t-1}}{\partial \theta_i} B_t^\top + B_t \tilde{V}_{t-1|t-1} \frac{\partial B_t^\top}{\partial \theta_i} + \frac{\partial (G_t Q_t G_t^\top)}{\partial \theta_i} \end{equation} The second derivative of $\tilde{V}_{t|t-1}$ is obtained by taking the derivative of \ref{derivVtt1} and eliminating any second derivatives of parameters: \begin{align} \frac{\partial^2 \tilde{V}_{t|t-1}}{\partial\theta_i \partial\theta_j} = \frac{\partial B_t}{\partial \theta_i} \frac{\tilde{V}_{t-1|t-1}}{\partial\theta_j} B_t^\top + \frac{\partial B_t}{\partial \theta_i} \tilde{V}_{t-1|t-1} \frac{\partial B_t^\top}{\partial \theta_j} + \frac{\partial B_t}{\partial \theta_j} \frac{\partial \tilde{V}_{t-1|t-1}}{\partial \theta_i} B_t^\top + B_t \frac{\partial^2 \tilde{V}_{t-1|t-1}}{\partial\theta_i \partial\theta_j} B_t^\top + \\ B_t \frac{\partial \tilde{V}_{t-1|t-1}}{\partial \theta_i} \frac{\partial B_t^\top}{\partial \theta_j} + \frac{\partial B_t}{\partial \theta_j} \tilde{V}_{t-1|t-1} \frac{\partial B_t^\top}{\partial \theta_i} + B_t \frac{\tilde{V}_{t-1|t-1}}{\partial\theta_j} \frac{\partial B_t^\top}{\partial \theta_i} \end{align} In the derivatives, $\tilde{V}_{t|t}$ is output by the Kalman filter. In MARSSkf, it is called Vtt[,t]. $\tilde{V}_{t-1|t-1}$ would be called Vtt[,t-1]. The derivatives of $\tilde{V}_{t-1|t-1}$ is from the rest of the recursion (below).

Rest of the recursion equations are the same for all t. From equation 3.4.74a: \begin{equation} \frac{\partial \tilde{x}_{t|t}}{\partial\theta_i } = \frac{\partial \tilde{x}_{t|t-1}}{\partial \theta_i} + \frac{\partial \tilde{V}_{t|t-1}}{\partial \theta_i} Z_t^\top F_t^{-1}v_t + \tilde{V}_{t|t-1} \frac{\partial Z_t^\top}{\partial \theta_i} F_t^{-1}v_t - \tilde{V}_{t|t-1} Z_t^\top F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}v_t + \tilde{V}_{t|t-1} Z_t^\top F_t^{-1}\frac{\partial v_t}{\partial \theta_i} \end{equation} $\tilde{V}_{t|t-1}$ is output by the Kalman filter. In MARSSkf, it is called Vtt1[,t]. $v_t$ are the innovations. In MARSSkf, they are called Innov[,t].

From equation 3.4.74b: \begin{equation} \begin{split} \frac{\partial \tilde{V}_{t|t}}{\partial\theta_i } = & \frac{\partial \tilde{V}_{t|t-1}}{\partial \theta_i} - \frac{\partial \tilde{V}_{t|t-1}}{\partial \theta_i} Z_t^\top F_t^{-1}Z_t \tilde{V}_{t|t-1} - \tilde{V}_{t|t-1} \frac{\partial Z_t^\top}{\partial \theta_i} F_t^{-1}Z_t \tilde{V}_{t|t-1} + \tilde{V}_{t|t-1} Z_t^\top F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}Z_t \tilde{V}_{t|t-1} - \\ &\tilde{V}_{t|t-1} Z_t^\top F_t^{-1}\frac{\partial Z_t}{\partial \theta_i} \tilde{V}_{t|t-1} - \tilde{V}_{t|t-1} Z_t^\top F_t^{-1}Z_t \frac{\partial \tilde{V}_{t|t-1}}{\partial \theta_i} \end{split} \end{equation} Repeat for next element in parameter matrix.
Repeat for parameter matrix.

    Loop over i = 1 to p.
    Loop over j = i to p.
    Compute $I_{ij}(\theta)$ and add to previous time step. This is equation 3.4.69 with expectation dropped. Store in Iij[i,j] and Iij[j,i]. \begin{equation} I_{ij}(\theta)_t = I_{ji}(\theta)_t = \frac{1}{2}\left[ tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}\frac{\partial F_t}{\partial \theta_j}\right]\right] + \left(\frac{\partial v_t}{\partial \theta_i}\right)^\top F_t^{-1}\frac{\partial v_t}{\partial \theta_j} \end{equation}     Add on to previous one: \[ I_{ij}(\theta) = I_{ij}(\theta) + I_{ij}(\theta)_t \]     Repeat for next j.
    Repeat for next i.

Repeat for next t.

At the end, $ I_{ij}(\theta) $ is the observed Fisher Information Matrix.

Note that $Q$ and $R$ do not appear in $\partial v_t/\partial \theta_i$, but all the other parameters do appear. So the second term in $I_{ij}(\theta) $ is always zero between $Q$ and $R$ and any other parameters. In the second term, $u$ and $a$ do not appear, but every other terms do appear. So the first term in $I_{ij}(\theta) $ is always zero between $u$ and $a$ and any other parameters. This means that there is always zero covariance between $u$ or $a$ and $Q$ or $R$. But this will not be the case between $Q$ or $R$ and $B$ or $Z$.

Part of the motivation of implementing the Harvey (1989) recursion is that currently in MARSS, I use a numerical estimate of the Fisher Information matrix by using one of R's functions to return the Hessian. But it often returns errors. I might improve it if I constrained it. If I am only estimating $u$, $a$, $Q$ and $R$, I could do a two-step process. Get the Hessian holding the variances at the MLEs and then repeat with $u$ and $a$ at the MLEs.

Notes on computing the Fisher Information matrix for MARSS models. Part III Overview of Harvey 1989

2016-06-01T16:07:00.002-07:00

MathJax and blogger can be iffy. Try reloading if the equations don't show up.

Notes on computing the Fisher Information matrix for MARSS models Part I Background, Part II Louis 1982

Part II discussed the approach by Louis 1982 which uses the full-data likelihood and the first derivative of that that is part of the M-step of the EM algorithm. The conclusion of part II was that that approach is doable but computationally expensive because it scales with $T^2$ at least.

Here I will review the more common approach (Harvey 1989, pages 140-142, section 3.4.5 Information matrix) which uses the prediction error form of the likelihood function to calculate the observed Fisher Information $ \mathcal{I}(\hat{\theta},y) $. A related paper is Cavanaugh and Shumway (1996), which presents an approach for calculating the expected Fisher Information.

Harvey 1989 recursion for the expected and observed Fisher Information matrix

Harvey (1989), pages 140-142, shows how to write the Hessian of the log-likelihood function using the prediction error form of the likelihood. The prediction error form is: \begin{equation}\label{peformlogL} \log L = \sum_{t=1}^T l_t = \sum_{t=1}^T p(y_t|y_1^{t-1}) \end{equation} The Hessian of the log-likelihood can then be written as \begin{equation}\label{hessian} \frac{\partial^2 \log L}{\partial\theta_i \partial\theta_j}=\sum{\frac{\partial^2 l_t}{\partial\theta_i \partial\theta_j}} \end{equation} and this can be written in terms of derivatives of the innovations $v_t$ and the variance of the innovations $F_t$. This is shown in Equation 3.4.66 in Harvey (1989). There are a couple differences between the equation below and 3.4.66 in Harvey. First, 3.4.66 has a typo; the $[I - F_t v_t v_t^\top]$ should be within the trace (as below). Second, I have written out the derivative with respect to $\theta_j$ that appears in the first trace term. \begin{equation}\label{liket} \begin{gathered} \frac{\partial^2 l_t}{\partial\theta_i \partial\theta_j} = \frac{1}{2} tr\left[ \left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_j} F_t^{-1} \frac{\partial F_t}{\partial \theta_i} - F_t^{-1}\frac{\partial^2 F_t}{\partial\theta_i \partial\theta_j} \right] \left[I - F_t^{-1}v_t v_t^\top\right] \right] - \\ \frac{1}{2}tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}\frac{\partial F_t}{\partial \theta_j}F_t^{-1}v_t v_t^\top\right] + \\ \frac{1}{2}tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}\left[ \frac{\partial v_t}{\partial \theta_j}v_t^\top + v_t\frac{\partial v_t^\top}{\partial \theta_j}\right]\right] - \\ \frac{\partial^2 v_t^\top}{\partial\theta_i \partial\theta_j}F_t^{-1}v_t + \frac{\partial v_t^\top}{\partial \theta_i} F_t^{-1}\frac{\partial F_t}{\partial \theta_j} F_t^{-1} v_t - \frac{\partial v_t^\top}{\partial \theta_i} F_t^{-1} \frac{\partial v_t}{\partial \theta_j} \end{gathered} \end{equation} The Fisher Information matrix is the negative of the expected value (over all possible data) of \ref{hessian}: \begin{equation}\label{FisherInformation2} I(\theta) = -E\left[ \frac{\partial^2 \log L}{\partial\theta_i \partial\theta_j} \right] \end{equation} Thus for the Fisher Information matrix, we take the expectation (over all possible data) of the sum (over t) of Equation 3 (3.4.66 in Harvey 1989). On pages 141-142, Harvey shows that the expected value of Equation 3 can be simplified and the i,j element of the Fisher Information matrix can be written as (Equation 3.4.69 in Harvey 1989): \begin{equation}\label{Iij} I_{ij}(\theta) = \frac{1}{2}\sum_t \left[ tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}\frac{\partial F_t}{\partial \theta_j}\right]\right] + E\left[\sum_t\left(\frac{\partial v_t}{\partial \theta_i}\right)^\top F_t^{-1}\frac{\partial v_t}{\partial \theta_j}\right] \end{equation} Equation \ref{Iij} (3.4.69 in Harvey 1989) is the Fisher Information and is evaluated at the true parameter values $ \theta $. We do not know $ \theta $ and instead we estimate the Fisher Information using our estimates of $ \theta $. The two estimates of $ I(\theta) $ that are used are called the expected and observed Fisher Information matrices. The expected Fisher Information is \begin{equation}\label{expectedFisherInformation2} I(\hat{\theta}) = -E\left[ \frac{\partial^2 \log L}{\partial\theta_i \partial\theta_j} \right] |_{\theta=\hat{\theta}} = -E\left[ \sum{\frac{\partial^2 l_t}{\partial\theta_i \partial\theta_j}} \right] |_{\theta=\hat{\theta}} \end{equation} and the observed Fisher Information is \begin{equation}\label{observedFisherInformation2} \mathcal{I}(\hat{\theta},y) = - \frac{\partial^2 \log L}{\partial\theta_i \partial\theta_j} |_{\theta=\hat{\theta}} = - \sum{\frac{\partial^2 l_t}{\partial\theta_i \partial\theta_j}} |_{\theta=\hat{\theta}} \end{equation} The $ |_{\theta=\hat{\theta}} $ means 'evaluated at'. $ l_t $ is a function of $ \theta $. We take the derivative of that function and then evaluate that derivative at $ \theta = \hat{\theta} $. The expectation (which is an integral) is over that possible values of the data $ y $ which are generated from the model with $ \theta $.

The observed Fisher Information drops the expectation and the expected Fisher Information does not. The expectation is taken over all possible data, and we have only one observed data set. On first blush, it may seem that it is impossible to compute the expectation and that we must always use the observed Fisher Information. However, for some models, one can write down the expectations analytically. One could simulate from the MLEs to get the expectations---this is the idea behind bootstrapping. In a bootstrapping approach one uses the MLE to generate data. This is an approximation since what we would like is to simulate data from the true parameters. The mean and variance of data generated from the MLEs versus data generated the true parameters often have nice asymptotic properties.

However it is common to use the observed Fisher Information matrix. This is what one is using when one uses the Hessian of the log-likelihood function evaluated at the MLEs. To get an analytical equation for the observed Fisher Information matrix, we use Equation 3 for $ l_t $ and take the sum to get the Hessian of the log-likelihood function (\ref{hessian}). This is the same Hessian that you can get numerically. In R, you can use the fdHess function in the nmle package or the optim function.

Partially observed, partially expected Fisher Information matrix

Equation \ref{Iij} (Equation 3.4.69 in Harvey) is a simplification the expected value of the sum of equation 3. The simplification occurs because a number of terms in equation 3 drop out or cancel out when you take the expectation (see bottom of page 141 in Harvey 1989). The only terms that remain are those shown in equation \ref{Iij}. Harvey (1989) does not say how to compute the expectation in equation \ref{Iij} (which is his 3.4.69). Cavanaugh and Shumway (1996) do not say how to compute it either and suggest that it is infeasible (page 1 in paragraph after their equation 1). Instead they say that you can drop the expectation in equation \ref{Iij} and get the observed Fisher Information: \begin{equation}\label{obsIij} \mathcal{I}_{ij}(\theta) = \frac{1}{2}\sum_t \left[ tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i}F_t^{-1}\frac{\partial F_t}{\partial \theta_j}\right]\right] + \sum_t\left(\frac{\partial v_t}{\partial \theta_i}\right)^\top F_t^{-1}\frac{\partial v_t}{\partial \theta_j} \end{equation} This however is halfway between the expected Fisher Information matrix and the observation Fisher Information matrix because equation \ref{Iij} is what you get after doing the expectation and dropping some of the terms in equation 3. If you compare what you get from equation \ref{obsIij} and what you get from a numerical estimate of the Hessian of the log-likelihood function at the MLE, you will see that they are different. The variance of the former is less than the variance of the latter. This is what you expect since the former has had the expectation applied to some terms in equation 3 (Harvey's 3.4.66).

This does not mean that equation \ref{obsIij} should not be used, but rather that if you compare it to the output from a numerically computed Hessian, they will not be the same. In Part IV, I show Harvey's recursion for computing the first derivatives of $v_t$ and $F_t$ needed in equations 3 and \ref{Iij}. I extend this recursion to get the second derivative also. Once we have all these, we can use equation \ref{observedFisherInformation2} with equation 3 to compute the observed Fisher Information matrix and use equation \ref{Iij} to compute the "observed/expected" Fisher Information.

Writing Equation 3 in vec form

We can compute the Hessian of the log-likelihood by using a for loop of i from 1 to p with an inner for loop for j from i to p. The Hessian is symmetric so the inner loop only needs to go from i to p. However, we can also write the Hessian for time step t in a single line without any for loops using the Jacobian matrices for our derivatives. With the t subscripts of F and v dropped: \begin{equation} \begin{gathered} \frac{1}{2} J_F^\top ( F^{-1} \otimes F^{-1}) J_F - J_F^\top ( F^{-1}vv^\top F^{-1} \otimes F^{-1} ) J_F -\frac{1}{2} ( I_p \otimes [ F^{-1} - F^{-1} v_t v_t^\top F^{-1} ] ) \mathcal{J}_F + \\ \frac{1}{2} J_F^\top [3 F^{-1}v \otimes F^{-1} + F^{-1} \otimes F^{-1}v] J_v - \mathcal{J_v}^\top (I_p \otimes F^{-1} v) - J_v^\top F^{-1} J_v \end{gathered} \end{equation} This may or may not be faster but it is more concise. Go to Part IV to see how to compute these Jacobians using Harvey's recursion.

Derivation of the observed Fisher Information matrix (equation 9)

Note, I am going to drop the t subscript on F and v because things are going to get cluttered; $ v_1 $ will refer to the 1st element of the $ n \times 1$ column vector v and $ F_{11} $ is the (1,1) element of the matrix F. There has to be a loop to go through all the $ F_t $ and $ v_t $ for t=1 to T.

Terms 1 and 2 of equation 3

The first term of equation 3 is \begin{equation} \begin{gathered} \frac{1}{2} tr\left[ \left[ F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \frac{\partial F}{\partial \theta_i} - \frac{1}{2} F^{-1}\frac{\partial^2 F}{\partial\theta_i \partial\theta_j} \right] \left[I - F^{-1}v v^\top\right] \right] = \\ \frac{1}{2} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \frac{\partial F}{\partial \theta_i}\left[I - F_t^{-1}v v^\top\right]\right] - \frac{1}{2} tr\left[ F^{-1}\frac{\partial^2 F}{\partial\theta_i \partial\theta_j} \left[ I - F^{-1}v v^\top \right] \right] = \\ \frac{1}{2} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \frac{\partial F}{\partial \theta_i} \right] - \frac{1}{2} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \frac{\partial F}{\partial \theta_i}F^{-1}v v^\top \right] - \frac{1}{2} tr\left[ F^{-1}\frac{\partial^2 F}{\partial\theta_i \partial\theta_j} \left[I - F^{-1}v v^\top\right]\right] \end{gathered} \end{equation} The second term of equation 3 is \begin{equation} - \frac{1}{2} tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_i} F_t^{-1} \frac{\partial F_t}{\partial \theta_j}F_t^{-1}v_t v_t^\top \right] \end{equation} All the matrices within the traces above are symmetric. The trace of products of symmetric matrices is permutation invariant. That means that if A, B, C, and D are symmetric matrices, $ tr(ABCD) = tr(ACBD) = tr(ACDB) $, etc. Thus the second term can be rearranged to match the middle term in the first term. Terms 1 + 2 of equation 3 can thus be written as \begin{equation}\label{term12eqn3} \frac{1}{2}tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_j} F_t^{-1} \frac{\partial F_t}{\partial \theta_i} \right] - tr\left[ F_t^{-1}\frac{\partial F_t}{\partial \theta_j} F_t^{-1} \frac{\partial F_t}{\partial \theta_i}F_t^{-1}v_t v_t^\top \right] - \frac{1}{2} tr\left[ F_t^{-1}\frac{\partial^2 F_t}{\partial\theta_i \partial\theta_j} \left[I - F_t^{-1}v_t v_t^\top\right]\right] \end{equation} We can write the first trace of equation \ref{term12eqn3} as a vector product using the relation $ tr(A^\top B) = vec(A)^\top vec(B) $. Note that the matrices in the traces in equation \ref{term12eqn3} are symmetric. If A is symmetric, $ A^\top = A $ and $ tr(AB) = vec(A)^\top vec(B) $. \begin{equation} \begin{gathered} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \frac{\partial F}{\partial \theta_i} \right] = vec\left( F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} \right)^\top vec\left( \frac{\partial F}{\partial \theta_i} \right) = \\ \left( ( F^{-1} \otimes F^{-1} ) vec\left( \frac{\partial F}{\partial \theta_j} \right) \right)^\top vec\left( \frac{\partial F}{\partial \theta_i} \right) = \\ vec\left( \frac{\partial F}{\partial \theta_j} \right)^\top ( F^{-1} \otimes F^{-1}) vec\left( \frac{\partial F}{\partial \theta_i} \right) \end{gathered} \end{equation} That is for the i,j element. This matrix is symmetric so it is also the j,i element. The derivative of $ vec(F) $ with respect to $ \theta $ (as opposed to the j-th element of $ \theta $) is the Jacobian matrix of $ vec(F) $. \begin{equation}\label{JF} J_F = \begin{bmatrix}\frac{\partial vec(F)}{\theta_1} & \frac{\partial vec(F)}{\theta_2} & \dots & \frac{\partial vec(F)}{\theta_p}\end{bmatrix} = \begin{bmatrix} \frac{\partial F_{11}}{\theta_1} & \frac{\partial F_{11}}{\theta_2} & \dots & \frac{\partial F_{11}}{\theta_p}\\ \frac{\partial F_{21}}{\theta_1} & \frac{\partial F_{21}}{\theta_2} & \dots & \frac{\partial F_{21}}{\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial F_{nn}}{\theta_1} & \frac{\partial F_{nn}}{\theta_2} & \dots & \frac{\partial F_{nn}}{\theta_p} \end{bmatrix} \end{equation} The full matrix for the first part of equation \ref{term12eqn3} is then \begin{equation} \frac{1}{2} J_F^\top ( F^{-1} \otimes F^{-1}) J_F \end{equation}

The middle trace of equation \ref{term12eqn3} is similar to the first and we end up with: \begin{equation} \begin{gathered} vec\left( \frac{\partial F}{\partial \theta_j} \right)^\top ( F^{-1} \otimes F^{-1}) vec\left( \frac{\partial F}{\partial \theta_i} F^{-1}vv^\top \right) = \\ vec\left( \frac{\partial F}{\partial \theta_j} \right)^\top ( F^{-1} \otimes F^{-1}) ( vv^\top F^{-1} \otimes I_n) vec\left( \frac{\partial F}{\partial \theta_i} \right) = \\ vec\left( \frac{\partial F}{\partial \theta_j} \right)^\top ( F^{-1}vv^\top F^{-1} \otimes F^{-1}) vec\left( \frac{\partial F}{\partial \theta_i} \right) \end{gathered} \end{equation} We can write this in terms of the Jacobian of vec(F): \begin{equation} J_F^\top ( F^{-1}vv^\top F^{-1} \otimes F^{-1} ) J_F \end{equation}

The third part of equation \ref{term12eqn3} involves the second derivatives $ \partial^2 F/\partial\theta_i \partial\theta_j $. \begin{equation} \begin{gathered} tr\left[ F^{-1} \frac{\partial^2 F}{\partial\theta_i \partial\theta_j} [I - F^{-1}v v^\top ] \right] = tr\left[ [I - F^{-1}v v^\top ] F^{-1} \frac{\partial^2 F}{\partial\theta_i \partial\theta_j} \right] = \\ vec\left( F^{-1} - F^{-1}v v^\top F^{-1} \right)^\top vec\left( \frac{\partial^2 F}{\partial\theta_i \partial\theta_j} \right) = \\ vec\left( F^{-1} - F^{-1}v v^\top F^{-1} \right)^\top \frac{\partial vec( \partial F/\partial\theta_i )}{\partial\theta_j} \end{gathered} \end{equation} Again this is the i,j term. The term on the bottom line on the right is the $ (\theta_i,\theta_j) $ term of the Jacobian of the vec of the Jacobian of F: \begin{equation} \mathcal{J}_F = \begin{bmatrix}\frac{\partial vec(J_F)}{\partial\theta_1} & \frac{\partial vec(J_F)}{\partial\theta_2} & \dots & \frac{\partial vec(J_F)}{\partial\theta_p}\end{bmatrix} = \begin{bmatrix} \frac{\partial F_{11}}{\theta_1\theta_1} & \frac{\partial F_{11}}{\theta_1\theta_2} & \dots & \frac{\partial F_{11}}{\theta_1\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial F_{nn}}{\theta_1\theta_1} & \frac{\partial F_{nn}}{\theta_1\theta_2} & \dots & \frac{\partial F_{nn}}{\theta_1\theta_p}\\ \frac{\partial F_{11}}{\theta_2\theta_1} & \frac{\partial F_{11}}{\theta_2\theta_2} & \dots & \frac{\partial F_{11}}{\theta_2\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial F_{nn}}{\theta_2\theta_1} & \frac{\partial F_{nn}}{\theta_2\theta_2} & \dots & \frac{\partial F_{nn}}{\theta_2\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial F_{11}}{\theta_p\theta_1} & \frac{\partial F_{11}}{\theta_p\theta_2} & \dots & \frac{\partial F_{11}}{\theta_p\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial F_{nn}}{\theta_p\theta_1} & \frac{\partial F_{nn}}{\theta_p\theta_2} & \dots & \frac{\partial F_{nn}}{\theta_p\theta_p}\\ \end{bmatrix} \end{equation}

The full matrix for the second part of term 1 + 2 in Equation 3 is then \begin{equation} ( I_p \otimes [ F^{-1} - F^{-1} v v^\top F^{-1} ] ) \mathcal{J}_F \end{equation} The subscript on the $ I $ indicates the size of the identity matrix. In this case, it is a $ p \times p $ matrix.

Term 3 of equation 3

With the t subscripts dropped, the 3rd term of equation 3 is \begin{equation}\label{term3eqn3} \frac{1}{2} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_i}F^{-1} \left( \frac{\partial v}{\partial \theta_j}v^\top + v\frac{\partial v^\top}{\partial \theta_j}\right) \right] \end{equation} Using the same procedure as for the above terms, we can write this in terms of vecs. If $b$ and $a$ are $1 \times n$ column vectors, \begin{equation} vec(ab^\top) = (b \otimes I_n)vec(a) = (b \otimes I_n)a = (I_n \otimes a)vec(b) = (I_n \otimes a)b \end{equation} Thus, \begin{equation} \begin{gathered} vec\left( \frac{\partial v}{\partial \theta_j}v^\top\right) = (v \otimes I_n)\frac{\partial v}{\partial \theta_j} \\ vec\left( v (\partial v/\partial \theta_j)^\top \right) = (I_n \otimes v)\frac{\partial v}{\partial \theta_j} \end{gathered} \end{equation} and \begin{equation} vec\left( \frac{\partial v}{\partial \theta_j}v^\top + v(\partial v/\partial \theta_j)^\top \right) = (v \otimes I_n + I_n \otimes v)\frac{\partial v}{\partial \theta_j} \end{equation} When A is symmetric, $ tr(AB) = vec(A)^\top vec(B) $. Thus term 3 of equation 3 can be written as \begin{equation} \begin{gathered} \frac{1}{2} tr\left[ F^{-1}\frac{\partial F}{\partial \theta_i}F^{-1} \left( \frac{\partial v}{\partial \theta_j}v^\top + v\frac{\partial v^\top}{\partial \theta_j}\right) \right] = vec\left( \frac{\partial F}{\partial \theta_i} \right)^\top (F^{-1} \otimes F^{-1}) (v \otimes I_n + I_n \otimes v)\frac{\partial v}{\partial \theta_j} \\ vec\left( \frac{\partial F}{\partial \theta_i} \right)^\top (F^{-1}v \otimes F^{-1} + F^{-1} \otimes F^{-1}v) \frac{\partial v}{\partial \theta_j} \end{gathered} \end{equation} This is the i,j term of the Fisher Information matrix from term 3 in equation 3. To get all terms, we use the Jacobian of vec(F) as above and the Jacobian of v: \begin{equation} \frac{1}{2} J_F^\top (F^{-1} \otimes F^{-1}) (v \otimes I_n + I_n \otimes v) J_v = \frac{1}{2} J_F^\top [F^{-1} v \otimes F^{-1} + F^{-1} \otimes F^{-1}v] J_v \end{equation} where $ J_F $ is defined in equation \ref{JF} and $ J_v $ is \begin{equation}\label{Jv} J_v = \begin{bmatrix} \frac{\partial v_{1}}{\theta_1} & \frac{\partial v_{1}}{\theta_2} & \dots & \frac{\partial v_{1}}{\theta_p}\\ \frac{\partial v_{2}}{\theta_1} & \frac{\partial v_{2}}{\theta_2} & \dots & \frac{\partial v_{2}}{\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial v_{n}}{\theta_1} & \frac{\partial v_{n}}{\theta_2} & \dots & \frac{\partial v_{n}}{\theta_p} \end{bmatrix} \end{equation}

Term 4 of equation 3

The 4th term of equation 3 is \begin{equation}\label{term4eqn3} - \frac{\partial^2 v^\top}{\partial\theta_i \partial\theta_j}F^{-1}v \end{equation} This is for the i,j term of the Fisher Information matrix. An equation for all terms can be written as a junction of the the Jacobian of $ vec(J_v) $: \begin{equation} \mathcal{J}_v = \begin{bmatrix}\frac{\partial vec(J_v)}{\partial\theta_1} & \frac{\partial vec(J_v)}{\partial\theta_2} & \dots & \frac{\partial vec(J_v)}{\partial\theta_p}\end{bmatrix} = \begin{bmatrix} \frac{\partial v_{1}}{\theta_1\theta_1} & \frac{\partial v_{1}}{\theta_1\theta_2} & \dots & \frac{\partial v_{1}}{\theta_1\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial v_{n}}{\theta_1\theta_1} & \frac{\partial v_{n}}{\theta_1\theta_2} & \dots & \frac{\partial v_{n}}{\theta_1\theta_p}\\ \frac{\partial v_{1}}{\theta_2\theta_1} & \frac{\partial v_{1}}{\theta_2\theta_2} & \dots & \frac{\partial v_{1}}{\theta_2\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial v_{n}}{\theta_2\theta_1} & \frac{\partial v_{n}}{\theta_2\theta_2} & \dots & \frac{\partial v_{n}}{\theta_2\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial v_{1}}{\theta_p\theta_1} & \frac{\partial v_{1}}{\theta_p\theta_2} & \dots & \frac{\partial v_{1}}{\theta_p\theta_p}\\ \vdots & \vdots & \vdots & \vdots \\ \frac{\partial v_{n}}{\theta_p\theta_1} & \frac{\partial v_{n}}{\theta_p\theta_2} & \dots & \frac{\partial v_{n}}{\theta_p\theta_p}\\ \end{bmatrix} \end{equation} The right of equation \ref{term4eqn3}, $ F^{-1}v $ is a $n \times 1$ matrix. We need to write this as the $np \times p$ matrix: \begin{equation} \begin{bmatrix} F^{-1}v & 0_{n \times 1} & \dots & 0_{n \times 1}\\ 0_{n \times 1} & F^{-1}v & \dots & 0_{n \times 1}\\ \vdots & \vdots & \vdots & \vdots\\ 0_{n \times 1} & 0_{n \times 1} & \dots & F^{-1}v \end{bmatrix} = I_p \otimes F^{-1}v \end{equation} Thus the full matrix for the i,j terms in the Fisher Information matrix from term 4 of equation 3 is \begin{equation} - \mathcal{J_v}^\top (I_p \otimes F^{-1}v) \end{equation}

Term 5 of equation 3

Term 5 is \begin{equation}\label{term5eqn3} \frac{\partial v^\top}{\partial \theta_i} F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} v \end{equation} This is a scalar and thus its vec is equal to itself. We can rewrite equation \ref{term5eqn3} using the following relation: \begin{equation} vec(a^\top ABC c ) = (c^\top \otimes a^\top) vec (ABC) = a^\top (c^\top \otimes I_n) vec(ABC) = c^\top (a^\top \otimes I_n) (C^\top \otimes A) vec(B) = c^\top (a^\top C^\top \otimes A) vec(B) \end{equation} Thus equation \ref{term5eqn3} is \begin{equation} \frac{\partial v^\top}{\partial \theta_i} F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} v = \frac{\partial v^\top}{\partial \theta_i} (v^\top \otimes I_n) (F^{-1} \otimes F^{-1}) vec\left( \frac{\partial F}{\partial \theta_j} \right) \end{equation} This is for the i,j term of the Fisher Information matrix. For the full matrix, we use the Jacobian of v (equation \ref{Jv}) and the Jacobian of vec(F) (equation \ref{JF}): J_v^\top (v^\top \otimes I_n) (F^{-1} \otimes F^{-1}) J_F = J_v^\top (v^\top F^{-1} \otimes F^{-1}) J_F \end{equation}

Term 6 of equation 3

Term 6 is \begin{equation}\label{term6eqn3} - \frac{\partial v^\top}{\partial \theta_i} F^{-1} \frac{\partial v}{\partial \theta_j} \end{equation} This is for the i,j term of the Fisher Information matrix and we can write it immediately as the full matrix in terms of the Jacobian of v: \begin{equation} \frac{\partial v^\top}{\partial \theta_i} F^{-1}\frac{\partial F}{\partial \theta_j} F^{-1} v = J_v^\top F^{-1} J_v \end{equation}

Putting all the terms together

Putting all the terms together, we have the full observed Fisher Information matrix: \begin{equation} \begin{gathered} \frac{1}{2} J_F^\top ( F^{-1} \otimes F^{-1}) J_F - J_F^\top ( F^{-1}vv^\top F^{-1} \otimes F^{-1} ) J_F -\frac{1}{2} ( I_p \otimes [ F^{-1} - F^{-1} v_t v_t^\top F^{-1} ] ) \mathcal{J}_F + \\ \frac{1}{2} J_F^\top [F^{-1}v \otimes F^{-1} + F^{-1} \otimes F^{-1}v] J_v - \mathcal{J_v}^\top (I_p \otimes F^{-1}v) + J_v^\top (v^\top F^{-1} \otimes F^{-1}) J_F - J_v^\top F^{-1} J_v \end{gathered} \end{equation} We can simplify this a little by noting that all terms are symmetric matrices and the transpose or a symmetric matrix is equal to itself. \begin{equation} J_v^\top (v^\top F^{-1} \otimes F^{-1}) J_F = J_F^\top (F^{-1} v \otimes F^{-1}) J_v \end{equation} Thus the full observed Fisher Information matrix is \begin{equation} \begin{gathered} \frac{1}{2} J_F^\top ( F^{-1} \otimes F^{-1}) J_F - J_F^\top ( F^{-1}vv^\top F^{-1} \otimes F^{-1} ) J_F -\frac{1}{2} ( I_p \otimes [ F^{-1} - F^{-1} v_t v_t^\top F^{-1} ] ) \mathcal{J}_F + \\ \frac{1}{2} J_F^\top [3 F^{-1}v \otimes F^{-1} + F^{-1} \otimes F^{-1}v] J_v - \mathcal{J_v}^\top (I_p \otimes F^{-1} v) - J_v^\top F^{-1} J_v \end{gathered} \end{equation}

Notes on computing the Fisher Information matrix for MARSS models. Part II Louis 1982

2016-05-19T20:48:00.000-07:00

MathJax and blogger can be iffy. Try reloading if the equations don't show up.

Part II. Background on Fisher Information is in Part I.

So how do we compute $ I(\hat{\theta}) $ or $ \mathcal{I}(\hat{\theta},y) $ (in Part I)? In particular, can we use the analytical derivatives of the full log-likelihood that are part of the EM algorithm? Many researchers have worked on this idea. My notes here were influenced by this blog post EM Algorithm: Confidence Intervals on the same topic, which got me started. This blog post is mainly a discussion of the result by Louis (1982) on calculation of the Fisher Information matrix from the 'score' function that one takes the derivative of in the M-step of the EM algorithm.

The 'score' function used in the EM algorithm for a MARSS model is \begin{equation} Q(\theta | \theta_j) = E_{X|y,\theta_j } [\log f_{XY}(X,y|\theta) ] \end{equation} It is the expected value taken over the hidden random variable $ X $ of the full data log-likelihood at $ Y=y $ [3]; full means it is a function of all the random variables in the model, which includes the hidden or latent variables. $ x, y $ is the full 'data', the left side of the $ x $ state equation and the $ y $ observation equation. We take the expectation of this full data likelihood conditioned on the observed data $ y $ and $ \theta_j $ which is the value of $ \theta $ at the j-th iteration of the EM algorithm. Although $ Q(\theta | \theta_j) $ looks a bit hairy, actually the full-data likelihood may be very easy to write down and considerably easier than the data likelihood $ f(y|\theta) $. The hard part is often the expectation step, however for MARSS models the Kalman filter-smoother algorithm computes the expectations involving $ X $ and Holmes (2010) shows how to compute the expectations involving $ Y $, which comes up when there are missing values in the dataset (missing time steps, say).

In the M-step of the EM algorithm, we take the derivative of $ Q(\theta | \theta_j) $ with respect to $ \theta $ and solve for the $ \theta $ where \[ \frac{\partial Q(\theta | \theta_j ) }{\partial \theta} = 0. \] It would be nice if one could use the following to compute the observed Fisher Information \[ - \frac{\partial^2 Q(\theta | \hat{\theta}) }{\partial \theta^2 } \right|_{\theta = \hat{\theta} } \] $ Q(\theta | \hat{\theta}) $ is our score function at the end of the EM algorithm, when $ \theta = \hat{\theta} $. $ Q $ is a function of $ \theta $, the model parameters, and will have terms like $ E(X|Y=y, \hat{\theta}) $, the expected value of $ X $ conditioned on $ Y=y $ and the MLE. Those are the expectations coming out of the Kalman filter-smoother. We take the second derivative of $ Q $ with respect to $ \theta $. That is straight-forward for the MARSS equations. You take the first derivative of $ Q $ with respect to $ \theta $, which you already have from the update or M-step equations, and take the derivative of that with respect to $ \theta $.

Conceptually, this \[ - \left.\frac{\partial^2 Q(\theta | \hat{\theta}) }{\partial \theta^2 } \right|_{\theta = \hat{\theta} } = \left.\frac{\partial^2 E_{X|y,\hat{\theta} } [\log f(X,y|\theta) ] }{\partial \theta^2 } \right|_{\theta = \hat{\theta} } \] looks a bit like the observed Fisher Information: \begin{equation}\label{obsFI} \mathcal{I}(\hat{\theta},y) = - \left.\frac{\partial^2\log f(y|\theta)}{\partial \theta^2} \right|_{\theta=\hat{\theta}} \end{equation} except that instead of the data likelihood $ f(y|\theta) $, we use the expected likelihood $ E_{X|y,\hat{\theta} } [\log f_{XY}(X,y|\theta) ] $. The expected likelihood is the full likelihood with the $ X $ and $ XX^\top $ random variables replaced by their expected values assuming $ \theta = \hat{\theta} $ and $ Y=y $. The problem is that $ E_{X|y,\theta } [\log f(X,y|\theta) ] $ is a function of $ \theta $ and by fixing it at $ \hat{\theta}$ we are not accounting for the uncertainty in that expectation. What we need is something like

Information with X fixed at expected value - Information on expected value of X

so we account for the fact that we have over-estimated the information from the data by treating the hidden random variable as fixed. The same issue arises when we compute confidence intervals using the estimate of the variance without accounting for the fact that this is an estimate and thus has uncertainty. Louis (1982) and Oakes (1999) are concerned with how to do this correction or adjustment.

Louis 1982 approach

The following is equations 3.1, 3.2 and 3.3 in Louis (1982) translated to the MARSS case. In the MARSS model, we have two random variables, $ X(t) $ and $ Y(t) $. The joint distribution of $ \{X(t), Y(t) \} $ conditioned on $ X(t-1) $ is multivariate normal. Our full data set includes all time steps, $ \{X, Y \} $.

Let's call the full state at time t $ \{x ,y\} $, the value of the $ X $ and $ Y $ at all times t. The full state can be an unconditional random variable, $ \{X,Y\} $ or a conditional random variable $ \{X,y\} $ (conditioned on $Y=y$. Page 227 near top of Louis 1982 becomes \begin{equation} \lambda(x,y,\theta) = \log\{ f_{XY}(x,y|\theta) \} \label{lambdaz} \end{equation} \begin{equation} \lambda^*(y,\theta) = \log\{ f_Y(y|\theta) \} = \log \int_X f_{XY}(x,y|\theta)dx \label{lambday} \end{equation} $ f(.|\theta) $ is the probability distribution of the random variable conditioned on $\theta$. $ \lambda $ is the full likelihood; 'full' means is includes both $ x $ and $ y $. $ \lambda^* $ is the likelihood of $ y $ alone. It is defined by the marginal distribution of $ y $ [1]; the integral over $ X $ on the right side of \ref{lambday}. For a MARSS model, the data likelihood can be written easily as a function of the Kalman filter recursions (which is why you can write a recursion for the information matrix based on derivatives of $ \lambda^* $; see Part III).

Next equation down. Louis doesn't say this and his notation is not totally clear, but the expectation right above section 3 (and in his eqn 3.1) is a conditional expectation. This is critical to know to follow his derivation of equation 3.1 in the appendix. $ \theta_j $ is his $ \theta(0) $; it is the value of $ \theta $ at the last EM iteration. \begin{equation}\label{expLL} E_{X|y,\theta_j}[ \lambda( X, y, \theta)] = \int_X \lambda( X, y, \theta) f_{X|Y}(x|Y=y, \theta_j) dx \end{equation} My 'expectation' notation is a little different than Louis'. The subscript on the E shows what is being integrated *($X$ ) and what are the conditionals. The term $ f_{X|Y}(x|Y=y, \theta_j) $ is the probability of $ x $ conditioned on $ Y=y $ and $ \theta=\theta_j $. The subscript on $f$ indicates that we are using the probability distribution of $x$ conditioned on $Y=y$. For the EM algorithm, we need to distinguish between $ \theta $ and $ \theta_j $ because we maximize with respect to $ \theta $ not $ \theta_j $. If we just need the expectation at $ \theta $, no maximization step, then we just use $ \theta $ in $ f(.|\theta) $ and the subscript on E.

Before moving on with the derivation, notice that in \ref{expLL}, we fix $ y $, the data. We are not treating that as a random variable. We could certainly treat $ E_{\theta_j}[ \lambda( \{X, y\}, \theta)] $ as some function $g(y) $ and consider the random variable $ g(Y) $. But Louis (1982) will not go that route. $ y $ is fixed. Thus we are talking about the observed Fisher Information rather than the expected Fisher Information. The latter would take an expectation over the possible $ y $ generated by our model with parameters at the MLE.

Derivation of equation 3.1 in Louis 1982

Now we can derive equation 3.1 in Louis (1982). I am going to combine the info in Louis' section 3.1 and the appendix on the derivation of 3.1. Before proceeding, Louis is using 'denominator' format for his matrix derivations; I normally use denominator format but I will follow his convention here. $ \theta $ is a column vector of parameters and the likelihood $ f(.|\theta)$ is scalar. Under 'denominator format', $ f^\prime(.|\theta) = df(.|\theta)/d\theta) $ will be a column vector. $ f^{\prime\prime}(.|\theta) = d^2f(.|\theta)/d\theta d\theta^\top)$ will be a matrix in Hessian format (the first $d\theta$ goes 1 to $n$ down columns and the second $d\theta$ does 1 to $n$ across rows).

Take the derivative of \ref{lambdaz} with respect to $ \theta $ to define $ S(z,\theta) $. \begin{equation} S(x,y,\theta)=\lambda^\prime(x,y,\theta)=\frac{d \log\{f_{XY}(x,y|\theta)\} }{d \theta}= \frac{df(x,y|\theta)/d\theta}{f(x,y|\theta)} = \frac{f^\prime(x,y|\theta)}{f(x,y|\theta)}\label{Sz} \end{equation} Take the derivative of the far right side of \ref{lambday} with respect to $ \theta $ to define $ S^*(y,\theta) $. For the last step (far right), I used $ f_Y(y|\theta) = \int_X f_{XY}(x,y|\theta)dx $, the definition of the marginal distribution [1], to change the denominator. \begin{equation}\label{Sy} S^*(y,\theta)=\lambda^{*\prime}(y,\theta)=\frac{ d \log \int_X f_{XY}(x,y|\theta)dx }{d \theta} = \frac{ \int_X f_{XY}^\prime(x,y|\theta) dx }{ \int_X f_{XY}(x,y|\theta)dx } = \frac{ \int_X f_{XY}^\prime(x,y|\theta) dx }{ f_Y(y|\theta) } \end{equation} Now multiply the integrand in the numerator by $ f_{XY}(x,y|\theta)/f_{XY}(x,y|\theta) $. The last step (far right) uses \ref{Sz}. \begin{equation}\label{intfprime} \int_X f_{XY}^\prime(x,y|\theta) dx = \int_X \frac{f_{XY}^\prime(x,y|\theta)f_{XY}(x,y|\theta)}{f_{XY}(x,y|\theta)} dx = \int_X \frac{f_{XY}^\prime(x,y|\theta)}{f_{XY}(x,y|\theta)}f_{XY}(x,y|\theta) dx = \int_X S(x,y,\theta) f_{XY}(x,y|\theta) dx \end{equation} We combine \ref{Sy} and \ref{intfprime}: \begin{equation}\label{Sstar} S^*(y,\theta)= \frac{ \int_X f_{XY}^\prime(x,y|\theta) dx }{ f_Y(y|\theta) } = \int_X S(x,y,\theta) \frac{ f_{XY}(x,y|\theta) }{ f_Y(y|\theta) } dx = \int_X S(x,y,\theta) f_{X|Y}(x|Y=y,\theta) dx \end{equation} The second to last step used the fact that $ f_Y(y|\theta) $ does not involve $ x $ thus we can bring it into the integral. This gives us $ f_{XY}(x,y|\theta)/f_Y(y|\theta)$. This is the probability of $ x $ conditioned on $ Y=y $ [2].

The last step in the derivation of equation 3.1 is to recognize that the far right side of \ref{Sstar} is the conditional expectation in 3.1. Louis does not actually write out the expectation in 3.1 and the notation is rather vague. But the expectation in equation 3.1 is the conditional expectation on the far right side of \ref{Sstar}. \begin{equation}\label{Louise3p1} S^*(y,\theta)=\int_X S(x,y,\theta) f_{X|Y}(x|Y=y,\theta) dx=E_{X|y,\theta} [ S(X,y,\theta) ] \end{equation} using my notation for a conditional expectation which slightly different than Louis'. At the MLE, $ S^*(y,\hat{\theta})=0$ since that is how the MLE is defined (it's where the derivative of the data likelihood is zero).

Derivation of equation 3.2 in Louis 1982

The meat of Louis 1982 is equation 3.2. The observed Fisher Information matrix \ref{obsFI} is \begin{equation}\label{obsFI32} \mathcal{I}(\theta,y) = B^*(y,\theta) = -S^\prime(x,y,\theta) = - \lambda^{*\prime\prime}(y,\theta) = - \frac{\partial^2\log f_Y(y|\theta)}{\partial \theta \partial \theta^\top} \end{equation} The first 3 terms on the left are just show that all are notation that refers to the observed Fisher Information. The 4th term is one of the ways we can compute the observed Fisher Information at $ \theta $ and the far right term shows that derivative explicitly.

We start by taking the second derivative of \ref{lambdaz} with respect to $ \theta $ to define $ B(x,y,\theta) $. We use $ S^\prime(z,\theta) $ as written in \ref{Sz}. \begin{equation}\label{B1} \mathcal{I}(\theta,x,y) = B(x,y,\theta)=-\lambda^{\prime\prime}(x,y,\theta) = -S^\prime(x,y,\theta) = -\frac{d[f_{XY}^\prime(x,y|\theta)/f_{XY}(x,y|\theta)]}{d \theta^\top} \end{equation} The transpose of $d\theta $ is because we are taking the second derivative $ d^2 l/d\theta d\theta^\top $ (the Hessian of the log-likelihood); $ d\theta d\theta $ wouldn't make sense as that that would be a column vector times a column vector.

To do the derivative on the far right side of \ref{B1}, we first need to recognize the form of the equation. $ f_{XY}^\prime(x,y|\theta) $ is a column vector and $ f(x,y|\theta) $ is a scalar, thus the thing we are taking the derivative of has the form $ \overrightarrow{h}(\theta)/g(\theta) $; the arrow over $h$ is indicating that it is a (column) vector while $g()$ is a scalar. Using the chain rule for vector derivatives, we have \[ \frac{ d (\overrightarrow{h}(\theta)/g(\theta))}{d \theta^\top} = \frac{d\overrightarrow{h}(\theta)}{d \theta^\top}\frac{1}{g(\theta)} - \frac{\overrightarrow{h}(\theta)}{ g(\theta)^2 }\frac{ g(\theta) }{ d \theta^\top } \] Thus (notice I'm writing the equation for the negative of $ B(x,y,\theta) $, \begin{equation}\label{B2} -B(x,y,\theta) = \frac{d(f_{XY}^\prime(x,y|\theta)/f_{XY}(x,y|\theta))}{d \theta^\top} = \frac{f_{XY}^{\prime\prime}(x,y|\theta)}{f_{XY}(x,y|\theta)} - \frac{f_{XY}^\prime(x,y|\theta)f^\prime(z|\theta)^\top}{ f_{XY}(x,y|\theta)^2 }= \frac{f_{XY}^{\prime\prime}(x,y|\theta)}{f_{XY}(x,y|\theta)} - S(x,y|\theta)S(x,y|\theta)^\top \end{equation}

Let's return to \ref{obsFI32} and take the derivative of $ \lambda^{*\prime}(y,\theta)$ with respect to $ \theta $ using the form shown in equation \ref{Sy}. I have replaced the integral in the denominator by $ f_Y(y|\theta) $ and used the same chain rule used for \ref{B2}. \begin{align} \begin{split} \lambda^{*\prime\prime}(y,\theta)= d\left( \int_X f_{XY}^\prime(x,y|\theta) dx \middle/ f_Y(y|\theta) \right)/d\theta^\top = \\ \frac{\int_X f_{XY}^{\prime\prime}(x,y|\theta) dx }{f_Y(y|\theta)}- \frac{\int_X f_{XY}^\prime(x,y|\theta)dx }{f_Y(y|\theta)} \left(\frac{\int_X f_{XY}^\prime(x,y|\theta)dx}{f_Y(y|\theta)}\right) = \frac{\int_X f_{XY}^{\prime\prime}(x,y|\theta) dx }{f_Y(y|\theta)}- S^*(y|\theta)S^*(y|\theta)^\top \end{split} \end{align} The last substitution uses \ref{Sy}. Thus, \begin{equation}\label{B4} \lambda^{*\prime\prime}(y,\theta)= \frac{\int_X f_{XY}^{\prime\prime}(x,y|\theta) dx }{f_Y(y|\theta)}- S^*(y|\theta)S^*(y|\theta)^\top \end{equation} Let's look at the integral of the second derivative of $f_{XY}(x,y|\theta)$ in \ref{B4}: \begin{equation}\label{B5} \left( \int_X f_{XY}^{\prime\prime}(x,y|\theta) dx \middle/ f_Y(y|\theta) \right) = \int_X \frac{f_{XY}^{\prime\prime}(x,y|\theta) dx}{ f_{XY}(x,y|\theta) }\frac{f_{XY}(x,y|\theta)}{ f_Y(y|\theta)} dx= \int_X \frac{f_{XY}^{\prime\prime}(x,y|\theta) dx}{ f_{XY}(x,y|\theta) }f_{X|Y}(x|Y=y,\theta) dx \end{equation} This is the conditional expectation $ E_{X|y,\theta} [ f_{XY}^{\prime\prime}(x,y|\theta) dx/f_{XY}(x,y|\theta) ] $ that we see 5 lines above the references in Louis (1982). Using \ref{B2} we can write this in terms of $ B(x,y|\theta) $: \begin{equation}\label{B6} \int_X \frac{f_{XY}^{\prime\prime}(z|\theta) dx}{ f_{XY}(x,y|\theta) } = -B(x,y|\theta)+S(x,y|\theta)S(x,y|\theta)^\top \end{equation} Combining \ref{B4}, \ref{B5}, and \ref{B6}, we can write the equation above the references in Louis: \begin{equation}\label{B7} \lambda^{*\prime\prime}(y,\theta)= E_{X|y,\theta} [ - B(X,y|\theta)+S(X,y|\theta)S(X,y|\theta)^\top]-S^*(y|\theta)S^*(y|\theta)^\top \end{equation} The negative of this is the observed Fisher Information (\ref{obsFI32}) which gives us equation 3.2 in Louis (1982): \begin{equation}\label{Louismain} \mathcal{I}(\theta,y) = E_{X|y,\theta} [ B(X,y|\theta)] - E_{X|y,\theta} [ S(X,y|\theta)S(X,y|\theta)^\top]+S^*(y|\theta)S^*(y|\theta)^\top \end{equation}

Derivation of equation 3.3 in Louis 1982

Louis states that "The first term in (3.2) is the conditional expected full data observed information matrix, while the last two produce the expected information for the conditional distribution of X given $X \in R$." His X is my $ \{X,Y\}$ and $ X \in R $ means $ Y=y $ in my context. He writes this in simplified form with $X$ replaced by $XY$: \[ I_Y = I_{XY} - I_{X|Y} \] \[ \mathcal{I}(\theta,y) = E_{X|y,\theta} [\mathcal{I}(\theta,X,y)] - I_{X|Y} \] Let's see how this is the case.

The full data observed information matrix is \[ \mathcal{I}(\theta,x,y) = -\lambda^{\prime\prime}(x,y|\theta) = B(x,y,\theta)\] This is simply the definition that Louis gives to $ B(x,y,\theta) $. We do not know $x$ so we do not know the full data observed Information matrix. But we have the distribution of $ x $ conditioned on our data $ y $. \[ E_{X|y,\theta} [ B(X,y|\theta)] \] is thus the expected full data observed information matrix conditioned on our observed data $ y $. So this is the first part of his statement. The second part of his statement takes a bit more effort to work out. First we substitute $ S^*(y|\theta) $ with $ E_{X|y,\theta} [ S(X,y|\theta) ] $ from \ref{Louise3p1}. This gives us this: \begin{equation}\label{ES1} E_{X|y,\theta} [ S(X,y|\theta)S(X,y|\theta)^\top ]-S^*(y|\theta)S^*(y|\theta)^\top = E_{X|y,\theta} [ S(X,y|\theta)S(X,y|\theta)^\top ]-E_{X|y,\theta} [ S(X,y|\theta) ]E_{X|y,\theta} [ S(X,y|\theta)^\top ] \end{equation} Using the computational form of the variance, $ var(X)=E(XX)-E(X)E(X) $, we can see that \ref{ES1} is the conditional variance of $ S(X,y|\theta) $. \[ var_{X|y,\theta}( S(X,y|\theta) ) \] But the variance of the first derivative of $ f^\prime(X|\theta) $ is the expected Fisher Information of $ X $ [4]. In this case, it is the expected Fisher Information of the hidden state $ X $, where we specify that $ X $ has the conditional distribution $ f_{X|Y} (X | Y=y,\theta) $. Thus we have the second part of Louis' statement.

Relating Louis 1982 to the update equations in the MARSS EM algorithm

The main result in Louis (1982) (\ref{Louismain}) can be written \begin{equation}\label{Louismain2} \mathcal{I}(\theta,y) = E_{X|y,\theta} [ B(X,y|\theta)] - var_{X|y,\theta} [ S(X,y|\theta) ] \end{equation} The M-step of the EM algorithm involves the first derivative of the log-likelihood with respect to $\theta$, $ S(X,y|\theta) $, since it involves setting this derivative to zero: \begin{equation} Q^\prime(\theta | \theta_j) = d( E_{X|y,\theta_j } [\log f_{XY}(X,y|\theta) ])/d\theta = E_{X|y,\theta_j } [\log f^\prime_{XY}(X,y|\theta) ] = E_{X|y,\theta_j } [ S(X,y|\theta) ] \end{equation} With the MARSS model, $ S(X,y|\theta) $ is analytical and we can also compute $ B(X,y|\theta)$, the second derivative, analytically.

'The difficulty arises with this term: $ var_{X|y,\theta} [ S(X,y|\theta) ] $. The $S(X,y|\theta)$ is a summation from $t=1$ to $T$ that involves $ X_t $ or $ X_t X_{t-1}^top $ for some parameters. When we do the cross-product, we will end up with terms like $ E[ X_t X_{t+k}^\top ] $ and $ E[ X_t X_t^\top X_{t+k}X_{t+k}^\top ] $. The latter is not a problem; all the random variables in a MARSS models are multivariate normal and the k-th central moments can be expressed in terms of the first and second moments [5], but that will still leave us with terms like $ E[ X_t X_{t+k}^\top ] $---the smoothed covariance between $X$ at time $t$ and $t+k$ conditioned on all the data ($t=1:T$).

Computing these is not hard. These are the the n-step apart smoothed covariances. Harvey (1989), page 148, discusses how to use the Kalman filter to get the n-step ahead prediction covariances and a similar approach can be used (presumably) to get the $ V(t,t+k) $ smoothed covariances. However this will end up being computationally expensive because we will need all of the $ t,t+k $ combinations, i.e., {1,3}, {1,4}, ..., {2,3}, {2,4}, .... etc.. That will be a lot: T + T-1 + T-2 + T-3 = $ T(T+1)/2 $, smoothed covariances. Lystig and Hughes (2012) and Duan and Fulop (2011) discuss this issue for in a related application of the approach in Louis (1982). They suggest that you do not need to include covariances with a large time separation because the covariance goes to zero. You just need to include enough time-steps.

Conclusion

I think the approach of Louis (1982) is not viable for MARSS models. The derivatives $B(x,y|\theta)$ and $S(x,y|\theta)$ are straight-forward (if tedious) to compute analytically following the approach in Holmes (2010). But the computing all the n-step smoothed covariances is going to be very slow and each computation involves many matrix multiplications. However, one could compute $ \mathcal{I}(\theta,y) $ via simulation using \ref{Louismain2}. It is easy enough to simulate $ X$ using the MLEs and then you compute $B(x_b,y|\theta)$ and $S(x_b,y|\theta)$ for each where $x_b$ is the bootstrapped $x$ time series and $y$ is the data. I don't think it makes sense to do that for MARSS models since there are two recursion approaches for computing the observed and expected Fisher Information using $f(y|\theta)$ and the Kalman filter equations (Harvey 1989, pages 140-142; Cavanaugh and Shumway 1996).

Footnotes

[1] Given a joint probability distribution of $ \{X,Y\}$, the marginal distribution of $ Y $ is $ \int_X f(X,Y) dx $. Discussions of the estimators for MARSS models often use the property of the marginal distributions of a multivariate normal without actually stating that this property is being used. The step in the derivation will just say, 'Thus' with no indication of what property was just used.
Reviewed here: http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html If you have a joint likelihood of some random variables, and you want the likelihood of a subset of those random variables, then you compute the marginal distribution. i.e. you integrate over the random variables you want to get rid of: \[ L(\theta | y) ] = \int_X L(\theta | X,Y) p(x|Y=y, \theta_j) dx |_{Y=y} \]. So we integrate out $ X $ from the full likelihood and then set $ Y=y $ to get the likelihood we want to maximize to get the MLE $ \theta $ (if we want MLEs).

The marginal likelihood is a little different. The marginal likelihood is used when you want to get rid of some of the parameters, nuisance parameters. The integral you use is different: \[ L(\theta_1|y) = \int_{\theta_2} p(y|\theta_1,\theta_2) p(\theta_2|\theta_1)d\theta_2 \] This presumes that you have $ p(\theta_2|\theta_1) $.

The expected likelihood is different yet again: \[ E_{X,Y|Y=y,\theta_j} [L(\theta | X,Y) ] = \int_X L(\theta | X,Y) p(x|Y=y, \theta_j) dx \]. On the surface it looks like the equation for $ L(\theta|y) $ but it is different. $ \theta_j $ is not $ \theta $. It is the parameter value at which we are computing the expected value of $ X $. Maximizing the $ E_{X,Y|Y=y,\theta_j} [L(\theta | X,Y) ] $ will increase the likelihood but will not take you to the MLE---you have to imbed this maximization in the EM algorithm that walks up the likelihood surface.

[2] P(A|B) = P(A \cup \B)/P(B)

[3] I normally think about $ Y $ as being partially observed (missing values) so I also take the expectation over $ Y(2) $ conditioned on $Y(1)$, where (1) means observed and (2) means missing. In Holmes (2010), this is done in order to derive general EM update equations for the missing values case. But my notation is getting hairy, so for this write-up, I'm treating $Y$ as fully observed; so no $Y(2)$ and I've dropped the integrals (expectations) over $ Y(2) $.

[4] http://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Fisher_info.pdf

[5] https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Higher_moments

Papers and online references

Ng, Krishnan and McLachlan 2004
The EM algorithm. Section 3.5 discusses standard errors approaches
https://www.econstor.eu/dspace/bitstream/10419/22198/1/24_tk_gm_skn.pdf
http://hdl.handle.net/10419/22198

Efron and Hinkley 1978
(argues that the observed Fisher Information is better than expected Fisher Information in many/some cases. The same paper argues for the likelihood ratio method for CIs)
Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher Information
https://www.stat.tamu.edu/~suhasini/teaching613/expected_observed_information78.pdf

Hamilton 1994
http://web.pdx.edu/~crkl/readings/Hamilton94.pdf

Hamilton's exposition assumes you know the marginal distribution of a multivariate normal. Scroll down to the bottom.
http://fourier.eng.hmc.edu/e161/lectures/gaussianprocess/node7.html

Meilijson 1989
Fast improvement to the EM algorithm on its own terms
http://www.jstor.org/stable/pdf/2345847.pdf

Oakes 1999
Direct calculation of the information matrix via the EM algorithm
http://www.jstor.org/stable/pdf/2680653.pdf?_=1463187953783

Ho, Shumway and Ombao 2006
(this has a brief statement that Oakes 1999 derivatives are hard to compute. It doesn't say why. It says nothing of Louis 1982.)
Chapter 7, The state-space approach to modeling dynamic processes
Models for Intensive Longitudinal Data
https://books.google.com/books?hl=en&lr=&id=Semo20xZ_M8C

Louis 1982
(so elegant. alas, MARSS deals with time series data...)
Finding the observed information matrix when using the EM algorithm
http://www.jstor.org/stable/pdf/2345828.pdf
http://www.markirwin.net/stat220/Refs/louis1982.pdf

Lystig and Hughes 2012
(helped me better understand why Louis 1982 is hard for MARSS models)
Exact computation of the observed information matrix for hidden Markov models
http://www.tandfonline.com.offcampus.lib.washington.edu/doi/abs/10.1198/106186002402

Duan and Fulop 2011
(also helped me better understand why Louis 1982 is hard for MARSS models)
A stable estimator for the information matrix under EM for dependent data
http://www.rmi.nus.edu.sg/DuanJC/index_files/files/EM_Variance_March%205%202007.pdf
http://link.springer.com/article/10.1007/s11222-009-9149-4

Naranjo 2007 (didn't use)
State-space models with exogenous variables and missing data, PhD U of FL
http://etd.fcla.edu/UF/UFE0021568/naranjo_a.pdf

Dempster, Laird, Rubin 1977
(didn't really use but looked up more info on the 'score' function Q)
Maximum likelihood for incomplete data via the EM algorithm
http://web.mit.edu/6.435/www/Dempster77.pdf

van Dyk, Meng and Rubin 1995
(this looks promising)
Maximum likelihood estimation via the ECM algorithm: computing the asymptotic variance
http://wwwf.imperial.ac.uk/~dvandyk/Research/95-sinica-secm.pdf

Cavanaugh and Shumway 1996
On computing the expected Fisher Information Matrix for state-space model parameters

Harvey 1989, pages 140-143, Section 3.4.5 Information matrix
Forecasting, structural time series models and the Kalman filter

Notes on computing the Fisher Information matrix for MARSS models. Part I Background

2016-05-18T17:52:00.000-07:00

MathJax and blogger can be iffy. Try reloading if the equations don't show up and then wait, like 30-60 seconds for the equations to magically appear (fingers crossed).

The Fisher Information is defined as \begin{equation}\label{FisherInformation} I(\theta) = E_{Y|\theta}\{ [\partial\log L(\theta|Y)/\partial\theta]^2 \} = \int_x [\partial\log L(\theta|y)/\partial\theta]^2 f(y|\theta)dy \end{equation} In words, it is the expected value (taken over all possible data) of the square of the gradient (first derivative) of the log likelihood surface at $ \theta $. It is a measure of how much information data (from our experiment or monitoring) have about $ \theta $. The log-likelihood surface is for a fixed set of data and the $ \theta $ vary. The peak is at the MLE, which is not $ \theta $, so the surface has some gradient (slope) at $ \theta $ since the peak is at the MLE not $ \theta $. The Fisher Information is the expected value (over possible data) of those gradients (squared). It can be shown[1] that the Fisher Information can also be written as \[ I(\theta) = - E_{Y|\theta}\{ \partial^2\log L(\theta|Y)/\partial\theta^2 \} = -\int_y [\partial^2\log L(\theta|y)/\partial\theta^2 f(y|\theta)dy \] So the Fisher Information is the average (over possible data) convexity of the log-likelihood surface at $ \theta $. That doesn't quite make sense to me. When I imagine the surface, that the convexity at a non-peak value $ \theta $ is not intuitively the information. The gradient squared, I understand, but the convexity at a non-peak? Note, my $ y $ should be understood to be some multi-dimensional data set (multiple sites over multiple time points, say), and is comprised of multiple samples. Often in this case Fisher Information is written $ I_n(\theta) $ and if the data points are all independent, $ I(\theta)=\frac{1}{n} I_n(\theta) $. However I'm not using that notation. My $ I(\theta) $ is referring to the Fisher Information for a dataset not individual data points within that data set. We do not know $ \theta $ so we need to use an estimator for the Fisher Information. A common approach is to use $ I(\hat{\theta}) $, the Fisher Information at the MLE $ \theta $ as an estimator of $ I(\theta) $ because: \[ I(\hat{\theta}) \xrightarrow{P} I(\theta) \] This is called the expected Fisher Information and is computed at the MLE: \begin{equation}\label{expectedFisherInformation} I(\hat{\theta}) = - E_{Y|\hat{\theta}}\{ \partial^2\log L(\theta|Y)/\partial \theta^2 \} |_{\theta=\hat{\theta}} \end{equation} That $ |_{\theta=\hat{\theta}} $ at the end means that after doing the derivative with respect to $ \theta $, we replace $ \theta $ with $ \hat{\theta} $. It would not make sense to do the substitution before since $ \hat{\theta} $ is a fixed value and so you cannot take the derivative with respect to it. This is a viable approach if you can take the derivative of the log-likelihood with respect to $ \theta $ and can take the expectation over the data. You could always do that expectation using simulation of course. You just need to be able to simulate data from your model with $ \hat{\theta} $. Another approach is to drop the expectation. This is termed the observed Fisher Information: \begin{equation}\label{observedFisherInformation} \mathcal{I}(\hat{\theta},y) = - \left.\frac{\partial^2\log L(\theta|y)}{\partial \theta^2} \right|_{\theta=\hat{\theta}} \end{equation} where $ y $ is the one dataset we collected. The observed Fisher Information is the curvature of the log-likelihood function around the MLE. When you estimate the variance of the MLEs from the Hessian of the log-likelihood (output from say some kind of Newton method or any other algorithm that uses the Hessian of the log-likelihood), then you are using the observed Fisher Information matrix. Efron and Hinkley (1978) (and Fisher they say in their article) say that the observed Fisher Information is a better estimate of the variance of $ \hat{\theta} $[2][3], while Cavanaugh and Shumway (1996) show results from MARSS models that indicate that the expected Fisher Information has lower mean squared error (though may be more biased; mean squared error measures both bias and precision). So how do we compute $ I(\hat{\theta}) $ or $ \mathcal{I}(\hat{\theta},y) $? In particular, I am interested in whether I can use the analytical derivatives of the full log-likelihood that are part of the EM algorithm? Notes on computing the Fisher Information matrix for MARSS models. Part II EM

Footnotes

[1] See any detailed write-up on Fisher Information. For example page 2 of these lecture notes on Fisher Information.
[2] The motivation for computing the Fisher Information is to get an estimate of the variance of $ \hat{\theta} $ for standard errors on the parameter estimates, say. $ var(\hat{\theta}) \xrightarrow{P} \frac{1}{I(\theta)} $.
[3] Note I'm using the notation of Cavanaugh and Shumway (1996). Efron and Hinkley (1978) use $ \mathscr{I}(\theta) $ for the expected Fisher Information and $ I(\theta) $ for the observed Fisher Information. Cavanaugh and Shumway (1996) use $ I(\theta) $ for the expected Fisher Information and $ \mathcal{I}(\theta,Y) $ for the observed Fisher Information. I use the same notation as Cavanaugh and Shumway (1996) except that they use $ I_n() $ and $ \mathcal{I}_n $ to be explicit that the data have $ n $ data points. I drop the $ n $ since I'm interested in the Fisher Information of the dataset not individual data points and if I need to use the information of the j-th data point, I would just write $ I_j() $. The other difference is that I use $ y $ to refer to the data. In my notation, $ Y $ is the random variable 'data' and $ y $ is a particular realization of that random variable. In some cases, I use $ y(1) $. That is when the random variable $ Y $ is only partially observed (meaning there are missing data points or time steps); $ y(1) $ is the observed portion of $ Y $.

References I looked at while working on this

Fisher Info Lectures

http://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/Fisher_info.pdf http://www.math.umt.edu/patterson/Information.pdf http://www.stat.umn.edu/geyer/old03/5102/notes/fish.pdf I also studied the Wikipedia Fisher Information page. Cavanaugh and Shumway (1996) have a succinct summary of Fisher Information in their introduction and I adopted their notation. Papers Efron and Hinkley 1978 (argues that the observed Fisher Information is better than expected Fisher Information in many/some cases. The same paper argues for the likelihood ratio method for CIs) Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher Information https://www.stat.tamu.edu/~suhasini/teaching613/expected_observed_information78.pdf

Cavanaugh and Shumway 1996
On computing the expected Fisher Information Matrix for state-space model parameters

Analysis of PhD and Baccalaureate origin of math faculty (Part I)

2016-04-28T17:09:00.001-07:00

I've been pondering the educational paths of math faculty, so I decided to collect some data by visiting the faculty websites and looking at CVs. I started with the top 20 or so schools on this ranking http://www.phds.org/rankings/mathematics and then added a few. I added some schools like U of WA, U of FL and U of ID for more regional diversity. I only collected data on PhD and undergrad institution from faculty who got their PhD in the US. If they got their undergrad degree in another country, I noted the country. If no undergrad institution was listed, I added with undergrad 'unknown'. I only included full, associate and assistant faculty. I excluded lecturers and research faculty. I took data from 30 institutions (below). I was able to get PhD data on 761 faculty (656 male/105 female) and undergrad data on 577 of these (489 male/88 female). Here where the faculty data were collected broken out by institution. The number in parentheses is the number of faculty for which I was able to collect data.

CalTech(5), Columbia(21), Cornell(30), Harvard(23), MIT(48), NYU(43), Penn State(36), Princeton(34), Rutgers(55), Stanford(31), U Chicago(50), U of AZ(29), U of FL(33),U of ID(14), U of IL UC(36), U of MD(25), U of Mich(57), U of MN(26), U of MN Duluth(2), U of Rochester(13), U of T Austin(58), U of Utah(32), U of WA(47), U of WI(45), UC Berk(73), UC Davis(17), UC Irvine(18), UCLA(38), UPenn(14), Yale(17)

Here are the first set of plots. These plots show where faculty (whose info was posted) got their PhDs and bachelors. Only ca 50% of faculty post CVs so this is a sample of the faculty. Only faculty, not lecturers or research faculty included. But I did include assistant and associate faculty. Note, I excluded faculty who got their PhD in another country. That's about 10% (except at CalTech where it is about 75%).

Plot 1 is just the Group 1 institutions. Harvard, Princeton, MIT, UC Berkeley, NYU, Stanford.
Why these? You'll see in plot 2. Plot 1 shows that this group is closed. Almost all faculty within this group got their PhD from institutions within this group. For the bachelor degrees, about 30% got their undergrad degree in another country. For those that got their bachelor's in the US, 40% got their bachelors from Group 1 and 50% got their undergrad from the Ivies+MIT+Stanford (excluding UC Berk).

Click figure to see full size.

Plot 2 shows just faculty from OUTSIDE Group 1. These are 23 large research universities. See the figure for the list. Within this group of 23,

56% of faculty got their PhD from Group 1 (right figure). This was how I defined Group 1--the schools whose PhDs showed up disproportionately.
23% got it from a University of XYZ (excluding UC Berkeley). This includes Canada flagships (so U of Toronto) but excludes, say, U of Rochester.
2% got it from a XYZ State institution (incl SUNY)

Group 1 shows up disproportionately in the undergrad degrees too. If the faculty got the undergrad degree in the US (about 60% of them), then

35% got their undergrad degree from a Group 1 institution
35% got it from the Ivies+MIT+Stanford. However, Dartmouth is an outlier as few of its undergrads show up.
16% got it from a University of XYZ and 7% got it from a XYZ State institution. This includes Canada flagships (so U of Toronto) but excludes, say, U of Rochester.
13% got it from a small liberal arts college. 27 different LACs appear, and almost all appear only once. The exception is Reed which appears 4x.
5.6% got if from the UC system (includes UC Berkeley which is 3.6 percent)
This means that over 2x as many faculty got their undergrad from a LAC than the entire UC system (188,000 undergrads). However, there are many LAC institutions and the total sum of their enrollment is likely greater than 188,000 undergrads.
43 out of the 248 faculty in this sample got their undergrad degree from Harvard or Princeton. That's 17%! It is somewhat higher in Group 1, 25%.

Related work: There is much work on this in other fields however I have not seen work that also looks at baccalaureate origin.

2015 Systematic inequality and hierarchy in faculty hiring networks See esp the list of references in this paper.

Another way to get R package download stats

2015-04-23T18:18:00.002-07:00

This is code from Mark Scheuerell that was adapted from this post by Felix Schonbrodt for a different way to get download stats: http://www.nicebread.de/finally-tracking-cran-packages-downloads/

## adadpted from code by Felix Schonbrodt
## http://www.nicebread.de/finally-tracking-cran-packages-downloads/


## ======================================================================
## Step 1: Download all log files
## ======================================================================

# start & end dates 12 months prior to current date
this.year = as.numeric(format(Sys.time(), "%Y"))
start <- as.Date( paste(this.year-1,"-",format(Sys.time(), "%m-%d"),sep="") )
today <- as.Date(Sys.time())

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))

dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}


## ======================================================================
## Step 2: Load single data files into one big data.table
##
##   NOTE: this step takes FOREVER to run
## ======================================================================

file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()
for (file in file_list) {
  print(paste("Reading", file, "..."))
  logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
                             dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}

# rbind together all files
library(data.table)
dat <- rbindlist(logs)

# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

setkey(dat, package, date, week, country)

save(dat, file="CRANlogs/CRANlogs.RData")

# for later analyses: load the saved data.table
# load("CRANlogs/CRANlogs.RData")


## ======================================================================
## Step 3: Plot results
## ======================================================================

# vector of pkgs to compare
pkgs <- c("MARSS","dlm")

# vector of plot colors
clr <- seq(length(pkgs))

# downloads of selected pkgs by week
com1 <- dat[J(pkgs), length(unique(ip_id)), by=c("week", "package")]

# total downloads to date
com1[, sum(V1), by=package]

# cumulative downloads by week
com1$C1 <- (com1[, cumsum(V1), by=package])$V1

# nicer form for plotting
plotdat <- cast(com1,week ~ package, value="C1")

# plot cumulative downloads over time
matplot(plotdat,
        type="l", lty="solid", lwd=2, col=clr,
        ylab="Cumulative downloads",
        xlab="Week of 2013")

legend(x="topleft", legend=colnames(plotdat)[-1],
       lty="solid", lwd=2, col=clr)

More general formulation. Tests 2

2014-01-10T15:45:00.000-08:00

More tests today of formulating MARSS model as in the napkin math of previous post. Better. Slightly faster (3-5%). I need to think more carefully if the x0 treatment is identical. data needs to have NA added column since X is x(t) over y(t-1) so at t=1 you have x(1) over y(0). There is never any data at t=0.

Update: But adding NA to the start is not the same as using x00, and I will need to recode the Kalman filter to get the right result with x = [x(t) y(t-1)]' at t=1. It happens to work here, but other tests suggests not in general. Probably not a fruitful direction since perhaps it is not really necessary to have constraints across B and Z, though it feels 'complete'.

To do: This helps, but previous where I incorporated U into B slowed things down a lot. Why is that? Most likely because of the Q=0 bit and hits the OmgQ code. How about adding a y=1 row and setting Q=1 so as not to hit that code? With new formulation, I can have U*y(n+1) so U*1. That should work. "working using Tt form 2.R" in MARSS sandbox dir.

Estimates are pretty similar but not identical.

#spp interaction example
royale.dat = log(t(isleRoyal[,2:3]))
z.royale.dat=(royale.dat-apply(royale.dat,1,mean,na.rm=TRUE))/
  sqrt(apply(royale.dat,1,var,na.rm=TRUE))
Q=matrix(list(0),2,2);diag(Q)=c("q1","q2")
royale.model.1=list(Z="identity", B="unconstrained",
                    Q=Q, R="diagonal and unequal",
                    U="zero", tinitx=0)
cntl.list=list(allow.degen=FALSE,maxit=200)
tic()
kemfit=MARSS(z.royale.dat, model=royale.model.1, control=list(allow.degen=FALSE))
toc()

a=summary(kemfit$model)
tinitx=a$tinitx
m=dim(a$B)[1];n=dim(a$Z)[1]
Bt=matrix(list(0),n+m,n+m);Bt[1:m,1:m]=a$B;Bt[(m+1):(n+m),1:m]=a$Z
Zt=matrix(list(0),n,m+n); Zt[1:n,(m+1):(m+n)]=diag(1,n)
Qt=matrix(list(0),m+n,m+n); Qt[1:m,1:m]=a$Q; Qt[(m+1):(n+m),(m+1):(n+m)]=a$R
x0t=rbind(a$x0,matrix(list(0),n,1))
V0t=matrix(list(0),n+m,n+m); VV0t[1:m,1:m]=a$V0
Ut=rbind(a$U,a$A)

newa = list(B=Bt, Z=Zt, U=Ut, A="zero", Q=Qt, R="zero", x0=x0t, V0=V0t, tinitx=tinitx)
inits.list=list(x0=matrix(1+kemfit$model$data[,1],m,1))
ddat=cbind(NA,kemfit$model$data)
tic()
kemfita = MARSS(ddat, model=newa, control=list(allow.degen=FALSE),inits=inits.list)
toc()
p1=coef(kemfit); p2=coef(kemfita)
rbind(c(p1$B,p1$Z,p1$U,p1$Q,p1$R,p1$x0,kemfit$logLik),c(p2$B,p2$U,p2$Q,p2$x0,kemfit$logLik))

#Works with this kemfit too
dat = t(harborSealWA)
dat = dat[2:4,] #remove the year row
#fit a model with 1 hidden state and 3 observation time series
tic()
kemfit = MARSS(dat, model=list(U=matrix(c("N","S","S"),3,1),tinitx=0), control=list(allow.degen=FALSE))
toc()

More general formulation of the MARSS model

2014-01-06T14:48:00.000-08:00

Napkin math. I've been pondering for some time how to formulate the MARSS model in a more general way to more fully allow constraints across parameter matrices and across the X and Y parts of the model. I also want to allow X to be observed.

Update 1/6/2014: This doesn't seem to get me anywhere. The EM algorithm requires that estimated matrix elements fall on rows of Q (and R) which are non-zero. Even putting U in B (or A into Z), thus one additional row---slows down the EM algorithm. Merging the y and x together in a matrix, means I have NAs in the y*, representing the unobserved x in the stacked y-x. That leads to problems estimating x_0 because R=0 for those. That problem is fixable, but the others are more intractable. Given that just putting U into B didn't seem to get me anywhere, I'm going to drop this tangent and work on other stuff. The test code for putting U into B is below.

The first napkin shows how I think I want to do this. e_t is iid 0,1 Gaussian noise.

The second napkin shows how to set this up as a standard MARSS eqn, but involves a var-cov error matrix with a bunch of 0s. That's bad because wherever 0 rows appear in Q (or R), that row of B (or Z) cannot be estimated with the EM algorithm because it falls out of the likelihood equation that you integrate to get the updated B. That's a general difficulty with the EM approach.

Also, the Q' and R' matrices (from the Cholesky transformation) above will have different constraints than the original Q and R matrices. That I think makes this bottom formulation above impossible and takes me back to the top formulation.

Reference
Here's a write up of ARMA models in state-space form, which seems to have nothing to do with the scratches above but my reformulation is motivated by thinking about (among other things) rewriting ARMA models in state-space form.
http://www-stat.wharton.upenn.edu/~stine/stat910/lectures/14_state_space.pdf

#12-20 notes
library(matlab)
# Test of some new ways to form the marss model to allow constraints across B and U
#harborSealWA is a n=5 matrix of logged population counts
dat = t(harborSealWA)
dat = dat[2:4,] #remove the year row
#fit a model with 1 hidden state and 3 observation time series
tic() #7.94 sec
kemfit = MARSS(dat, model=list(U=matrix(c("N","S","S"),3,1)), control=list(allow.degen=FALSE))
toc()

#reformat model to put U in B;

a=summary(kemfit$model)
m=dim(a$B)[1]
Bt=cbind(rbind(a$B,matrix(0,1,m)),matrix(c(a$U,1),m+1,1)); #KFAS Tt
Zt=cbind(a$Z,a$A)
Qt=matrix(list(0),m+1,m+1); Qt[1:m,1:m]=a$Q; Qt[m+1,m+1]=0
x0t=rbind(a$x0,1)
V0t=matrix(list(0),m+1,m+1); V0t[1:m,1:m]=a$V0

newa = list(B=Bt, Z=Zt, U="zero", A="zero", Q=Qt, R=a$R, x0=x0t, V0=V0t, tinitx=a$tinitx)
#will get same value but need to run longer
tic() #9.76 sec
kemfita = MARSS(kemfit$model$data, model=newa, control=list(allow.degen=FALSE))
toc()
rbind(c(coef(kemfit,type="vector"),kemfit$logLik),c(coef(kemfita,type="vector"),kemfita$logLik))

#This is an alternate approach that uses a y=1 row 9.78sec

#and x is [x y]'

a=summary(kemfit$model)
tinitx=a$tinitx
m=dim(a$B)[1];n=dim(a$Z)[1]
Bt=matrix(list(0),n+m+1,n+m+1);Bt[1:m,1:m]=a$B;Bt[(m+1):(n+m),1:m]=a$Z; Bt[1:m,n+m+1]=a$U
Zt=matrix(list(0),n+1,m+n+1); Zt[,(m+1):(m+n+1)]=diag(1,n+1)
Qt=matrix(list(0),m+n+1,m+n+1); Qt[1:m,1:m]=a$Q; Qt[(m+1):(n+m),(m+1):(n+m)]=a$R;Qt[m+n+1,m+n+1]=1
x0t=rbind(a$x0,matrix(list(0),n+1,1)); x0t[n+m+1,1]=1;
V0t=matrix(list(0),n+m+1,n+m+1)
Ut="zero"

newa = list(B=Bt, Z=Zt, U=Ut, A="zero", Q=Qt, R="zero", x0=x0t, V0=V0t, tinitx=tinitx)
inits.list=list(x0=matrix(1+kemfit$model$data[,1],m,1))
ddat=cbind(NA,kemfit$model$data); ddat=rbind(ddat,1)
tic()
kemfita = MARSS(ddat, model=newa, control=list(allow.degen=FALSE),inits=inits.list)
toc()
p1=coef(kemfit); p2=coef(kemfita)
rbind(c(p1$B,p1$Z,p1$U,p1$Q,p1$R,p1$x0,kemfit$logLik),c(p2$B,p2$U,p2$Q,p2$x0,kemfita$logLik))

#same reformat code can be run with this kemfit
#spp interaction example
royale.dat = log(t(isleRoyal[,2:3]))
z.royale.dat=(royale.dat-apply(royale.dat,1,mean,na.rm=TRUE))/
  sqrt(apply(royale.dat,1,var,na.rm=TRUE))
royale.model.1=list(Z="identity", B="unconstrained",
                    Q="diagonal and unequal", R="diagonal and unequal",
                    U="zero", tinitx=1)
cntl.list=list(allow.degen=FALSE,maxit=200)
tic()
kemfit=MARSS(z.royale.dat, model=royale.model.1, control=list(allow.degen=FALSE))
toc()

Quantifying R package downloads using the CRAN mirror stats

2013-12-20T12:36:00.001-08:00

Because I have to justify all the time I spend working on the MARSS package, I collect stats for how much it is downloaded relative to other R packages. To be honest, I think downloads stats are not really helpful to getting some recognition for the work---not that it hurts. The only that thing that really counts are citations of the published paper on MARSS, and that relies on users citing the paper. Even then the citations are not 'worth' as much since the paper is in a software journal. Fact is, one published paper in a high-impact paper cited 3 times is still "worth" a lot more than a R package downloaded hundreds times a day. Such is the research life.

Here the R code to get package stats off the CRAN mirror. See also this post by Felix Schonbrodt for a completely different way to get download stats: http://www.nicebread.de/finally-tracking-cran-packages-downloads/

require(XML)
require(RCurl)
require(httr)
require(stringr)

#read in table 13 which is the download stats table
a=readHTMLTable("http://cran.r-project.org/report_cran.html", which=13, stringsAsFactors=FALSE)
b=(as.numeric(a$reqs))
filename=a$file
#detect which filenames are .tar.gz files and which are .zip. (packages)
pkg=str_detect(filename,"tar.gz") & str_detect(filename, "/src/contrib/")
pkg2=str_detect(filename,".zip") & str_detect(filename, "/bin/windows/contrib/r-release")
#detect which are documentation
docum=str_detect(filename,".pdf")

#make some plots
par(mfrow=c(3,1))
#get the pkgname---because I need to deal with multiple versions of packages and I only want to count 1 of those
pkgname=sapply(filename[pkg],function(x){ tmp=str_split(str_split(x,"_")[[1]][1],"/")[[1]]; tmp[length(tmp)] })
#go through and just get the pkg version that has the max downloads
pkgcount=c()
for(i in unique(pkgname)){
  pkgcount=c(pkgcount,max(b[pkg][pkgname==i]))
}
#figure out which filename is MARSS
marsspkg=str_detect(filename,"tar.gz") & str_detect(filename, "/src/contrib/") & str_detect(filename, "MARSS")
#max(b[marsspkg]) means uses the count for whatever MARSS version is maximum to deal with multiple versions listed
titl=paste("Index of All R Source Package Downloads\ntop ",format(100*sum(pkgcount>max(b[marsspkg]))/length(pkgcount),digits=1),"%",sep="")
hist(log(pkgcount),main=titl,xlab="log(downloads)")
abline(v=log(sum(b[marsspkg])),col="red")
text(log(max(b[marsspkg])),2000,"MARSS",pos=4)

pkgname=sapply(filename[pkg2],function(x){ tmp=str_split(str_split(x,"_")[[1]][1],"/")[[1]]; tmp[length(tmp)] })
marsspkg2=str_detect(filename,".zip") & str_detect(filename, "/bin/windows/contrib/r-release") & str_detect(filename, "MARSS")
pkgcount=c()
for(i in unique(pkgname)){
  pkgcount=c(pkgcount,max(b[pkg2][pkgname==i]))
}
titl=paste("Index of All R Package Windows Binaries Downloads\ntop ",format(100*sum(pkgcount>max(b[marsspkg2]))/length(pkgcount),digits=1),"%",sep="")
hist(log(pkgcount),main=titl,xlab="log(downloads)")
abline(v=log(sum(b[marsspkg2])),col="red")
text(log(max(b[marsspkg2])),1000,"MARSS",pos=4)

titl=paste("Index of R Package Documentation Downloads\ntop ",format(100*sum(b[docum]>max(b[marssdocum]))/length(b[docum]),digits=1),"%",sep="")
marssdocum=str_detect(filename,".pdf") & str_detect(filename, "MARSS")
hist(log(b[docum]),main=titl,xlab="log(downloads)")
abline(v=log(max(b[marssdocum])),col="red")
text(log(sum(b[marssdocum])),2000,"MARSS",pos=4)

#dlm compared to marss
dlmpkg=str_detect(filename,"tar.gz") & str_detect(filename, "/src/contrib/") & str_detect(filename, "/dlm_")
max(b[dlmpkg])
max(b[marsspkg]) #max to deal with different package versions and only use one

Automating testing of package version updates (MARSS specific)

2013-12-12T15:21:00.002-08:00

With a major update to MARSS in the works with the EM algorithm translated to C++, I realized I needed to bite the bullet and automate the testing of the package updates. The main testing in the MARSS package occurs in the code in the extensive User Guide, but one of the tedious tasks for each version update has been making sure that I don't break anything with new updates and that any differences between output using different versions are expected (due to an intended change). I didn't want to duplicate effort put into the User Guide code by making special test code. Instead I just wanted to rerun the code from different package versions and make sure everything matched. I came up with a way to automate this by having different versions in different locations (in this case the base R library versus my local library). This also has the side benefit of testing that the R code supplied with the package can be sourced (I discovered a number of problems this way).

Here's the code and comments explain what it is doing. Each User Guide chapter comes with a code file that allows users to replicate the chapter's examples. The Sweave file for each chapter is written such with special labels for code chunks that are to appear in these R files and the makefile assembles these into the R files supplied with the package. I'll have to change this a bit to facilitate using it as test code:

Avoid not exporting code to the R file. The code in the Sweave files is flagged with "CS_" if I want it to appear in the R files for the users. I need to make sure I export all code that makes stuff in the chapters. Otherwise I risk not testing some of the code in the User Guide. Right now I pick-and-choose a bit and exports mostly examples.
Avoid reusing object names. I tend to use "kem" and "kemfit" over and over the chapter code. If I do that there won't be a separate object created for each bit of code.
Use set.seed() to ensure that objects from random number generators are the same.

Here is the test code. Basic idea is to load one MARSS version, run test code, save all objects as list. Repeat for 2nd version. Compare the lists from the two versions. Report any differences.

# ###########################################
# This compares output from two different MARSS versions
# using the R code in the doc folder
# How to run
# Install one version of MARSS into the base R library
# Install a second version into the local R library
# Open the unit test.R file
# RShowDoc("versiontest.R", package="MARSS")
# Source the code.
# Note: Using 'build and reload' from RStudio builds the package into the local
# library but does not install the doc or help files
# Use Install from zip and install from a .tar.gz file instead
# ###########################################

#make sure MARSS isn't loaded
try(detach(package:MARSS),silent=TRUE)

#New version should be in the local library
lib.loc = Sys.getenv("R_LIBS_USER")
unittestvrs=packageVersion("MARSS", lib.loc = lib.loc)
library(MARSS, lib.loc = lib.loc)

#Get whatever code files are in the doc directory; these are tested
unittestfiles = dir(path=paste(lib.loc,"/MARSS/doc",sep=""), pattern="*[.]R$", full.names = TRUE)

cat("Running code with MARSS version", as.character(unittestvrs), "\n")
for(unittestfile in unittestfiles){
  #clean the workspace but keep objects needed for the unit test
  rm(list = ls()[!(ls()%in%c("unittestfile","unittestfiles","unittestvrs"))])
  #set up name for log files
  tag=strsplit(unittestfile,"/")[[1]]
  tag=tag[length(tag)]
  tag=strsplit(tag,"[.]")[[1]][1]
  #run the code which will create objects
  cat("Running ",unittestfile, "\n")
  sink(paste("outputNew-",tag,".txt",sep=""))
  #wrapped in try so it keeps going if the code has a problem
  #set the seed so any random nums are the same
  set.seed(10)
  try(source(unittestfile))
  sink()
  #make a list of objects created by the test code
  testNew = mget(ls()[!(ls()%in%c("unittestfile","unittestfiles","unittestvrs"))])
  save(testNew,file=paste(tag,unittestvrs,".Rdata",sep=""))
}
#detach the new version
detach(package:MARSS)

#Repeat for an older version of MARSS which is in the R library (no local library)
lib.loc = paste(Sys.getenv("R_HOME"),"/library",sep="")
unittestvrs=packageVersion("MARSS", lib.loc = lib.loc)
library(MARSS, lib.loc = lib.loc)
cat("\n\nRunning code with MARSS version", as.character(unittestvrs), "\n")
for(unittestfile in unittestfiles){
  rm(list = ls()[!(ls()%in%c("unittestfile","unittestfiles","unittestvrs"))])
  tag=strsplit(unittestfile,"/")[[1]]
  tag=tag[length(tag)]
  tag=strsplit(tag,"[.]")[[1]][1]
  cat("Running ",unittestfile, "\n")
  sink(paste("outputOld-",tag,".txt",sep=""))
  set.seed(10)
  try(source(unittestfile))
  sink()
  testOld = mget(ls()[!(ls()%in%c("unittestfile","unittestfiles","unittestvrs"))])
  save(testOld,file=paste(tag,unittestvrs,".Rdata",sep=""))
}
detach(package:MARSS)

#Now start comparing the lists made using different versions of MARSS
cat("\n\nStarting object comparisons\n")
for(unittestfile in unittestfiles){
  #Get the file name
  tag=strsplit(unittestfile,"/")[[1]]
  tag=tag[length(tag)]
  tag=strsplit(tag,"[.]")[[1]][1]
  #Load in the 2 lists, testNew and testOld
  vrs=packageVersion("MARSS", lib.loc = Sys.getenv("R_LIBS_USER"))
  load(file=paste(tag,vrs,".Rdata",sep=""))
  lib.loc = paste(Sys.getenv("R_HOME"),"/library",sep="")
  vrs=packageVersion("MARSS", lib.loc = lib.loc)
  load(file=paste(tag,vrs,".Rdata",sep=""))
  
  #Compare the lists and report any differences
  cat("Checking ", tag, "\n")
  if(!identical(names(testNew), names(testOld))){
    cat("ERROR: Names of the test lists not identical\n\n")
    next
  }
  good=rep(TRUE,length(names(testNew)))
  for(ii in 1:length(names(testNew))){
    if(!identical(testNew[[ii]], testOld[[ii]])) good[ii] = FALSE
  }
  if(!all(good)){
    cat("ERROR: The following objects are not identical\n")
    cat(names(testNew)[!good])
    cat("\n\n")
  }else{
    cat("PASSED\n\n")
  }
}

More native R versus RcppArmadillo speed test comparisons for EM algorithm

2013-12-11T14:58:00.001-08:00

Following on my previous post, I continue to evaluate whether time spent writing C++ code for some of my EM algorithm in MARSS is time well-spent. Today I wrote a small function for one of the update equations, the R update. Below is a little function in R to do the biggest part of that update:

test = function(Z, A, dR, kf, Ey){
sum1 = t.dR.dR = 0
TT = dim(kf[["xtT"]])[2]
t.dR.dR = t.dR.dR + crossprod(dR)
for (i in 1:TT) {
    hatyt = Ey[["ytT"]][,i,drop=FALSE]; hatyxt=sub3D(Ey[["yxtT"]],t=i); hatOt = sub3D(Ey[["OtT"]],t=i)
    hatPt = kf[["VtT"]][,,i]+tcrossprod(kf[["xtT"]][,i,drop=FALSE])
    hatxt = kf[["xtT"]][,i,drop=FALSE]
    sum1a = (hatOt - tcrossprod(hatyxt, Z) - tcrossprod(Z, hatyxt)- tcrossprod(hatyt, A) - tcrossprod(A, hatyt) + tcrossprod(Z%*%hatPt, Z) + tcrossprod(Z%*%hatxt, A) + tcrossprod(A, Z%*%hatxt) + tcrossprod(A)) + A%*%t.A
    sum1 = sum1 + crossprod(dR, vec(sum1a))
}
return(sum1)
}
Z is a matrix, A a matrix, dR a 3D array, kf a list with a 3D array and 2D matrix I need, Ey is a list with 2 3D arrays and 1 2D matrix I need.

Here's some RcppArmadillo (C++) to replicate the function above:

// [[Rcpp::depends(RcppArmadillo)]]

#include <RcppArmadillo.h>

using namespace arma;

// [[Rcpp::export]]
vec Rupdate2(mat Z, mat A, Rcpp::NumericVector vecdR, Rcpp::List& kf, Rcpp::List& Ey) {
Rcpp::NumericVector kfVtT = kf["VtT"], EyyxtT = Ey["yxtT"], EyOtT = Ey["OtT"];
mat ytT = Ey["ytT"], xtT = kf["xtT"];
vec dRDim = vecdR.attr("dim"), VtTDim = kfVtT.attr("dim"), yxtTDim = EyyxtT.attr("dim"), OtTDim = EyOtT.attr("dim");
unsigned int TT = xtT.n_cols, m = xtT.n_rows, n = ytT.n_rows, p = dRDim[1];
cube VtT(kfVtT.begin(), VtTDim[0], VtTDim[1], VtTDim[2], false);
cube yxtT(EyyxtT.begin(), yxtTDim[0], yxtTDim[1], yxtTDim[2], false);
cube OtT(EyOtT.begin(), OtTDim[0], OtTDim[1], OtTDim[2], false);
cube cubedR(vecdR.begin(), dRDim[0], dRDim[1], dRDim[2], false);
vec hatyt(n), hatxt(m), sum1=zeros(p);
mat hatyxt(n,m), hatOt(n,n), hatPt(m,m), sum1a(n,n), dR=cubedR.slice(0);
for (unsigned int i = 0; i<TT; i++) {
    if(dRDim[2]>1) dR=cubedR.slice(i);
    hatyt=ytT.col(i); hatxt=xtT.col(i);
    hatOt=OtT.slice(i); hatyxt=yxtT.slice(i);
    hatPt=VtT.slice(i) + hatxt * hatxt.t();
    sum1a = hatOt - hatyxt * Z.t() - Z * hatyxt.t()- hatyt * A.t() - A * hatyt.t() + Z * hatPt * Z.t() + (Z * hatxt) * A.t() + A * (hatxt.t() * Z.t()) + A*A.t();
    sum1 = sum1 + dR.t() * vectorise(sum1a);
}
return sum1;
}

A few notes

Rcpp::List& is to make it pass the list by reference instead of memory
cubedR(vecdR.begin(), dRDim[0], dRDim[1], dRDim[2], false) is the way to do the same thing when you need to construct a cube. See comments by Rcpp developer on this SO post.

I tried it on a 15 x 20 matrix of data and 15 x 154 matrix and both time got about a 5 fold increase in speed:

2         1000    3.15    1.010
1         1000   16.42    5.263

So not orders of magnitude like I'd hoped, but probably enough to speed up the EM part by 50% when all is said and done.

################################################ ## Self-contained benchmark example for blog ################################################ require(utils) require(Rcpp) require(RcppArmadillo) require(rbenchmark) require(MARSS) #these are internal functions to MARSS since, surprisingly, R doesn't have these #in a 3D array say M[2,3,5] get the 2D matrix M[2,3,1] -> sub3D(M, t=1) sub3D=MARSS:::sub3D # turn a 2D matrix into a column vector vec=MARSS:::vec test = function(Z, A, dR, kf, Ey){ sum1 = t.dR.dR = 0 TT = dim(kf[["xtT"]])[2] t.dR.dR = t.dR.dR + crossprod(dR) for (i in 1:TT) { hatyt = Ey[["ytT"]][,i,drop=FALSE] hatyxt= sub3D(Ey[["yxtT"]],t=i) hatOt = sub3D(Ey[["OtT"]],t=i) hatPt = kf[["VtT"]][,,i]+tcrossprod(kf[["xtT"]][,i,drop=FALSE]) hatxt = kf[["xtT"]][,i,drop=FALSE] sum1a = (hatOt - tcrossprod(hatyxt, Z) - tcrossprod(Z, hatyxt)- tcrossprod(hatyt, A) - tcrossprod(A, hatyt) + tcrossprod(Z%*%hatPt, Z) + tcrossprod(Z%*%hatxt, A) + tcrossprod(A, Z%*%hatxt) + tcrossprod(A)) #A%*%t.A #sum1a = symm(sum1a) #enforce symmetry function from MARSSkf sum1 = sum1 + crossprod(dR, vec(sum1a)) } return(sum1) } #if this fun saved to file Rupdate.cpp, use sourceCpp("Rupdate.cpp") sourceCpp(code=' // [[Rcpp::depends(RcppArmadillo)]] #include <RcppArmadillo.h> using namespace arma; // [[Rcpp::export]] vec Rupdate(mat& Z, mat& A, Rcpp::NumericVector vecdR, Rcpp::List& kf, Rcpp::List& Ey) { Rcpp::NumericVector kfVtT = kf["VtT"], EyyxtT = Ey["yxtT"], EyOtT = Ey["OtT"]; mat ytT = Ey["ytT"], xtT = kf["xtT"]; vec dRDim = vecdR.attr("dim"), VtTDim = kfVtT.attr("dim"), yxtTDim = EyyxtT.attr("dim"), OtTDim = EyOtT.attr("dim"); unsigned int TT = xtT.n_cols, m = xtT.n_rows, n = ytT.n_rows, p = dRDim[1]; cube VtT(kfVtT.begin(), VtTDim[0], VtTDim[1], VtTDim[2], false); cube yxtT(EyyxtT.begin(), yxtTDim[0], yxtTDim[1], yxtTDim[2], false); cube OtT(EyOtT.begin(), OtTDim[0], OtTDim[1], OtTDim[2], false); cube cubedR(vecdR.begin(), dRDim[0], dRDim[1], dRDim[2], false); vec hatyt(n), hatxt(m), sum1=zeros(p); mat hatyxt(n,m), hatOt(n,n), hatPt(m,m), sum1a(n,n), dR=cubedR.slice(0); for (unsigned int i = 0; i<TT; i++) { if(dRDim[2]>1) dR=cubedR.slice(i); hatyt=ytT.col(i); hatxt=xtT.col(i); hatOt=OtT.slice(i); hatyxt=yxtT.slice(i); hatPt=VtT.slice(i) + hatxt * hatxt.t(); sum1a = hatOt - hatyxt * Z.t() - Z * hatyxt.t()- hatyt * A.t() - A * hatyt.t() + Z * hatPt * Z.t() + (Z * hatxt) * A.t() + A * (hatxt.t() * Z.t()) + A*A.t(); sum1 = sum1 + dR.t() * vectorise(sum1a); } return sum1; }' ) #test w 15 x t matrix of data nr=15 #rows for(t in c(20, 154)){ dat = t(apply(matrix(rnorm(nr*t),nr,t),1,cumsum)) #create nr random walks fit = MARSS(dat, silent=TRUE) Ey = print(fit, what="Ey", silent=TRUE) kf = print(fit, what="kfs", silent=TRUE) model = coef(fit, type="matrix") dR=fit$model$free[["R"]] Z=model$Z A=model$A res <- benchmark(test(Z, A, dR, kf, Ey), Rupdate(Z, A, dR, kf, Ey), columns = c("test", "replications","elapsed", "relative"),order="relative",replications=1000) cat("test with ",nr,"x",t," matrix\n") print(res[,1:4]) cat("are the results the same? ") cat(identical(unname(test(Z, A, dR, kf, Ey)), Rupdate(Z, A, dR, kf, Ey))) cat("\n") }

Speed comparisons using native R and RcppArmadillo

2013-12-10T22:58:00.002-08:00

I spent some time today learning the RcppArmadillo package that allows you to run C++ code from the Armadillo linear algebra library. I developed my example (below) from the examples directory in the inst directory of RcppArmadillo, but I also learned a lot from other posts

and the actual documentation for Armadillo. And I spent some time on this C++ tutorial since I don't actually know C++ (though eons ago I programmed in Fortran and C+ and C++ reminds me why I like matlab and R better). I downloaded the source files for RcppArmadillo from CRAN and extracted the tar file to get example files for "kalman" which I edited for my purposes.

The following code compares this computation in native R versus Armadillo C++

Y=0
for(i in 1:nrows(A)) Y = Y+(A%*%B) %*% t(A[i])

Here's the benchmark comparison and you can see that Armadillo C++ is considerably faster. Note, I had to futz a bit to find a computation where C++ was much faster, but this particular computation is very similar to one I make in the EM step for the MARSS package.

                test replications elapsed relative
2 crossprodCpp(A, B)           10    0.14    1.000
1    crosstest(A, B)           10    9.50   67.857

Here is the code to run this particular example. Just install the RcppArmadillo and benchmark R packages and you should be good to go. You don't need to install Armadillo. Just source the code below. Unfortunately, MARSS stores things in arrays and it turns out that passing these to RcppArmadillo is tedious and my initial speed tests were not promising, but that's another post.

Update: Turns out I picked a problem where Armadillo excels and used tcrossprod where it is not so efficient. If I'd used
    Y = Y + A%*%(B%*%t(a))+B%*%t(a)
Rcpp is only 4x faster. If I'd used
    Y = Y + tcrossprod(A%*%B,B) + tcrossprod(B,B)
or
    Y = Y +A%*%B%*%B + B%*%B
Rcpp is no faster.

CODE ----------------------------------------------------
src='
// [[Rcpp::depends(RcppArmadillo)]]

#include <RcppArmadillo.h>

using namespace arma;

// [[Rcpp::export]]
mat crossprodCpp(mat A, mat B) {
unsigned int n = A.n_cols;
colvec a;
mat Y = zeros(n, 1);
for (unsigned int i = 0; i<n; i++) {
a = A.row(i).t();
Y = Y + (A*B) * a + B * a;
}
return Y;
}'

require(utils)
require(RcppArmadillo)
require(rbenchmark)

sourceCpp(code=src)

crosstest = function(A,B){
Y = 0
for(i in 1:dim(A)[1]){
    a=A[i,,drop=FALSE]
    Y = Y + tcrossprod(A%*%B,a)+tcrossprod(B,a)
}
return(Y)
}

A=diag(200); B=diag(200)

res <- benchmark(crosstest(A,B), crossprodCpp(A,B),
                 columns = c("test", "replications",
                             "elapsed", "relative"),
                 order="relative",
                 replications=10)

print(res[,1:4])

Speeding up R (MARSS specific)

2013-11-26T14:48:00.001-08:00

Some notes on MARSS 3.6 update.

MARSS is generally slow since it is in native R but is slower than needed since it is doing a lot of matrix manipulations. I got a 10-20 increase in speed for large matrix (10,000 rows, 100s cols) problems (which arise when say n=120 and 120 r's are being estimated) by the following in order of how much it helped

Replace all instances of t(A)%*%B and A%*%t(B) with crossprod(A,B) and tcrosprod(A,B), respectively.
vectorize the one case where I was using for(i in 1:nrow){ for(j in 1:ncol) {} } to do something to each element of matrix, element by element. This only was done once, but ground the code to a halt for big matrices.
Used R profiling to find that a slow diagonal matrix test was slowing my Kalman filter function. Replaced that with a fast test.
Found all cases where I was subsetting arrays, like A[,,i], and replaced with code like this if(dim(A)[3]==1) dim(A)=dim(A)[1:2]. This torched the dimnames, so I needed to be careful to rest those if needed.
Made sure I was not recreating matrices unnecessarily. The diagonal matrices created for degenerate R and Q were getting created over and over. Made a flag so that they are created only once and only updated if new 0s appear.

Things I tried that didn't help

using the Matrix package and sparse matrices, but that only helped when n was really big and hurt when n was small.
vectorize the for loop over time using block diagonal matrices and the Matrix package. See previous blog post on that test. I was really bummed that didn't speed things up dramatically.

To do:

Bust out the degen code into a bit with if(allow.degen==TRUE).
Clean up the set-up and testing code for marssMODELs
Maybe do. If there are no zeros on the diagonals in A, then the solve(A)%*%b call can be sped up with solve(A,b). If I isolate the degen code, then I could us that. Right now I need to do a robust inverse that deals with 0s on the diagonal and structures that prevent solve() from working. I bet I cannot use this though because the problem is not just 0s on the diagonal.

References
Faster R notes: http://pj.freefaculty.org/blog/?p=122

Some speed testing and profiling code
require(MARSS)
plankdat=lakeWAplanktonTrans
plankdat=plankdat[plankdat[,"Year"]>=1980 & plankdat[,"Year"]<1990,]
# create vector of phytoplankton group names
phytoplankton = c("Cryptomonus", "Diatoms", "Greens",
"Bluegreens", "Unicells", "Other.algae")
# get only the phytoplankton
dat.spp.1980 = as.matrix(plankdat[,phytoplankton])
dat.spp.1980 = t(dat.spp.1980)

cntl.list = list(maxit=50)
model.list = list(m=2, R="diagonal and equal")
model.list = list(m=2, R="diagonal and unequal")

#quick speed testing using matlab
require(matlab)
n=10
a=c()
for(i in 1:n) a=rbind(a,dat.spp.1980)
a=a+matrix(rnorm(length(a),0,.1),dim(a)[1],dim(a)[2])
R=matrix(list(0),dim(a)[1],dim(a)[1])
diag(R)=as.character(1:dim(a)[1])
model.list = list(m=2, R=R)
tic()
kemz.2 = MARSS(a, model=model.list, z.score=TRUE, form="dfa", control=cntl.list)
toc()

#R profiling
Rprof(tmp<-tempfile())
kemz.2 = MARSS(dat.spp.1980, model=model.list, z.score=TRUE, form="dfa", control=list(maxit=50))
Rprof()
summaryRprof(tmp)

#system.time
system.time(MARSS(dat.spp.1980, model=model.list, z.score=TRUE, form="dfa", control=list(maxit=50)))

Speed test using Matrix to do block summation

2013-11-25T14:38:00.000-08:00

I've spent the last week speeding up the MARSS package by getting rid of some expensive matrix manipulations. It turns out that subscripting a large matrix and taking the transpose is really slow in R. Some of my design matrices have 100s of thousands of rows, so that was getting slow.

In the process, I've been thinking about how to speed up the EM algorithm by getting rid of the "for" loops over time. One idea is to use the Matrix package and use block diagonal matrices instead of arrays to hold the time-varying matrices coming out of the Kalman filter. Preliminary speed test was not promising however. It was slow to use matrix multiplication to do a simultaneous summation. Speed tests are below. Doing the summation with matrix multiplication took consistently about twice the time.

Idea is to replace the for loop over 3rd dim (time) of array:
for(i in 1:TT) sum1=sum1+a[,,i]%*%b[,,i]

with this
sum1 = II.row %*% a.blockdiag %*% b.blockdiag %*% II.col
where a.blockdiag is a block diagonal with each a[,,i] a block down the diagonal, b.blockdiag is similar, II.row is a row of TT identity matrices and II.col is a col of TT identity matrices.

Here's some speed test code
require(Matrix)
TT=100; n=20
a=array(1:(n*n),dim=c(n,n,TT))
xtT=matrix(1:n,n,TT)

#set up the II.row and II.col
II.row = II.col = Diagonal(n)
for(i in 1:(TT-1)) II.row = cBind(II.row,Diagonal(n))
for(i in 1:(TT-1)) II.col = rBind(II.col,Diagonal(n))
I.row=Matrix(1,1,TT)
I.col=Matrix(1,TT,1)

#set up the block diag matrices; Matrix wants a list of matrices for bdiag()
for(i in 1:TT) b[[i]]=a[,,i]
b=list()
d=bdiag(b)
for(i in 1:TT) b[[i]]=xtT[,i,drop=FALSE]
e=bdiag(b)

#speed test
require(matlab)
tic()
for(j in 1:100)
sum2=II.row%*%d%*%e%*%I.col
toc()

tic()
for(j in 1:100){
sum1=0; for(i in 1:TT) sum1=sum1+a[,,i]%*%xtT[,i]
}
toc()

tic()
for(j in 1:100){
sum2=I.row%*%crossprod(e,d)%*%II.col
}
toc()

tic()
for(j in 1:100){
sum1=0; for(i in 1:TT) sum1=sum1+crossprod(xtT[,i,drop=FALSE],a[,,i])
}
toc()

Fitting big state-space models with glmnet?

2013-11-05T17:45:00.002-08:00

Brian Dennis once showed me an algorithm for fitting state-space models using a big matrix of all the data. I viewed the approach as unworkable for anything but small data sets. Maybe glmnet could be used?

Dixon and Coles tests repeated with glmnet(..., lambda=0)

2013-10-11T14:01:00.003-07:00

I repeated the tests from my previous post using glmnet(..., lambda=0). Given how well this worked with real soccer data, I was surprised that it would not work with 'better' simulated data. Maybe I do need to constrain the model?

Update: I figured out that this is a problem with family="poisson" when mu is pathologically huge (like 1e365). Works fine on more realistic mu. I tested on various soccer fits with speedglm() versus glmnet() and they gave the same estimates. I recoded rank.teams() in fbRanks to allow use of glmnet.

In all tests, the network of games played (who plays who) is real. See previous post for a description of the tests.
Test 1: attack and defense strength for teams are drawn i.i.d. NON-CONVERGENCE
Test 2: attack and defense strength drawn with same mean and variance but age groups do not have different means. NON-CONVERGENCE
Test 3: attack and defense strengths set from estimated values from speedglm using real data. glmnet works well and seems a little better than speedglm.

What is up with test 1 and test 2? Works fine if I don't pass in lambda=0, while test 3 works fine if I do. Maybe too many 0s if I draw strengths randomly? Variance within age groups is maybe less than 1 or 2 (real data)?

Follow-up on the glmnet problems I had for fitting poisson models

2013-10-11T14:00:00.000-07:00

After emailing the maintainer of glmnet, I figured out what was going on with glmnet. It is not a bug with intercept estimation as I had thought, but rather that when mu (that generated the poisson counts) gets very large, as in 1e+13 large, the glmnet algorithm likelihood surface (or whatever surface it is maximizing) gets very, very, very flat. So it shows convergence at the default thresh of 1e-7 long before it is near the correct maximum.

Here is simulated data:

N=1000; p=50
nzc=p
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta
mu=exp(f)
y=rpois(N,mu)

#intercept should be 0; it's not anywhere close
fit=glmnet(x,y,family="poisson")
coef(fit)[1,]
fit2=glm(y~x,family="poisson")
coef(fit2)[1]

So there are 2 problems
1) I should have used exact=TRUE in my coef() call
cor(coef(fit2),as.numeric(coef(fit,s=0)))
[1] 0.4459366
cor(coef(fit2),as.numeric(coef(fit,s=0,exact=TRUE)))
[1] 0.8289489
2) I should have set thresh much lower
fit=glmnet(x,y,family="poisson",thresh=1e-10)
cor(coef(fit2),as.numeric(coef(fit,s=0,exact=TRUE)))
[1] 0.9995521

Nonetheless, this still doesn't work
fit=glmnet(x,y,family="poisson",thresh=1e-10,intercept=FALSE)
Warning message: from glmnet Fortran code (error code -2); Convergence for 2th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned

Maybe it would converge if I set maxit much higher, but glm doesn't have trouble:
fit2=glm(y~-1+x,family="poisson")

At this point, I decided to upgrade R from 2.15.3 to 3.0.2 After the upgrade, my glmnet() call threw a y<0 error about a 1/4 of the time even though y was not < 0 ever. Here's the email I sent to the maintainer that sorted out all the oddness:

-------------------------------

Hi,

Thanks for the response and edited code. I had tried standardize=FALSE and setting thresh=1e-10 before writing, but I had still been getting the result so thought that the slow convergence was being caused by an intercept estimation issue. Here's an example with some sample output that shows what I was seeing every so often:

N=1000; p=50; nzc=p
x=matrix(rnorm(N*p),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta #intercept is 0
mu=exp(f)
y=rpois(N,mu)

fit=glmnet(x,y,family="poisson",standardize=FALSE,thresh=1e-10,maxit=1e7)
coef(fit,s=0,exact=TRUE)[1] #big intercept
[1] 8.289213
cor(coef(fit2),as.numeric(coef(fit,s=0,exact=TRUE))) #low correlation with glm()
[1] 0.4740912

The reason I thought it was an intercept issue was that I had gotten this error for what seemed an easy problem.

fit=glmnet(x,y,family="poisson",intercept=FALSE)
Warning message:
from glmnet Fortran code (error code -2); Convergence for 2th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned

while this had no trouble

fit2=glm(y~-1+x,family="poisson")
I thought it odd that glmnet() had trouble on what seemed a relatively simple problem and I had never observed glmnet to not converge when glm did. I've been using glmnet for awhile and had never seen it have trouble with a model that glm solved easily. I thought it had something to do with a problem with setting intercept=FALSE and that led to the query and code I originally sent you.

However, after writing you I updated from R 2.15.3 to R 3.0.2. And then glmnet starting returning an error that y<0 a quarter of the time with my sample code. I tracked this down to a change in the rpois() behavior in 3.0.2 :

R 2.15.2 and 2.15.3
rpois(1,1e10)

[1] 10000025096 (e.g.)

R 3.02
rpois(1,1e10)
[1] NA

When those NA appeared, glmnet reported some y<0 and thus returned an error. At that point, I realized that the example code was producing mu's that were exceedingly large. And when mu was very large, that's when I was seeing the low correlation. For example, for the example shown above max(mu) was 4.831217e+12 . After reading your email, I see that even though thresh=1e-10, with this big of mu, the glmnet() algorithm was not near the maximum and was approaching the maximum very, very slowly so thresh would need to be even smaller than 1e-10.

But if I change the code so that mu is more reasonable, then glmnet() has no convergence issues:

#draw x from normal with smaller variance

N=1000; p=50; nzc=p
x=matrix(rnorm(N*p,0,0.1),N,p)

beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta #intercept is 0
mu=exp(f)
y=rpois(N,mu)

fit=glmnet(x,y,family="poisson",standardize=FALSE)
fit2=glm(y~x,family="poisson")
cor(coef(fit2),as.numeric(coef(fit,s=0,exact=TRUE)))
[1] 1
Now this works fine too
fit=glmnet(x,y,family="poisson",intercept=FALSE)

So, no bug just interesting (different) behavior for glm vs glmnet for really large mu.

Regards,

Eli

-----------------------------------------------------------

And now this code works fine
N=1000; p=50
nzc=p
x=matrix(rnorm(N*p,0,0.1),N,p)
beta=rnorm(nzc)
f = x[,seq(nzc)]%*%beta #intercept is 0
mu=exp(f)
y=rpois(N,mu)
fit=glmnet(x,y,family="poisson",intercept=FALSE,lambda=0)
fit2=glm(y~-1+x,family="poisson")
cor(coef(fit2),as.numeric(coef(fit)[-1]))
[1] 1

glm, speedglm, glmnet comparison (part 1 repeated with lambda=0)

2013-10-11T13:53:00.000-07:00

glmnet does OLS when you set alpha=1 and lambda=0, so should return the same values as glm. I repeated the part 1 tests with lambda=0. Works fine for family="gaussian". Crashes for family="poisson".

Update: what's going on is I inadvertently created mu's for the poisson that are huge. max(y) = 1e8 to 1e10 . When facs = 20 enough draws, every so often sum(betas) > 17 or so. And then exp(betas) = HUGE. That is where the problem is. When I used more smaller betas, so that sum(betas) never greater than 10 (say), the convergence problem disappears.

Test is a model with facs factors and levs levels. betas drawn from normal(0,1). I took care in this test to set the first level of each factor to 0, same as glm does, and estimated the intercept.

with family gaussian all is fine

  facs   glm speedglm glmnet
a    5  0.09     0.03   0.02
a   10  0.20     0.06   0.05
a   20  0.67     0.22   0.15
a   50  3.67     1.27   0.09
a  100 14.24     1.67   0.17

with family poisson it doesn't work

Warning messages:
1: from glmnet Fortran code (error code -1); Convergence for 1th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned 
2: In getcoef(fit, nvars, nx, vnames) :
  an empty model has been returned; probably a convergence issue

timings

           glm  speedglm    glmnet
a  5      0.22      0.11      0.05
a 10      0.44      0.22      0.52
a 20      1.33      0.75     86.09

R code
library(glmnet)
library(speedglm)

n = 10000
levs = 10

res=obj=c()
for(facs in c(5,10,20)){
beta=matrix(rnorm(levs*facs,0,1),levs,facs)

levx = c()
for(i in 1:facs) levx=cbind(levx,sample(levs,n,replace=TRUE))

y <- apply(levx, 1, function(x){ rpois(1,exp(sum(beta[x+seq(0,levs*(facs-1),by=levs)]))) })
x = data.frame(levx)
for(i in 1:facs) x[,i]=factor(x[,i],levels=1:levs)
dat = cbind(y=y,x)

cat(facs);cat("\n")

#set up the formula
fooform = "y~1"
for(i in 1:facs) fooform=paste(fooform,"+X",i,sep="")

#fit glm
#don't do glm if facs>500; too slow
if(facs<500){
    a=c(facs, system.time(fit<-glm(formula(fooform), data=dat, family="poisson"))[1])
    b=c(facs, object.size(fit))
}else{
    a=c(facs, NA)
    b=c(facs, NA)
}

#fit speedglm
a=c(a,system.time(fit2<-speedglm(as.formula(fooform), data=dat, family=poisson(log)))[1])
b=c(b, object.size(fit))

#fit glmnet
sx.j = as.vector(levx+matrix(seq(0,levs*(facs-1),by=levs),n,facs,byrow=TRUE))
sx.i = rep(1:n,facs)
#remove the 1st level of each factor; do sx.i first because test depends on sx.j
sx.i = sx.i[!(sx.j %in% seq(1,levs*facs,by=levs))]
sx.j = sx.j[!(sx.j %in% seq(1,levs*facs,by=levs))]
sx=sparseMatrix(sx.i,sx.j,
                  x=1,dims=c(n,levs*facs))
a=c(a, system.time(fit<-glmnet(sx, y, lambda=0, family="poisson"))[1])
b=c(b, object.size(fit))

res=rbind(res,a)
obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of explanatory variables",ylim=c(0,500))
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

#plot 2
plot(obj[,1],log(obj[,2]),type="l", ylab="object size (log(M))", xlab="number of explanatory variables",ylim=c(10,23))
lines(obj[,1],log(obj[,3]),col="red",lty=2)
lines(obj[,1],log(obj[,4]),col="blue",lty=3)
abline(h=log(8042*1e6))
abline(h=log(2000*1e6))
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

Dixon and Coles 2-player model with glmnet: Tests with simulated real data

2013-10-10T12:35:00.001-07:00

Update: all these tests are using the default lambda for glmnet. Since posting this I discovered that passing in lambda=0 to glmnet forces it to return the equivalent glm estimates. speedglm and glmnet estimates are much more similar and glmnet seems more robust with lambda=0. Next post repeats the analyses here but with lambda=0 passed into glmnet.

Now I will take real soccer match data, with all its non-random mixing, but substitute randomly generated scores from teams with known attack and defend strengths. I do a series of tests to try replicate the problem shown in Dixon and Coles model fit with glmnet: Test 1 with real data.Test 1 is with i.i.d. attack strengths and didn't reproduce the problem. Test 2 is a little more realistic and has correlated attack and defend strengths (teams with strong attack tend to have strong defense) mimicking the real data. This didn't reproduce the problem either. Test 3 uses the attack and defense strengths that came out of speedglm used on the real data. These have the property that different age groups have different strengths. This finally replicates the problem and shows that it is glmnet that is producing the bad estimates (cannot recover the values used to produce the simulated data). The full R code to run all the tests is at the bottom. Note all these results are using glmnet coefficients at lambda = max(lambda) .

Test 1: Attack and defend strengths are i.i.d normal (the bold bit)
sim_data=rank_data
sim_data$scores = rank_data$scores[!(is.na(rank_data$scores$home.score) & is.na(rank_data$scores$away.score)),]
teams = unique(c(as.character(sim_data$scores$home.team),as.character(sim_data$scores$away.team)))
nteams = length(teams)
sim.attack = rnorm(nteams)
sim.defend = rnorm(nteams)-.4
sim.scores = sim_data$scores
ngames = dim(sim.scores)[1]
for(i in 1:ngames){
sim.scores$home.score[i] = rpois(1,exp(sim.attack[which(teams==sim.scores$home.team[i])] - sim.defend[which(teams==sim.scores$away.team[i])]))
sim.scores$away.score[i] = rpois(1,exp(sim.attack[which(teams==sim.scores$away.team[i])] - sim.defend[which(teams==sim.scores$home.team[i])]))
}
sim_data$scores=sim.scores

Make sure the sim and real data look remotely similar (the 0.4 added to sim.defend was to get the mean goals scored similar and # of 0s similar).

par(mfrow=c(1,2))
hist(sim.scores$home.score,breaks=0:500,xlim=c(0,20),ylim=c(0,10000),main="simulated",xlab="home score")
hist(rank_data$scores$home.score,breaks=0:500,xlim=c(0,20),ylim=c(0,10000),main="real",xlab="home score")

Not horrible. I'll go with that.

Fit the simulated data with speedglm and glmnet
glmnet run with default settings (except family="poisson")
#using fbRanks 2.0
age=c("B01","B00","B99","B98","B97","B96")
fbRanks.sim=rank.teams(scores=sim_data$scores, teams=sim_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="glmnet")
p.sim=print(fbRanks.sim,silent=TRUE)$ranks[[1]]
fbRanks.sim.spdglm=rank.teams(scores=sim_data$scores, teams=sim_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="speedglm")
p.spd=print(fbRanks.sim.spdglm,silent=TRUE)$ranks[[1]]

Plot estimates against true
par(mfrow=c(1,2))
plot(log(p.sim$attack),sim.attack[match(p.sim$team,teams)],
main="glmnet",xlab="estimated attack",ylab="true attack")
plot(log(p.spd$attack),sim.attack[match(p.spd$team,teams)],
main="speedglm",xlab="estimated attack",ylab="true attack")

Compare to each other:
#compare to each other
par(mfrow=c(1,1))
teamsb=union(as.character(p.sim$team),as.character(p.spd$team))
plot(p.spd$total[match(teamsb,p.spd$team)],p.sim$total[match(teamsb,p.sim$team)])
abline(0,1)

That looks pretty good, except for the points about -2 below the 1-1 line.

Test 2: attack and defense strengths are correlated (the bold bit)
For this test, I am going to generate attack and defend strengths with the same mean and var-cov matrix as in the real data. The previous post HERE got estimates of attack and defend strength from the real data. I use those estimates (called p.b) here.

Replace the bolded bit in the Test 1 code with this mu=apply(cbind(log(p.b$attack),-1*log(p.b$defense)),2,mean,na.rm=TRUE)
Sigma = cov(cbind(log(p.b$attack),-1*log(p.b$defense)),use="na.or.complete")
tmp=mvrnorm(nteams,mu=mu,Sigma=Sigma)
sim.attack = tmp[,1]
sim.defend = tmp[,2]

Now my attack and defend strengths have the same structure as the estimates from speedglm from the real data (the p.b$attack and p.b$defense).

Plot estimates against true
par(mfrow=c(1,2))
mse.sim = mean((log(p.sim$attack)-sim.attack[match(p.sim$team,teams)])^2)
plot(log(p.sim$attack),sim.attack[match(p.sim$team,teams)],
main=paste("glmnet\nmse =",mse.sim),xlab="estimated attack",ylab="true attack")
mse.spd = mean((log(p.spd$attack[p.spd$attack!=0])-sim.attack[match(p.spd$team[p.spd$attack!=0],teams)])^2,na.rm=TRUE)
plot(log(p.spd$attack),sim.attack[match(p.spd$team,teams)],
main=paste("speedglm\nmse =",mse.spd),xlab="estimated attack",ylab="true attack")

That looks ok.

Look at total strength which combines attack and defense strengths. print(fbRanks) is subtracting the mean total, so I'm doing my abline with the mean added back on.
par(mfrow=c(1,2))
sim.total = (sim.attack+sim.defend)/log(2)
plot(p.sim$total,sim.total[match(p.sim$team,teams)],
main="glmnet",xlab="estimated",ylab="true")
abline(mean(sim.total[match(p.sim$team,teams)]),1)
abline(mean(sim.total[match(p.sim$team,teams)])+sin(pi/4),1,col="red")
abline(mean(sim.total[match(p.sim$team,teams)])-1*sin(pi/4),1,col="red")
plot(p.spd$total,sim.total[match(p.spd$team,teams)],
main="speedglm",xlab="estimated",ylab="true")
abline(mean(sim.total[match(p.spd$team,teams)]),1)
abline(mean(sim.total[match(p.spd$team,teams)])+sin(pi/4),1,col="red")
abline(mean(sim.total[match(p.spd$team,teams)])-1*sin(pi/4),1,col="red")

I really want to be within those red lines which off the true total by +/- 1.0. A lot of my estimates are outside that. Let's look at what fraction fall outside the +/- 1.0 for different numbers of games played. I expect that estimates are better for teams that have played more games.

#compare total against games played
par(mfrow=c(1,2))
sim.total = (sim.attack+sim.defend)/log(2)
nrange=1:30
fracbig = c()
for(nlim in nrange){
fracbig = c(fracbig, sum(abs(p.sim$total[p.sim$n>nlim]-sim.total[match(p.sim$team[p.sim$n>nlim],teams)])>1,na.rm=TRUE)/sum(p.sim$n>nlim))
}
par(mfrow=c(1,1))
plot(nrange,fracbig,type="l",xlab="number of games played")
fracbig = c()
for(nlim in nrange){
fracbig = c(fracbig, sum(abs(p.spd$total[p.spd$n>nlim]-sim.total[match(p.spd$team[p.spd$n>nlim],teams)])>1,na.rm=TRUE)/sum(p.spd$n>nlim))
}
lines(nrange,fracbig,col="red")
title("fraction outside abs(est total - true total)>1\nglmnet black and speedglm red")

This suggests that glmnet is performing better than speedglm when I use a minimum number of games (so just don't show estimates for teams with few games).

But I still haven't replicated the problem seen in the real data. Test 3, make the age groups have different means.

Test 3: use estimates from speedglm as attack and defense strength. Now attack and defense are correlated, but also the different age groups have different mean strengths (order teams tend to be stronger). For this I am just going to use the estimates from cluster.1 in fbRanks.b from the previous Dixon & Coles post. It so happens that rank_data has 9 unique clusters.

Replace the attack and strength generating code with this
tmp=p.b[!is.na(p.b$attack) & !is.na(p.b$defense) & !(p.b$defense==0) & !(p.b$attack==0),]
teams = as.character(tmp$team)
nteams = length(teams)
sim.attack = log(tmp$attack)
sim.defend = log(tmp$defense)
#get rid of data from teams that are not in cluster.1
sim.scores = sim_data$scores[sim_data$scores$home.team %in% teams & sim_data$scores$away.team %in% teams,]

FINALLY, this replicates the problem of multiple parallel lines.

And it is glmnet that is having problems:

And the consequences for the total strength estimate are very bad.

Next I'll work on trying to tweak glmnet's settings and see if I can get around this problem.

Full R code for running these simulations and tests
depends on fbRanks 2.0
library(fbRanks)
#load in rank_data from RData file and run code to produce p.b
#takes 10 min or so
#fbRanks.b=rank.teams(scores=rank_data$scores, teams=rank_data$teams, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="speedglm")
#p.b=print(fbRanks.b,silent=TRUE)$ranks[[1]] #cluster.1

testt = 1; #Test with i.i.d. strengths
testt = 2; #Test with correlated strengths
testt = 3; #use the speedglm estimates from the real data
sim_data=rank_data
sim_data$scores = rank_data$scores[!(is.na(rank_data$scores$home.score) & is.na(rank_data$scores$away.score)),]
teams = unique(c(as.character(sim_data$scores$home.team),as.character(sim_data$scores$away.team)))
nteams = length(teams)
sim.scores = sim_data$scores
if(testt == 1){
sim.attack = rnorm(nteams)
sim.defend = rnorm(nteams)-.4
}
if(testt==2){
mu=apply(cbind(log(p.b$attack),-1*log(p.b$defense)),2,mean,na.rm=TRUE)
Sigma = cov(cbind(log(p.b$attack),-1*log(p.b$defense)),use="na.or.complete")
tmp=mvrnorm(nteams,mu=mu,Sigma=Sigma)
sim.attack = tmp[,1]
sim.defend = tmp[,2]
}
if(testt==3){
tmp=p.b[!is.na(p.b$attack) & !is.na(p.b$defense) & !(p.b$defense==0) & !(p.b$attack==0),]
teams = as.character(tmp$team)
nteams = length(teams)
sim.attack = log(tmp$attack)
sim.defend = log(tmp$defense)
sim.scores = sim_data$scores[sim_data$scores$home.team %in% teams & sim_data$scores$away.team %in% teams,]
}
ngames = dim(sim.scores)[1]
for(i in 1:ngames){
sim.scores$home.score[i] = rpois(1,exp(sim.attack[which(teams==sim.scores$home.team[i])] - sim.defend[which(teams==sim.scores$away.team[i])]))
sim.scores$away.score[i] = rpois(1,exp(sim.attack[which(teams==sim.scores$away.team[i])] - sim.defend[which(teams==sim.scores$home.team[i])]))
}
sim_data$scores=sim.scores
age=c("B01","B00","B99","B98","B97","B96")
fbRanks.sim=rank.teams(scores=sim_data$scores, teams=sim_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="glmnet")
fbRanks.sim.spdglm=rank.teams(scores=sim_data$scores, teams=sim_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="speedglm")
p.sim=print(fbRanks.sim,silent=TRUE)$ranks
p.spd=print(fbRanks.sim.spdglm,silent=TRUE)$ranks
if(testt!=3){ #use cluster 1; there are many
p.sim=p.sim[[1]]
p.spd=p.spd[[1]]
}

#compare estimates agains true
par(mfrow=c(1,2))
mse.sim = mean((log(p.sim$attack[p.sim$attack!=0])-sim.attack[match(p.sim$team[p.sim$attack!=0],teams)])^2,na.rm=TRUE)
plot(log(p.sim$attack),sim.attack[match(p.sim$team,teams)],
     main=paste("glmnet\nmse =",mse.sim),xlab="estimated attack",ylab="true attack")
abline(0,1)
mse.spd = mean((log(p.spd$attack[p.spd$attack!=0])-sim.attack[match(p.spd$team[p.spd$attack!=0],teams)])^2,na.rm=TRUE)
plot(log(p.spd$attack),sim.attack[match(p.spd$team,teams)],
     main=paste("speedglm\nmse =",mse.spd),xlab="estimated attack",ylab="true attack")
abline(0,1)

#compare total
par(mfrow=c(1,2))
sim.total = (sim.attack+sim.defend)/log(2)
plot(p.sim$total,sim.total[match(p.sim$team,teams)],
     main="glmnet",xlab="estimated",ylab="true")
abline(mean(sim.total[match(p.sim$team,teams)]),1)
abline(mean(sim.total[match(p.sim$team,teams)])+sin(pi/4),1,col="red")
abline(mean(sim.total[match(p.sim$team,teams)])-1*sin(pi/4),1,col="red")
plot(p.spd$total,sim.total[match(p.spd$team,teams)],
     main="speedglm",xlab="estimated",ylab="true")
abline(mean(sim.total[match(p.spd$team,teams)]),1)
abline(mean(sim.total[match(p.spd$team,teams)])+sin(pi/4),1,col="red")
abline(mean(sim.total[match(p.spd$team,teams)])-1*sin(pi/4),1,col="red")

#compare total against games played
sim.total = (sim.attack+sim.defend)/log(2)
nrange=1:30
fracbig = c()
for(nlim in nrange){
fracbig = c(fracbig, sum(abs(p.sim$total[p.sim$n>nlim]-sim.total[match(p.sim$team[p.sim$n>nlim],teams)])>1,na.rm=TRUE)/sum(p.sim$n>nlim))
}
par(mfrow=c(1,1))
plot(nrange,fracbig,type="l",xlab="number of games played",ylim=c(0,.6))
fracbig = c()
for(nlim in nrange){
fracbig = c(fracbig, sum(abs(p.spd$total[p.spd$n>nlim]-sim.total[match(p.spd$team[p.spd$n>nlim],teams)])>1,na.rm=TRUE)/sum(p.spd$n>nlim))
}
lines(nrange,fracbig,col="red")
title("fraction outside abs(est total - true total)>1\nglmnet black and speedglm red")

#compare to each other
par(mfrow=c(1,2))
teamsb=union(as.character(p.sim$team),as.character(p.spd$team))
plot(log(p.spd$attack[match(teamsb,p.spd$team)]),log(p.sim$attack[match(teamsb,p.sim$team)]),
     xlab="speedglm estimate",ylab="glmnet estimate",main="attack estimates")
abline(0,1)
plot(log(p.spd$defense[match(teamsb,p.spd$team)]),log(p.sim$defense[match(teamsb,p.sim$team)]),
     xlab="speedglm estimate",ylab="glmnet estimate",main="defend estimates")
abline(0,1)

#plot 1 make sure the sim and real data look kind of similar
par(mfrow=c(1,2))
hist(sim.scores$home.score,breaks=0:500,xlim=c(0,20),ylim=c(0,10000),main="simulated",xlab="home score")
hist(rank_data$scores$home.score,breaks=0:500,xlim=c(0,20),ylim=c(0,10000),main="real",xlab="home score")

Dixon and Coles 2-player model fit with glmnet: Test 1 with real data

2013-10-10T09:32:00.000-07:00

Follow up on Dixon and Coles 2-player model in glm, speed, and glmnet

Initial tests with simulated data with random mixing---meaning the graph of interactions across players has no 'clusters'---was promising and suggested that glmnet is both much faster and more robust. However, real social networks (and 2-player systems can be thought of as a type of social network) are highly non-random. The norm is a network with clusters in which players interact strongly and where there is lower (and potentially quite low) interactions across clusters. The result is going to be a likelihood surface with strong ridges. I don't fully understand the algorithm used by glmnet, but if it is using any kind of ascent algorithm, it might get stuck on these ridges.

It looks like this might be happening. Here is a plot of speedglm versus glmnet for some real soccer match data spanning 6 age groups. Age groups are clusters and within age groups there are further clusters (states and leagues). Update: I was using glmnet default lambda which does not quite replicate glm behavior. I was using the coefficients at min(fit$lambda) which was almost a saturated model but still not quite the same. See update below where I pass in lambda=0 to force glmnet to return glm-equivalent estimates. The speedglm and glmnet estimates are now identical. See updated plot below.

The estimates should be parallel to the 1-1 line (but not necessarily on it). Notice the data seem to fall on multiple 1-1 lines. This suggests that individual clusters (age groups) are ok, but glmnet is stopping before getting to a solution that gets all those the same 1-1 line. However to understand what is going wrong this to work with simulated data where I know the "truth".

R code to produce plot above.
rank_data is a data.frame of the 2013 match data for WA and OR youth boys select soccer teams. About 2500 teams. The data includes age B02, B95 and B94 (B=boys, 02=birth year), but I left that off to speed up speedglm. The problem shown in the plot above is much much worse with those ages added.

#using fbRanks R package 2.0
library(fbRanks)
#this was using default lambda for glmnet and getting coefficients using coef(fit, s=min(fit$lambda))
#where fit is what is returned from the fit=glmnet() call
age=c("B01","B00","B99","B98","B97","B96")
fbRanks.a=rank.teams(scores=rank_data$scores, teams=rank_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="glmnet")
fbRanks.b=rank.teams(scores=rank_data$scores, teams=rank_data$teams, age=age, min.date=min.date, max.date=max.date, silent=TRUE, time.weight.eta=best.eta, date.format="%m/%d/%Y", fun="speedglm")
p.a=print(fbRanks.a,silent=TRUE)$ranks[[1]]
p.b=print(fbRanks.b,silent=TRUE)$ranks[[1]]
teams=union(as.character(p.a$team),as.character(p.b$team))
plot(p.a$total[match(teams,p.a$team)],p.b$total[match(teams,p.b$team)],ylab="speedglm",xlab="glmnet")
abline(0,1)

Update
Passing in lambda=0 to the glmnet call fixes the problem.
Same code as above but with glmnet(...., lambda=0) call in the rank.teams() function.
The estimates are not identical but my analysis with simulated data suggests that glmnet estimates are more robust.

Dixon and Coles' 2-player model in glm, speedglm and glmnet

2013-10-08T15:50:00.002-07:00

Back to Dixon and Coles model applied to a huge team pool. This is part of a series of posts comparing glm, speedglm and glmnet and is related to stuff I have been playing with regarding massive 2-player estimation problems.

Dixon, M. J. and Coles, S. G. (1997), Modelling Association Football Scores and Inefficiencies in the Football Betting Market. Journal of the Royal Statistical Society: Series C (Applied Statistics), 46: 265–280. doi: 10.1111/1467-9876.00065

The last post used a model that is pretty similar to Dixon and Coles model. I'll tweak it a bit and do the same series of tests. Some oddities about how I set up the model, which are irrelevant for the purpose of these speed tests, but I note them anyhow.

In the simulated data, attack and defend strengths of a team are uncorrelated. This is not true. They are highly correlated. This is btw why I can't use glmer on real data, i.e. can't just treat the strengths as random effects. I tried and failed to figure how to treat different random effects as correlated with glmer in the lme4 R package. Easy enough in a bayesian-glmer, but that is slow....
The model I passed to glm is unidentifiable. One of the coefficients needs to be set to 0. But glm and speedglm are smart enough to figure out what to do though they complain a bit. Easy enough to fix. Just specify the glm model as having 2 factor (attack and defend) and treat the subjects as levels. Easy but not relevant for this test. Btw I set this model up the correct way for glm normally. I did it the other way here since my code from previous posts was closer to that.
For glmnet, I don't know how to specify lambda for the model with 1 coefficient equal to 0. I'm sure it is possible to compute that, but I don't know how. So I used the default lambda. This will lead to some variable number of 0s. Seems to work ok, in fact, works better than glm no doubt the extra constraints help.

The model
Two subjects in each "competition". One is attacking, other is defending. Each subject has an attack and defend strength. The outcome of the competition is a poisson distributed random variable with mean = exp(attack strength of attacker + defend strength of defender). If a subject has a low (very negative) defend strength, then only strong attackers can score against them. If a subject has a high attack strength, they score against all but the strongest defenders.

The simulation
I draw attack and defend strengths randomly from normal distributions. I sample 2 subjects from the pool and assign them randomly to attack or defend. I compute the 'result' as a random number generated from the appropriate poisson distribution. Repeat for 10*nSubjects number of competitions. So the size of the my dataset is increasing with the number of subjects, which I do to make the data a bit more realistic. This gives me about 10 competitions per subject. So on average 5 competitions to get their attack strength and 5 to get their defend strength.

Results
glm is desperately slow. 25 minutes for 1500 subjects! And that's a small test case. Speedglm is considerably speedier at 7.5 minutes. But glmnet is HALF A SECOND for this problem and more robust.

glm is not particularly robust for this problem. Top plot shows glmnet versus true. Shows about what I expect. Looks pretty good to me. It widens out for low attack or defend strengths because those are cases where you get 0s from the poisson. The way I set it up (adding the attack and defend strengths as opposed to subtracting defend strength from attack strength), negative defend strength equals strong defense and negative attack equals weak attack.

Next plot shows glm versus true. Bah, look at all those 1e-15 values. Ok, I made it hard for glm with sometimes only 1-2 competitions with which to estimate a strength, but glmnet did a lot better by setting some coef to 0s and thus constraining the problem a bit.

When the values were not 1e-15, they matched glmnet's values:

R code
library(glmnet)
library(speedglm)

#Treat subject as a explanatory variable (x). There will be nSubject x variables
#This allows us to take into account that subject 1 in the player1, player2, or player3 is the same subject
nPlayers=2 #how many 1s in each row of data
res=obj=c()
for(nSubjects in c(100,200,500,1000,1500)){
n=10*nSubjects
    #variance of the distribution of the player pool x's
    mean.x = 0
    sig2.x = 1
    #true.x is what we are trying to estimate
    true.attack = rnorm(nSubjects, mean.x, sig2.x)
    true.defend = rnorm(nSubjects, mean.x, sig2.x)

    #There are 2 ways to set this up. Treat subject as level in factors attack and defend
    #or create 2 (attack+defend) x n.x explanatory variables that are 0/1
    #they are mathematically equivalent to glmnet
levx=y=c() #levx is holder for player #; y is data
x = matrix(0,n,2*nSubjects) #x is the n x 2nSubjects explanatory variable matrix needed by glm
for(i in 1:n){
    #draw nPlayers randomly for each of the n competitions
    levx=rbind(levx,sample(nSubjects, 2, replace=FALSE))
    x[i,c(levx[i,]+c(0,nSubjects))]=1 #set x var for attacksubject = 1 if subject is present; same for defendsubject
    y=c(y,rpois(1,exp(true.attack[levx[i,1]]+true.defend[levx[i,2]])))
}
#set the colnames on the explanatory variables
colnames(x)=c(paste("attack",1:nSubjects,sep=""),
        paste("defend",1:nSubjects,sep=""))
dat = cbind(y=y,x)
dat = data.frame(dat) #glm wants a data frame

cat(nSubjects); cat(" ")

    #set up the formula
    fooform = "y~-1"
    fooform=paste(fooform,paste("+attack",1:nSubjects,collapse="",sep=""),
             paste("+defend",1:nSubjects,collapse="",sep=""),sep="")

    #fit glm
    a=c(nSubjects, system.time(fit1<-glm(formula(fooform), data=dat, family="poisson"))[1])
    b=c(nSubjects, object.size(fit1))

    #fit speedglm
    a=c(a,system.time(fit2<-speedglm(formula(fooform), data=dat))[1])
    b=c(b, object.size(fit2))

    #fit glmnet
    #1 row for each player; col is just the player number
    #need to t() levx so as.vector works by row
#add that 0,nSubjects bit because I have 1:nSubjects for attack and another 1:nSubjects for defend sx=sparseMatrix(rep(1:n,each=nPlayers),as.vector(t(levx+matrix(c(0,nSubjects),n,2,byrow=TRUE))),x=1)
    a=c(a, system.time(fit3<-glmnet(sx, y, intercept=FALSE, family="poisson"))[1])
    b=c(b, object.size(fit3))

    res=rbind(res,a)
    obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of subjects")
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)
title("Time to compute on my laptop")

#plot 2
plot(res[,1],res[,2]/res[,4],xlab="number of subjects",ylab="glm (or speedglm) speed/glmnet speed", type="l")
lines(res[,1],res[,3]/res[,4], col="red")
title("relative speed of glm (black)\n and speedglm (red) to glmnet")

#plot 3
par(mfrow=c(1,2))
coef.glm = coef(fit1) #since I didn't est an intercept
coef.speedglm = coef(fit2)
coef.glmnet.attack = coef(fit3,s=min(fit3$lambda))[2:(nSubjects+1)]
coef.glmnet.defend = coef(fit3,s=min(fit3$lambda))[(nSubjects+2):(2*nSubjects+1)]
#true to glmnet
plot(true.attack,coef.glmnet.attack,ylab="estimated beta",main="glmnet")
plot(true.defend,coef.glmnet.defend,ylab="estimated beta",main="glmnet")
#glm to glmnet
plot(coef.glm[1:nSubjects],coef.glmnet.attack,ylab="from glmnet",xlab="from glm",main="attack")
plot(coef.glm[(nSubjects+1):(2*nSubjects)],coef.glmnet.defend,ylab="from glmnet",xlab="from glm",main="defend")
#glm to true
plot(true.attack,coef.glm[1:nSubjects],ylab="estimated beta",main="glm")
plot(true.defend,coef.glm[(nSubjects+1):(2*nSubjects)],ylab="estimated beta",main="glm")

Multiplayer problem revisited with glm vs speedglm vs glmnet

2013-10-08T12:19:00.003-07:00

Here I discuss a specific problem: using glm to estimate player "strengths" in a multi-player problem where the player pool is huge. This is part of a series of posts I did comparing glm, speedglm and glmnet. And this is related to stuff I have been playing with regarding massive 2-player estimation problems.

Scenario
Imagine we have a series of competitions where each competition consists of nPlayers randomly chosen from a pool of nSubjects. Each subject has a 'strength' and the outcome of the competition is some function of the additive strengths. Some examples might be:

players are pulling on a rope and we measure force (on um a scale of -Inf to Inf....). y ~ normal(sum(strengths))
players are playing a game where they score 0-10 (or so). y ~ poisson(exp(sum(strengths)))
players are playing a win/lose game. y ~ binomial(logit(sum(strengths)))

Here I just use a normal to be simple.

Set up in glm framework
We don't want to treat the subjects at levels and player1, player2, etc as a factor since subject i could be player1, player2, etc in any one competition but they are still the same subject. Instead we treat subject as a 0/1 explanatory variable. 1 = subject was in competition. 0 = they were not. Our data consists of nSubjects explanatory variables with nPlayer 1s in each rows. So...most of our explanatory variable data is all zeros and we are going to have a huge number of explanatory variables. We can expect that glmnet will excel here.

The R code below shows how to set up the simulated data and then set the model up for glm, speedglm and glmnet. It's pretty similar to the R code from my previous posts. But here I show some plots of estimated betas (subject strengths) versus true values, which requires getting the coefficients out of glmnet. Read this post on what glmnet does, how to get coefficients out of glmnet and why I pass lambda=1e-6 into my glmnet call.

Results of speed test (R code below)

Now that I pass in lambda=1e-6, glmnet didn't get slower as nSubjects increased. Wow. It was basically instantaneous for these tests while glm took about a minute.

Relative speed correspondingly skyrockets for glm and speedglm versus glmnet for this problem.

Estimates however are basically identical. First plot shows glmnet versus true and the next show glmnet versus glm and speedglm versus glm estimates. Yes, they are on the 1-1 line.

R code
library(glmnet)
library(speedglm)

#3 players
#Treat subject as a explanatory variable (x). There will be nSubject x variables
#This allows us to take into account that subject 1 in the player1, player2, or player3 is the same subject
nPlayers = 3
n = 10000

res=obj=c()
for(nSubjects in c(100,200,500,1000,1500)){
    #subject strength is drawn from a normal
    beta=rnorm(nSubjects,0,1)
    levx=y=c() #levx is holder for player #; y is data
    x = matrix(0,n,nSubjects) #x is the n x nSubjects explanatory variable matrix needed by glm
    for(i in 1:n){
      #draw nPlayers randomly for each of the n competitions
      levx=rbind(levx,sample(nSubjects, nPlayers, replace=FALSE))
      x[i,levx[i,]]=1 #set x var for subject = 1 if subject is present in this competition
      #outcome of competition is normal (could be binomial-win/loss, or poisson, or whatever)
      y=c(y,rnorm(1,sum(beta[levx[i,]])))
    }
    #set the colnames on the explanatory variables
    colnames(x)=paste("X",1:nSubjects,sep="")
    dat = cbind(y=y,x)
    dat = data.frame(dat) #glm wants a data frame

    cat(nSubjects); cat(" ")

    #set up the formula
    fooform = "y~-1"
    for(i in 1:nSubjects) fooform=paste(fooform,"+X",i,sep="")

    #fit glm
    a=c(nSubjects, system.time(fit1<-glm(formula(fooform), data=dat))[1])
    b=c(nSubjects, object.size(fit1))

    #fit speedglm
    a=c(a,system.time(fit2<-speedglm(formula(fooform), data=dat))[1])
    b=c(b, object.size(fit2))

    #fit glmnet
    #1 row for each player; col is just the player number; need to t() levx so as.vector works by row
    sx=sparseMatrix(rep(1:n,each=nPlayers),as.vector(t(levx)),x=1)
    a=c(a, system.time(fit3<-glmnet(sx, y, intercept=FALSE, lambda=1e-6))[1])
    b=c(b, object.size(fit3))

    res=rbind(res,a)
    obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of categories")
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)
title("Time to compute on my laptop")

#plot 2
plot(res[,1],res[,2]/res[,4],xlab="number of subjects",ylab="glm (or speedglm) speed/glmnet speed", type="l")
lines(res[,1],res[,3]/res[,4], col="red")
title("relative speed of glm (black)\n and speedglm (red) to glmnet")

#plot 3
par(mfrow=c(3,1))
coef.glm = coef(fit1) #since I didn't est an intercept
coef.speedglm = coef(fit2)
coef.glmnet = coef(fit3,s=min(fit3$lambda))[2:(nSubjects+1)]
plot(beta,coef.glmnet,ylab="estimated beta",main="glmnet estimates vs true")
plot(coef.glm,coef.glmnet,ylab="from glmnet",xlab="estimated beta from glm")
plot(coef.glm,coef.speedglm,ylab="from speedglm",xlab="estimated beta from glm")

Getting coefficients out of glmnet

2013-10-08T11:29:00.000-07:00

Surprisingly, figuring out how to get the coefficients out of a glmnet fit took me about 2 hours of reading posts on stackexchange and R forums. I would have given up except I saw a blog where someone said they used glmnet to do glms, so I knew it was possible. Turns out it is really easy but you need to know what coef(glmnet.fit) is outputting. The problem was that I was avoiding reading the paper accompanying glmnet and couldn't really understand the output until I bucked up and read the paper. This is part of a series of posts I did comparing glm, speedglm and glmnet.

This paper describes the glmnet package and its algorithms
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL http://www.jstatsoft.org/v33/i01/.

glmnet is used to find a reduced regression model that leads to minimized mean squared error---so you have many, many explanatory variables but most of these don't increase the predictive value of the model and you want to find the optimal reduced model. The parameter 'lambda' is a measure of the model size. When you type coef(fit) after a glmnet fit, you get all the fits for the lambda used. The first columns are for small models, so most of the coef are 0. As you increase column number, the models get bigger and bigger. The idea, in a normal glmnet use, is to find the size of model that minimizes mean squared error. Here's one of the plots from their paper showing mean square error of predictions (cross-validation) versus size of model fit (with glmnet):

In this example they had 100 explanatory variables. The size of the model is at the top. The corresponding lambda is at the bottom. In normal use, you do some cross-validation (glmnet has functions for that) and use that to make a plot like above and select the lambda (size of model) that minimizes the mean squared error for your problem.

The output for coef(fit), where fit is a glmnet fit, is a 1+#coefs X #lambdas matrix. You can call coef() with the argument s to specify the lambda level you want. Why "s" and not "lambda"?? Anyhow "s" is "lambda" in the coef call. So let's say I wanted the coefficients at lambda = log(-2), about the minimum in the figure above. I would use the following:

fit3<-glmnet(sx, y)
coef.glmnet = coef(fit3, s=log(-2))[2:(nSubjects+1)]

However, I am not using glmnet that way. I'm not trying to find a reduced model. I want to fit a saturated model---meaning I want to estimate all the coefficients. I want an estimate of strength for every subject in my model. Though I suppose 0 is an estimate, I don't want that. So I want to force glmnet to fit the saturated model (careful, sometimes you do need to fix a coef to 0 to have a solveable model; e.g. models with factors with multiple levels).

I had trouble figuring out how to force glmnet to do this. [update: turns out that setting lambda=0 makes glmnet return the "glm" parameters. See below.] argument lambda.min.ratio should do the trick but seemed to have no effect. However passing in argument lambda to set your own lambda values seems to work. Friedman et al (2010) says to not pass in just one lambda value because the algorithm works better with a "warm start". Hmm, I'm not sure what a "warm start" is but my guess is that it is a saturated model. In other words, that the algorithm works better is you start with the full model and work down. So...I'm just going to start with the full model. Need to make sure that the full model is identifiable!

I should be able to compute the lambda for the saturated model and send that to glmnet, but I couldn't figure out how to compute that. So after some futzing, it seemed like passing in lambda=1e-6 forced glmnet to fit the saturated model for my toy problems. So my call to glmnet and corresponding coef() call looks like so:

fit3<-glmnet(sx, y, lambda=1e-6)
coef.glmnet = coef(fit3)

I don't need to pass s into coef(fit) since I only have 1 column because I passed in one value of lambda.

Update, a week later
Fitting the saturated or near-saturated models worked for most of my test cases---until I tried to use it on real soccer match data which has a clustered structure. See this post. In the process of trying to come up with a work-around for the problems explored more fully here, I came across a forum post where someone wrote that you can pass in lambda=0, to get glmnet to duplicate the behavior of glm. I had already tried passing in lambda really small to get it to fit the saturated model. The estimates were almost exactly the same but not quite and I ran into the problem that glmnet complained when I gave it a saturated model that is non-identifiable. glm deals with this by setting one of the factor coefficients to 0 to make the model identifiable, but I couldn't figure out how to set that constraint for glmnet. But turns out passing in lambda=0 works just fine for those non-identifiable models.

glm, speedglm, glmnet comparison (Part 4: models with a few categorical variables but many levels)

2013-10-07T15:05:00.002-07:00

Part 3 talked about models with many categorical variables but fairly low numbers of categories within those variables. Part 2 discussed that the size of the relative size of the model matrix compared to its sparse matrix representation scales with levels/2, so we might expect that glm would get even worse relative to glmnet as we try to use it to estimate a model with categorical variables where there are many (1000s or 10000s) of levels for the variable. This is part of a series of posts I did comparing glm, speedglm and glmnet.

In this test, I use again n=10,000. I use 2 categorical explanatory variables and allow the number of levels to go up.

Speed comparison

Here's the relative comparison. glm is getting slower and slower relative to glmnet as the number of categories goes up. For 1000 categories (and 2 factors), it was 750 times slower.

bottom line is for speedglm. It's not red as the title suggests. Little jiggle for glm at 200 is because glmnet was so fast so didn't get good speed estimate (likely affected by surfing I was doing while running the test).

RAM notes
At about 2500 categories, I use up the 8M of RAM on my laptop with glm. There is no discernible jump in RAM use at 2500 categories for glmnet. I don't have a good way to measure RAM use during the function calls. I can watch it using the Windows performance monitor while running code, but I don't know how to get the max memory used with R code. Rprofmem() didn't seem to get me what I wanted nor does gc().

R code

library(glmnet)
library(speedglm)

n = 10000
facs = 2

res=obj=c()
for(levs in c(50,100,200,300,500,1000)){
beta=matrix(rnorm(levs*facs,0,1),levs,facs)

levx = c()
for(i in 1:facs) levx=cbind(levx,sample(levs,n,replace=TRUE))

y <- apply(levx, 1, function(x){ sum(beta[x+seq(0,10*(facs-1),by=10)]) }) + rnorm(n)
x = data.frame(levx)
for(i in 1:facs) x[,i]=factor(x[,i],levels=1:levs)
dat = cbind(y=y,x)

cat(levs);cat("\n")

#set up the formula
fooform = "y~-1"
for(i in 1:facs) fooform=paste(fooform,"+X",i,sep="")

#fit glm
a=c(levs, system.time(fit<-glm(formula(fooform), data=dat))[1])
b=c(levs, object.size(fit))

#fit speedglm
a=c(a,system.time(fit<-speedglm(as.formula(fooform), data=dat))[1])
b=c(b, object.size(fit))

#fit glmnet
sx=sparseMatrix(rep(1:n,facs),as.vector(levx+matrix(seq(0,levs*(facs-1),by=levs),n,facs,byrow=TRUE)),x=1)
a=c(a, system.time(fit<-glmnet(sx, y))[1])
b=c(b, object.size(fit))

res=rbind(res,a)
obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of categories")
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)
title(paste(facs,"explanatory variables"))

#plot 2
plot(res[,1],res[,2]/res[,4],xlab="number of categories",ylab="glm (or speedglm) speed/glmnet speed", type="l")
lines(res[,1],res[,3]/res[,4], col="red")
title("relative speed of glm (black) and speedglm (red) to glmnet")

#plot 2
plot(obj[,1],log(obj[,2]),type="l", ylab="object size (log(M))", xlab="number of explanatory variables",ylim=c(10,23))
lines(obj[,1],log(obj[,3]),col="red",lty=2)
lines(obj[,1],log(obj[,4]),col="blue",lty=3)
abline(h=log(8042*1e6))
abline(h=log(2000*1e6))
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

glm, speedglm, glmnet comparison (Part 3: models with many categorical explanatory variables)

2013-10-07T14:19:00.001-07:00

In Part 1 of the glm, speedglm, glmnet comparison, I looked at models with continuous explanatory variables. In Part 3, I look at speeds for models with lots of categorical explanatory variables. This will use the sparse matrix representation of a model matrix for models with categorical explanatory variables. Part 2 talked about that. This is part of a series of posts I did comparing glm, speedglm and glmnet.

Here I have a model that looks like this

y ~ factor1 + factor2 + factor3 + ... + factor-k

where k is big or # of levels gets big. This is the case where the model matrix gets huge, and we might expect glm to really bog down.

First test. # of categorical explanatory variables gets big. I set n to 10,000 and number of levels per explanatory variable at 10. So yes glm is getting really slow relative to glm. In fact, I didn't do more than 300 variables for glm since the speed was going down so much. speedglm also shows trouble as number of explanatory variables gets big. R code to generate this is below.

But look at glm with categorical explanatory variables versus continuous explanatory variables. Yipes. Using glm with large numbers (1000s) of categorical explanatory variables is not going to work. We are slowing down quickly and using up RAM.

Here is the object size

R code

library(glmnet)
library(speedglm)

n = 10000
levs = 10

res=obj=c()
for(facs in c(50,100,200,300,500,1000)){
  beta=matrix(rnorm(levs*facs,0,1),levs,facs)
  
  levx = c()
  for(i in 1:facs) levx=cbind(levx,sample(levs,n,replace=TRUE))
  
  y <- apply(levx, 1, function(x){ sum(beta[x+seq(0,10*(facs-1),by=10)]) }) + rnorm(n)
  x = data.frame(levx)
  for(i in 1:facs) x[,i]=factor(x[,i],levels=1:levs)
  dat = cbind(y=y,x)

  cat(facs);cat("\n")
  
  #set up the formula
  fooform = "y~-1"
  for(i in 1:facs) fooform=paste(fooform,"+X",i,sep="")

  #fit glm
  #don't do glm if facs>500; too slow
  if(facs<500){
  a=c(facs, system.time(fit<-glm(formula(fooform), data=dat))[1])
  b=c(facs, object.size(fit))
  }else{
    a=c(facs, NA)
    b=c(facs, NA)
  }
  
  #fit speedglm
  a=c(a,system.time(fit<-speedglm(as.formula(fooform), data=dat))[1])
  b=c(b, object.size(fit))
  
  #fit glmnet
  sx=sparseMatrix(rep(1:n,facs),as.vector(levx+matrix(seq(0,levs*(facs-1),by=levs),n,facs,byrow=TRUE)),x=1)
  a=c(a, system.time(fit<-glmnet(sx, y))[1])
  b=c(b, object.size(fit))
  
res=rbind(res,a)
obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of explanatory variables",ylim=c(0,500))
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

#plot 2
plot(obj[,1],log(obj[,2]),type="l", ylab="object size (log(M))", xlab="number of explanatory variables",ylim=c(10,23))
lines(obj[,1],log(obj[,3]),col="red",lty=2)
lines(obj[,1],log(obj[,4]),col="blue",lty=3)
abline(h=log(8042*1e6))
abline(h=log(2000*1e6))
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

Writing a model matrix in sparse matrix form

2013-10-07T12:53:00.000-07:00

The last post glm, speedglm, glmnet comparison (part 1) showed that glmnet gives us a big speed and object size advantage for a vanilla regression when we have many explanatory variables (1000s). In my next posts, I will look at models with lots of factors. Here glmnet has an even bigger advantage because we can use sparse matrix notation to pass in our model. Before using sparse matrices with glmnet, I want to review how to specify a sparse matrix in R. This uses the Matrix R package. This is part of a series of posts I did comparing glm, speedglm and glmnet.

Let's say we had 5 data points and 2 explanatory variables X1 and X2. Each has 2 levels, "a" and "b". Our data look like so, a 5x2 matrix:

X1 X2
a  a
b  a
b  a
b  b
a  b

glm will represent this will a model matrix that will expand that out into binary form with a column for each level-factor combination. It'll look something like this*, a 2x2x5 matrix:

X1a X1b X2a X2b
1   0   1   0
0   1   1   0
0   1   1   0
0   1   0   1
1   0   0   1

*Ok, assume that no intercept is estimated and ignore that we have to set one factor to 0 (or something) to make the problem identifiable.

This model matrix will have (#data points) rows and (#factors * #levels) columns AND it is almost all zeros. As the number of factors or levels gets big, this will get very wasteful and will will run out of RAM and everything slows down.

We can represent this more concisely in sparse matrix form. For sparse matrix form, we just need the row and columns that are 1s.

The sparse matrix representation has (#factors * #data points) rows and 2 cols (a col for row # and one for col #). The ratio of the size of the original model matrix to the sparse model matrix is #levels/2. So as number of levels and factors gets big, sparse model matrix form will save lots of space.

How to write the model matrix in R

Here's a little piece of code to make the sparse matrix from a data frame where each column of the data frame is a factor. It uses the numeric representation of a factor, so as.numeric(factor).

x=data.frame(X1=c("a","b","b","b","a"), X2=c("a","a","a","b","b"))
cols=0
sx=c() #the sparse matrix representation
for(i in 1:ncol(x)){
#we need to add on the numbers of cols used for previous factors
sx=rbind(sx,cbind(1:nrow(x),as.numeric(x[,i])+cols))
cols = cols + length(levels(x[,i]))
}

glm, speedglm versus glmnet comparison tests (part 1)

2013-10-03T19:47:00.000-07:00

I spent part of today learning glmnet, another R package for speedier generalized linear regression. Read this post for some background on what glmnet does This is for massive linear regression problems where you are trying to find a minimal model and where the model matrix is so huge that it is maxing out your RAM and the computation is getting slow. The RAM is the kicker. If the computation is just slow, you can wait but if it requires more RAM than you have then you are stuck. glm() is very RAM hungry due to the model.matrix that it constructs. This gets enormous as the number of response variables gets huge. This post is based on this one by someone else using-sparse-matrices-in-r This is the first of a whole series of posts I did comparing glm, speedglm and glmnet.

Speed test #1 My first speed test used a simple gaussian errors regression with continuous response variables (meaning not factors, not categorical). First plot shows speed in seconds (on my laptop). Model is y ~ x1 + x2 + ... + xn, family="gaussian". R code for the test is below (and shows how to set each up for glm(), speedglm() and glmnet() ).

Plot 2 shows object size. The top 2 lines show 2M RAM and 8M RAM

library(glmnet)
library(speedglm)
n =10000 #number of data points (y)
res = obj = c() #holders for output
#p is number of response variables
for(p in c(100,500,1000,1500,2000)){
#create random covariate values
x = matrix(rnorm(n * p), n, p)
beta = rnorm(p) #random betas
y = x %*% beta + rnorm(n) #the response variable

cat(p);cat("\n")
#vanilla glm
#glm.fit = glm(y ~ x)
a=c(p, system.time(fit<-glm(y ~ x))[1])
b=c(p, object.size(fit))

#speedglm
da=data.frame(y=y, x)
#spdglm.fit = speedglm(y ~ x, data=da)
a=c(a,system.time(fit<-speedglm(y ~ x, data=da))[1])
b=c(b, object.size(fit))

#glmnet
#glmnet.fit = glmnet(x, y)
a=c(a, system.time(fit<-glmnet(x, y))[1])
b=c(b, object.size(fit))

res=rbind(res,a)
obj=rbind(obj,b)
}

#plot 1
plot(res[,1],res[,2],type="l", ylab="seconds", xlab="number of response variables")
lines(res[,1],res[,3],col="red",lty=2)
lines(res[,1],res[,4],col="blue",lty=3)
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

#plot 2
plot(obj[,1],log(obj[,2]),type="l", ylab="object size (log(M))", xlab="number of response variables",ylim=c(10,23))
lines(obj[,1],log(obj[,3]),col="red",lty=2)
lines(obj[,1],log(obj[,4]),col="blue",lty=3)
abline(h=log(8042*1e6))
abline(h=log(2000*1e6))
legend("topleft",c("glm","speedglm","glmnet"),col=c("black","red","blue"),lty=1)

Test with real data

2013-09-27T12:58:00.000-07:00

The previous test indicated that the 2 factor model is working. But test with real data is problematic.

First good news. Estimates are correlated within well connected groups. Here's the soccer data

It's not a cloud. That's something. What's with it not being on the 1-1 line? Wrong prior?

But when I add more groups that are loosely connected to each other. It breaks down and starts looking like a cloud.

This makes sense as there is not a whole lot of information to sort out groups against each other.

Add a smoother step to condition on all the data. ALAS it depends on retaining the covariances. But still I don't need to retain all of them and make a nxn var-cov matrix (which is what is hogging the RAM). That matrix is incredibly sparse and working with the whole thing is the problem. Look in to methods for storing sparse matrices.
Do some sims to see if being off the 1-1 is from the prior.
The filterglm() is still working ok for a well-mixed group so still has potential. Do some sims to understand how the lack of mixing hurts. Elo starts all new players at a low level. I'm kind of doing that too. Do some time-series to understand how players move "up". I have priors for the groups. Why not use that as a better prior?
Or perhaps a hierarchical approach? Where I define 'groups' and try to estimate the group mean? It would work, but to pedantic. I like the organic 'crowd-sourced' ranking idea better. The structure is idiosyncratic to the problem and I don't see how I'd organically get the groups.

Code to run this test
formula=y~-1+attack+defend
dat=read.csv("2013 match data/2013-2014/boys-scores-master.csv", stringsAsFactors=FALSE)
attack = c(dat$home.team, dat$away.team)
defend = c(dat$away.team, dat$home.team)
y = c(dat$home.score,dat$away.score)
moddat = data.frame(y=y,attack=attack, defend=defend,stringsAsFactors=FALSE)
#No NaN allowed in this approach
moddat = moddat[!is.na(moddat$y),]
#now make the factors
levs = unique(c(moddat$attack,moddat$defend))
moddat = data.frame(y=moddat$y,attack=factor(moddat$attack, levels=levs),
                    defend=factor(moddat$defend, levels=levs))
test=filterglm(formula, moddat)
test2=as.data.frame(test)

#load in the fbRanks speedglm object for the 9-24 data above
test3=print(fbRanks,silent=TRUE,age="B00",region="WA")$ranks[[1]]
#resort to match whatever is output by print.fbRanks
test4=test2[match(test3$team,rownames(test2)),]
#print.fbRanks is showing exp(attack)
plot(log(test3$attack),test4$attack.mean,xlab="speedglm estimate",ylab="filterglm estimate")
abline(a=-1*mean(log(test3$attack)),b=1)

par(mfrow=c(2,2))
for(i in c("B01","B00","B99","B98")){
#load in the fbRanks speedglm object for the 9-24 data above
test3=print(fbRanks,silent=TRUE,age=i,region=c("OR","WA"))$ranks[[1]]
#resort to match whatever is output by print.fbRanks
test4=test2[match(test3$team,rownames(test2)),]
#print.fbRanks is showing exp(attack)
plot(log(test3$attack)-mean(log(test3$attack),na.rm=TRUE),test4$attack.mean-mean(test4$attack.mean,na.rm=TRUE),xlab="speedglm estimate",ylab="filterglm estimate")
title(i)
abline(a=0,b=1)
}

par(mfrow=c(2,2))
i = c("B00","B99","B98")
#load in the fbRanks speedglm object for the 9-24 data above
test3=print(fbRanks,silent=TRUE,age=i,region=c("OR","WA"))$ranks[[1]]
#resort to match whatever is output by print.fbRanks
test4=test2[match(test3$team,rownames(test2)),]
#print.fbRanks is showing exp(attack)
plot(log(test3$attack)-mean(log(test3$attack),na.rm=TRUE),test4$attack.mean-mean(test4$attack.mean,na.rm=TRUE),xlab="speedglm estimate",ylab="filterglm estimate")
title(i)
abline(a=0,b=1)

New functions
filterglm = function(formula, data, weights = NULL){
#this code is specific to the soccer ranking problem
#it requires that factors have the same levels; not a requirement
#the data must be a dataframe with attack and defend
tf <- terms(formula)
M <- model.frame(tf, data)
names.x = levels(data$attack)
n.x = length(names.x)
M = lapply(M,function(x){if(is.factor(x))x=as.numeric(x) else x})
M = as.data.frame(M)
#mean and variance
est.x=matrix(c(0, 0,1,1), n.x, 4, byrow=TRUE)
rownames(est.x)=names.x
colnames(est.x)=c(paste(colnames(attr(tf,"factors")),".mean",sep=""),
     paste(colnames(attr(tf,"factors")),".var",sep=""))
n.trials = dim(M)[1]
for(i in 1:n.trials){
    #go through each contest sequentially and update the factors estimates
    prior.xtt = matrix(c(est.x[M[i,2],1],est.x[M[i,3],2]),2,1)
    prior.Ptt = diag(c(est.x[M[i,2],3],est.x[M[i,3],4]))
    out=filter.update(M[i,1],prior.xtt=prior.xtt, prior.Ptt=prior.Ptt)
    est.x[M[i,2],1]=out$post.xtt[1]
    est.x[M[i,3],2]=out$post.xtt[2]
    #post.Ptt is not a diagonal matrix! Think about it. It shouldn't be.
    #but I don't retain the information regarding covariance between player estimates
    #this is where this approach loses efficiency relative to an approach that
    #analyzes all the data jointly. But I'm assume I never have the data.....
    est.x[M[i,2],3]=diag(out$post.Ptt)[1]
    est.x[M[i,3],4]=diag(out$post.Ptt)[2]
}
return(est.x)
}

filter.update=function(y, prior.xtt = matrix(0,2,1), prior.Ptt = diag(1,2), Q=diag(0,2)){
    require(KFAS)
    n=1; TT=1; m=2
    B=diag(1,2); t.B=B
    Z=matrix(c(1,-1),1,2)
    Q=diag(0,2)
    P1inf=matrix(0,m,m)
    if(packageVersion("KFAS")=="0.9.11")
      stop("KFAS 1.0.0 required and you have old version")
      #kfas.model=SSModel(y, Z=Z, T=B, R=diag(1,m), Q=Q, a1=prior.xtt, P1=prior.Ptt, P1inf=P1inf, distribution="Poisson")
    else
      kfas.model=SSModel(y ~ -1+SSMcustom( Z=Z, T=B, R=diag(1,m), Q=Q, a1=prior.xtt, P1=prior.Ptt, P1inf=P1inf), distribution="poisson")
    ks.out=KFS(kfas.model)
    return(list(post.xtt=ks.out$alphahat[1:2],post.Ptt=ks.out$V[1:2,1:2,1]))
}

Test of the 2-player filter model with attack and defend

2013-09-27T11:46:00.002-07:00

Same ideas as yesterday, except now players have different types of x's depending on whether they are player 1 or 2 (i.e. attacking and defending). I did this after testing the idea on a real dataset, seeing no correlation to speedglm, then I tested against a known dataset and saw no correlation there. So I did this to see if the problem is the 2 factors types or a bug in the code I wrote this AM. Looks like the later.

Here's the 2nd test. Uses the simple.update() function from yesterday's post. Requires KFAS 1.0.0. I tried 0.9.11, and though it looks to have the poisson, it returns NaN if count is 0.

sim.poisson.test2 = function(n.x=1000, n.trials=10*1000){
#variance of the distribution of the player pool x's
mean.x = 0
sig2.x = 1
#true.x is what we are trying to estimate
true.attack = rnorm(n.x, mean.x, sig2.x)
true.defend = rnorm(n.x, mean.x, sig2.x)
dat = matrix(0,n.trials,3)
for(i in 1:n.trials){
    dat[i,2:3] = sample(1:n.x,2)
    dat[i,1] = rpois(1,exp(true.attack[dat[i,2]]-true.defend[dat[i,3]]))
}

#start everyone with an estimate and uncertainty
#corresponding to the player pool mean and variance
est.x=matrix(c(mean.x, mean.x, sig2.x, sig2.x),n.x,4,byrow=TRUE)
for(i in 1:n.trials){
    #go through each contest sequentially and update the player x's
    prior.xtt = matrix(c(est.x[dat[i,2],1],est.x[dat[i,3],2]),2,1)
    prior.Ptt = diag(c(est.x[dat[i,2],3],est.x[dat[i,3],4]))
    out=simple.update(dat[i,1],prior.xtt=prior.xtt, prior.Ptt=prior.Ptt, distribution="poisson")
    est.x[dat[i,2],1]=out$post.xtt[1]
    est.x[dat[i,3],2]=out$post.xtt[2]
    #post.Ptt is not a diagonal matrix! Think about it. It shouldn't be.
    #but I don't retain the information regarding covariance between player estimates
    #this is where this approach loses efficiency relative to an approach that
    #analyzes all the data jointly. But I'm assume I never have the data.....
    est.x[dat[i,2],3]=diag(out$post.Ptt)[1]
    est.x[dat[i,3],4]=diag(out$post.Ptt)[2]
}
par(mfrow=c(1,2))
plot(est.x[,1],true.attack,xlab="estimated attack",ylab="true attack")
plot(est.x[,2],true.defend,xlab="estimated defend",ylab="true defend")
}

Sept 23 2013 Papers

2013-09-26T12:30:00.003-07:00

25 Years of Forecasting w Time Series Models
http://www.est.uc3m.es/esp/nueva_docencia/comp_col_get/lade/tecnicas_prediccion/Practicas0708/Practica1/25%20years%20of%20time%20series%20forecasting%20%28Gooijer%20and%20Hyndman%29.pdf

update equations for the 2-player contest with a poisson link function

2013-09-26T12:29:00.000-07:00

The equations for various link functions are here but it's for a univariate case. Crud. I need univariate y and bivariate x.
Harvey, A.C. and C. Fernandes, 1989, Time series models for count or qualitative observations,

J. Bus, Statist., 7, pp. 407-423.

http://www.tandfonline.com/doi/abs/10.1080/07350015.1989.10509750#.UkSIXxD0eUM

Toy example of a filter for computing the 2-player hidden x's

2013-09-26T11:03:00.002-07:00

Follow-up on relationship-of-elo-algorithm-to-logistic-regression

This morning I coded up an implementation of the idea for an Elo-like (or Kalman-like or filter-esque or Bayesian) update algorithm for 2-player data. It is Elo-like because the idea is that you never store the data nor the player information. Rather each player knows their own estimated mean x and uncertainty in that estimate. They meet another player and have a contest. At the end of the contest, they exchange their priors (their mean x and uncertainty before the contest) and each updates their own estimate of their x.

I'm going to tackle first an easy problem for which I already know the update equation. Ultimately, I want to use this for a problem where I will have to derive the update equation. I would prefer a closed form update equation, but the principle will work even if I have to do a numerical update using a MCMC algorithm.

Set-up of the problem:
Assume a large player pool. Player x's are drawn from a Normal distribution with mean mu and variance pi. We assume that we know what this distribution is but don't known the individual player x's . Our objective is to estimate those x's. Two players are drawn at random. One is chosen (randomly) to be #1 (attacker) and the other is #2 (defender). They have a contest. The outcome of this contest is a Normal distribution with mean (x.attacker - x.defender) and variance of contest.var. Again we assume we know a lot about the nature of this contest, so we know the contest variance and we know the outcome is normally distributed. But we don't know the x's of the players in the contest. Our players start with an estimated x and uncertainty of mu and pi (the distribution of x's in the player population). They head out and randomly encounter other players and have contests with them. After each contest, the individual players update their x estimate and uncertainty in that.

Code is below. This plot summarizes the results with 1000 players and 10,000 or 5,000 contests. The attacker and defender were chosen randomly. The mean number of contests per player was 20 for 10,000 contests (each contest includes 2 players) and 10 for 5,000 contests.

First, cool. This works!! Second, it depends a lot on the characteristic of the contest. If the contest involves a lot of luck (bottom row) then there is a loose relationship between mu1-mu2 and the outcome, then many contests are needed to get a good estimate of players' x's. If the contest outcome is closely related to m1-mu2 (top row), then fewer contests are needed.

The R code:
simple.update=function(y, prior.xtt = matrix(0,2,1), prior.Ptt = diag(1,2), Q=diag(0,2), R=.1){
#This is a Kalman filter
#y is the response (data point)
#Q is how mean x varies in time
#R is how y (response) is variable with given mu1-mu2
#y ~ N(mu1-mu2,R)
Z=matrix(c(1,-1),1,2); tZ = t(Z)
Ptt1 = prior.Ptt + Q
xtt1 = prior.xtt
Kt = Ptt1%*%tZ%*%solve(Z%*%Ptt1%*%tZ + R)
xtt = xtt1 + Kt%*%(y-Z%*%xtt1)
Ptt = (diag(1,2)-Kt%*%Z)%*%Ptt1
return(list(post.xtt=xtt,post.Ptt=Ptt))
}

sim.test = function(r=1, n.x=1000, n.trials=10*1000){
#variance of the distribution of the player pool x's
mean.x = 0
sig2.x = 1
#true.x is what we are trying to estimate
true.x = rnorm(n.x, mean.x, sig2.x)
dat = matrix(0,n.trials,3)
for(i in 1:n.trials){
dat[i,2:3] = sample(1:n.x,2)
dat[i,1] = rnorm(1,true.x[dat[i,2]]-true.x[dat[i,3]],r)
}

#start everyone with an estimate and uncertainty
#corresponding to the player pool mean and variance
est.x=matrix(c(mean.x, sig2.x),n.x,2,byrow=TRUE)
for(i in 1:n.trials){
#go through each contest sequentially and update the player x's
prior.xtt = matrix(est.x[dat[i,2:3],1])
prior.Ptt = diag(est.x[dat[i,2:3],2])
out=simple.update(dat[i,1],prior.xtt=prior.xtt, prior.Ptt=prior.Ptt)
est.x[dat[i,2:3],1]=out$post.xtt
#post.Ptt is not a diagonal matrix! Think about it. It shouldn't be.
#but I don't retain the information regarding covariance between player estimates
#this is where this approach loses efficiency relative to an approach that
#analyzes all the data jointly. But I'm assume I never have the data.....
est.x[dat[i,2:3],2]=diag(out$post.Ptt)
}

plot(est.x[,1],true.x,xlab="estimated x",ylab="true x")
}

par(mfrow=c(3,3))
r=.1
hist(rnorm(1000,0,sqrt(r)),main="Dist of contest outcomes\nr=.1",xlab="contest outcome")
sim.test(r=r)
title("Mean 20 contests\nper player")
sim.test(r=r,n.trials=5*1000)
title("Mean 10 contests\nper player")

r=1
hist(rnorm(1000,0,sqrt(r)),main="r=.5",xlab="contest outcome")
sim.test(r=r)
sim.test(r=r,n.trials=5*1000)

r=2
hist(rnorm(1000,0,sqrt(r)),main="r=1",xlab="contest outcome")
sim.test(r=r)
sim.test(r=r,n.trials=5*1000)

Ok, that's great. This is nothing new. It's just an implementation of Elo's idea but

in a slightly different context
different link function between response variable and hidden variables
players retain information about the uncertainty in their estimated x

But it now points me in the direction of an algorithm for a generic contest link function (Bernoulli for a success-fail contest, Poisson for a contest with points or goals, Negative binomial, etc). The Gaussian link function is nice since the update equation (kalman filter) is closed form. If I have to resort to a numerical updater (gibbs or mcmc), it's going to get slow.

Relationship of Elo algorithm to logistic regression

2013-09-25T12:21:00.001-07:00

Follow up on strategy-for-asynchronous-update
See ** at bottom for where this all is going.
Huh, what's this have to do with Elo algorithm? The Elo algorithm is a solution to a problem similar to the reverse 2-player logistic regression described half-way down.

y ~ b, link=f(b)*

y is 0,1 data (success, failure). In a typical logistic regression, we use the logistic function to link "t" to probability of success.

prob of success = p = 1/(1+exp(-t))

Then we assume some function that relates our covariate x to t. Vanilla approach is a linear relationship: t = a + bx

* except that we think of this in the inverse (logit). a+bx = g(x) = log(p/(1-p)) or log odds is a linear function of x.

The objective of this simple logistic regression is to estimate a and b, given x's associated with y's (0,1 data). An iterative algorithm is used where we start with some estimate of a and b, and then keep updating that (e.g. Newton method).

So now let's reverse the problem.

We know (assume) a and b but we do not know x. We want an algorithm that gets us the x(i) where i is our i-th individual (say). That's seems easy enough but we need multiple trials for each i. Then we get an estimate of p = successes/trials. We plug in p, a and b into the logistic equation and solve for x.

So that's not very interesting. It becomes more interesting when we have a 2 players in each trial.

t = a + bx(i) + bx(j)

We want to solve for the x's . How to do that? First imagine that you have the data** on a bunch of trails.

** Don't you always have to 'have the data'? No. Where this is going is an algorithm where no one has the data. Each player has an estimate of their x and their uncertainty about this estimate. Two players come together and have a trial. Each updates their estimate and uncertainty given their information about both players' x's. Then they go off and find another player to have a trial with. The 'data' is never kept; players only keep their current estimate of their x and their estimate of its uncertainty.

Strategy for asynchronous update algorithm for Dixon and Cole's model

2013-09-25T11:05:00.000-07:00

Problem: Synchronous updating for Dixon and Cole's model--speedglm(family=poisson(log)), is ultimately unscalable. So glm(y ~ factor(x)) at some point reaches a limit as the levels in x go to infinity.

It works until one maxes out the RAM. Second it doesn't not allow parallization. But parallel is not the right idea. Parallel means each agent (agent is analogy for something that does a computation) works on an isolated part of the computation. I want something more like the exercise that Rachel lead at the ISEES workshop. The post-in notes are all on the wall. Many agents come up and move the post-it notes at once. Each messing up the others work. There is no compartmentalization. But there is something like 'importance sampling'. The contentious post-it notes are moving more. The non-contentious ones are quickly settled. Idea is to 'set loose' many 'bugs' in the data and these go to work on the data.

Imagine rating an effectively infinite number of 'players'. I'm using 'teams' but this isn't about sports but about estimating a model from enormous 2-player datasets with effectively infinite numbers of players. Players could be cell-phone numbers and the contest something about a call between 2 phones and you are trying to rank some characteristic of the phone numbers.

Relation to EM algorithm. At each step the LL increases. Ultimately the max is reached.
* compute expected value of hidden state conditioned on all the data
- forward/backward smoother
* compute ML of parameters conditioned on data and expected value of hidden state

Relation to bayesian algorithm
* Start with prior on hidden state
* Get 1 data point, update to posterior of hidden state
* Need a closed form update equation

Relation to MCMC
* MCMC algorithm is getting the posterior surface
* Same idea but I want the 'strengths surface'. The x-axis is 'player'. It is a factor in glm lingo or random effect in glmer lingo. It is discrete, but effectively infinite. The y-axis is strength.

Does biological complexity add realism?

2007-09-28T12:54:00.000-07:00

From a report that will remain unnamed:
"At the other end of the spectrum are formulations such as IBMs which require detailed knowledge of physiological and metabolic processes and how these influence the vital rates of fecundity and survivorship. The realism of these approaches is further enhanced through incorporation of density influences on or stochastic variation in these processes. Such data are difficult to obtain, yet their inclusion into appropriate models permits the most detailed assessments."

As usual, addition of biological complexity into a model is equated with adding realism. Realism is good and permits a better risk assessment. I completely disagree with this general statement -- BECAUSE details are unknown. I would argue unknowable, but most would disagree.

But back to this mechanistic detailed model is better for risk assessment. Let's use an analogy. I am a witness to a crime. I get a brief glimpse of the perpetrator. I report the crime and am working with a police artist to create a composite of the criminal.

Artist: Male or Female
Me: Female
Artist: Hair?
Me: Black
Artist: Eye color?
Me: I didn't see that.
Artist: Hmm, well we know that all humans have eye color so to make this realistic we need to pick an eye color.
Me: I didn't see her eyes.
Artist: Ok, let's use the maximum likelihood estimate and make them brown.
Artist: Height?
Me: Average
Artist: Hmm, well to make this realistic let's use the average height of women, 5' 6".
Artist: Clothing?
Me: Dark pants and light t-shirt. I didn't see the shoes.
Artist: Ok, let's add some realism. Blue jeans, sound ok?, light t-shirt..hmm, a woman wouldn't wear a regular t-shirt, let's make it a v-neck. Shoes...crocs, everyone is wearing those nowadays.
Me: Well, I really don't recall the specifics, that could be what she was wearing.
Artist: Ok, I'm going to go off and make a detailed photorealistic picture of this woman.....

The artist comes back with a photorealistic picture. It definitely looks like a real human woman, but it does not look like the criminal. In this case, more realism just hinders the investigation. It would be better to stick with "black haired average height woman" even though that is vague. It might not end up being all that useful, but it rules out many suspects.

neutral models of metapopulation dynamics

2007-02-26T13:26:00.000-08:00

Discuss the population patterns that occur via neutral models of dispersal. Illustrate that these patterns occur in large collections of spatially-structured populations. Illustrate that complex patterns of population density can occur via patterns of dispersal. Analogous to Hubbell's work on neutral models of diversity.

Neutral models of population distributions

Colloquially people think of different rates of population growth or decline as an indication of population robustness? However Can we see what is going on?

Is it possible to detect habitat heterogeneity? At low dispersal, we see the effect of heterogeneity but as dispersal increases,

Within a metapopulation, there is a canonical relationship between the year-to-year variability within the total population and the variability in growth rates between sub-populations.

Rweb

2006-10-10T14:06:00.000-07:00

An example of folks with a R server up and running

http://www.stat.umn.edu/geyer/old03/5601/examp/parm.html

http://www.math.montana.edu/Rweb/

http://bayes.math.montana.edu/Rweb/Resources.html

prediction error

2006-06-22T13:19:00.000-07:00

Efron, B. 2004. The estimation of prediction error: covariance penalties and cross-validation.
Journal of the American Statistical Association 99: 619-632.

predictive vs interpolative accuracy

2006-06-03T14:22:00.000-07:00

Predictive Accuracy as an
Achievable Goal of Science
Malcolm R. Forster†‡
University of Wisconsin-Madison

What has science actually achieved? A theory of achievement should (1) define what has been achieved, (2) describe the means or methods used in science, and (3) explain how such methods lead to such achievements. Predictive accuracy is one truth-related achievement of science, and there is an explanation of why common scientific practices (of trading off simplicity and fit) tend to increase predictive accuracy. Akaike’s explanation for the success of AIC is limited to interpolative predictive accuracy. But therein lies the strength of the general framework, for it also provides a clear formulation of many open problems of research.

http://philosophy.wisc.edu/forster/papers/PSA2000.pdf

ok so LR depends on nested-ness

2006-02-14T10:28:00.000-08:00

since the models are not nested, the usual LR test statistic will not have an asymptotic Chi-square distribution and hence the statistic you compute will not have a meaningful interpretation.

http://www.biostat.wustl.edu/archives/html/s-news/2004-03/msg00200.html

However, Burnham and Anderson argue that the ranking of models with AICc is not limited by this pg 88.

page 61 in PRNN

Ripley says that NIC criterion is based on penalty

2p* = trace[KJ^-1]

If the model is adequate (or true), J=K, and p* is the number of parameters and NIC becomes AIC. These results are based on asymptotic normality of the parameter estimates.

Moody 1991, 1992 (uses effective number of parameters)
Murata et al 1991 (on the effective number of parameters)
cf. maybe first Draper 1995 JRSS

Fisher information
http://en.wikipedia.org/wiki/Fisher_information_matrix

Determining correct model complexity

2006-02-14T10:17:00.000-08:00

x X (X is the set of possible data)

Let's specify some statistic t(x)

from x estimate the deviance [t(X)-t(x)]^2 = s^hat

On average how big is this deviance?

Akaike -> 2p

http://www.stat.columbia.edu/~cook/movabletype/archives/2004/12/against_parsimo.html
Against parsimony
Occam’s Razor and the Relational Nature of Evidence
Tutorial
ftp://ftp.cs.utoronto.ca/pub/radford/bayes-tut.ps