Analysis with Programming

Julia: Installation and Editors

2017-06-12T13:01:00.000+08:00

If you have been following this blog, you may have noticed that I don't have any update for more than a year now. The reason is that I've been busy with my research, my work, and I promised not to share anything here until I finished my degree (Master of Science in Statistics). Anyways, at this point I think it's time to share with you what I've learned in the past year. So far, it's been a good year for Statistics especially in the Philippines, in fact, last November 15, 2016, the team of local data scientists made a huge step in Big data by organizing the first ever conference on this topic. Also months before that, the 13th National Convention on Statistics organized by the Philippine Statistics Authority, invited a keynote speaker from Paris21 to tackle Big data and its use in the government.

So without further ado, in this post, I would like to share a new programming language which I've used for several months now, and it's called Julia. This programming language is by far my favorite, it's a well-thought-out language as many would say, for many reasons. The first of course is the speed, second is the grammar, and many more. I can't list them down here, but I suggest you visit the official website, and try it for yourself.

Installation

The installation of this program is straightforward, simply go to the Julia's official download page and download the binaries for your operating system. Alternatively, you can install Julia by downloading the JuliaPro from the Julia Computing products. This will setup everything you need, which include the Github Atom Editor out of the box. After installation, the first time you load the command-line-version program, you'll have the following window:

Working with the command-line-version is actually fun, and personally I think Julia has the best command-line-version compared to R and Python in terms of features. For example, you can shift to shell mode by simply pressing ; in the Julia prompt, and using ? to activate the help mode. It also has autocompletion by pressing Tab after entering first few letters of the syntax, the LaTeX UTF autocompletion is also one of the best features, and almost any symbols/characters can be used as variables, like emoticon as shown below:

Editor

While Julia's command-line-version is loaded with good features, working with huge project needs a better front-end editors. Like RStudio for R, PyCharm for Python, Julia can run on Jupyter (also available for R and Python), Github Atom Editor, and Microsoft Visual Studio Code.

Julia in Jupyter Notebook

Julia in Github Atom Editor

Julia in Microsoft Visual Studio Code

To install the Jupyter notebook, simply run the following codes:
In the screenshot above, I tweaked the theme of the notebook using the script from this repository. As mentioned, to setup Julia in Github Atom Editor, I recommend downloading the JuliaPro or you can follow the instruction in the Juno Lab website. After installation, you can add Atom Extensions like Minimap, which is not available by default, and in case you are interested, the syntax highlighter I used in the screenshot is the Gruvbox Plus.

Further, to setup Julia in Microsoft Visual Studio Code, open the program, press Ctrl+P, paste ext install language-julia and hit Enter. This will install the Julia extension for Visual Studio Code. After installation, you can load the Julia REPL by pressing the following keys Ctrl+Shift+P (Windows) or Cmd+Shift+P (Mac) and enter julia start repl, and press Enter. If there is an error, the path may need to be specified properly. To do this, go to Preferences > Settings. Then in the .json user file settings, enter the following:
Of course, you need to check the path properly by replacing the Julia-0.6.0-rc3 (Windows) or Julia-0.6.app (Mac) with the desired version of your Julia, and the C:/Users/MyName with your desired path. Further, I use the following setting in my .json file to adjust my Minimap similar to the screenshot above.
Lastly, to toggle the cursor's focus between the script pane and the integrated Julia terminal using Ctrl+`, I use the following Keybindings (go to Preferences > Keyboard Shortcuts > keybindings.json).
For more on this topic visit the official github page. The three editors above have advantages and disadvantages. However, my primary editor is the Visual Studio Code, because it is fast and loaded with features as well. The major limitation of this editor is the LaTeX UTF autocompletion, which is available for Atom Editor. But there are third party packages like Unicode LaTeX, that can do the job indirectly, or alternatively you can generate the LaTeX UTF using the console (the integrated Julia terminal in the Visual Studio Code), but I think this is not a big deal, and may be in the near future, this capability will be added. On the other hand, the Atom Editor has of course more features for Julia, for example the plot pane, the workspace, and many more. The only problem is that, it's kind of slow especially when working with several datasets in your workspace, plus plots, plus very long lines of codes, scrolling through it is not smooth. Nevertheless, let's be positive and hope that more improvements are coming to these editors. Finally, for those who want to start using Julia, visit the Official Documentation and Learning Materials; ask questions on Julia Discourse and join the Julia Gitter.

R and Python: Gradient Descent

2015-12-22T22:20:00.003+08:00

One of the problems often dealt in Statistics is minimization of the objective function. And contrary to the linear models, there is no analytical solution for models that are nonlinear on the parameters such as logistic regression, neural networks, and nonlinear regression models (like Michaelis-Menten model). In this situation, we have to use mathematical programming or optimization. And one popular optimization algorithm is the gradient descent, which we're going to illustrate here. To start with, let's consider a simple function with closed-form solution given by \begin{equation} f(\beta) \triangleq \beta^4 - 3\beta^3 + 2. \end{equation} We want to minimize this function with respect to $\beta$. The quick solution to this, as what calculus taught us, is to compute for the first derivative of the function, that is \begin{equation} \frac{\text{d}f(\beta)}{\text{d}\beta}=4\beta^3-9\beta^2. \end{equation} Setting this to 0 to obtain the stationary point gives us \begin{align} \frac{\text{d}f(\beta)}{\text{d}\beta}&\overset{\text{set}}{=}0\nonumber\\ 4\hat{\beta}^3-9\hat{\beta}^2&=0\nonumber\\ 4\hat{\beta}^3&=9\hat{\beta}^2\nonumber\\ 4\hat{\beta}&=9\nonumber\\ \hat{\beta}&=\frac{9}{4}. \end{align}

The following plot shows the minimum of the function at $\hat{\beta}=\frac{9}{4}$ (red line in the plot below).

R Script Now let's consider minimizing this problem using gradient descent with the following algorithm:

Initialize $\mathbf{x}_{r},r=0$
while $\lVert \mathbf{x}_{r}-\mathbf{x}_{r+1}\rVert > \nu$
$\mathbf{x}_{r+1}\leftarrow \mathbf{x}_{r} - \gamma\nabla f(\mathbf{x}_r)$
$r\leftarrow r + 1$
end while
return $\mathbf{x}_{r}$ and $r$

where $\nabla f(\mathbf{x}_r)$ is the gradient of the cost function, $\gamma$ is the learning-rate parameter of the algorithm, and $\nu$ is the precision parameter. For the function above, let the initial guess be $\hat{\beta}_0=4$ and $\gamma=.001$ with $\nu=.00001$. Then $\nabla f(\hat{\beta}_0)=112$, so that \[\hat{\beta}_1=\hat{\beta}_0-.001(112)=3.888.\] And $|\hat{\beta}_1 - \hat{\beta}_0| = 0.112> \nu$. Repeat the process until at some $r$, $|\hat{\beta}_{r}-\hat{\beta}_{r+1}| \ngtr \nu$. It will turn out that 350 iterations are needed to satisfy the desired inequality, the plot of which is in the following figure with estimated minimum $\hat{\beta}_{350}=2.250483\approx\frac{9}{4}$.

R Script with Plot Python Script Obviously the convergence is slow, and we can adjust this by tuning the learning-rate parameter, for example if we try to increase it into $\gamma=.01$ (change gamma to .01 in the codes above) the algorithm will converge at 42nd iteration. To support that claim, see the steps of its gradient in the plot below.

If we try to change the starting value from 4 to .1 (change beta_new to .1) with $\gamma=.01$, the algorithm converges at 173rd iteration with estimate $\hat{\beta}_{173}=2.249962\approx\frac{9}{4}$ (see the plot below).

Now let's consider another function known as Rosenbrock defined as \begin{equation} f(\mathbf{w})\triangleq(1 - w_1) ^ 2 + 100 (w_2 - w_1^2)^2. \end{equation} The gradient is \begin{align} \nabla f(\mathbf{w})&=[-2(1 - w_1) - 400(w_2 - w_1^2) w_1]\mathbf{i}+200(w_2-w_1^2)\mathbf{j}\nonumber\\ &=\left[\begin{array}{c} -2(1 - w_1) - 400(w_2 - w_1^2) w_1\\ 200(w_2-w_1^2) \end{array}\right]. \end{align} Let the initial guess be $\hat{\mathbf{w}}_0=\left[\begin{array}{c}-1.8\\-.8\end{array}\right]$, $\gamma=.0002$, and $\nu=.00001$. Then $\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -2914.4\\-808.0\end{array}\right]$. So that \begin{equation}\nonumber \hat{\mathbf{w}}_1=\hat{\mathbf{w}}_0-\gamma\nabla f(\hat{\mathbf{w}}_0)=\left[\begin{array}{c} -1.21712 \\-0.63840\end{array}\right]. \end{equation} And $\lVert\hat{\mathbf{w}}_0-\hat{\mathbf{w}}_1\rVert=0.6048666>\nu$. Repeat the process until at some $r$, $\lVert\hat{\mathbf{w}}_r-\hat{\mathbf{w}}_{r+1}\rVert\ngtr \nu$. It will turn out that 23,374 iterations are needed for the desired inequality with estimate $\hat{\mathbf{w}}_{23375}=\left[\begin{array}{c} 0.9464841 \\0.8956111\end{array}\right]$, the contour plot is depicted in the figure below.

R Script with Contour Plot Python Script Notice that I did not use ggplot for the contour plot, this is because the plot needs to be updated 23,374 times just to accommodate for the arrows for the trajectory of the gradient vectors, and ggplot is just slow. Finally, we can also visualize the gradient points on the surface as shown in the following figure.

R Script In my future blog post, I hope to apply this algorithm on statistical models like linear/nonlinear regression models for simple illustration.

R and Python: Theory of Linear Least Squares

2015-12-15T18:55:00.000+08:00

In my previous article, we talked about implementations of linear regression models in R, Python and SAS. On the theoretical sides, however, I briefly mentioned the estimation procedure for the parameter $\boldsymbol{\beta}$. So to help us understand how software does the estimation procedure, we'll look at the mathematics behind it. We will also perform the estimation manually in R and in Python, that means we're not going to use any special packages, this will help us appreciate the theory.

Linear Least Squares

Consider the linear regression model, \[ y_i=f_i(\mathbf{x}|\boldsymbol{\beta})+\varepsilon_i,\quad\mathbf{x}_i=\left[ \begin{array}{cccc} 1&x_{11}&\cdots&x_{1p} \end{array}\right],\quad\boldsymbol{\beta}=\left[\begin{array}{c}\beta_0\\\beta_1\\\vdots\\\beta_p\end{array}\right], \] where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$. The $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the deterministic part of the model that depends on both the parameters $\boldsymbol{\beta}\in\mathbb{R}^{p+1}$ and the predictor variable $\mathbf{x}_i$, which in matrix form, say $\mathbf{X}$, is represented as follows \[ \mathbf{X}=\left[ \begin{array}{cccccc} 1&x_{11}&\cdots&x_{1p}\\ 1&x_{21}&\cdots&x_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ 1&x_{N1}&\cdots&x_{Np}\\ \end{array} \right]. \] $\varepsilon_i$ is the error term at the $i$th case which we assumed to be Gaussian distributed with mean 0 and variance $\sigma^2$. So that \[ \mathbb{E}y_i=f_i(\mathbf{x}|\boldsymbol{\beta}), \] i.e. $f_i(\mathbf{x}|\boldsymbol{\beta})$ is the expectation function. The uncertainty around the response variable is also modelled by Gaussian distribution. Specifically, if $Y=f(\mathbf{x}|\boldsymbol{\beta})+\varepsilon$ and $y\in Y$ such that $y>0$, then \begin{align*} \mathbb{P}[Y\leq y]&=\mathbb{P}[f(x|\beta)+\varepsilon\leq y]\\ &=\mathbb{P}[\varepsilon\leq y-f(\mathbf{x}|\boldsymbol{\beta})]=\mathbb{P}\left[\frac{\varepsilon}{\sigma}\leq \frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\\ &=\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right], \end{align*} where $\Phi$ denotes the Gaussian distribution with density denoted by $\phi$ below. Hence $Y\sim\mathcal{N}(f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2)$. That is, \begin{align*} \frac{\operatorname{d}}{\operatorname{d}y}\Phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]&=\phi\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]\frac{1}{\sigma}=\mathbb{P}[y|f(\mathbf{x}|\boldsymbol{\beta}),\sigma^2]\\ &=\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y-f(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}. \end{align*} If the data are independent and identically distributed, then the log-likelihood function of $y$ is, \begin{align*} \mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=\mathbb{P}[\mathbf{y}|\mathbf{X},\boldsymbol{\beta},\sigma]=\prod_{i=1}^N\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ &=\frac{1}{(2\pi)^{\frac{n}{2}}\sigma^n}\exp\left\{-\frac{1}{2}\sum_{i=1}^N\left[\frac{y_i-f_i(\mathbf{x}|\boldsymbol{\beta})}{\sigma}\right]^2\right\}\\ \log\mathcal{L}[\boldsymbol{\beta}|\mathbf{y},\mathbf{X},\sigma]&=-\frac{n}{2}\log2\pi-n\log\sigma-\frac{1}{2\sigma^2}\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2. \end{align*} And because the likelihood function tells us about the plausibility of the parameter $\boldsymbol{\beta}$ in explaining the sample data. We therefore want to find the best estimate of $\boldsymbol{\beta}$ that likely generated the sample. Thus our goal is to maximize the likelihood function which is equivalent to maximizing the log-likelihood with respect to $\boldsymbol{\beta}$. And that's simply done by taking the partial derivative with respect to the parameter $\boldsymbol{\beta}$. Therefore, the first two terms in the right hand side of the equation above can be disregarded since it does not depend on $\boldsymbol{\beta}$. Also, the location of the maximum log-likelihood with respect to $\boldsymbol{\beta}$ is not affected by arbitrary positive scalar multiplication, so the factor $\frac{1}{2\sigma^2}$ can be omitted. And we are left with the following equation, \begin{equation}\label{eq:1} -\sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2. \end{equation} One last thing is that, instead of maximizing the log-likelihood function we can do minimization on the negative log-likelihood. Hence we are interested on minimizing the negative of Equation (\ref{eq:1}) which is \begin{equation}\label{eq:2} \sum_{i=1}^N\left[y_i-f_i(\mathbf{x}|\boldsymbol{\beta})\right]^2, \end{equation} popularly known as the residual sum of squares (RSS). So RSS is a consequence of maximum log-likelihood under the Gaussian assumption of the uncertainty around the response variable $y$. For models with two parameters, say $\beta_0$ and $\beta_1$ the RSS can be visualized like the one in my previous article, that is

Performing differentiation under $(p+1)$-dimensional parameter $\boldsymbol{\beta}$ is manageable in the context of linear algebra, so Equation (\ref{eq:2}) is equivalent to \begin{align*} \lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=\langle\mathbf{y}-\mathbf{X}\boldsymbol{\beta},\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rangle=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{y}+(\mathbf{X}\boldsymbol{\beta})^{\text{T}}\mathbf{X}\boldsymbol{\beta}\\ &=\mathbf{y}^{\text{T}}\mathbf{y}-\mathbf{y}^{\text{T}}\mathbf{X}\boldsymbol{\beta}-\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{y}+\boldsymbol{\beta}^{\text{T}}\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} And the derivative with respect to the parameter is \begin{align*} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\rVert^2&=-2\mathbf{X}^{\text{T}}\mathbf{y}+2\mathbf{X}^{\text{T}}\mathbf{X}\boldsymbol{\beta} \end{align*} Taking the critical point by setting the above equation to zero vector, we have \begin{align} \frac{\operatorname{\partial}}{\operatorname{\partial}\boldsymbol{\beta}}\lVert\mathbf{y}-\mathbf{X}\hat{\boldsymbol{\beta}}\rVert^2&\overset{\text{set}}{=}\mathbf{0}\nonumber\\ -\mathbf{X}^{\text{T}}\mathbf{y}+\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{0}\nonumber\\ \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\label{eq:norm} \end{align} Equation (\ref{eq:norm}) is called the normal equation. If $\mathbf{X}$ is full rank, then we can compute the inverse of $\mathbf{X}^{\text{T}}\mathbf{X}$, \begin{align} \mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ (\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{X}\hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}\nonumber\\ \hat{\boldsymbol{\beta}}&=(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}.\label{eq:betahat} \end{align} That's it, since both $\mathbf{X}$ and $\mathbf{y}$ are known.

Prediction

If $\mathbf{X}$ is full rank and spans the subspace $V\subseteq\mathbb{R}^N$, where $\mathbb{E}\mathbf{y}=\mathbf{X}\boldsymbol{\beta}\in V$. Then the predicted values of $\mathbf{y}$ is given by, \begin{equation}\label{eq:pred} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{P}_{V}\mathbf{y}=\mathbf{X}(\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}, \end{equation} where $\mathbf{P}$ is the projection matrix onto the space $V$. For proof of the projection matrix in Equation (\ref{eq:pred}) please refer to reference (1) below. Notice that this is equivalent to \begin{equation}\label{eq:yhbh} \hat{\mathbf{y}}=\mathbb{E}\mathbf{y}=\mathbf{X}\hat{\boldsymbol{\beta}}. \end{equation}

Computation

Let's fire up R and Python and see how we can apply those equations we derived. For purpose of illustration, we're going to simulate data from Gaussian distributed population. To do so, consider the following codes

R Script Python Script Here we have two predictors x1 and x2, and our response variable y is generated by the parameters $\beta_1=3.5$ and $\beta_2=2.8$, and it has Gaussian noise with variance 7. While we set the same random seeds for both R and Python, we should not expect the random values generated in both languages to be identical, instead both values are independent and identically distributed (iid). For visualization, I will use Python Plotly, you can also translate it to R Plotly.

Now let's estimate the parameter $\boldsymbol{\beta}$ which by default we set to $\beta_1=3.5$ and $\beta_2=2.8$. We will use Equation (\ref{eq:betahat}) for estimation. So that we have

R Script Python Script That's a good estimate, and again just a reminder, the estimate in R and in Python are different because we have different random samples, the important thing is that both are iid. To proceed, we'll do prediction using Equations (\ref{eq:pred}). That is,

R Script Python Script The first column above is the data y and the second column is the prediction due to Equation (\ref{eq:pred}). Thus if we are to expand the prediction into an expectation plane, then we have

You have to rotate the plot by the way to see the plane, I still can't figure out how to change it in Plotly. Anyway, at this point we can proceed computing for other statistics like the variance of the error, and so on. But I will leave it for you to explore. Our aim here is just to give us an understanding on what is happening inside the internals of our software when we try to estimate the parameters of the linear regression models.

Reference

Arnold, Steven F. (1981). The Theory of Linear Models and Multivariate Analysis. Wiley.
OLS in Matrix Form

R, Python, and SAS: Getting Started with Linear Regression

2015-08-17T10:38:00.001+08:00

Consider the linear regression model, $$ y_i=f_i(\boldsymbol{x}|\boldsymbol{\beta})+\varepsilon_i, $$ where $y_i$ is the response or the dependent variable at the $i$th case, $i=1,\cdots, N$ and the predictor or the independent variable is the $\boldsymbol{x}$ term defined in the mean function $f_i(\boldsymbol{x}|\boldsymbol{\beta})$. For simplicity, consider the following simple linear regression (SLR) model, $$ y_i=\beta_0+\beta_1x_i+\varepsilon_i. $$ To obtain the (best) estimate of $\beta_0$ and $\beta_1$, we solve for the least residual sum of squares (RSS) given by, $$ S=\sum_{i=1}^{n}\varepsilon_i^2=\sum_{i=1}^{n}(y_i-\beta_0-\beta_1x_i)^2. $$ Now suppose we want to fit the model to the following data, Average Heights and Weights for American Women, where weight is the response and height is the predictor. The data is available in R by default.

The following is the plot of the residual sum of squares of the data base on the SLR model over $\beta_0$ and $\beta_1$, note that we standardized the variables first before plotting it,

If you are interested on the codes of the above figure, please click here. To minimize this elliptic paraboloid, differentiation has to be done with respect to the parameters, and then equate this to zero to obtain the stationary point, and finally solve for $\beta_0$ and $\beta_1$. For more on derivation of the estimates of the parameters see reference 1.

Simple Linear Regression in R

In R, we can fit the model using the lm function, which stands for linear models, i.e.

Formula, defined above as {response ~ predictor}, is a handy method for fitting model to the data in R. Mathematically, our model is $$ weight = \beta_0 + \beta_1 (height) + \varepsilon. $$ The summary of it is obtain by running model %>% summary or for non-magrittr user summary(model), given the model object defined in the previous code,

The Coefficients section above returns the estimated coefficients of the model, and these are $\beta_0 = -87.51667$ and $\beta_1=3.45000$ (it should be clear that we used the unstandardized variables for obtaining these estimates). The estimates are both significant base on the p-value under .05 and even in .01 level of the test. Using the estimated coefficients along with the residual standard error we can now construct the fitted line and it's confidence interval as shown below.

Fig 1. Plot of the Data and the Predicted Values in R.

Simple Linear Regression in Python

In Python, there are two modules that have implementation of linear regression modelling, one is in scikit-learn (sklearn) and the other is in Statsmodels (statsmodels). For example we can model the above data using sklearn as follows:

Above output is the estimate of the parameters, to obtain the predicted values and plot these along with the data points like what we did in R, we can wrapped the functions above into a class called linear_regression say, that requires Seaborn package for neat plotting, see the codes below:

Using this class and its methods, fitting the model to the data is coded as follows:

The predicted values of the data points is obtain using the predict method,

And Figure 2 below shows the plot of the predicted values along with its confidence interval and data points.

Fig 2. Plot of the Data and the Predicted Values in Python.

If one is only interested on the estimates of the model, then LinearRegression of scikit-learn is sufficient, but if the interest on other statistics such as that returned in R model summary is necessary, the said module can also do the job but might need to program other necessary routine. statsmodels, on the other hand, returns complete summary of the fitted model as compared to the R output above, which is useful for studies with particular interest on these information. So that modelling the data using simple linear regression is done as follows:

Clearly, we could spare time with statsmodels, especially in diagnostic checking involving test statistics such as Durbin-Watson and Jarque-Bera tests. We can of course add some plotting for diagnostic, but I prefer to discuss that on a separate entry.

Simple Linear Regression in SAS

Now let's consider running the data in SAS, I am using SAS Studio and in order to import the data, I saved it as a CSV file first with columns height and weight. Uploaded it to SAS Studio, in which follows are the codes below to import the data.

Next we fit the model to the data using the REG procedure,

Number of Observations Read	15
Number of Observations Used	15

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	1	3332.70000	3332.70000	1433.02	<.0001
Error	13	30.23333	2.32564
Corrected Total	14	3362.93333

Root MSE	1.52501	R-Square	0.9910
Dependent Mean	136.73333	Adj R-Sq	0.9903
Coeff Var	1.11531

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	-87.51667	5.93694	-14.74	<.0001
height	1	3.45000	0.09114	37.86	<.0001

Now that's a lot of output, probably the complete one. But like I said, I am not going to discuss each of these values and plots as some of it are used for diagnostic checking (you can read more on that in reference 1, and in other applied linear regression books). For now, let's just confirm the coefficients obtained -- both the estimates are the same with that in R and Python.

Multiple Linear Regression (MLR)

To extend SLR to MLR, we'll demonstrate this by simulation. Using the formula-based lm function of R, assuming we have $x_1$ and $x_2$ as our predictors, then following is how we do MLR in R:

Although we did not use intercept in simulating the data, but the obtained estimates for $\beta_1$ and $\beta_2$ are close to the true parameters (.35 and .56). The intercept, however, will help us capture the noise term we added in simulation.

Next we'll try MLR in Python using statsmodels, consider the following:

It should be noted that, the estimates in R and in Python should not (necessarily) be the same since these are simulated values from different software. Finally, we can perform MLR in SAS as follows:

Number of Observations Read	100
Number of Observations Used	100

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	610.86535	305.43268	303.88	<.0001
Error	97	97.49521	1.00511
Corrected Total	99	708.36056

Root MSE	1.00255	R-Square	0.8624
Dependent Mean	244.07327	Adj R-Sq	0.8595
Coeff Var	0.41076

Parameter Estimates
Variable	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	1	18.01299	11.10116	1.62	0.1079
X1	1	0.31770	0.01818	17.47	<.0001
X2	1	0.58276	0.03358	17.35	<.0001

Conclusion

In conclusion, SAS saves a lot of work, since it returns complete summary of the model, no doubt why companies prefer to use this, besides from their active customer support. R and Python, on the other hand, despite the fact that it is open-source, it can well compete with the former, although it requires programming skills to achieved all of the SAS outputs, but I think that's the exciting part of it -- it makes you think, and manage time. The achievement in R and Python is of course fulfilling. Hope you've learned something, feel free to share your thoughts on the comment below.

Reference

Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley & Sons, Inc. United States of America.
Scikit-learn Documentation
Statsmodels Documentation
SAS Documentation
Delwiche, Lora D., and Susan J. Slaughter. 2012. The Little SAS® Book: A Primer, Fifth Edition. Cary, NC: SAS Institute Inc.
Regression with SAS. Institute for Digital Research and Education. UCLA. Retrieved August 13, 2015.
Python Plotly Documentation

Parametric Inference: Karlin-Rubin Theorem

2015-07-21T07:05:00.000+08:00

A family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ for a univariate random variable $T$ with real-valued parameter $\theta$ has a monotone likelihood ratio (MLR) if, for every $\theta_2>\theta_1$, $g(t|\theta_2)/g(t|\theta_1)$ is a monotone (nonincreasing or nondecreasing) function of $t$ on $\{t:g(t|\theta_1)>0\;\text{or}\;g(t|\theta_2)>0\}$. Note that $c/0$ is defined as $\infty$ if $0< c$.

Consider testing $H_0:\theta\leq \theta_0$ versus $H_1:\theta>\theta_0$. Suppose that $T$ is a sufficient statistic for $\theta$ and the family of pdfs or pmfs $\{g(t|\theta):\theta\in\Theta\}$ of $T$ has an MLR. Then for any $t_0$, the test that rejects $H_0$ if and only if $T >t_0$ is a UMP level $\alpha$ test, where $\alpha=P_{\theta_0}(T >t_0)$.

Example 1
To better understand the theorem, consider a single observation, $X$, from $\mathrm{n}(\theta,1)$, and test the following hypotheses: $$ H_0:\theta\leq \theta_0\quad\mathrm{versus}\quad H_1:\theta>\theta_0. $$ Then $\theta_1>\theta_0$, and the likelihood ratio test statistics would be $$ \lambda(x)=\frac{f(x|\theta_1)}{f(x|\theta_0)}. $$ And we say that the null hypothesis is rejected if $\lambda(x)>k$. To see if the distribution of the sample has MLR property, we simplify the above equation as follows: $$ \begin{aligned} \lambda(x)&=\frac{\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{(x-\theta_1)^2}{2}\right]}{\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{(x-\theta_0)^2}{2}\right]}\\ &=\exp \left[-\frac{x^2-2x\theta_1+\theta_1^2}{2}+\frac{x^2-2x\theta_0+\theta_0^2}{2}\right]\\ &=\exp\left[\frac{2x\theta_1-\theta_1^2-2x\theta_0+\theta_0^2}{2}\right]\\ &=\exp\left[\frac{2x(\theta_1-\theta_0)-(\theta_1^2-\theta_0^2)}{2}\right]\\ &=\exp\left[x(\theta_1-\theta_0)\right]\times\exp\left[-\frac{\theta_1^2-\theta_0^2}{2}\right] \end{aligned} $$ which is increasing as a function of $x$, since $\theta_1>\theta_0$.

Figure 1. Normal Densities with $\mu=1,2$.

By illustration, consider Figure 1. The plot of the likelihood ratio of these models is monotone increasing as seen in Figure 2, where rejecting $H_0$ if $\lambda(x)>k$ is equivalent to rejecting it if $T\geq t_0$.

Figure 2. Likelihood Ratio of the Normal Densities.

And by factorization theorem the likelihood ratio test statistic can be written as a function of the sufficient statistics since the term, $h(x)$ will be cancelled out. That is, $$ \lambda(t)=\frac{g(t|\theta_1)}{g(t|\theta_0)}. $$ And by Karlin-Rubin theorem, the rejection region $R=\{t:t>t_0\}$ is a uniformly most powerful level-$\alpha$ test. Where $t_0$ satisfies the following: $$ \begin{aligned} \mathrm{P}(T>t_0|\theta_0)&=\mathrm{P}(T\in R|\theta_0)\\ \alpha&=1-\mathrm{P}(X\leq t_0|\theta_0)\\ 1-\alpha&=\int_{-\infty}^{t_0}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{(x-\theta_0)^2}{2}\right]\operatorname{d}x \end{aligned} $$ Hence the quantile of the $1-\alpha$ probability, which is $z_{\alpha}$ is equal to $t_0$, that is $z_{\alpha}=t_0$, and thus we reject $H_0$ if $T>z_{\alpha}$.

Example 2
Now consider testing the hypotheses, $H_0:\theta\geq \theta_0$ versus $H_1:\theta< \theta_0$ using the sample $X$ (single observation) from Beta($\theta$, 2), and to be more specific let $\theta_0=4$ and $\theta_1=3$. Can we apply Karlin-Rubin? Of course! Visually, we have something like in Figure 3.

Figure 3. Beta Densities Under Different Parameters.

Note that for this test, $\theta_1<\theta_0$, and so the likelihood ratio test statistics is simplified as follows: $$ \begin{aligned} \lambda(x)&=\frac{f(x|\theta_1=3, 2)}{f(x|\theta_0=4, 2)}=\frac{\displaystyle\frac{\Gamma(\theta_1+2)}{\Gamma(\theta_1)\Gamma(2)}x^{\theta_1-1}(1-x)^{2-1}}{\displaystyle\frac{\Gamma(\theta_0+2)}{\Gamma(\theta_0)\Gamma(2)}x^{\theta_0-1}(1-x)^{2-1}}\\ &=\frac{\displaystyle\frac{\Gamma(5)}{\Gamma(3)\Gamma(2)}x^{2}(1-x)}{\displaystyle\frac{\Gamma(6)}{\Gamma(4)\Gamma(2)}x^{3}(1-x)}=\frac{\displaystyle\frac{12\Gamma(3)}{\Gamma(3)\Gamma(2)}x^{2}(1-x)}{\displaystyle\frac{20\Gamma(4)}{\Gamma(4)\Gamma(2)}x^{3}(1-x)}\\ &=\frac{3}{5x}, \end{aligned} $$ which is decreasing as a function of $x$, see the plot of this in Figure 4. And we say that $H_0$ is rejected if $\lambda(x) > k$ if and only if $T < t_0$. Where $t_0$ satisfies the following equations: $$ \begin{aligned} \mathrm{P}(T < t_0|\theta_0)&=\mathrm{P}(X < t_0|\theta_0)\\ \alpha&=\int_{0}^{t_0}\frac{\Gamma(\theta_0+2)}{\Gamma(\theta_0)\Gamma(2)}x^{\theta_0-1}(1-x)^{2-1}\operatorname{d}x\\ \alpha&=\int_{0}^{t_0}\frac{\Gamma(6)}{\Gamma(4)\Gamma(2)}x^{3}(1-x)\operatorname{d}x. \end{aligned} $$ Hence the quantile of the $\alpha$ probability, $x_{\alpha}=t_0$. And thus we reject $H_0$ if $T < x_{\alpha}$.

Figure 4. Likelihood Ratio of the Beta Densities.

Reference

Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

Parametric Inference: Likelihood Ratio Test Problem 2

2015-05-23T16:30:00.002+08:00

More on Likelihood Ratio Test, the following problem is originally from Casella and Berger (2001), exercise 8.12.

Problem

For samples of size $n=1,4,16,64,100$ from a normal population with mean $\mu$ and known variance $\sigma^2$, plot the power function of the following LRTs (Likelihood Ratio Tests). Take $\alpha = .05$.

$H_0:\mu\leq 0$ versus $H_1:\mu>0$
$H_0:\mu=0$ versus $H_1:\mu\neq 0$

Solution

The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})}{\displaystyle\sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})}, \;\text{since }\sigma^2\text{ is known}. $$ The denominator can be expanded as follows: $$ \begin{aligned} \sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})&=\sup_{-\infty<\mu<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\mu<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\\ &\quad\text{since }\bar{x}\text{ is the MLE of }\mu.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ while the numerator is evaluated as follows: $$ \begin{aligned} \sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})&=\sup_{\mu\leq 0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{\mu\leq 0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]. \end{aligned} $$ Above expression will attain its maximum if the value inside the exponential function is small. And for negative values of $\mu\in(-\infty,0)$ the quantity $(x_i-\mu)^2$ would be large, implies that the exponential term would become small. Therefore, the only value that will give us the supremum likelihood is $\mu=\mu_0=0$. Hence, $$ \begin{aligned} \sup_{\mu\leq 0}\mathcal{L}(\mu|\mathbf{x})&=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu_0)^2}{2\sigma^2}\right]\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x}+\bar{x}-\mu_0)^2}{2\sigma^2}\right]\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left\{-\displaystyle\sum_{i=1}^{n}\left[\frac{(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu_0)+(\bar{x}-\mu_0)^2}{2\sigma^2}\right]\right\}\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\mu_0)^2}{2\sigma^2}\right], \\ &\text{since the middle term is 0.}\\ =\frac{1}{(2\pi\sigma^2)^{1/n}}&\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right], \text{since }\mu_0=0.\\ \end{aligned} $$ So that $$ \begin{equation} \label{eq:lrtre} \begin{aligned} \lambda(\mathbf{x})&=\frac{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right]}{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right]}\\ &=\exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right].\\ \end{aligned} \end{equation} $$ And we reject the null hypothesis if $\lambda(\mathbf{x})\leq c$, that is $$ \begin{aligned} \exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]&\leq c\\ -\frac{n\bar{x}^2}{2\sigma^2}&\leq \log c\\ \frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}&\geq\sqrt{-2\log c}=c'. \end{aligned} $$

Figure 1: Plot of Likelihood Ratio Test Statistic for $n = 4,\sigma = 1$.

Hence, rejecting the null hypothesis if $\lambda(\mathbf{x})\leq c$, is equivalent to rejecting $H_0$ if $\frac{\bar{x}}{\sigma/\sqrt{n}}\geq c'\in[0,\infty)$. Figure 1 depicts the plot of the LRT, the shaded region is on the positive side because that's where the alternative region is, $H_1:\mu>0$, in a sense that if the LRT is small enough to reject $H_0$, then it simply tells us that the plausibility of the parameter in the alternative in explaining the sample is higher compared to the null hypothesis. And if that's the case, we expect the sample to come from the model proposed by $H_1$, so that the sample mean $\bar{x}$, being an unbiased estimator of the population mean $\mu$, a function of the LRT statistic, should fall on the side (shaded region) of the alternative.

So that the power function, that is the probability of rejecting the null hypothesis given that it is true (the probability of Type I error) is, $$ \begin{aligned} \beta(\mu)&=\mathrm{P}\left[\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}\geq c'\right],\quad\mu_0=0\\ &=1-\mathrm{P}\left[\frac{\bar{x}+\mu-\mu-\mu_0}{\sigma/\sqrt{n}}< c'\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}} + \frac{\mu-\mu_0}{\sigma/\sqrt{n}}< c'\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]. \end{aligned} $$ Values taken by $\Phi$ are negative and so it decreases, but since we subtracted it to 1, then $\beta(\mu)$ is an increasing function. So that for $\alpha=.05$, $$ \begin{aligned} \alpha&=\sup_{\mu\leq \mu_0}\beta(\mu)\\ .05&=\beta(\mu_0)\Rightarrow\beta(\mu_0)=1-\Phi(c')\\ .95&=\Phi(c')\Rightarrow c'=1.645. \end{aligned} $$ Since, $$ \begin{aligned} \Phi(1.645)=\int_{-\infty}^{1.645}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x=.9500151. \end{aligned} $$ Therefore for $c'=1.645,\mu_0=0,\sigma=1$, the plot of the power function as a function of $\mu$ for different sample size, $n$, is shown in Figure 2. For example, for $n=1$ we compute for the function \begin{equation} \label{eq:powcomp} \begin{aligned} \beta(\mu)&=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[1.645+ \frac{0-\mu}{1/\sqrt{1}}\right]\\ &=1-\int_{-\infty}^{\left(1.645+ \frac{0-\mu}{1/\sqrt{1}}\right)}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x. \end{aligned} \end{equation} The obtained values would be the $y$. For $n = 64$ $$ \begin{aligned} \beta(\mu)&=1-\Phi\left[c'+ \frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[1.645+ \frac{0-\mu}{1/\sqrt{64}}\right]\\ &=1-\int_{-\infty}^{\left(1.645+ \frac{0-\mu}{1/\sqrt{64}}\right)}\frac{1}{\sqrt{2\pi}}\exp\left[-\frac{x^2}{2}\right]\operatorname{d}x, \end{aligned} $$ and so on.

Figure 2: Power Function for Different Values of $n$.
The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\mu= 0}\mathcal{L}(\mu|\mathbf{x})}{\displaystyle\sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})}, \;\text{since }\sigma^2\text{ is known}. $$ The denominator can be expanded as follows: $$ \begin{aligned} \sup_{-\infty<\mu<\infty}\mathcal{L}(\mu|\mathbf{x})&=\sup_{-\infty<\mu<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\mu<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\\ &\quad\;\text{since }\bar{x}\text{ is the MLE of }\mu.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ and the numerator is evaluated as follows: $$ \begin{aligned} \sup_{\mu=0}\mathcal{L}(\mu|\mathbf{x})&=\sup_{\mu=0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\sup_{\mu=0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n\bar{x}^2}{2\sigma^2}\right], \end{aligned} $$ we skip some lines in the above simplification since we've done this already in part (a). And by Equation (1), $\lambda(\mathbf{x})=\exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]$. So that $\lambda(\mathbf{x})\leq c$ would be $$ \begin{aligned} \exp\left[-\frac{n\bar{x}^2}{2\sigma^2}\right]&\leq c\\ -\frac{n\bar{x}^2}{2\sigma^2}&\leq \log c\\ \frac{\lvert\bar{x}-\mu_0\rvert}{\sigma/\sqrt{n}}&\geq\sqrt{-2\log c}=c',\quad \mu_0=0. \end{aligned} $$ So rejecting the null hypothesis if $\lambda(\mathbf{x})\leq c'$ is equivalent to rejecting $H_0$ if $\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}\geq c'$. And since $H_1$ is two-sided, then we reject $H_0$ if $\frac{\bar{x}}{\sigma/\sqrt{n}}\geq c'$ or $\frac{\bar{x}}{\sigma/\sqrt{n}}\leq -c'$. To illustrate this, consider Figure 3 where the two shaded regions are the lower and upper rejection regions.

Figure 3: Plot of Likelihood Ratio Test Statistic for $n = 4,\sigma = 1$.

So that the power function is, $$ \begin{aligned} \beta(\mu)&=\mathrm{P}\left[\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}\geq c'\right]\\ &=1 - \mathrm{P}\left[\frac{\lvert\bar{x}\rvert}{\sigma/\sqrt{n}}< c'\right]\\ &=1 - \mathrm{P}\left[-c'<\frac{\bar{x}}{\sigma/\sqrt{n}}< c'\right]\\ &=1 - \left\{\mathrm{P}\left[\frac{\bar{x}}{\sigma/\sqrt{n}}< c'\right]-\mathrm{P}\left[\frac{\bar{x}}{\sigma/\sqrt{n}}< -c'\right]\right\}\\ &=1 - \left\{\mathrm{P}\left[\frac{\bar{x}+\mu-\mu}{\sigma/\sqrt{n}}< c'\right]-\mathrm{P}\left[\frac{\bar{x}+\mu-\mu}{\sigma/\sqrt{n}}< -c'\right]\right\}\\ &=1 - \mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< c'-\frac{\mu}{\sigma/\sqrt{n}}\right]+\mathrm{P}\left[\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}< -c'-\frac{\mu}{\sigma/\sqrt{n}}\right]\\ &=\underbrace{1 - \Phi\left[c'-\frac{\mu}{\sigma/\sqrt{n}}\right]}_{\Phi_1}+\underbrace{\Phi\left[-c'-\frac{\mu}{\sigma/\sqrt{n}}\right]}_{\Phi_2}. \end{aligned} $$ Notice that $\Phi_1$ is an increasing function, while $\Phi_2$ is decreasing as a function of $\mu$. We expect this since the alternative hypothesis is a two-sided one, so does the power. To see this, consider Figure 4 for different values of $n$.

Figure 4: Two-Sided Power Function for Different $n$.

The points in the plot are computed by substituting values of $\mu=0,\sigma=1$ and $n$ to the power function just like we did in Equation (2).

Reference

Parametric Inference: Likelihood Ratio Test Problem 1

2015-05-21T15:25:00.000+08:00

Another post for mathematical statistics, the problem below is originally from Casella and Berger (2001) (see Reference 1), exercise 8.6.

Problem

Suppose that we have two independent random samples $X_1,\cdots, X_n$ are exponential($\theta$), and $Y_1,\cdots, Y_m$ are exponential($\mu$).
1. Find the LRT (Likelihood Ratio Test) of $H_0:\theta=\mu$ versus $H_1:\theta\neq\mu$.
2. Show that the test in part (a) can be based on the statistic
3. Find the distribution of $T$ when $H_0$ is true.

Solution

The Likelihood Ratio Test is given by $$ \lambda(\mathbf{x},\mathbf{y}) = \frac{\displaystyle\sup_{\theta = \mu,\mu>0}\mathrm{P}(\mathbf{x},\mathbf{y}|\theta,\mu)}{\displaystyle\sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)}, $$ where the denominator is evaluated as follows: $$ \sup_{\theta > 0,\mu>0}\mathrm{P}(\mathbf{x}, \mathbf{y}|\theta,\mu)= \sup_{\theta > 0}\mathrm{P}(\mathbf{x}|\theta)\sup_{\mu > 0}\mathrm{P}(\mathbf{y}|\mu),\quad\text{by independence.} $$ So that, $$ \begin{aligned} \sup_{\theta > 0}\mathrm{P}(\mathbf{x}|\theta)&=\sup_{\theta>0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]=\sup_{\theta>0}\frac{1}{\theta^n}\exp\left[-\frac{\sum_{i=1}^{n}x_i}{\theta}\right]\\ &=\frac{1}{\bar{x}^n}\exp\left[-\frac{\sum_{i=1}^{n}x_i}{\bar{x}}\right]=\frac{1}{\bar{x}^n}\exp[-n], \end{aligned} $$ since $\bar{x}$, or the sample mean is the MLE of $\theta$. Also, $$ \begin{aligned} \sup_{\mu > 0}\mathrm{P}(\mathbf{y}|\mu)&=\sup_{\mu>0}\prod_{j=1}^{m}\frac{1}{\mu}\exp\left[-\frac{y_j}{\mu}\right]=\sup_{\mu>0}\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^{m}y_j}{\mu}\right]\\ &=\frac{1}{\bar{y}^m}\exp\left[-\frac{\sum_{j=1}^{m}y_j}{\bar{y}}\right]=\frac{1}{\bar{y}^m}\exp[-m]. \end{aligned} $$ Now the numerator is evaluated as follows, $$ \begin{aligned} \sup_{\theta = \mu,\mu>0}\mathrm{P}(\mathbf{x},\mathbf{y}|\theta,\mu)&=\sup_{\theta=\mu,\mu>0}\mathrm{P}(\mathbf{x}|\theta)\mathrm{P}(\mathbf{y}|\mu),\quad\text{by independence.}\\ &=\sup_{\theta=\mu,\mu>0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\prod_{j=1}^{m}\frac{1}{\mu}\exp\left[-\frac{y_j}{\mu}\right]\\ &=\sup_{\theta=\mu,\mu>0}\frac{1}{\theta^n}\exp\left[-\frac{\sum_{i=1}^nx_i}{\theta}\right]\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^m y_j}{\mu}\right]\\ &=\sup_{\mu>0}\frac{1}{\mu^n}\exp\left[-\frac{\sum_{i=1}^nx_i}{\mu}\right]\frac{1}{\mu^m}\exp\left[-\frac{\sum_{j=1}^m y_j}{\mu}\right]\\ &=\sup_{\mu>0}\frac{1}{\mu^{n+m}}\exp\left\{-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\} \end{aligned} $$ Note that $\mu$ is a nuisance parameter, and so we will also maximize this over its domain. And to do that we take the log-likeihood function first, $$ \begin{aligned} \ell(\mu|\mathbf{x},\mathbf{y})&=-\log(\mu^{n+m})-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\\ &=-(n+m)\log(\mu)-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]. \end{aligned} $$ Taking the derivative with respect to $\mu$, gives us $$ \frac{\operatorname{d}}{\operatorname{d}\mu}\ell(\mu|\mathbf{x},\mathbf{y})=-(n+m)\frac{1}{\mu}+\frac{1}{\mu^2}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right], $$ equate this to zero to obtain the stationary point, $$ \begin{aligned} -(n+m)\frac{1}{\mu}+\frac{1}{\mu^2}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]&=0\\ -(n+m)\mu+\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]&=0\\ \mu&=\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]. \end{aligned} $$ To verify if this is the MLE, we take the second derivative test for the log-likelihood function, $$ \frac{\operatorname{d}^2}{\operatorname{d}\mu^2}\ell(\mu|\mathbf{x},\mathbf{y})=(n+m)\frac{1}{\mu^2}-\frac{2}{\mu^3}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]<0, $$ since $\frac{1}{\mu^2}<\frac{2}{\mu^3}$, implying $\hat{\mu}=\displaystyle\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]$ is the MLE of $\mu$. Thus the LRT, $\lambda(\mathbf{x},\mathbf{y})$ would be, $$ \begin{aligned} \lambda(\mathbf{x},\mathbf{y})&=\frac{\sup_{\mu>0}\displaystyle\frac{1}{\mu^{n+m}}\exp\left\{-\frac{1}{\mu}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\}}{\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)]}\\ &=\left(\frac{1}{\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\times\right.\\ &\qquad\left.\exp\left\{-\frac{1}{\frac{1}{n+m}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]\right\}\right)\bigg/\\ &\qquad\qquad\qquad\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)] \end{aligned} $$ $$ \begin{aligned} &=\frac{\displaystyle\frac{1}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\displaystyle\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\times\exp[-(n+m)]}{\displaystyle\frac{1}{\bar{x}^n}\frac{1}{\bar{y}^m}\exp[-(n+m)]}\\[.3cm] &=\frac{\displaystyle \bar{x}^n \bar{y}^m}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}. \end{aligned} $$ And we say that $H_0$ is rejected if $\lambda(\mathbf{x},\mathbf{y})\leq c$.
If we do some algebra on the LRT in part (a), we obtain the following: $$ \begin{aligned} \lambda(\mathbf{x},\mathbf{y})&=\frac{\displaystyle \bar{x}^n \bar{y}^m}{\displaystyle\frac{1}{{(n+m)}^{n+m}}\left[\sum_{i=1}^nx_i+\sum_{j=1}^m y_j\right]^{n+m}}\\ &=\frac{\displaystyle\frac{1}{n^n}\left(\sum_{i=1}^{n}x_i\right)^{n}\frac{1}{m^{m}}\left(\sum_{j=1}^{m}y_j\right)^{m}}{\displaystyle\frac{1}{(n+m)^{n+m}}\left[\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j\right]^{n+m}}\\ &=\frac{\displaystyle (n+m)^{n+m}\left(\sum_{i=1}^{n}x_i\right)^{n}\left(\sum_{j=1}^{m}y_j\right)^{m}}{\displaystyle n^{n}m^{m}\left[\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j\right]^{n+m}}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[\frac{\displaystyle \sum_{j=1}^{m}y_j}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{m}\left[\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{n}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[1-\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{m}\left[\frac{\displaystyle \sum_{i=1}^{n}x_i}{\displaystyle\sum_{i=1}^{n}x_i+\sum_{j=1}^{m}y_j}\right]^{n}\\ &=\frac{(n+m)^{n+m}}{n^nm^{m}}\left[1-T\right]^{m}\left[T\right]^{n}. \end{aligned} $$ Hence, the LRT can be based on the statistic $T$.
The distribution of $\sum X_i$ is obtain using the MGF (Moment Generating Function) technique, that is $$ \begin{aligned} \mathrm{M}_{\Sigma X_i}(t)&=\mathrm{E}\exp[t\Sigma X_i]=\mathrm{E}\exp[tX_1 +\cdots + tX_n]\\ &=\mathrm{E}\exp[tX_1]\times\cdots\times\mathrm{E}\exp[tX_n],\quad\text{by independence.}\\ &=\frac{1}{1-\theta t}\times\cdots\times\frac{1}{1-\theta t}\\ &=\left(\frac{1}{1-\theta t}\right)^{n}=\text{MGF of gamma}(n,\theta). \end{aligned} $$ Now, when $H_0$ is true then $\sum X_i$ is gamma($m,\theta$). For brevity, let $X=\sum_{i=1}^{n} X_i$ and $Y=\sum_{j=1}^{m}Y_j$. The joint distribution of $X$ and $Y$ is given below, $$ f_{XY}(x, y)=\frac{1}{\Gamma (n)\theta^{n}}x^{n-1}\exp[-x/\theta]\times\frac{1}{\Gamma (m)\theta^{m}}y^{m-1}\exp[-y/\theta]. $$ Let $U=\frac{X}{X+Y}$ and $V=X+Y$, then the support of $(X,Y)$ is $\mathcal{A}=\left\{(x,y)\in \mathbb{R}^{+}\times \mathbb{R}^{+}\right\}$. Since the transformations $U$ and $V$ is one-to-one and onto, then $\mathcal{B}=\left\{(u,v)\in [0,1]\times \mathbb{R}^{+}\right\}$. Consider the following transformations $$ u=g_{1}(x,y)=\frac{x}{x+y}\quad\text{and}\quad v=g_{2}(x,y)=x+y. $$ Then, \begin{equation} \label{eq:bvt1} u=\frac{x}{x+y}\Rightarrow x=\frac{uy}{1-u} \end{equation} and \begin{equation} \label{eq:bvt2} v=x+y\Rightarrow y = v-x. \end{equation} Substitute Equation (\ref{eq:bvt2}) to Equation (\ref{eq:bvt1}), then $$ \begin{aligned} x=\frac{u(v-x)}{1-u}&\Rightarrow x(1-u)=u(v-x)\\ x-ux=uv-ux&\Rightarrow x=uv=h_{1}(u,v). \end{aligned} $$ Substitute $x$ above to Equation (\ref{eq:bvt2}) to obtain, $$y=v(1-u)=h_2(u,v).$$ And the Jacobian matrix is, $$ \mathbf{J}=\bigg| \begin{array}{cc} v&u\\[.2cm] -v&1-u \end{array} \bigg|=v(1-u)+uv=v. $$ So that, $$ \begin{aligned} f_{UV}(u,v)&=f_{XY}(h_1(u,v),h_2(u,v))\lvert \mathbf{J}\rvert=f_{XY}(uv,v(1-u))\lvert v\rvert\\ &=\frac{1}{\Gamma (n)\theta^{n}}(uv)^{n-1}\exp[-uv/\theta]\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m}}(v(1-u))^{m-1}\exp[-v(1-u)/\theta]v\\ &=\frac{1}{\Gamma (n)\theta^{n}}(uv)^{n-1}\exp[-uv/\theta]\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m}}(v(1-u))^{m-1}\exp[-v/\theta]\exp[uv/\theta]v\\ &=\frac{1}{\Gamma (n)\theta^{n}}u^{n-1}v^{n-1}\times\frac{1}{\Gamma (m)\theta^{m}}v^{m-1}(1-u)^{m-1}\exp[-v/\theta]v\\ &=\frac{1}{\Gamma (n)}\underbrace{u^{n-1}(1-u)^{m-1}}_{\text{Beta}(n,m)\text{ kernel}}\frac{1}{\Gamma (m)\theta^{m+n}}v^{m-1}v^{n-1}\exp[-v/\theta]v\\ &=\frac{\Gamma(m)\Gamma(m+n)}{\Gamma(m)\Gamma(m+n)}\frac{u^{n-1}(1-u)^{m-1}}{\Gamma (n)}\times\\ &\quad\;\frac{1}{\Gamma (m)\theta^{m+n}}v^{m-1}v^{n}\exp[-v/\theta]\\ &=\underbrace{\frac{\Gamma(m+n)}{\Gamma (n)\Gamma(m)}u^{n-1}(1-u)^{m-1}}_{\text{Beta}(n,m)}\times\\ &\quad\;\underbrace{\frac{1}{\Gamma(m+n)\theta^{m+n}}v^{m+n-1}\exp[-v/\theta]}_{\text{Gamma}(m+n,\theta)}. \end{aligned} $$ So that the marginal density of $U=\displaystyle\frac{\sum X_i}{\sum X_i +\sum Y_i}$ is Beta($n,m$).

Reference

Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

Parametric Inference: The Power Function of the Test

2015-05-01T15:17:00.000+08:00

In Statistics, we model random phenomenon and make conclusions about its population. For example, in an experiment of determining the true heights of the students in the university. Suppose we take sample from the population of the students, and consider testing the null hypothesis that the average height is 5.4 ft against an alternative hypothesis that the average height is greater than 5.4 ft. Mathematically, we can represent this as $H_0:\theta=\theta_0$ vs $H_1:\theta>\theta_0$, where $\theta$ is the true value of the parameter, and $\theta_0=5.4$ is the testing value set by the experimenter. And because we only consider subset (the sample) of the population for testing the hypotheses, then we expect for errors we commit. To understand these errors, consider if the above test results into rejecting $H_0$ given that $\theta\in\Theta_0$, where $\Theta_0$ is the parameter space of the null hypothesis, in other words we mistakenly reject $H_0$, then in this case we committed a Type I error. Another is, if the above test results into accepting $H_0$ given that $\theta\in\Theta_0^c$, where $\Theta_0^c$ is the parameter space of the alternative hypothesis, then we committed a Type II error. To summarize this consider the following table,

Truth	Decision
Table 1: Two Types of Errors in Hypothesis Testing.
	Accept $H_0$	Reject $H_0$
$H_0$	Correct Decision	Type I Error
$H_1$	Type II Error	Correct Decision

Let's formally define the power function, from Casella and Berger (2001), see reference 1.

Definition 1. The power function of a hypothesis test with rejection region $R$ is the function of $\theta$ defined by $\beta(\theta)=\mathrm{P}_{\theta}(\mathbf{X}\in R)$.

To relate the definition to the above problem, if $R$ is the rejection region of $H_0$. Then we make mistake if the sample observed, $\mathbf{x}$, $\mathbf{x}\in R$ given that $\theta\in\Theta_0$. That is, $\beta(\theta)=\mathrm{P}_{\theta}(\mathbf{X}\in R)$ is the probability of Type I error. Let's consider an example, one that is popularly used in testing the sample mean. The example below is the combined problem of Example 8.3.3 and Exercise 8.37 (a) of reference 1.

Example 1. Let $X_1,\cdots, X_n\overset{r.s.}{\sim}N(\mu,\sigma^2)$ -- normal population where $\sigma^2$ is known. Consider testing $H_0:\theta\leq \theta_0$ vs $H_1:\theta> \theta_0$, obtain the likelihood ratio test (LRT) statistic and its power function.

Solution: The LRT statistic is given by $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{-\infty<\theta<\infty}L(\theta|\mathbf{x})}, $$ where $$ \begin{aligned} \sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta\leq\theta_0}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\sup_{\theta\leq\theta_0}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta_0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x}+\bar{x}-\theta_0)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left\{-\displaystyle\sum_{i=1}^{n}\left[\frac{(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\theta_0)+(\bar{x}-\theta_0)^2}{2\sigma^2}\right]\right\}\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\theta_0)^2}{2\sigma^2}\right], \text{since the middle term is 0.} \end{aligned} $$ And $$ \begin{aligned} \sup_{-\infty<\theta<\infty}L(\theta|\mathbf{x})&=\sup_{-\infty<\theta<\infty}\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left[-\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\sup_{-\infty<\theta<\infty}\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\theta)^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right],\quad\text{since }\bar{x}\text{ is the MLE of }\theta.\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{n-1}{n-1}\displaystyle\sum_{i=1}^{n}\frac{(x_i-\bar{x})^2}{2\sigma^2}\right]\\ &=\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right],\\ \end{aligned} $$ so that $$ \begin{aligned} \lambda(\mathbf{x})&=\frac{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2+n(\bar{x}-\theta_0)^2}{2\sigma^2}\right]}{\frac{1}{(2\pi\sigma^2)^{1/n}}\exp\left[-\frac{(n-1)s^2}{2\sigma^2}\right]}\\ &=\exp\left[-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}\right].\\ \end{aligned} $$ And from my previous entry, $\lambda(\mathbf{x})$ is rejected if it is small, such that $\lambda(\mathbf{x})\leq c$ for some $c\in[0,1]$. Hence, $$ \begin{aligned} \lambda(\mathbf{x})&=\exp\left[-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}\right]< c\\&\Rightarrow-\frac{n(\bar{x}-\theta_0)^2}{2\sigma^2}<\log c\\ &\Rightarrow\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}>\sqrt{-2\log c}. \end{aligned} $$ So that $H_0$ is rejected if $\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}> c'$ for some $c'=\sqrt{-2\log c}\in[0,\infty)$. Now the power function of the test, is the probability of rejecting the null hypothesis given that it is true, or the probability of the Type I error given by, $$ \begin{aligned} \beta(\theta)&=\mathrm{P}\left[\frac{\bar{x}-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta+\theta-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}+\frac{\theta-\theta_0}{\sigma/\sqrt{n}}> c'\right]\\ &=\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}> c'-\frac{\theta-\theta_0}{\sigma/\sqrt{n}}\right]\\ &=1-\mathrm{P}\left[\frac{\bar{x}-\theta}{\sigma/\sqrt{n}}\leq c'+\frac{\theta_0-\theta}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c'+\frac{\theta_0-\theta}{\sigma/\sqrt{n}}\right]. \end{aligned} $$ To illustrate this, consider $\theta_0=5.4,\sigma = 1,n=30$ and $c'=1.645$. Then the plot of the power function as a function of $\theta$ is,

Since $\beta$ is an increasing function with unit range, then $$ \alpha = \sup_{\theta\leq\theta_0}\beta(\theta)=\beta(\theta_0)=1-\Phi(c'). $$ So that using values we set for the above graph, $\alpha=0.049985\approx 0.05$, $\alpha$ here is called the size of the test since it is the supremum of the power function over $\theta\leq\theta_0$, see reference 1 for level of the test. Now let's investigate the power function above, the probability of committing Type I error, $\beta(\theta), \forall \theta\leq \theta_0$, is acceptably small. However, the probability of committing Type II error, $1-\beta(\theta), \forall \theta > \theta_0$, is too high as we can see in the following plot,

Therefore, it's better to investigate the error structure when considering the power of the test. From Casella and Berger (2001), the ideal power function is 0 $\forall\theta\in\Theta_0$ and 1 $\forall\theta\in\Theta_0^c$. Except in trivial situations, this ideal cannot be attained. Qualitatively, a good test has power function near 1 for most $\theta\in\Theta_0^c$ and $\theta\in\Theta_0$. Implying, one that has steeper power curve.

Now an interesting fact about power function is that it depends on the sample size $n$. Suppose in our experiment above we want the Type I error to be 0.05 and the Type II error to be 0.1 if $\theta\geq \theta_0+\sigma/2$. Since the power function is increasing, then we have $$ \beta(\theta_0)=0.05\Rightarrow c'=1.645\quad\text{and}\quad 1 - \beta(\theta_0+\sigma/2)=0.1\Rightarrow\beta(\theta_0+\sigma/2)=0.9. $$ Where $$ \begin{aligned} \beta(\theta_0+\sigma/2)&=1-\Phi\left[c' +\frac{\theta_0-\sigma/2-\theta_0}{\sigma/\sqrt{n}}\right]\\ &=1-\Phi\left[c' - \frac{\sqrt{n}}{2}\right]\\ 0.9&=1-\Phi\left[1.645 - \frac{\sqrt{n}}{2}\right]\\ 0.1&=\Phi\left[1.645 - \frac{\sqrt{n}}{2}\right].\\ \end{aligned} $$ Hence, $n$ is chosen such that it solves the above equation. That is, $$ \begin{aligned} 1.645 - \frac{\sqrt{n}}{2}&=-1.28155,\quad\text{since }\Phi(-1.28155)=0.1\\ \frac{3.29 - \sqrt{n}}{2}&=-1.28155\\ 3.29 - \sqrt{n}&=-2.5631\\ n&=(3.29+2.5631)^2=34.25878,\;\text{take }n=35. \end{aligned} $$ For purpose of illustration, we'll consider the non-rounded value of $n$. Below is the plot of this,

And for different values of $n$, consider the following power functions

From the above plot, the larger the sample size, $n$, the steeper the curve implying a better error structure. To see this, try hovering over the lines in the plot, and you'll witness a fast departure for values of large $n$ on the unit range, this characteristics contribute to the sensitivity of the test.

Plot's Python Codes

In case you want to reproduce the above plots, click here for the source code.

Reference

Parametric Inference: Likelihood Ratio Test by Example

2015-04-27T17:21:00.000+08:00

Hypothesis testing have been extensively used on different discipline of science. And in this post, I will attempt on discussing the basic theory behind this, the Likelihood Ratio Test (LRT) defined below from Casella and Berger (2001), see reference 1.

Definition. The likelihood ratio test statistic for testing $H_0:\theta\in\Theta_0$ versus $H_1:\theta\in\Theta_0^c$ is \begin{equation} \label{eq:lrt} \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta\in\Theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\in\Theta}L(\theta|\mathbf{x})}. \end{equation} A likelihood ratio test (LRT) is any test that has a rejection region of the form $\{\mathbf{x}:\lambda(\mathbf{x})\leq c\}$, where $c$ is any number satisfying $0\leq c \leq 1$.

The numerator of equation (\ref{eq:lrt}) gives us the supremum probability of the parameter, $\theta$, over the restricted domain (null hypothesis, $\Theta_0$) of the parameter space $\Theta$, that maximizes the joint probability of the sample, $\mathbf{x}$. While the denominator of the LRT gives us the supremum probability of the parameter, $\theta$, over the unrestricted domain, $\Theta$, that maximizes the joint probability of the sample, $\mathbf{x}$. Therefore, if the value of $\lambda(\mathbf{x})$ is small such that $\lambda(\mathbf{x})\leq c$, for some $c\in [0, 1]$, then the true value of the parameter that is plausible in explaining the sample is likely to be in the alternative hypothesis, $\Theta_0^c$.

Example 1. Let $X_1,X_2,\cdots,X_n\overset{r.s.}{\sim}f(x|\theta)=\frac{1}{\theta}\exp\left[-\frac{x}{\theta}\right],x>0,\theta>0$. From this sample, consider testing $H_0:\theta = \theta_0$ vs $H_1:\theta<\theta_0$.

Solution:
The parameter space $\Theta$ is the set $(0,\Theta_0]$, where $\Theta_0=\{\theta_0\}$. Hence, using the likelihood ratio test, we have $$ \lambda(\mathbf{x})=\frac{\displaystyle\sup_{\theta=\theta_0}L(\theta|\mathbf{x})}{\displaystyle\sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})}, $$ where, $$ \begin{aligned} \sup_{\theta=\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta=\theta_0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\\ &=\sup_{\theta=\theta_0}\left(\frac{1}{\theta}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta}\right]\\ &=\left(\frac{1}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right], \end{aligned} $$ and $$ \begin{aligned} \sup_{\theta\leq\theta_0}L(\theta|\mathbf{x})&=\sup_{\theta\leq\theta_0}\prod_{i=1}^{n}\frac{1}{\theta}\exp\left[-\frac{x_i}{\theta}\right]\\ &=\sup_{\theta\leq\theta_0}\left(\frac{1}{\theta}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta}\right]=\sup_{\theta\leq\theta_0}f(\mathbf{x}|\theta). \end{aligned} $$ Now the supremum of $f(\mathbf{x}|\theta)$ over all values of $\theta\leq\theta_0$ is the MLE (maximum likelihood estimator) of $f(x|\theta)$, which is $\bar{x}$, provided that $\bar{x}\leq \theta_0$.

So that, $$ \begin{aligned} \lambda(\mathbf{x})&=\frac{\left(\frac{1}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]} {\left(\frac{1}{\bar{x}}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\bar{x}}\right]},\quad\text{provided that}\;\bar{x}\leq \theta_0\\ &=\left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]\exp[n]. \end{aligned} $$ And we say that, if $\lambda(\mathbf{x})\leq c$, $H_0$ is rejected. That is, $$ \begin{aligned} \left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]\exp[n]&\leq c\\ \left(\frac{\bar{x}}{\theta_0}\right)^n\exp\left[-\displaystyle\frac{\sum_{i=1}^{n}x_i}{\theta_0}\right]&\leq c',\quad\text{where}\;c'=\frac{c}{\exp[n]}\\ n\log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{n}{\theta_0}\bar{x}&\leq \log c'\\ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0}&\leq \frac{1}{n}\log c'\\ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0}&\leq \frac{1}{n}\log c-1. \end{aligned} $$ Now let $h(x)=\log x - x$, then $h'(x)=\frac{1}{x}-1$. So the critical point of $h'(x)$ is $x=1$. And to test if this is maximum or minimum, we apply second derivative test. That is, $$ h''(x)=-\frac{1}{x^2}<0,\forall x. $$ Thus, $x=1$ is a maximum. Hence, $ \log\left(\frac{\bar{x}}{\theta_0}\right)-\frac{\bar{x}}{\theta_0} $ is maximized if $\frac{\bar{x}}{\theta_0}=1\Rightarrow\bar{x}=\theta_0$. To see this consider the following plot,

Above figure is the plot of $h(\bar{x})$ function with $\theta_0=1$. Given the assumption that $\bar{x}\leq \theta_0$ then assuming $R=\frac{1}{n}\log c-1$ designates the orange line above, then we reject $H_0$ if $h(\bar{x})\leq R$, if and only if $\bar{x}\leq k$. In practice, $k$ is specified to satisfy, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)\leq \alpha, $$ where $\alpha$ is called the level of the test.

It follows that $X_i|\theta = \theta_0\overset{r.s.}{\sim}\exp[\theta_0]$, then $\mathrm{E}X_i=\theta_0$ and $\mathrm{Var}X_i=\theta_0^2$. If $\bar{x}=\frac{1}{n}\sum_{i=1}^{n}X_i$ and if $G_n$ is the distribution of $\frac{(\bar{x}_n-\theta_0)}{\sqrt{\frac{\theta_0^2}{n}}}$. By CLT (central limit theorem) $\lim_{n\to\infty}G_n$ converges to standard normal distribution. That is, $\bar{x}|\theta = \theta_0\overset{r.s.}{\sim}AN\left(\theta_0,\frac{\theta_0^2}{n}\right)$. $AN$ - assymptotically normal.

Thus, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)=\Phi\left(\frac{k-\theta_0}{\theta_0/\sqrt{n}}\right),\quad\text{for large }n. $$ So that, $$ \mathrm{P}(\bar{x}\leq k|\theta=\theta_0)=\Phi\left(\frac{k-\theta_0}{\theta_0/\sqrt{n}}\right)\leq \alpha. $$ Plotting this gives us,

with corresponding PDF given by,

Implying, $$ \frac{k-\theta_0}{\theta_0/\sqrt{n}}=z_{\alpha}\Rightarrow k=\theta_0+z_{\alpha}\frac{\theta_0}{\sqrt{n}}. $$ Therefore, a level-$\alpha$ test of $H_0:\theta=\theta_0$ vs $H_1:\theta<\theta_0$ is the test that rejects $H_0$ when $\bar{x}\leq\theta_0+z_{\alpha}\frac{\theta_0}{\sqrt{n}}$.

Plot's Python Codes

In case you might ask how above plots were generated:

Reference

SAS®: Getting Started with PROC IML

2015-04-25T15:29:00.000+08:00

Another powerful procedure of SAS, my favorite one, that I would like to share is the PROC IML (Interactive Matrix Language). This procedure treats all objects as a matrix, and is very useful for doing scientific computations involving vectors and matrices. To get started, we are going to demonstrate and discuss the following:

Creating and Shaping Matrices;
Matrix Query;
Subscripts;
Descriptive Statistics;
Set Operations;
Probability Functions and Subroutine;
Linear Algebra;
Reading and Creating Data;

Above outline is based on the IML tip sheet (see Reference 1). So to begin on the first bullet, consider the following code:

scalar
5

row_vec
1	2	3	4	5	6

col_vec
1
2
3
4
5
6

num_mat
1	2	3
4	5	6

chr_mat
Hello,
world! :D

i_mat
1	0	0	0	0	0
0	1	0	0	0	0
0	0	1	0	0	0
0	0	0	1	0	0
0	0	0	0	1	0
0	0	0	0	0	1

mat_2
2
2
2

trow_vec
1
2
3
4
5
6

mat1
1	2
3	4
5	6

With the help of the comments in the code, it wouldn't be difficult to comprehend what each line tries to tell us, so I will only explain line 33. In SAS, variables defined are not automatically stored into the workspace unless one stores it first, and then call it on other procedures by loading the storage, which we'll see on the next entry -- Math Query. Functions we'll discuss in math query involve extracting number of columns, rows, and so on, below is the sample code of this,

 SYMBOL     ROWS   COLS TYPE   SIZE                     
 ------   ------ ------ ---- ------                     
 CHR_MAT       2      1 char      9                     
 COL_VEC       6      1 num       8                     
 I_MAT         6      6 num       8                     
 MAT1          3      2 num       8                     
 MAT_2         3      1 num       8                     
 NUM_MAT       2      3 num       8                     
 ROW_VEC       1      6 num       8                     
 SCALAR        1      1 num       8                     
 TROW_VEC      6      1 num       8                     
  Number of symbols = 10  (includes those without values)

nmat_row
2

nmat_col
3

nmat_dim
2	3

cmat_len
6
9

cmat_nlen
9

nmat_typ
N

cmat_typ
C

So to load all variables stored in the workspace, we use line 3. Succeeding lines are not that difficult to understand, and this what I love about SAS, the statements and functions are self-explanatory -- a good excuse for us to proceed with subscripting on matrices, below is the code of it

NUM_MAT
1	2	3
4	5	6

n22_mat
5

nr1_mat
1	2	3

ir12_mat
1	0	0	0	0	0
0	1	0	0	0	0

ic12_mat
1	0
0	1
0	0
0	0
0	0
0	0

ngm_mat
3.5

ncm_mat
2.5	3.5	4.5

nrm_mat
2
5

ngs_mat
21

nrs_mat
17	29	45

ncs_mat
14
77

nss_mat
91

nrs_mat
17	29	45

ncs_mat
14
77

Line 17 computes the grand mean of the matrix by simply inserting : symbol inside the place holder of the subscript. So that if we have num_mat[:, 1], then mean is computed over the row entries, giving us the column mean, particularly for first column. The same goes for num_mat[1, :], where it computes the mean over the column entries, giving us the row mean. If we replace the symbol in the place holder of the subscripts to +, then we are interested in the sum of the entries. Further, if we use ## symbol, the returned value will be the sum of square of the elements. And reducing this to #, the returned value will be the product of the elements.

Now let's proceed to the next bullet, which is about Descriptive Statistics.

csr_vec
1	3	6	10	15	21

csn_mat
1	3	6
10	15	21

mnr_vec
1

mnn_mat
1

mxr_vec
6

mxn_mat
6

smr_vec
21

smn_mat
21

ssr_vec
91

ssn_mat
91

To generate random numbers from say normal distribution and computing the mean, standard deviation and other statistics, consider the following:

x1
0.2642335
1.0747269
0.8179241
-0.552775
1.5401449
-1.233822
-0.141535
1.0420036
0.0657322
1.225259
-0.148304
0.2901233
-1.149394
-0.482548
-0.452974
0.2738675
-0.224133
0.218553
-0.420015
0.246356

x2
54.993687
58.167325
59.147705
40.74794
45.813645
53.460273
57.877839
51.98273
49.875743
52.570553
54.097005
46.936325
57.509082
50.463228
42.775346
39.376643
53.303455
54.494482
55.747821
44.512206

x12
0.2642335	54.993687
1.0747269	58.167325
0.8179241	59.147705
-0.552775	40.74794
1.5401449	45.813645
-1.233822	53.460273
-0.141535	57.877839
1.0420036	51.98273
0.0657322	49.875743
1.225259	52.570553
-0.148304	54.097005
0.2901233	46.936325
-1.149394	57.509082
-0.482548	50.463228
-0.452974	42.775346
0.2738675	39.376643
-0.224133	53.303455
0.218553	54.494482
-0.420015	55.747821
0.246356	44.512206

x12_cor
1	-0.001531
-0.001531	1

x12_cov
0.5645625	-0.006864
-0.006864	35.614684

x1_mu
0.1126712

x2_std
5.967804

Line 2 above sets the initial random seed for random numbers to be generated in line 8. Line 5 allocates a matrix of dimension 20 by 1 to x1 variable, and that's done by using the j function. The number of rows of x1 represents the sample size of the random numbers needed. One can also set x1 to a row vector, where in this case, the number of columns represents the sample size needed. The two sets of random sample, x1 and x2, generated from the same family of distribution, Gaussian/Normal, are then concatenated column-wise (||) to form a matrix of size 20 by 2 in line 13. Using this new matrix, x12, we can then compute the correlation and covariance of the two columns using corr and cov functions, respectively, which from the above output tells us that there is almost no relation between the two.

SAS can also perform set operations, and it's easy. Consider the following:

B_comp
a	i	m	x

A_comp
e	h	r	t	y

AuB
a	e	h	i	m	o	r	t	x	y

AnB
o

AB_unq
a	e	h	i	m	o	r	t	x	y

Next bullet is all about Probability Functions and Subroutine. For example, consider an experiment defined by the random variable $X$ which follows an exponential distribution with mean $\beta = .5$. What is the probability of $X$ to be at most 2, $\mathrm{P}(X\leq 2)$? To solve this we use the CDF function, but note that the exponential density in SAS is given by $$f(x|\beta)=\frac{1}{\beta}\exp\left[-\frac{x}{\beta}\right].$$ So to compute the probability, we solve for the following integration, $$ \mathrm{P}(X\leq 2)=\int_{0}^{2}\frac{1}{.5}\exp\left[-\frac{x}{.5}\right]\operatorname{d}x = 0.9816844 $$ To confirm this in SAS, run the following

px
0.9816844

If we take the derivative of the Cumulative Distribution Function (CDF), the returned expression is what we call the Probability Density Function (PDF). And in SAS, we play on this using the PDF function. For example, we can confirm the above probability by integrating the PDF. And to do so, run the following

px
0.9816844

To end this topic, consider the inverse of the CDF, which is the quantile. To compute for the quantile of the popular level of significance $\alpha = 0.05$, from a standard normal distribution, which is $z_{\alpha} = -1.645$ for lower tail, run

z_a
-1.644854

Next entry is about Linear Algebra, the topic on which this procedure is based upon. Linear algebra is very useful in Statistics, especially in Regression, Nonlinear Regression, and Multivariate Analysis. To perform this in SAS, consider

xm_det
-1

xm_inv
1	-3	2
-3	3	-1
2	-1	4.441E-16

x_evl
11.344814
0.1709152
-0.515729

x_evc
0.3279853	0.591009	0.7369762
0.591009	-0.736976	0.3279853
0.7369762	0.3279853	-0.591009

x_coef
3
-4
2

Finally, one of the coolest capabilities of SAS/IML is to Read and Create SAS Data. The following code demos how to read SAS data set.

x_dat
Acura
Acura
Acura
Acura
Acura
Acura
Acura
Audi
Audi
Audi

hp_mean
215.88551

And to create a SAS data set, run

Obs	COL1	COL2	COL3
1	1	2	3
2	4	5	6

To end this post, I want to say, I am loving SAS because of IML. There are still hidden capabilities of this procedure that I would love to explore and share to my readers, so stay tuned. Another great blog about SAS/IML is The DO Loop, whose author, Dr. Rick Wicklin, is also the principal developer of the said procedure and SAS/IML Studio, do check that out.

Reference

SAS/IML Tip Sheet. Frequently Used SAS/IML Functions and Subroutines.
SAS/IML 13.2 User Guide.
Rick Wicklin. The DO Loop. How to numerically integrate a function in SAS.

Python and R: Basic Sampling Problem

2015-04-16T14:02:00.000+08:00

In this post, I would like to share a simple problem about sampling analysis. And I will demonstrate how to solve this using Python and R. The first two problems are originally from Sampling: Design and Analysis book by Sharon Lohr.

Problems

Let $N=6$ and $n=3$. For purposes of studying sampling distributions, assume that all population values are known.

$y_1 = 98$ $y_2 = 102$ $y_3=154$

$y_4 = 133$ $y_5 = 190$ $y_6=175$

We are interested in $\bar{y}_U$, the population mean. Consider eight possible samples chosen.

Sample No. Sample, $\mathcal{S}$ $P(\mathcal{S})$

1 $\{1,3,5\}$ $1/8$

2 $\{1,3,6\}$ $1/8$

3 $\{1,4,5\}$ $1/8$

4 $\{1,4,6\}$ $1/8$

5 $\{2,3,5\}$ $1/8$

6 $\{2,3,6\}$ $1/8$

7 $\{2,4,5\}$ $1/8$

8 $\{2,4,6\}$ $1/8$
1. What is the value of $\bar{y}_U$?
2. Let $\bar{y}$ be the mean of the sample values. For each sampling plan, find
  1. $\mathrm{E}\bar{y}$;
  2. $\mathrm{Var}\bar{y}$;
  3. $\mathrm{Bias}(\bar{y})$;
  4. $\mathrm{MSE}(\bar{y})$;
Mayr et al. (1994) took an SRS of 240 children who visisted their pediatric outpatient clinic. They found the following frequency distribution for the age (in months) of free (unassisted) walking among the children:

Age (months) 9 10 11 12 13 14 15 16 17 18 19 20

Number of Children 13 35 44 69 36 24 7 3 2 5 1 1

Find the mean and SE of the age for onset of free walking.

Table 1 gives the cultivated area in acres in 1981 for 40 villages in a region. (Theory and Method of Survey) Using the arrangement (random) of data in the table, draw systematic sample of size 8. Use r ((random start) = 2,

Village	$Y_j$	Village	$Y_j$	Village	$Y_j$	Village	$Y_j$
1	105	11	319	21	70	31	16
2	625	12	72	22	249	32	439
3	47	13	109	23	384	33	123
4	312	14	91	24	482	34	207
5	327	15	152	25	378	35	145
6	230	16	189	26	111	36	666
7	240	17	365	27	534	37	338
8	203	18	70	28	306	38	624
9	535	9	249	29	655	39	501
10	275	20	384	30	102	40	962

Solutions

In order to appreciate the codes, I will share some theoretical part of the solution. But our main focus here is to solve this problem computationally using Python and R.

1. The value of $\bar{y}_U$ is coded as follows:
  
  Python Code R Code
2. To obtain the sample using the sample index given in the table in the above question, we do a combination of population index of three elements, ${6\choose 3}$, first. Where the first two combinations are the samples, $\{1,2,3\}$ and $\{1,2,4\}$, and so on. Then from this list of all possible combinations of three elements, we draw those that are listed in the above table as our samples, with first sample index $\{1,3,5\}$, having population units, $\{98, 154, 190\}$. So that the following is the code of this sampling design:
  
  Python Code R Code
  1. Now to obtain the expected value of the average of the sample data, we compute it using $\mathrm{E}\bar{y}=\sum_{k}\bar{y}_k\mathrm{P}(\bar{y}_k)=\sum_{k}\bar{y_k}\mathrm{P}(\mathcal{S}_k)$, $\forall k\in\{1,\cdots,8\}$. So for $k = 1$, $$ \begin{aligned} \bar{y}_1\mathrm{P}(\mathcal{S}_1)&=\frac{98+154+190}{3}\mathrm{P}(\mathcal{S}_1)\\ &=\frac{98+154+190}{3}\left(\frac{1}{8}\right)=18.41667. \end{aligned} $$ Applying this to the remaining $n-1$ $k$s, and summing up the terms gives us the answer to $\mathrm{E}\bar{y}$. So that the following is the equivalent of this:
    
    Python Code R Code From the above code, the output tells us that $\mathrm{E}\bar{y}=140$.
  2. Next is to compute for the variance of $\bar{y}$, which is $\mathrm{Var}\bar{y}=\mathrm{E}\bar{y}^{2}-(\mathrm{E}\bar{y})^2$. So we need a function for $\mathrm{E}\bar{y}^2$, where the first term of this, $k=1$, is $\bar{y}_1^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2\mathrm{P}(\mathcal{S}_1)=\left(\frac{98+154+190}{3}\right)^2(\frac{1}{8})=2713.3889$. Applying this to other terms and summing them up, we have following code:
    
    Python Code R Code So that using the above output, 20182.94, and subtracting $(\mathrm{E}\bar{y})^2$ to it, will give us the variance. And hence the succeeding code:
    
    Python Code: R Code: So the variance of the $\bar{y}$ is $18.9444$.
3. The $\mathrm{Bias}$ is just the difference between the estimate and the true value. And since the estimate is unbiased ($\mathrm{E}\bar{y}=142$), so $\mathrm{Bias}=142-142=0$.
4. $\mathrm{MSE}=\mathrm{Var}\bar{y}-(\mathrm{Bias}\bar{y})^2$, and since the $\mathrm{Bias}\bar{y}=0$. So $\mathrm{MSE}=\mathrm{Var}\bar{y}$.
First we need to obtain the probability of each Age, that is by dividing the Number of Children with the total sum of it. That is why, we have p_s function defined below. After obtaining the probabilities, we can then compute the expected value using the expectation function we defined earlier.

Python Code R Code It should be clear in the data that the average age is about 12 months old, where the plot of it is shown below, For the code of the above plot please click here. Next is to compute the standard error which is just the square root of the variance of the sample,

Python Code R Code So the standard variability of the Age is 1.920824.
Let me give you a brief discussion on the systematic sampling to help you understand the code. The idea in systematic sampling is that, given the population units numbered from 1 to $N$, we compute for the sampling interval, given by $k = \frac{N}{n}$, where $n$ is the number of units needed for the sample. After that, we choose for the random start, number between $1$ and $k$. This random start will be the first sample, and then the second unit in the sample is obtained by adding the sampling interval to the random start, and so on. There are two types of systematic sampling namely, Linear and Circular Systematic Samplings. Circular systematic sampling treats the population units numbered from $1$ to $N$ in circular form, so that if the increment step is more than the number of $N$ units, say $N+2$, the sample unit is the $2^{nd}$ element in the population, and so on. The code that I will be sharing can be used both for linear and circular, but for this particular problem only. Since there are rules in linear that are not satisfied in the function, one of which is if $k$ is not a whole number, despite that, however, you can always extend it to a more general function.

Python Code R Code You may notice in the output above, that the index returned in Python is not the same with the index returned in R. This is because Python index starts with 0, while that in R starts with 1. So that's why we have the same population units sampled between the two language despite the differences between the index returned.

Reference

Lohr, Sharon (2009). Sampling: Design and Analysis. Cengage Learning.

Probability Theory: Convergence in Distribution Problem

2015-03-06T19:27:00.000+08:00

Let's solve some theoretical problem in probability, specifically on convergence. The problem below is originally from Exercise 5.42 of Casella and Berger (2001). And I just want to share my solution on this. If there is an incorrect argument below, I would be happy if you could point that to me.

Problem

Let $X_1, X_2,\cdots$ be iid (independent and identically distributed) and $X_{(n)}=\max_{1\leq i\leq n}x_i$.

If $x_i\sim$ beta(1,$\beta$), find a value of $\nu$ so that $n^{\nu}(1-X_{(n)})$ converges in distribution;
If $x_i\sim$ exponential(1), find a sequence $a_n$ so that $X_{(n)}-a_n$ converges in distribution.

Solution

Let $Y_n=n^{\nu}(1-X_{(n)})$, we say that $Y_n\rightarrow Y$ in distribution. If $$\lim_{n\rightarrow \infty}F_{Y_n}(y)=F_Y(y).$$ Then, $$ \begin{aligned} \lim_{n\rightarrow\infty}F_{Y_n}(y)&=\lim_{n\rightarrow\infty}P(Y_n\leq y)=\lim_{n\rightarrow\infty}P(n^{\nu}(1-X_{(n)})\leq y)\\ &=\lim_{n\rightarrow\infty}P\left(1-X_{(n)}\leq \frac{y}{n^{\nu}}\right)\\ &=\lim_{n\rightarrow\infty}P\left(-X_{(n)}\leq \frac{y}{n^{\nu}}-1\right)=\lim_{n\rightarrow\infty}\left[1-P\left(-X_{(n)}> \frac{y}{n^{\nu}}-1\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(\max\{X_1,X_2,\cdots,X_n\}< 1-\frac{y}{n^{\nu}}\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(X_1< 1-\frac{y}{n^{\nu}},X_2< 1-\frac{y}{n^{\nu}},\cdots,X_n< 1-\frac{y}{n^{\nu}}\right)\right]\\ &=\lim_{n\rightarrow\infty}\left[1-P\left(X_1< 1-\frac{y}{n^{\nu}}\right)^n\right],\;\text{since}\;X_i's\;\text{are iid.} \end{aligned} $$ And because $x_i\sim$ beta(1,$\beta$), the density is $$ f_{X_1}(x)=\begin{cases} \beta(1-x)^{\beta - 1}&\beta>0, 0\leq x\leq 1\\ 0,&\mathrm{Otherwise} \end{cases} $$ Implies, $$ \begin{aligned} \lim_{n\to \infty}P(Y_n\leq y)&=\lim_{n\to \infty}\left\{1-\left[\int_0^{1-\frac{y}{n^{\nu}}}\beta(1-t)^{\beta-1}\,\mathrm{d}t\right]^n\right\}\\ &=\lim_{n\to \infty}\left\{1-\left[-\int_1^{\frac{y}{n^{\nu}}}\beta u^{\beta-1}\,\mathrm{d}u\right]^{n}\right\}\\ &=\lim_{n\to \infty}\left\{1-\left[-\beta\frac{u^{\beta}}{\beta}\bigg|_{u=1}^{u=\frac{y}{n^{\nu}}}\right]^{n}\right\}\\ &=1-\lim_{n\to \infty}\left[1-\left(\frac{y}{n^{\nu}}\right)^{\beta}\right]^{n} \end{aligned} $$ We can simplify the limit if $\nu=\frac{1}{\beta}$, that is $$ \lim_{n\to\infty}P(Y_n\leq y)=1-\lim_{n\to\infty}\left[1-\frac{y^{\beta}}{n}\right]^{n}=1-e^{-y^{\beta}} $$ To confirm this in Python, run the following code using the sympy module

Therefore, if $1-e^{-y^{\beta}}$ is a distribution function of $Y$, then $Y_n=n^{\nu}(1-X_{(n)})$ converges in distribution to $Y$ for $\nu=\frac{1}{\beta}$.
$\hspace{12.5cm}\blacksquare$
$$ \begin{aligned} P(X_{(n)}-a_{n}\leq y) &= P(X_{(n)}\leq y + a_n)=P(\max\{X_1,X_2,\cdots,X_n\}\leq y+a_n)\\ &=P(X_1\leq y+a_n,X_2\leq y+a_n,\cdots,X_n\leq y+a_n)\\ &=P(X_1\leq y+a_n)^n,\;\text{since}\;x_i's\;\text{are iid}\\ &=\left[\int_{-\infty}^{y+a_n}f_{X_1}(t)\,\mathrm{d}t\right]^n \end{aligned} $$ Since $X_i\sim$ exponential(1), then the density is $$ f_{X_1}=\begin{cases} e^{-x},&0\leq x\leq \infty\\ 0,&\mathrm{otherwise} \end{cases} $$ So that, $$ \begin{aligned} P(X_{(n)}-a_{n}\leq y)&=\left[\int_{0}^{y+a_n}e^{-t}\,\mathrm{d}t\right]=\left\{-\left[e^{-(y+a_n)}-1\right]\right\}^n\\ &=\left[1-e^{-(y+a_n)}\right]^n \end{aligned} $$ If we let $Y_n=X_{(n)}-a_n$, then we say that $Y_n\rightarrow Y$ in distribution if $$ \lim_{n\to\infty}P(Y_n\leq y)=P(Y\leq y) $$ Therefore, $$ \begin{aligned} \lim_{n\to\infty}P(Y_n\leq y) &= \lim_{n\to\infty}P(X_{(n)}-a_n\leq y)=\lim_{n\to \infty}\left[1-e^{-y-a_n}\right]^n\\ &=\lim_{n\to\infty}\left[1-\frac{e^{-y}}{e^{a_n}}\right]^n \end{aligned} $$ We can simplify the limit if $a_n=\log n$, that is $$ \lim_{n\to\infty}\left[1-\frac{e^{-y}}{e^{\log n}}\right]^n=\lim_{n\to\infty}\left[1-\frac{e^{-y}}{n}\right]^n=e^{-e^{-y}} $$ Check this in Python by running the following code,

In conclusion, if $e^{-e^{-y}}$ is a distribution function of Y, then $Y_n=X_{(n)}-a_n$ converges in distribution to $Y$ for sequence $a_n=\log n$.
$\hspace{12.5cm}\blacksquare$

Reference

Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

R: How to Layout and Design an Infographic

2015-02-26T22:34:00.001+08:00

As promised from my recent article, here's my tutorial on how to layout and design an infographic in R. This article will serve as a template for more infographic design that I plan to share on future posts. Hence, we will go through the following sections:

Layout - mainly handles by grid package.
Design - style of the elements in the layout.
- Texts - use extrafont package for custom fonts;
- Shapes (lines and point characters) - use grid, although this package has been removed from CRAN (as of February 26, 2015), the compressed file of the source code of the package is still available. But if I am not mistaken, by default this package is included in R. You might check it first before installing.
- Plots - several choices for plotting data in R: base plot, lattice, or ggplot2 package.

The Infographic

We aim to obtain the following layout and design in the final output of our code:

To start with, we need to setup our data first. And for illustration purposes, we will use a simulated data:

Design: Colour

The aesthetic of an infographic not only depends on the shapes and plots, but also on the colours. So if you are not an artist, I suggest to look first for a list of sample infographics to get some inspiration. Once you have found the theme for your chart, grab the colour of it. To grab the colour, use eyedropper tool from software such as photoshop, affinity designer, etc. There is also free add ons for Mozilla Firefox called ColorZilla, I haven't tried it but maybe you could explore that. For the above theme, there are five colours with the following hexadecimal colour code:

Colour Name	Hexadecimal
Table 1: Colours Used in the Chart.
Dark Violet	`#552683`
Dark Yellow	`#E7A922`
White	`#FFFFFF`
Gray (Infographic Text)	`#A9A8A7`
Dark Yellow (Crime Text)	`#CA8B01`

Design: Data Visualization

At this point, we'll prepare the elements in the layout, and we begin with the plots. Below is the bar plot of y1 in the data frame, dat, in three groupings, grp. Note that the plot you'll obtain will not be the same with the one below since the data changes every time we run the simulation above.

So that's the default theme of ggplot2, and we want to customize this using the theme function. One of the elements in the plot that will be tweaked is the font. To deal with this we need to import the fonts using the extrafont package. That is,

What happens above is that all fonts installed in your machine will be imported. It's better to import all of it so that we'll have several choices to play on. For the above infographic, the font used is called Impact, which is available on windows and I think on mac as well. If you don't have that, then download and install it first before running the above codes. To arrive on the design of the bar plot in the infographic we use the following theme,

I named it kobe_theme since if you recall from my previous article, the above chart is inspired by Kobe Bryant Infographic. So applying this to the plot we'll have the following,

Obtain by running p1 + kobe_theme(). If in case you want to reorder the ticks in the x-axis, by starting with A from the top and ending with L in the bottom, simply run the following,

And you'll have

So that's our first plot, next is to plot y2 from dat data frame, this time using the line plot.

Obtain by running the following code:

Applying kobe_theme, will give us

Above plot is generated by running p2 + kobe_theme(). We should expect this since the kobe_theme that was applied in the bar plot with coord_flip option enabled, affects the orientation of the grids. So instead, we do a little tweak on the current theme, and see for yourself the difference:

So that we have the following result:

Now that's better, one more issue is the title label for the legend. To change the label, run the following code:

And that will give us

Finally, y3 variable is plotted using the following codes:

by default we have,

Applying kobe_theme2(),

Layout

All plots are now set, next is to place it in the layout. The following steps explain the procedure:

Start by creating new grid plot, grid.newpage();
Next define the layout of the grid. Think of this as a matrix of plots, where a 2 by 2 matrix plot will give us 4 windows (two rows and two columns). These windows will serve as a placeholder of the plots. So to achieve a matrix plot with 4 rows and 3 columns, we run
Next is the background colour, this will be the background colour of the infographic. For the given chart, we run the following:
Next is to insert texts in the layout, use the grid.text function. The position of objects/elements such as texts in the grid is defined by the (x, y) coordinates. The bound of the grid by default is a unit square, of course the aspect ratio of the square can be modified. So the support of x and y is $[0,1]^2$;
To insert the plot into a specific window in the matrix plot use the vplayout function for the coordinates of the placeholder, and print for pasting. Say we want to insert the first plot in first row, second column, we code it this way

Now to place it in first row and stretched it over all (three) columns, run

Using the above procedure, we have the following codes for the infographic. Enjoy!

PNG Output

PDF Output

Reference

ggplot2 Documentation.
Cookbook for R.

Philippine Infographic: Recapitulation on Incidents Involving Motorcycle Riding in Tandem Criminals for 2011-2013

2015-02-18T23:54:00.000+08:00

The Philippine government has launched Open Data Philippines (data.gov.ph) last year, January 16, 2014. Accordingly, the data.gov.ph aims to make national government data searchable, accessible, and useful, with the help of the different agencies of government, and with the participation of the public. This website consolidates the data sets of different government agencies, allowing users to find specific information from a rich and continuously growing collection of public data sets.

Data.gov.ph provides information on how to access these datasets and tools, such infographics and other applications, to make the information easy to understand. Users may not only view the datasets, but also share and download them as spreadsheets and other formats, for their own use.

The primary goal of data.gov.ph is to foster a citizenry empowered to make informed decisions, and to promote efficiency and transparency in government. For more, check out the video:

Although admittedly I accidentally discovered this few weeks ago, but still good news for me. I mean I've been frustrated about our government data since college, it was difficult to do case study and research about crimes, rainfall, and other interesting variables to model due to the lack of data available online. With the launch of Open Data Philippines, and for believing that data can improve our country, it's a win win for me. So as a first exploitation on it. I decided to use the data from Philippine National Police (PNP) agency about the incidents involving motorcycle riding in tandem criminals. Check my first infographic below,

The Infographic (PNG)

PDF Version

What software did I use for creating this infographic?

Well I designed it entirely using R, with the help of ggplot2, grid, and extrafont packages. The above infographic is inspired by Kobe Bryant Infographic. The hexadecimal color codes from the said chart were extracted using the eyedropper tool from Affinity Designer.

I will not share any code in this post, but will do a tutorial on how to create one. So be notified by subscribing.

Python: Getting Started with Data Analysis

2015-02-09T17:37:00.000+08:00

Analysis with Programming has recently been syndicated to Planet Python. And as a first post being a contributing blog on the said site, I would like to share how to get started with data analysis on Python. Specifically, I would like to do the following:

Importing the data
- Importing CSV file both locally and from the web;
Data transformation;
Descriptive statistics of the data;
Hypothesis testing
- One-sample t test;
Visualization; and
Creating custom function.

Importing the data

This is the crucial step, we need to import the data in order to proceed with the succeeding analysis. And often times data are in CSV format, if not, at least can be converted to CSV format. In Python we can do this using the following codes:

To read CSV file locally, we need the pandas module which is a python data analysis library. The read_csv function can read data both locally and from the web.

Data transformation

Now that we have the data in the workspace, next is to do transformation. Statisticians and scientists often do this step to remove unnecessary data not included in the analysis. Let's view the data first:

To R programmers, above is the equivalent of print(head(df)) which prints the first six rows of the data, and print(tail(df)) -- the last six rows of the data, respectively. In Python, however, the number of rows for head of the data by default is 5 unlike in R, which is 6. So that the equivalent of the R code head(df, n = 10) in Python, is df.head(n = 10). Same goes for the tail of the data.

Column and row names of the data are extracted using the colnames and rownames functions in R, respectively. In Python, we extract it using the columns and index attributes. That is,

Transposing the data is obtain using the T method,

Other transformations such as sort can be done using sort attribute. Now let's extract a specific column. In Python, we do it using either iloc or ix attributes, but ix is more robust and thus I prefer it. Assuming we want the head of the first column of the data, we have

By the way, the indexing in Python starts with 0 and not 1. To slice the index and first three columns of the 11th to 21st rows, run the following

Which is equivalent to print df.ix[10:20, ['Abra', 'Apayao', 'Benguet']]

To drop a column in the data, say columns 1 (Apayao) and 2 (Benguet), use the drop attribute. That is,

axis argument above tells the function to drop with respect to columns, if axis = 0, then the function drops with respect to rows.

Descriptive Statistics

Next step is to do descriptive statistics for preliminary analysis of our data using the describe attribute:

Hypothesis Testing

Python has a great package for statistical inference. And that's the stats library of scipy. The one sample t-test is implemented in ttest_1samp function. So that, if we want to test the mean of the Abra's volume of palay production against the null hypothesis with 15000 assumed population mean of the volume of palay production, we have

The values returned are tuple of the following values:

t : float or array
t-statistic
prob : float or array
two-tailed p-value

From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha=0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Applying this test for all variables against the population mean 15000 volume of production, we have

The first array returned is the t-statistic of the data, and the second array is the corresponding p-values.

Visualization

There are several module for visualization in Python, and the most popular one is the matplotlib library. To mention few, we have bokeh and seaborn modules as well to choose from. In my previous post, I've demonstrated the matplotlib package which has the following graphic for box-whisker plot,

Now plotting using pandas module can beautify the above plot into the theme of the popular R plotting package, the ggplot. To use the ggplot theme just add one more line to the above code,

And you'll have the following,

Even neater than the default matplotlib.pyplot theme. But in this post, I would like to introduce the seaborn module which is a statistical data visualization library. So that, we have the following

Sexy boxplot, scroll down for more.

Creating custom function

To define a custom function in Python, we use the def function. For example, say we define a function that will add two numbers, we do it as follows,

By the way, in Python indentation is important. Use indentation for scope of the function, which in R we do it with braces {...}. Now here's an algorithm from my previous post,

Generate samples of size 10 from Normal distribution with $\mu$ = 3 and $\sigma^2$ = 5;
Compute the $\bar{x}$ and $\bar{x}\mp z_{\alpha/2}\displaystyle\frac{\sigma}{\sqrt{n}}$ using the 95% confidence level;
Repeat the process 100 times; then
Compute the percentage of the confidence intervals containing the true mean.

Coding this in Python we have,

Above code might be easy to read, but it's slow in replication. Below is the improvement of the above code, thanks to Python gurus, see comments on my previous post.

Update

For those who are interested in the ipython notebook of this article, please click here. This article was converted to ipython notebook by of Nuttens Claude.

Data Source

Philippine Bureau of Agricultural Statistics

Reference

Pandas, Scipy, and Seaborn Documentations.
Wes McKinney & PyData Development Team (2014). pandas: powerful Python data analysis toolkit.

Multiple Random Variables Problems

2015-01-30T21:46:00.000+08:00

To probability lovers, I just want to share (and discuss) few simple problems I solved in Chapter 4 of Casella, G. and Berger, R.L. (2002). Statistical Inference.

A random point $(X,Y)$ is distributed uniformly on the square with vertices $(1, 1),(1,-1),(-1,1),$ and $(-1,-1)$. That is, the joint pdf is $f(x,y)=\frac{1}{4}$ on the square. Determine the probabilities of the following events.
1. $X^2 + Y^2 < 1$
2. $2X-Y>0$
3. $|X+Y|<1$ (modified since the original $|X+Y|<2$ is trivial.)
Solutions:
1. $X^2 + Y^2 < 1$
  We need to consider the boundary of this inequality first in the unit square, so below is the plot of $X^2 + Y^2 = 1$,
  
  Hence, we are interested in the area of the ellipse above since $X^2 + Y^2$ is less than 1. To compute the area of the ellipse, notice that the regions in the 4 quadrants are identical except on the orientation, thus we can compute the area of the first quadrant then we simply multiply this by 4 to cover the overall area of the said geometric. \begin{equation}\nonumber \begin{aligned} P(X^2 + Y^2 < 1) &= 4\int_{0}^{1}\int_{0}^{\sqrt{1 - x^2}}\frac{1}{4}\operatorname{d}y\operatorname{d}x\\ &= \int_{0}^{1}y\Bigg|_{y=0}^{y=\sqrt{1 - x^2}}\operatorname{d}x = \int_{0}^{1}\sqrt{1 - x^2}\operatorname{d}x\\ &=\left(\frac{x}{2} \sqrt{- x^{2} + 1} + \frac{1}{2} \operatorname{sin}^{-1}{\left (x \right )}\right)\Bigg|_{x=0}^{x=1}\\ &=\frac{\pi}{4}-0=\frac{\pi}{4}. \end{aligned} \end{equation} Confirm this using python symbolic computation,
2. Given $2X-Y>0$, we have
  \begin{equation}\nonumber P(2X-Y>0)=P(-Y>-2x) = P(Y<2X) \end{equation} The plot of $Y=2X$ is shown below,
  
  The shaded region is the event we are interested in, then \begin{equation}\nonumber \begin{aligned} P(Y < 2X) &= \int_{-1}^{1}\int_{\frac{y}{2}}^{1}\frac{1}{4}\operatorname{d}x\operatorname{d}y=\int_{-1}^{1}\frac{y}{4}\Bigg|_{\frac{y}{2}}^{1}\operatorname{d}y\\ &=\int_{-1}^{1}\left(\frac{1}{4}-\frac{y}{8}\right)\operatorname{d}y=\left(\frac{y}{4}-\frac{y^2}{16}\right)\Bigg|_{-1}^{1}\\ &=\left(\frac{1}{4}-\frac{1}{16}\right)-\left(-\frac{1}{4}-\frac{1}{16}\right)=\frac{1}{2}. \end{aligned} \end{equation}
3. Given $|X+Y| < 1$, we have \begin{equation}\nonumber P(|X+Y|<1) = P(-1 < X+Y < 1) = P(-1-X < Y < 1-X) \end{equation} The shaded region of both equations ($Y=1-X$ and $Y=-1-X$) is shown below
  
  Hence, we have \begin{equation}\nonumber \begin{aligned} P(-1-X < Y < 1-X) &= 2\int_{0}^{1}\int_{-1}^{1-x}\frac{1}{4}\operatorname{d}y\operatorname{d}x\\ &=\int_{0}^{1}\frac{y}{2}\Bigg|_{-1}^{1-x}\operatorname{d}x=\int_{0}^{1}\left(\frac{1-x}{2}+\frac{1}{2}\right)\operatorname{d}x\\ &=\left(x-\frac{x^2}{4}\right)\Bigg|_{0}^{1}=\frac{3}{4}. \end{aligned} \end{equation}
A pdf is defined by \begin{equation}\nonumber f(x,y) = \begin{cases} C (x+2y) & \text{if}\;0 < y < 1\;\text{and}\;0 < x < 2\\ 0 & \text{otherwise}. \end{cases} \end{equation}
1. Find the value of $C$.
2. Find the marginal distribution of $X$.
3. Find the joint cdf of $X$ and $Y$.
Solutions:
1. Find the value of $C$.
  To solve for the value of $C$ we integrate the given pdf first for $x$ and $y$, that is \begin{equation}\nonumber \begin{aligned} 1&=\int_{0}^{1}\int_{0}^{2}C (x+2y)\operatorname{d}x\operatorname{d}y= C\int_{0}^{1} \left(\frac{x^2}{2}+2xy\right)\Bigg|_{x=0}^{x=2}\operatorname{d}y\\ &=C\int_{0}^{1}(2+4y)\operatorname{d}y=C\left(2y+4\frac{y^2}{2}\right)\Bigg|_{y=0}^{y=1}\\ 1&=4C\Rightarrow C=\frac{1}{4} \end{aligned} \end{equation}
2. Find the marginal distribution of $X$. \begin{equation}\nonumber \begin{aligned} f_X(x)&=\int_{0}^{1}f(x,y)\operatorname{d}y = \frac{1}{4}\int_{0}^{1}(x+2y)\operatorname{d}y\\ &=\frac{1}{4}(xy+y^2)\Bigg|_{y=0}^{y=1}=\begin{cases} \frac{1}{4}(x+1),&0 < x < 2\\ 0,&\text{elsewhere} \end{cases} \end{aligned} \end{equation}
3. Find the joint cdf of $X$ and $Y$. \begin{equation}\nonumber \begin{aligned} F_{XY}(x,y)&=P(X\leq x, Y\leq y) = \frac{1}{4}\int_{0}^{x}\int_{0}^{y}(u+2v)\operatorname{d}v\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{x}(uv+v^2)\Bigg|_{v=0}^{v=y}\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{x}(uy+y^2)\operatorname{d}u\\ &=\frac{1}{4}\left(\frac{u^2y}{2}+uy^2\right)\Bigg|_{u=0}^{u=x}=\frac{x^2y}{8}+\frac{xy^2}{4} \end{aligned} \end{equation} If $x\geq 2$ and $0 < y < 1$, then \begin{equation}\nonumber \begin{aligned} F_{XY}(x,y)&=P(X\leq x, Y\leq y) = \frac{1}{4}\int_{0}^{2}\int_{0}^{y}(u+2v)\operatorname{d}v\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{2}(uv+v^2)\Bigg|_{v=0}^{v=y}\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{2}(uy+y^2)\operatorname{d}u\\ &=\frac{1}{4}\left(\frac{u^2y}{2}+uy^2\right)\Bigg|_{u=0}^{u=2}=\frac{y}{2}+\frac{y^2}{2} \end{aligned} \end{equation} If $0 < x < 2$ and $y \geq 1$, then \begin{equation}\nonumber \begin{aligned} F_{XY}(x,y)&=P(X\leq x, Y\leq y) = \frac{1}{4}\int_{0}^{x}\int_{0}^{1}(u+2v)\operatorname{d}v\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{x}(uv+v^2)\Bigg|_{v=0}^{v=1}\operatorname{d}u\\ &=\frac{1}{4}\int_{0}^{x}(u+1)\operatorname{d}u\\ &=\frac{1}{4}\left(\frac{u^2}{2}+u\right)\Bigg|_{u=0}^{u=x}=\frac{x^2}{8}+\frac{x}{4} \end{aligned} \end{equation}
1. Find $P(X > \sqrt{Y})$ if $X$ and $Y$ are jointly distributed with pdf \begin{equation} f(x,y)=x+y,\;0\leq x\leq 1,\;0\leq y\leq 1. \end{equation}
2. Find $P(X^2 < Y < X)$ if $X$ and $Y$ are jointly distributed with pdf \begin{equation} f(x,y)=2x,\;0\leq x\leq 1,\; 0\leq y \leq 1. \end{equation}
Solutions:
1. $P(X > \sqrt{Y})=P(Y < X^2)$. Now the plot of $y=x^2$ is shown below
  
  The probability of the blue region above is computed as follows, \begin{equation}\nonumber \begin{aligned} P(Y < X^2)&=\int_{0}^{1}\int_{0}^{x^2}x + y\operatorname{d}y\operatorname{d}x\\ &=\int_{0}^{1}\left(xy+\frac{y^2}{2}\right)\Bigg|_{y=0}^{y=x^2}\operatorname{d}x\\ &=\int_{0}^{1}\left(x^3+\frac{x^4}{2}\right)\operatorname{d}x\\ &=\left(\frac{x^4}{4}+\frac{x^5}{10}\right)\Bigg|_{0}^{1}\\ &=\frac{1}{4}+\frac{1}{10}=\frac{7}{20} \end{aligned} \end{equation}
2. So we are interested on the event between $y=x$ and $y=x^2$, as shown below
  
  Thus, \begin{equation}\nonumber \begin{aligned} P(X^2 < Y < X) &=\int_{0}^{1}\int_{x^2}^{x}2x\operatorname{d}y\operatorname{d}x\\ &=\int_{0}^{1}2xy\Bigg|_{y=x^2}^{y=x}\operatorname{d}x=\int_{0}^{1}(2x^2-2x^3)\operatorname{d}x\\ &=\left(\frac{2x^3}{3}-\frac{x^4}{2}\right)\Bigg|_{0}^{1}\\ &=\frac{2}{3}-\frac{1}{2}=\frac{1}{6} \end{aligned} \end{equation}

New Toy: SAS® University Edition

2015-01-15T21:41:00.000+08:00

So I started using SAS® University Edition which is a FREE version of SAS® software. Again it's FREE, and that's the main reason why I want to relearn the language. The software was announced on March 24, 2014 and the download went available on May of that year. And for that, I salute Dr. Jim Goodnight. At least we can learn SAS® without paying for the expensive price tag, especially for single user like me.

The software requires a virtual machine, where it runs on top of that; and a 64-bit processor. To install, just follow the instruction in this video. Although the installation in the video is done in Windows, it also works on Mac. Below is the screenshot of my SAS® Studio running on Safari.

What's in the box?

The software includes the following libraries:

Base SAS® - Make programming fast and easy with the SAS® programming language, ODS graphics and reporting procedure;
SAS/STAT® - Trust SAS® proven reliability with a wide variety of statistical methods and techniques;
SAS/IML® - Use this matrix programming language for more specialized analyses and data exploration;
SAS Studio - Reduce your programming time with autocomplete for hundreds of SAS® statements and procedures, as well as built-in syntax help;
SAS/ACCESS® - Seamlessly connect with your data, no matter where it resides.

For more about SAS® University Edition please refer to the fact sheet.

If you've been following this blog, I have been promoting free software (R, Python, and C/C++) for analysis, and the introduction of SAS® University Edition will only mean one thing, a new topic to discuss on succeeding posts. So let's welcome this software by doing analysis on it.

Analysis

Our goal here is to address the basics in order to proceed with the analysis, and thus we have the following: 1. Importing and transforming the data; 2. Descriptive statistics; 3. Hypothesis testing: One-sample t test; 4. Creating function; and, 5. Visualization.

Data

We'll use again the Volume of Palay Production (1994 to 2013 quarterly) from Cordillera Administrative Region (CAR) Philippines. To reproduce this article, please click here to download the data.

Importing and transforming the data
Working in SAS® Studio, requires you to upload your data into it. To do this, hover to the sidebar, click on Folders tab, and there you will find the "up arrow" for upload. See picture below

You are now set to import the data using the following code. As for my case, the location of the uploaded data seen from the above photo is in "/folders/myfolders/palay.csv",

In SAS®, proc refers to procedure, where in this case we perform the import procedure. out is the path where the SAS® data is saved, here we saved it in "Work" folder with filename "palay". getnames determines whether to generate SAS® variable names from the data values in the first record of the imported file. Finally, datarow starts reading data from the specified row number in the delimited text file.

I want to emphasize that the description of the arguments of the statements and procedures above is available in the software itself, thanks to SAS® Studio, autocomplete for hundreds of SAS® statements and procedures is very handy. So that in the proceeding codes, we will give description on selected statements only. Below is the autocomplete feature of SAS® Studio seen in action,

Now that we have the data in our workspace, let's do some transformation on it. In R, we always start by viewing the head of the data or the first few observations of the data, and we code it as head(data). Having that habit, here's how to do it in SAS®, in this case, first five observations,

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga	Mt_Province
1	1243	2934	148	3300	10553	2675
2	4158	9235	4287	8063	35257	1920
3	1787	1922	1955	1074	4544	6955
4	17152	14501	3536	19607	31687	2715
5	1266	2385	2530	3315	8520	2601

If you want to start and end on specific row, you can do the following. In this case, from 5th row to 10th row:

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga	Mt_Province
5	1266	2385	2530	3315	8520	2601
6	5576	7452	771	13134	28252	1242
7	927	1099	2796	5134	3106	9145
8	21540	17038	2463	14226	36238	2465
9	1039	1382	2592	6842	4973	2624
10	5424	10588	1064	13828	40140	1237

Now, what about playing with the variables of the data? Say we want to view a specific column only, assuming observations from row 15 to 20 of the Benguet variable, how is that? Well, I humbly present to you the following code,

Obs	Benguet
15	2847
16	2942
17	2119
18	734
19	2302
20	2598

For viewing multiple columns, simply enumerate the name of the variables using either keep -- keeps the variables to be returned, or drop -- drops the variables, excluded in the printing.

Obs	Abra	Apayao	Benguet	Ifugao	Kalinga
15	1048	1427	2847	5526	4402
16	25679	15661	2942	14452	33717
17	1055	2191	2119	5882	7352
18	5437	6461	734	10477	24494
19	1029	1183	2302	6438	3316
20	23710	12222	2598	8446	26659

I think above are enough demonstrations for data transformation.

Perform descriptive statistics
And as always, next step is to look on the descriptive statistics of the data, and here's how to do it,

Variable	N	Mean	Std Dev	Minimum	Maximum
Abra Apayao Benguet Ifugao Kalinga Mt_Province	79 79 79 79 79 79	12874.38 16860.65 3237.39 12414.62 30446.42 4506.20	16746.47 15448.15 1588.54 5034.28 22245.71 3815.71	927.0000000 401.0000000 148.0000000 1074.00 2346.00 382.0000000	60303.00 54625.00 8813.00 21031.00 68663.00 13038.00

In case you want to view few or more statistics, you can try

We'll end this section with the following scatter plot matrix,

A quick analysis, we see a strong positive relationship between Kalinga and Apayao; and relationship between Ifugao and Benguet base on the above scatter plot matrix.

Hypothesis testing: One-sample t test
Let's perform simple hypothesis testing, the one-sample t test. Using 0.05 level of significance we'll test whether the true mean of Abra is not equal to 15000.

N Mean Std Dev Std Err Minimum Maximum

79 12874.4 16746.5 1884.1 927.0 60303.0

Mean 95% CL Mean Std Dev 95% CL Std Dev

12874.4 9123.4 16625.4 16746.5 14480.9 19859.1

DF t Value Pr > |t|

78 -1.13 0.2627

From the above numerical output, we see that the p-value = 0.2627 is greater than $\alpha = 0.05$, hence there is no sufficient evidence to conclude that the average volume of palay production is not equal to 15000. Graphically, the observations of the Abra variable is not normally distributed based on its Q-Q plot, although that is subjective but evidently the points clearly deviates from the line.
Creating a function
Let's create a function, we'll use the fcmp procedure. For illustration purposes, consider the standard normal function, $$ \phi(x) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{x^2}{2}\right\} $$ In SAS® we code it as follows,

To generate data from this function using do loop, consider the following:

Obs x y

1 -5.0 .000001487

2 -4.9 .000002439

3 -4.8 .000003961

4 -4.7 .000006370

5 -4.6 .000010141

And that's how you create and use a function in SAS®. For me, the function definition procedure fcmp is the best procedure to be included in SAS® version 9.2, and I'm just lucky relearning this language with this feature available, especially that it is FREE in SAS® Studio.
Visualization
Now it's time for us to create some visual art. And SAS® being a propriety software, has a lot to offer. We've demonstrate few above already, this time let's plot the data points of sn_data generated from the stdnorm function we define earlier. Here it is,

For other types of plot, simply go to the Snippets tab in the side bar of the SAS® Studio, and there you will find template codes for different types of plots. See picture below,

I will end this section with histogram and series plot.
- Histogram
- Historical

Conclusion

In conclusion, it wasn't difficult for me to relearn SAS®, not only because I have used it on few papers back in college, but also because I have programming background on R and Python, which I used as basis on understanding the grammar of the language. Overall, SAS® language is a high level language, as we see above, simple statement will give you complete results with graphics without having lengthy code. And although I used R and Python as my primary tools for research, I am happy to include SAS® on it. And despite the popularity of R in analysis, I am looking ahead to see more learners, students, and researchers even more bloggers using SAS®. That way, we can share and get ideas, techniques between communities of R, SAS®, and Python.

What about you? How's your experience with SAS® University Edition?

Data Source

Philippine Bureau of Agricultural Statistics

Reference

SAS® Documentation
r4stats.com: Data Import. From http://r4stats.com/examples/data-import/ (acccessed January 15, 2015)
SAS Learning Module: Subsetting data in SAS. From http://www.ats.ucla.edu/stat/sas/modules/subset.htm (accessed January 15, 2015)

R: Canonical Correlation Analysis on Imaging

2015-01-05T15:38:00.000+08:00

In imaging, we deal with multivariate data, like in array form with several spectral bands. And trying to come up with interpretation across correlations of its dimensions is very challenging, if not impossible. For example let's recall the number of spectral bands of AVIRIS data we used in the previous post. There are 152 bands, so in total there are 152$\cdot$152 = 23104 correlations of pairs of random variables. How will you be able to interpret that huge number of correlations?

To engage on this, it might be better if we group these variables into two and study the relationship between these sets of variables. Such statistical procedure can be done using the canonical correlation analysis (CCA). An example of this on health sciences (from Reference 2) is variables related to exercise and health. On one hand you have variables associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can run, the amount of weight lifted on bench press, the number of push-ups per minute, etc. But you also might have health variables such as blood pressure, cholesterol levels, glucose levels, body mass index, etc. So two types of variables are measured and the relationships between the exercise variables and the health variables are to be studied.

Methodology

Mathematically we have the following procedures:

Divide the random variables into two groups, and assign these to the following random vectors: \begin{equation}\nonumber \mathbf{X} = [X_1,X_2,\cdots, X_p]^T\;\text{and}\;\mathbf{Y} = [Y_1,Y_2,\cdots, Y_q]^T \end{equation}

Analogous to principal component analysis (PCA), we aim to find a linear combination \begin{equation}\nonumber \begin{aligned} U_1 = &\mathbf{a}_1^T\mathbf{X} = a_{11}X_1 + a_{12}X_2+\cdots + a_{1p}X_p\\ U_2 = &\mathbf{a}_2^T\mathbf{X} = a_{21}X_1 + a_{22}X_2+\cdots + a_{2p}X_p\\ &\qquad\quad\qquad\vdots\qquad\qquad\vdots\\ U_p = &\mathbf{a}_p^T\mathbf{X} = a_{p1}X_1 + a_{p2}X_2+\cdots + a_{pp}X_p \end{aligned} \end{equation} and \begin{equation}\nonumber \begin{aligned} V_1 = &\mathbf{b}_1^T\mathbf{Y}=b_{11}Y_1 + b_{12}Y_2+\cdots + b_{1q}Y_q\\ V_2 = &\mathbf{b}_2^T\mathbf{Y}=b_{21}Y_1 + b_{22}Y_2+\cdots + b_{2q}Y_q\\ &\qquad\quad\qquad\vdots\qquad\qquad\vdots\\ V_q = &\mathbf{b}_q^T\mathbf{Y}=b_{q1}Y_1 + b_{q2}Y_2+\cdots + b_{qq}Y_q\\ \end{aligned} \end{equation} that will maximize the correlation \begin{equation}\nonumber Corr(U_i,V_i)=\frac{Cov(U_i,V_i)}{\sqrt{Var(U_i)}\sqrt{Var{V_i}}},\quad i=1,2\cdots,n \end{equation} where $n = \min{(p, q)}$.
The first pair canonical variables is defined by \begin{equation}\nonumber Corr(U_1, V_1)=\rho_1=\sqrt{\rho_1^2}, \end{equation} where $\rho_1$, the first canonical correlation, is the square root of the highest of the eigenvalues, $\rho_1^2\geq \rho_2^2\geq \cdots \geq \rho_n^2$, which is the eigenvalues of the matrix $\mathbf{\Sigma}_{XX}^{-1/2}\mathbf{\Sigma}_{XY}\mathbf{\Sigma}_{YY}^{-1}\mathbf{\Sigma}_{XY}^{T}\mathbf{\Sigma}_{XX}^{-1/2}$, where $\mathbf{\Sigma}_{XX}$ is the variance-covariance of $\mathbf{X}$; $\mathbf{\Sigma}_{YY}$ is the variance-covariance of $\mathbf{Y}$; and $\mathbf{\Sigma}_{XY}$ is the covariance matrix of the random vector $\mathbf{XY}$. So that the second pair canonical variable is given by \begin{equation}\nonumber Corr(U_2, V_2)=\rho_2=\sqrt{\rho_2^2}, \end{equation} and so on.

For more detailed theory of CCA, please refer to Reference 1 and 2 below. To continue, let's apply this methodology on an image. We will use the Grass data from (Bajorski, 2012), and do analysis on it using R. Below is the proper description of the data.

Data

Grass data is a spectral image of 64 by 64 pixels, grass texture. Each pixel is represented by a spectral reflectance curve in 42 spectral bands with reflectance given in percent.

Analysis

To begin, let's display the data in an image form:

The code generates the first 12 spectral bands of the data, where we observe a significant change on brightness of the twelfth band compared to the first band. The signature of all pixels across these bands is shown below:

Investigating on the above plot tells us that it seems almost all bands are correlated; that is, if the reflectance of a given pixel on $i$th band (increases or decreases), the $j$th band, $i\neq j$, is also expected to (increase or decrease); except on bands 30 and 31 where seems to be no clear pattern on it. But that's subjective, we cannot tell exactly because there are 4096 signatures (lines in the plot) that will likely to overlap other important informations. So to see properly the relationship between all variables, here is the correlation matrix of all the spectral bands,

The cyan colour engulfing almost 60 percent of the region indicates higher correlation between the corresponding spectral bands. But the fuchsia colour that is pronounced in the plot tells us low correlation between those bands. Now let's divide this data into two, from 42 bands we can have two equal sets of variables (each with 21 dimensions). But for purpose of illustration, we'll consider unequal sets of variables, say the first 15 bands is classified as first group and the remaining bands 16 - 42 be the second group, hence $p=15$ and $q=27$. So that there are $\min(p,q)=n=15$ pairs of canonical variables. And applying CCA we have,

The above numerical output returned is actually the $n=15$ canonical correlations. And as we can see, the first five canonical correlations are very large implying that the linear combinations we obtain on the first five canonical variables were highly correlated to each other. For subsequent correlations, similar way of interpretation can be done. Next, we'll examine the coefficients of the first five canonical variables to see which bands is highly explained by the above canonical correlations. The cancor function returns the following components:

cor - correlations;
xcoef - estimated coefficients for the x variables;
ycoef - estimated coefficients for the y variables;
xcenter - the values used to adjust the x variables; and,
ycenter - the values used to adjust the x variables.

We are interested on xcoef and ycoef, and so the plot of the coefficients of the first three $i$s of $U_i$s and $V_i$s random variables is shown below,

A closer look on the plot of the coefficients of the first three $U_i$s random variables, shows us fluctuations of loadings between negative and positive values, so that the $U_1,U_2,$ and $U_3$ are a contrast of the spectral bands. And a similar situation is also observed on the plot of the coefficients of the first three $V_i$s random variables, and because of that we cannot further tell for a more specific interpretation on these bands.

Test of Canonical Dimension

The dimension of the canonical variates above is $n = 15$, let's check if all these are statistically significant. We'll use the CCP (Significance Tests for Canonical Correlation Analysis) R package, which contains p.asym function that will do the job for us.

Above output tells us that with 0.05 level of significance, only the first 13 canonical dimensions are significant out of 15.

For more on CCA using R, please check Reference 3. If you want to perform it on SAS, you might want to check Reference 2, and for more on imaging I suggest Reference 1.

Reference

Bajorski, P. (2012). Statistics for Imaging, Optics, and Photonics. John Wiley & Sons, Inc.
Stat 505 - Applied Multivariate Statistical Analysis. Lesson 8: Canonical Correlation Analysis. Eberly College of Science, Pennsylvania State University (Penn State). (accessed January 2, 2015)
R Data Analysis Examples: Canonical Correlation Analysis. UCLA: Statistical Consulting Group. From http://www.ats.ucla.edu/stat/r/dae/canonical.htm (accessed January 4, 2015)

R: Principal Component Analysis on Imaging

2014-12-25T20:26:00.001+08:00

Ever wonder what's the mathematics behind face recognition on most gadgets like digital camera and smartphones? Well for most part it has something to do with statistics. One statistical tool that is capable of doing such feature is the Principal Component Analysis (PCA). In this post, however, we will not do (sorry to disappoint you) face recognition as we reserve this for future post while I'm still doing research on it. Instead, we go through its basic concept and use it for data reduction on spectral bands of the image using R.

Let's view it mathematically

Consider a line $L$ in a parametric form described as a set of all vectors $k\cdot\mathbf{u}+\mathbf{v}$ parameterized by $k\in \mathbb{R}$, where $\mathbf{v}$ is a vector orthogonal to a normalized vector $\mathbf{u}$. Below is the graphical equivalent of the statement:

So if given a point $\mathbf{x}=[x_1,x_2]^T$, the orthogonal projection of this point on the line $L$ is given by $(\mathbf{u}^T\mathbf{x})\mathbf{u}+\mathbf{v}$. Graphically, we mean

$Proj$ is the projection of the point $\mathbf{x}$ on the line, where the position of it is defined by the scalar $\mathbf{u}^{T}\mathbf{x}$. Therefore, if we consider $\mathbf{X}=[X_1, X_2]^T$ be a random vector, then the random variable $Y=\mathbf{u}^T\mathbf{X}$ describes the variability of the data on the direction of the normalized vector $\mathbf{u}$. So that $Y$ is a linear combination of $X_i, i=1,2$. The principal component analysis identifies a linear combinations of the original variables $\mathbf{X}$ that contain most of the information, in the sense of variability, contained in the data. The general assumption is that useful information is proportional to the variability. PCA is used for data dimensionality reduction and for interpretation of data. (Ref 1. Bajorski, 2012)

To better understand this, consider two dimensional data set, below is the plot of it along with two lines ($L_1$ and $L_2$) that are orthogonal to each other:

If we project the points orthogonally to both lines we have,

So that if normalized vector $\mathbf{u}_1$ defines the direction of $L_1$, then the variability of the points on $L_1$ is described by the random variable $Y_1=\mathbf{u}_1^T\mathbf{X}$. Also if $\mathbf{u}_2$ is a normalized vector that defines the direction of $L_2$, then the variability of the points on this line is described by the random variable $Y_2=\mathbf{u}_2^T\mathbf{X}$. The first principal component is one with maximum variability. So in this case, we can see that $Y_2$ is more variable than $Y_1$, since the points projected on $L_2$ are more dispersed than in $L_1$. In practice, however, the linear combinations $Y_i = \mathbf{u}_i^T\mathbf{X}, i=1,2,\cdots,p$ is maximized sequentially so that $Y_1$ is the linear combination of the first principal component, $Y_2$ is the linear combination of the second principal component, and so on. Further, the estimate of the direction vector $\mathbf{u}$ is simply the normalized eigenvector $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$. And the variability explained by the principal component is the corresponding eigenvalue $\lambda$. For more details on theory of PCA refer to (Bajorski, 2012) at Reference 1 below.

As promised we will do dimensionality reduction using PCA. We will use the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data from (Barjorski, 2012), you can use other locations of AVIRIS data that can be downloaded here. However, since for most cases the AVIRIS data contains thousands of bands so for simplicity we will stick with the data given in (Bajorski, 2012) as it was cleaned reducing to 152 bands only.

What is spectral bands?

In imaging, spectral bands refer to the third dimension of the image usually denoted as $\lambda$. For example, RGB image contains red, green and blue bands as shown below along with the first two dimensions $x$ and $y$ that define the resolution of the image.

These are few of the bands that are visible to our eyes, there are other bands that are not visible to us like infrared, and many other in electromagnetic spectrum. That is why in most cases AVIRIS data contains huge number of bands each captures different characteristics of the image. Below is the proper description of the data.

Data

The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), is a sensor collecting spectral radiance in the range of wavelengths from 400 to 2500 nm. It has been flown on various aircraft platforms, and many images of the Earth’s surface are available. A 100 by 100 pixel AVIRIS image of an urban area in Rochester, NY, near the Lake Ontario shoreline is shown below. The scene has a wide range of natural and man-made material including a mixture of commercial/warehouse and residential neighborhoods, which adds a wide range of spectral diversity. Prior to processing, invalid bands (due to atmospheric water absorption) were removed, reducing the overall dimensionality to 152 bands. This image has been used in Bajorski et al. (2004) and Bajorski (2011a, 2011b). The first 152 values in the AVIRIS Data represent the spectral radiance values (a spectral curve) for the top left pixel. This is followed by spectral curves of the pixels in the first row, followed by the next row, and so on. (Ref. 1 Bajorski, 2012)

To load the data, run the following code:

Above code uses EBImage package, and can be installed from my previous post.

Why do we need to reduce the dimension of the data?

Before we jump in to our analysis, in case you may ask why? Well sometimes it's just difficult to do analysis on high dimensional data, especially on interpreting it. This is because there are dimensions that aren't significant (like redundancy) which adds to our problem on the analysis. So in order to deal with this, we remove those nuisance dimension and deal with the significant one.

To perform PCA in R, we use the function princomp as seen below:

The structure of princomp consist of a list shown above, we will give description to selected outputs. Others can be found in the documentation of the function by executing ?princomp.

sdev - standard deviation, the square root of the eigenvalues $\lambda$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
loadings - eigenvectors $\mathbf{e}$ of the variance-covariance matrix $\mathbf{\Sigma}$ of the data, dat.mat;
scores - the principal component scores.

Recall that the objective of PCA is to find for a linear combination $Y=\mathbf{u}^T\mathbf{X}$ that will maximize the variance $Var(Y)$. So that from the output, the estimate of the components of $\mathbf{u}$ is the entries of the loadings which is a matrix of eigenvectors, where the columns corresponds to the eigenvectors of the sequence of principal components, that is if the first principal component is given by $Y_1=\mathbf{u}_1^T\mathbf{X}$, then the estimate of $\mathbf{u}_1$ which is $\mathbf{e}_1$ (eigenvector) is the set of coefficients obtained from the first column of the loadings. The explained variability of the first principal component is the square of the first standard deviation sdev, the explained variability of the second principal component is the square of the second standard deviation sdev, and so on. Now let's interpret the loadings (coefficients) of the first three principal components. Below is the plot of this,

Base above, the coefficients of the first principal component (PC1) are almost all negative. A closer look, the variability in this principal component is mainly explained by the weighted average of radiance of the spectral bands 35 to 100. Analogously, PC2 mainly represents the variability of the weighted average of radiance of spectral bands 1 to 34. And further, the fluctuation of the coefficients of PC3 makes it difficult to tell on which bands greatly contribute on its variability. Aside from examining the loadings, another way to see the impact of the PCs is through the impact plot where the impact curve $\sqrt{\lambda_j}\mathbf{e}_j$ are plotted, I want you to explore that.

Moving on, let's investigate the percent of variability in $X_i$ explained by the $j$th principal component, below is the formula of this, \begin{equation}\nonumber \frac{\lambda_j\cdot e_{ij}^2}{s_{ii}}, \end{equation} where $s_{ii}$ is the estimated variance of $X_i$. So that below is the percent of explained variability in $X_i$ of the first three principal components including the cumulative percent variability (sum of PC1, PC2, and PC3),

For the variability of the first 33 bands, PC2 takes on about 90 percent of the explained variability as seen in the above plot. And still have great contribution further to 102 to 152 bands. On the other hand, from bands 37 to 100, PC1 explains almost all the variability with PC2 and PC3 explain 0 to 1 percent only. The sum of the percentage of explained variability of these principal components is indicated as orange line in the above plot, which is the cumulative percent variability.

To wrap up this section, here is the percentage of the explained variability of the first 10 PCs.

PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
Table 1: Variability Explained by the First Ten Principal Components for the AVIRIS data.
82.057	17.176	0.320	0.182	0.094	0.065	0.037	0.029	0.014	0.005

Above variability were obtained by noting that the variability explained by the principal component is simply the eigenvalue (square of the sdev) of the variance-covariance matrix $\mathbf{\Sigma}$ of the original variable $\mathbf{X}$, hence the percentage of variability explained by the $j$th PC is equal to its corresponding eigenvalue $\lambda_j$ divided by the overall variability which is the sum of the eigenvalues, $\sum_{j=1}^{p}\lambda_j$, as we see in the following code,

Stopping Rules

Given the list of percentage of variability explained by the PCs in Table 1, how many principal components should we take into account that would best represent the variability of the original data? To answer that, we introduce the following stopping rules that will guide us on deciding the number of PCs:

Scree plot;
Simple fare-share;
Broken-stick; and,
Relative broken-stick.

The scree plot is the plot of the variability of the PCs, that is the plot of the eigenvalues. Where we look for an elbow or sudden drop of the eigenvalues on the plot, hence for our example we have

Therefore, we need return the first two principal components based on the elbow shape. However, if the eigenvalues differ by order of magnitude, it is recommended to use the logarithmic scale which is illustrated below,

Unfortunately, sometimes it won't work as we can see here, it's just difficult to determine where the elbow is. The succeeding discussions on the last three stopping rules are based on (Bajorski, 2012). The simple fair-share stopping rule identifies the largest $k$ such that $\lambda_k$ is larger than its fair share, that is larger than $(\lambda_1+\lambda_2+\cdots+\lambda_p)/p$. To illustrate this, consider the following:

Thus, we need to stop at second principal component.

If one was concerned that the above method produces too many principal components, a broken-stick rule could be used. The rule is that it identifies the principal components with largest $k$ such that $\lambda_j/(\lambda_1+\lambda_2+\cdots +\lambda_p)>a_j$, for all $j\leq k$, where \begin{equation}\nonumber a_j = \frac{1}{p}\sum_{i=j}^{p}\frac{1}{i},\quad j =1,\cdots, p. \end{equation} Let's try it,

Above result coincides with the first two stopping rule. The draw back of simple fair-share and broken-stick rules is that it do not work well when the eigenvalues differ by orders of magnitude. In such case, we then use the relative broken-stick rule, where we analyze $\lambda_j$ as the first eigenvalue in the set $\lambda_j\geq \lambda_{j+1}\geq\cdots\geq\lambda_{p}$, where $j < p$. The dimensionality $k$ is chosen as the largest value such that $\lambda_j/(\lambda_j+\cdots +\lambda_p)>b_j$, for all $j\leq k$, where \begin{equation}\nonumber b_j = \frac{1}{p-j+1}\sum_{i=1}^{p-j+1}\frac{1}{i}. \end{equation} Applying this to the data we have,

According to the numerical output, the first 34 principal components are enough to represent the variability of the original data.

Principal Component Scores

The principal component scores is the resulting new data set obtained from the linear combinations $Y_j=\mathbf{e}_j(\mathbf{x}-\bar{\mathbf{x}}), j = 1,\cdots, p$. So that if we use the first three stopping rules, then below is the scores (in image) of PC1 and PC2,

If we base on the relative broken-stick rule then we return the first 34 PCs, and below is the corresponding scores (in image).

Click on the image to zoom in.

Residual Analysis

Of course when doing PCA there are errors to be considered unless one would return all the PCs, but that would not make any sense because why would someone apply PCA when you still take into account all the dimensions? An overview of the errors in PCA without going through the theory is that, the overall error is simply the excluded variability explained by the $k$th to $p$th principal components, $k>j$.

Reference

Bajorski, P. (2012). Statistics for Imaging, Optics, and Photonics. John Wiley & Sons, Inc.

Download PDF Version

ALUES: Agricultural Land Use Evaluation System, R package

2014-10-26T23:21:00.000+08:00

Authors:
Arnold R. Salvacion
arsalvacion@gmail.com
Data Analysis and Visualization using R (blog)

Al-Ahmadgaid B. Asaad (maintainer)
alstated@gmail.com

Agricultural Land Use Evaluation System (ALUES) is an R package that evaluates land suitability for different crop production. The package is based on the Food and Agriculture Organization (FAO) and the International Rice Research Institute (IRRI) methodology for land evaluation. Development of ALUES is inspired by similar tool for land evaluation, Land Use Suitability Evaluation Tool (LUSET). The package uses fuzzy logic approach to evaluate land suitability of a particular area based on inputs such as rainfall, temperature, topography, and soil properties. The membership functions used for fuzzy modeling are the following: Triangular, Trapezoidal and Gaussian. The methods for computing the overall suitability of a particular area are also included, and these are the Minimum, Maximum, Product, Sum, Average, Exponential and Gamma. Finally, ALUES uses the power of Rcpp library for efficient computation.

INSTALLATION

The package is not yet on CRAN, and is currently under development on github. To install it, run the following:

We want to hear some feedbacks, and if you have any suggestion or issues regarding this package, please do submit it here.

DATASET

The package contains several datasets which can be categorized into two:

Land Units' Attributes - datasets that contain the attributes of the land units of a given location.
Crop Requirements - datasets that contain the required values of factors of a particular crop for the land units.

Land Units' Attributes

The package contains sample dataset of land units' attributes from two countries:

Marinduque, Philippines:
- MarinduqueLT - a dataset consisting the land and terrain characteristics of the land units of Marinduque, Philippines;
- MarinduqueTemp - a dataset consisting the temperature characteristics of the land units of Marinduque, Philippines; and
- MarinduqueWater - a dataset consisting the water characteristics of the land units of Marinduque, Philippines.
Lao Cai, Vietnam
- LaoCaiLT - a dataset consisting the land and terrain characteristics of the land units of Lao Cai, Vietnam;
- LaoCaiTemp - a dataset consisting the temperature characteristics of the land units in Lao Cai, Vietnam;
- LaoCaiWater - a dataset consisting the water characteristics of the land units of Lao Cai, Vietnam;

For example, the first six land units in MarinduqueLT is shown below

The complete list of factors is available in the pdf version.

Crop Requirements

The crops available in the package are the listed in Table 1.

Code	Crops
Table 1: Crops Dataset Available in ALUES.
`BANANA-`	Banana
`CASSAVA-`	Cassava
`COCOA-`	Cocoa
`COCONUT-`	Coconut
`COFFEEAR-`	Arabica Coffee
`COFFEERO-`	Robusta Coffee
`RICEBR-`	Rainfed Bunded Rice
`RICEIW-`	Irrigated Rice
`RICENF-`	Rice Cultivation Under Natural Floods
`RICEUR-`	Rainfed Upland Rice

From the table, the codes are suffixed with the land units' characteristics (TerrainCR, SoilCR, WaterCR and TemperatureCR) required for the crop. For example, below are the required values for the terrain characteristics of the land units on cultivating coconut:

For required characteristics of soil, water and temperature on cultivating coconut the codes are COCONUTSoilCR, COCONUTWaterCR and COCONUTTemperatureCR, respectively.

R FUNCTIONS

The package contains the following functions:

suitability - computes the suitability scores and classes of the land units base on the requirements of the crop.
overall_suit- computes the overall suitability of the land units, using the suitability scores obtained from the suitability function.

Suitability

In this section, we will get into the details of the suitability function. Usage

`x`	a data frame consisting the properties of the land units;
`y`	a data frame consisting the crop (e.g. coconut, cassava, etc.) requirements for a given characteristics (terrain, soil, water and temperature);
`mf`	membership function, default is set to `"triangular"`. Other fuzzy models are `"Trapezoidal"` and `"Gaussian"`.
`sow.month`	sowing month of the crop. Takes integers from 1 to 12 (inclusive), representing the twelve months of a year. So if sets to 1, the function assumes sowing month on January.
`min`	factor's minimum value. If `NULL` (default), `min` is set to 0. But if numeric of length one, say 0.5, then minimum is set to 0.5 for all factors. If factors on land units (`x`) have different minimum, then these can be concatenated to vector of `min`s, the length of this vector should be equal to the number of factors in `x`. However, if sets to `"average"`, then `min` is theoretically computed as: Let X be a factor, then X has the following suitability class: S3, S2 and S1. Assuming the scores of the said suitability class for X are $a, b$ and $c$, respectively. Then, $$\mathrm{min} = a - \displaystyle\frac{(b - a) + (c - b)}{2}$$ For factors with suitability class S3, S2, S1, S1, S2 and S3 with scores $a, b, c, d, e$ and $f$, respectively. `min` is computed as, $$\mathrm{min} = a - \displaystyle\frac{(b - a) + (c - b) + (d - c) + (e - d) + (f - e)}{5}$$
`max`	factor's maximum value. Default is set to `"average"`. If numeric of length one, say 50, then maximum is set to 50 for all factors. If factors on land units (`x`) have different maximum, then these can be concatenated to vector of `max`s, the length of this vector should be equal to the number of factors in `x`. However, if sets to `"average"`, then `max` is computed from the equation below: $$\mathrm{max}=c + \displaystyle\frac{(b-a) + (c-b)}{2}$$ For factors with suitability class S3, S2, S1, S1, S2 and S3 with scores $a, b, c, d, e$ and $f$, respectively. Then, $$\mathrm{max} = f + \displaystyle\frac{(b - a) + (c - b) + (d - c) + (e - d) + (f - e)}{5}$$
`interval`	domain for every suitability class (S1, S2, S3, and N). If `"fixed"`, the interval would be 0 to 0.25 for N (Not Suitable), 0.25 to 0.50 for S3 (Marginally Suitable), 0.50 to 0.75 for S2 (Moderately Suitable), and 0.75 to 1 for (Highly Suitable). If `"unbias"`, then the interval is set to 0 to $\displaystyle\frac{a}{\mathrm{max}}$ for N, $\displaystyle\frac{a}{\mathrm{max}}$ to $\displaystyle\frac{b}{\mathrm{max}}$ for S3, $\displaystyle\frac{b}{\mathrm{max}}$ to $\displaystyle\frac{c}{\mathrm{max}}$ for S2, and $\displaystyle\frac{c}{\mathrm{max}}$ to $\displaystyle\frac{\mathrm{max}}{\mathrm{max}}$ for S1.

Output
The function returns the following output:

Actual Factors Evaluated;
Suitability Score;
Suitability Class;
Factors' Minimum Values; and,
Factors' Maximum Values.

Example: To test the suitability of the land units in Marinduque, Philippines, for terrain requirements of coconut, we have

Before we run the function, let's check for the possible output. From the land units (MarinduqueLT), the only factor available to be evaluated is CFragm, for required soil characteristics of the coconut. The first land unit has 11% coarse fragment (CFragm), which falls within the S1 domain of the required soil characteristics, with domain [min - 15%), where min has default value set to 0. The second to sixth land units also are highly suitable as it falls within the said domain. Let's confirm it using the function,

Extract the first 6 of the outputs,

Indeed, just what we argued earlier.

Options for mf (Membership Function)
The membership function is an option for the type of fuzzy model, the available models are the following:

Triangular;
Trapezoidal; and,
Gaussian.

The suitability scores are computed base on these fuzzy models.

Options for sow.month (Sowing Month)
The sow.month is the sowing month which takes integers from 1 to 12, representing the twelve months of a year. So if sets to 1, the function assumes sowing month on January. This argument is only use for water and temperature characteristics.

To illustrate this, we will test the land units of Marinduque for the required water and temperature for rainfed bunded rice. Thus, we have

We will test first the land units for water, and here are the following water requirements for rainfed bunded rice,

The factors to be evaluated here are the following:

WmAv1 - Mean precipitation of first month (mm);
WmAv2 - Mean precipitation of second month (mm);
WmAv3 - Mean precipitation of third month (mm); and
WmAv4 - Mean precipitation of fourth month (mm).

If sowing month is set to November, then we have

WmAv1 - November;
WmAv2 - December;
WmAv3 - January; and
WmAv4 - February.

So for Novermber, we see the first land unit falls within the domain of S1, that is, 277 mm falls within [175 - 500 mm). And same thing for the first land unit of December, highly suitable. Let's fire up the function to confirm that,

You will have this error if there is no factors to be evaluated. What just happened here is that, the function assumed the data as neither water nor temperature characteristics. Thus, it ignores the WmAv1, WmAv2, WmAv3 and WmAv4 factors. But if we specify the sowing month (sow.month) to November (11), then we have

The first land unit for November does confirms to be S1, but for December it isn't, and instead S2 is given. This problem will be discussed later on details about the interval argument.

Options for min (Factors' Minimum Value)
By default, min = 0 for all factors. This can be assigned to any positive integers, for example, using the cassava soil requirements,

Now let's try different minimums for factors, we will use the following:

CECc	pHH20	CFragm	SoilTe
Table 2: Custom `min`.
0.4	0.6	0.1	0.3

So we got an error, it is expected, since the length of the vector min should be equal to the number of factors in x, which is 6. Since we are not interested on the latitude (X) and longitude (Y) factors of the dataset, then we can ommit the two and rerun the code,

Only CECc and SoilTe are returned since these are the factors evaluated.

Options for max (Factors' Maximum Value)
By default max = 'average', and just like min, max can be assigned to any positive integer, example:

For different maximum value on every factor, we will use the following and ommit the first two factors in MarinduqueLT like what we did in the previous section.

CECc	pHH20	CFragm	SoilTe
Table 3: Custom `max`.
52.5	8.8	40	14

Options for interval (Domain of Suitability Scores)
The domain of suitability scores are set to default, 'fixed', if this option is used, the domain of the suitability scores would be,

Class	N	S3	S2	S1
Table 4: Domain for `'fixed'`.
Domain	[0, 0.25)	[0.25, 0.5)	[0.5, 0.75)	[0.75, 1]

An example of interval = 'fixed' is the one illustrated in Options for sow.month (Sowing Month) above. Let us investigate the output of that, here is the crop requirements for water (the crop we are interested in, is the rainfed bunded rice),

Given that the starting sowing month assigned is November, then the following factors are evaluated:

WmAv1 - November;
WmAv2 - December;
WmAv3 - January; and
WmAv4 - February.

So we are going to extract this factors from the dataset, MarinduqueWater,

The suitability scores and class of this would be,

Focus your attention on suitability scores of Feb factor for the first three land units. We have here 0.3714, 0.3714 and 0.3771. And the domain of this base on Table 4, would be S3, S3 and S3. But, if we refer to the original data, the first three data points in Feb factor are all 65. Since WmAv4 is the corresponding requirements for Feb factor, with scores:

Factor	S3	S2	S1	S1	S2	S3	Weight
Table 5: `WmAv4`’s Suitability Requirements.
WmAv4	29	30	50	300	500	600	NA

Then it is easy to pin point what suitability class does the scores of the land units falls into. Which follows that all first three land units falls within class S1. See the problem with 'fixed' interval? This is the same problem for other factor like Dec (December), where instead of S1, we got S2. Users can change the domain though, that is, instead of using the 'fixed' option, users can assign for example, interval = c(0, 0.33, 0.56, 0.89, 1), which equivalently:

Class	N	S3	S2	S1
Table 6: Custom Domains.
Domain	[0, 0.33)	[0.33, 0.56)	[0.56, 0.89)	[0.89, 1]

Assigning new values for parameters of the interval won't solve the problem, but this argument has one more option to offer, which does solve the problem, and that is by changing interval = 'fixed' to interval = 'unbias'. Let's try it,

And that supports our argument above.

Weighting
The function, suitability, also considers the weights of the factors. An example of crop with no weights is the soil requirement for coconut,

The weights are assigned on the last column, Weight.class. And here is the soil requirements for the cassava, with weight on each factor:

If a given factor has a weight, then the function will compute the corresponding suitability and then use the weighting score to obtain the appropriate suitability score. The weights of the factors for the default interval (interval = 'fixed') are in Table 7:

Suitability	Factor Weights
Table 7: Weights of the Factors for `'fixed'` Interval.
Class	1	2	3
S1	0.833	0.916	1.000
S2	0.583	0.667	0.750
S3	0.333	0.416	0.500
N	0.083	0.167	0.250

Thus the function simply divides the interval of the suitability class into three, for three weights.

Overall Suitability

`x`	a data frame consisting the suitability scores of a given characteristics (terrain, soil, water and temperature) for a given crop (e.g. coconut, cassava, etc.);
`method`	the method for computing the overall suitability, which includes the minimum, maximum, sum, product, average, exponential and gamma. If `NULL`, minimum is used.
`interval`	if `NULL`, the interval used are the following: 0-0.25 (Not suitable, N), 0.25-0.50 (Marginally Suitable, S3), 0.50-0.75 (Moderately Suitable, S2), and 0.75-1 (Highly Suitable, S1).
`output`	the output to be returned, either the scores or class. If `NULL`, both are returned.

DEMONSTRATION

Let's assume we are interested on the land units in Lao Cai, Vietnam, for cultivating irrigated rice. So here are the first 6 land units in the said location,

And here are the required values for factors of soil, terrain, temperature and water characteristics for irrigated rice,

Now, we are going to take the suitability scores for every characteristics,

Next, we will take the overall suitability on all factors in each land unit using the "average" method (default is "minimum").

Finally, take the overall suitability from these characteristics using the "maximum" method.

Probability Theory Problems

2014-09-21T11:56:00.000+08:00

Let's have fun on probability theory, here is my first problem set in the said subject.

Problems

It was noted that statisticians who follow the deFinetti school do not accept the Axiom of Countable Additivity, instead adhering to the Axiom of Finite Additivity.
1. Show that the Axiom of Countable Additivity implies Finite Additivity.
2. Although, by itself, the Axiom of Finite Additivity does not imply Countable Additivity, suppose we supplement it with the following. Let $A_1\supset A_2\supset\cdots\supset A_n\supset \cdots$ be an infinite sequence of nested sets whose limit is the empty set, which we denote by $A_n\downarrow\emptyset$. Consider the following:
  
  Axiom of Continuity: If $A_n\downarrow\emptyset$, then $P(A_n)\rightarrow 0$
  
  Prove that the Axiom of Continuity and the Axiom of Finite Additivity imply Countable Additivity.
Prove each of the following statements. (Assume that any conditioning event has positive probability.)
1. If $P(B)=1$, then $P(A|B)=P(A)$ for any $A$.
2. If $A\subset B$, then $P(B|A)=1$ and $P(A|B)=P(A)/P(B)$.
3. If $A$ and $B$ are mutually exclusive, then \begin{equation}\nonumber P(A|A\cup B) = \displaystyle\frac{P(A)}{P(A)+P(B)}. \end{equation}
4. $P(A\cap B\cap C)=P(A|B\cap C)P(B|C)P(C)$.

Prove that the following functions are cdfs.
1. $\frac{1}{2}+\frac{1}{\pi}\arctan(x), x\in (-\infty, \infty)$
2. $(1+e^{-x})^{-1},x\in (-\infty,\infty)$
3. $e^{-e^{-x}}, x\in (-\infty, \infty)$
4. $1-e^{-x}, x\in (0,\infty)$
5. the function defined in (1.5.6), (Check in the reference below.)
A cdf $F_X$ is stochastically greater than a cdf $F_{Y}$ if $F_{X}(t)\leq F_{Y}(t)$ for all $t$ and $F_{X}(t) < F_{Y}(t)$ for some $t$. Prove that if $X\sim F_X$ and $Y\sim F_Y$, then \begin{equation}\nonumber P(X>t) \geq P(Y>t)\;\text{for every}\;t \end{equation} and \begin{equation}\nonumber P(X>t)>P(Y>t),\;\text{for some}\; t \end{equation} that is, $X$ tends to be bigger than $Y$.
Let $X$ be a continuous random variable with pdf $f(x)$ and cdf $F(x)$. For a fixed number $x_0$, define the function \begin{equation}\nonumber g(x) = \begin{cases} f(x) / [1-F(x_0)]& x \geq x_0\\ 0 & x < x_0. \end{cases} \end{equation} Prove that $g(x)$ is a pdf. (Assume that $F(x_0)<1$.)
For each of the following, determine the value of $c$ that makes $f(x)$ a pdf.
1. $f(x)=\mathrm{c}\sin x, 0 < x < \pi/2$
2. $f(x)=\mathrm{c}e^{-|x|},-\infty < x < \infty$

Solutions

1. Proof. Let $\mathscr{B}$ be a $\sigma$-algebra and consider $A_1,A_2,\cdots\in \mathscr{B}$ are pairwise disjoint, then by countable additivity \begin{equation}\nonumber P\left(\displaystyle\bigcup_{i=1}^{\infty}A_i\right)=\displaystyle\sum_{i=1}^{\infty}P(A_i). \end{equation} Now, \begin{equation} \begin{aligned} P\left(\displaystyle\bigcup_{i=1}^{\infty}A_i\right)&= P\left(\displaystyle\bigcup_{i=1}^{n}A_i\cup\displaystyle \bigcup_{i=n+1}^{\infty}A_i\right)\\ &= P\left(\displaystyle\bigcup_{i=1}^{n}A_i\right)+P\left(\displaystyle \bigcup_{i=n+1}^{\infty}A_i\right),\;(\text{since}\;A_i's\;\text{are disjoints})\\ &=P(A_1)+\cdots+P(A_n)+P\left(\displaystyle \bigcup_{i=n+1}^{\infty}A_i\right),\\ &\quad(\text{by finite additivity})\\ &=\displaystyle\sum_{i=1}^{n}P(A_i)+P\left(\displaystyle \bigcup_{i=n+1}^{\infty}A_i\right) \end{aligned}\nonumber \end{equation} Notice that for any $n$, we can consider $P(A_i),\;i>n$ to be empty. Implying \begin{equation}\nonumber P\left(\displaystyle\bigcup_{i=n+1}^{\infty}A_i\right)=\displaystyle \sum_{i=n+1}^{\infty}P(A_i)=P(\emptyset)+P(\emptyset)+\cdots, \end{equation} that is, \begin{equation}\nonumber \begin{aligned} P\left(\displaystyle\bigcup_{i=1}^{\infty}A_i\right)&= \displaystyle\sum_{i=1}^{n}P(A_i)+\sum_{i=n+1}^{\infty}P(A_i)\\ &=\displaystyle\sum_{i=1}^{n}P(A_i)+P(\emptyset)+P(\emptyset)+\cdots \end{aligned} \end{equation} $\therefore$ countable additivity implies finite additivity.
  $\hspace{12.5cm}\blacksquare$
2. From (a), we have shown that countable additivity implies finite additivity, i.e., \begin{equation} P\left(\displaystyle\bigcup_{i=1}^{\infty}A_i\right)=\displaystyle\sum_{i=1}^{n}P(A_i)+P\left(\displaystyle \bigcup_{i=n+1}^{\infty}A_i\right) \nonumber \end{equation} If we supplement this with the following condition, that $A_1\supset A_2\supset A_3\supset\cdots$. By Axiom of Continuity, $\displaystyle\lim_{n\to \infty}A_n=\emptyset$, and by Monotone Sequential Continuity, $P\left(\displaystyle\lim_{n\to\infty}A_n\right)= \displaystyle\lim_{n\to\infty}P(A_n)=0$. Now we can write $A_1\supset A_2\supset A_3\supset\cdots$ as \begin{equation}\nonumber B_k=\bigcup_{i=k}^{\infty}A_i,\;\text{such that}\;B_{k+1}\subset B_k, \text{implying}\; \lim_{k\to\infty}B_k=\emptyset \end{equation} Thus, finite additivity plus axiom of continuity, we have \begin{equation}\nonumber \begin{aligned} P\left(\bigcup_{i=1}^{\infty}A_i\right)&=\lim_{n\to\infty}\left( \sum_{i=1}^{n}P(A_i)+P(B_{n+1})\right)\\ &=\lim_{n\to\infty}\left(\sum_{i=1}^{n}P(A_i)\right)+\lim_{n\to\infty} P(B_{n+1})\\ &=\sum_{i=1}^{\infty}P(A_i)+0,\;(\text{by axiom of continuity}). \end{aligned} \end{equation} Implying countable additivity.
  $\hspace{12.5cm}\blacksquare$
1. Proof. If $P(B)=1$, then $P(S)=P(B)=1$. Because $A\subseteq S$, implies $A\subseteq B$. Thus, $A\cap B = A$, and therefore \begin{equation}\nonumber P(A|B)=\displaystyle\frac{P(A\cap B)}{P(B)}=\displaystyle\frac{P(A)}{P(B)}=P(A) \end{equation} $\hspace{12.5cm}\blacksquare$
2. Proof. If $A\subseteq B$ then \begin{equation}\nonumber P(B|A)=\displaystyle\frac{P(A\cap B)}{P(A)}=\displaystyle\frac{P(A)}{P(A)}=1 \end{equation} and, \begin{equation}\nonumber P(A|B)=\displaystyle\frac{P(A\cap B)}{P(B)}=\displaystyle\frac{P(A)}{P(B)} \end{equation} $\hspace{12.5cm}\blacksquare$
3. Proof. If $A$ and $B$ are mutually exclusive, then \begin{equation} \nonumber \begin{aligned} P(A|A\cup B)&=\displaystyle\frac{P(A\cap (A\cup B))}{P(A\cup B)}\\ &=\displaystyle\frac{P(A)\cup [P(A\cap B)]}{P(A)+ P(B)}\\ &=\displaystyle\frac{P(A)}{P(A)+ P(B)} \end{aligned} \end{equation}$\hspace{12.5cm}\blacksquare$
4. Proof. Consider, \begin{equation}\nonumber P(A|B\cap C)=\displaystyle\frac{P(A\cap B\cap C)}{P(B\cap C)} \end{equation} Hence, \begin{equation}\nonumber P(A\cap B\cap C) = P(A|B\cap C)P(B\cap C) \end{equation} Now $P(B\cap C)=P(B|C)P(C)$, therefore \begin{equation}\nonumber P(A\cap B\cap C) = P(A|B\cap C)P(B|C)P(C) \end{equation}$\hspace{12.5cm}\blacksquare$
$F(x)$ is a cdf if it satisfies the following conditions:
1. $\displaystyle\lim_{x\to-\infty}F(x)=0$ and $\displaystyle\lim_{x\to\infty}F(x)=1$
2. $F(x)$ is nondecreasing.
3. $F(x)$ is right-continuous.
1. Proof.
  1. $F(x)=\frac{1}{2}+\frac{1}{\pi}\arctan(x), x\in (-\infty, \infty)$
    
    Above figure was generated by the following $\mathrm{\LaTeX}$ codes:
    
    \begin{equation}\nonumber \begin{aligned} \displaystyle\lim_{x\to-\infty}F(x)&=\displaystyle\lim_{x\to-\infty} \left(\frac{1}{2}+\frac{1}{\pi}\arctan(x)\right)\\ &=\frac{1}{2}+\frac{1}{\pi}\displaystyle\lim_{x\to-\infty}\left(\arctan(x)\right)\\ &=\frac{1}{2}+\frac{1}{\pi} \left(\frac{-\pi}{2}\right),\;\text{since}\;\displaystyle\lim_{x\to-\frac{\pi}{2}}\frac{\sin(x)}{\cos(x)}=-\infty\\ &=0\\[0.5cm] \displaystyle\lim_{x\to\infty}F(x)&=\displaystyle\lim_{x\to\infty} \left(\frac{1}{2}+\frac{1}{\pi}\arctan(x)\right)\\ &=\frac{1}{2}+\frac{1}{\pi}\displaystyle\lim_{x\to\infty}\left(\arctan(x)\right)\\ &=\frac{1}{2}+\frac{1}{\pi} \left(\frac{\pi}{2}\right),\;\text{since}\;\displaystyle\lim_{x\to\frac{\pi}{2}}\frac{\sin(x)}{\cos(x)}=\infty\\ &=1 \end{aligned} \end{equation}
  2. To test if $F(x)$ is nondecreasing, recall in Calculus that, first differentiation of the function tells us if it is decreasing or increasing. In particular, $\frac{dF(x)}{dx}>0$ tells us that the function is increasing in a given interval of $x$. Thus, \begin{equation} \nonumber \frac{dF(x)}{dx}=\frac{d}{dx}\left(\frac{1}{2}+\frac{1}{\pi}\arctan(x)\right)=\frac{1}{\pi(1+x^2)} \end{equation} Confirm the above differentiation with Python using sympy module.
    
    Since $x^2$ is always positive for all $x$, thus $\frac{dF(x)}{dx}>0$, implying $F(x)$ is increasing.
  3. $F(x)$ is continuous, implies that $F(x)$ is right-continuous.
  $\hspace{12.5cm}\blacksquare$
2. Proof.
  1. $ F(x)=\displaystyle\frac{1}{1+e^{-x}}, x\in(-\infty,\infty) $
    
    \begin{equation}\nonumber \begin{aligned} \displaystyle\lim_{x\to-\infty}F(x)&=\displaystyle\lim_{x\to-\infty} \left(\frac{1}{1+e^{-x}}\right)\\ &=0\\[0.5cm] \displaystyle\lim_{x\to\infty}F(x)&=\displaystyle\lim_{x\to\infty} \left(\frac{1}{1+e^{-x}}\right)\\ &=\displaystyle\lim_{x\to\infty} \left(\frac{1}{1+\frac{1}{e^{x}}}\right)\\ &=1 \end{aligned} \end{equation} Confirm these in Python,
  2. Using the same method we did in (a), we have \begin{equation} \nonumber \begin{aligned} \frac{dF(x)}{dx}&=\frac{d}{dx}\left(\displaystyle\frac{1}{1+e^{-x}}\right)\\ &=\frac{e^{-x}}{(1+e^{-x})^2} \end{aligned} \end{equation} $\frac{dF(x)}{dx}=\frac{e^{-x}}{(1+e^{-x})^2}>0,\;\forall\;x\in(-\infty,\infty)$. Thus the function is increasing in the interval of $x$.
  3. $F(x)$ is continuous, implies the function is right-continuous.
  $\hspace{12.5cm}\blacksquare$
3. Proof.
  1. $F(x)=e^{-e^{-x}}, x\in (-\infty, \infty)$
    
    \begin{equation}\nonumber \begin{aligned} \displaystyle\lim_{x\to-\infty}F(x)&=\displaystyle\lim_{x\to-\infty} \left(e^{-e^{-x}}\right)\\ &=\displaystyle\lim_{x\to-\infty} \left(\frac{1}{e^{\frac{1}{e^{x}}}}\right)\\ &=0\\[0.5cm] \displaystyle\lim_{x\to\infty}F(x)&=\displaystyle\lim_{x\to\infty} \left(e^{-e^{-x}}\right)\\ &=\displaystyle\lim_{x\to\infty} \left(\frac{1}{e^{\frac{1}{e^{x}}}}\right)\\ &=1 \end{aligned} \end{equation}
  2. Like what we did above, $\frac{dF(x)}{dx}$ is, \begin{equation} \nonumber \frac{dF(x)}{dx}=\frac{d}{dx}\left(e^{-e^{-x}}\right)=e^{-x}e^{-e^{-x}}>0 \end{equation} Because $e^{-x}e^{-e^{-x}}>0,\;\forall\; x\in(-\infty,\infty)$. Then we say $F(x)$ is an increasing function in the interval of $x$.
  3. $F(x)$ is continuous, implies that $F(x)$ is right-continuous.
  $\hspace{12.5cm}\blacksquare$
4. Proof.
  1. $F(x)=1-\displaystyle\frac{1}{e^{x}}, x\in(0,\infty)$
    
    \begin{equation}\nonumber \begin{aligned} \displaystyle\lim_{x\to-\infty}F(x)&=\displaystyle \lim_{x\to 0}F(x)=1-\displaystyle\lim_{x\to 0} \left(\frac{1}{e^{x}}\right) =0\\[0.5cm] \displaystyle\lim_{x\to\infty}F(x)&=1- \displaystyle\lim_{x\to\infty} \left(\frac{1}{e^{x}}\right)=1 \end{aligned} \end{equation}
  2. \begin{equation}\nonumber \frac{dF(x)}{dx}=\frac{d}{dx}\left(1-\frac{1}{e^{x}}\right)=0-(-e^{-x})=\frac{1}{e^{x}} \end{equation} $F(x)$ is an increasing function since $\frac{1}{e^{x}}>0,\;\forall\;x\in(0,\infty)$.
  3. $F(x)$ is right-continuous, since it is continuous.
  $\hspace{12.5cm}\blacksquare$
5. Proof. The function in Equation (1.5.6) is given by, \begin{equation} F_Y(y)=\begin{cases} \displaystyle\frac{1-\varepsilon}{1+e^{-y}}&\text{if}\;y<0,\; \text{for some}\;\varepsilon, 1>\varepsilon>0\\ \varepsilon+\displaystyle\frac{1-\varepsilon}{1+e^{-y}}&\text{if}\;y\geq 0,\;\text{for some}\;\varepsilon, 1>\varepsilon>0 \end{cases}\nonumber \end{equation}
  1. \begin{equation}\nonumber \begin{aligned} \displaystyle\lim_{y\to-\infty}F_Y(y)&=\displaystyle\lim_{y\to-\infty} \left(\displaystyle\frac{1-\varepsilon}{1+e^{-y}}\right)=\displaystyle\lim_{y\to-\infty} \left(\displaystyle\frac{1-\varepsilon}{1+\frac{1}{e^{y}}}\right)=0\\[0.5cm] \displaystyle\lim_{y\to\infty}F(y)&=\displaystyle\lim_{y\to\infty} \left(\varepsilon+\displaystyle\frac{1-\varepsilon}{1+e^{-y}}\right)=\varepsilon + \displaystyle\lim_{y\to\infty} \left(\displaystyle\frac{1-\varepsilon}{1+\frac{1}{e^{y}}}\right)=1 \end{aligned} \end{equation}
  2. For $y<0$, we have \begin{equation} \begin{aligned} \frac{d}{dy}\left(\frac{1-\varepsilon}{1+e^{-y}}\right)&=(1-\varepsilon)\frac{d}{dy}\left(\frac{1}{1+e^{-y}}\right)\\ &=(1-\varepsilon)\frac{(1+\varepsilon^{-y})\cdot 0 - 1\cdot e^{-y}(-1)}{(1+e^{-y})^2}\\ &=\frac{(1-\varepsilon)e^{-y}}{(1+e^{-y})^2} \end{aligned}\nonumber \end{equation} $(1-\varepsilon)>0$ since $0<\varepsilon<1$. Thus for all $y < 0$, $\frac{(1-\varepsilon)e^{-y}}{(1+e^{-y})^2}>0$, implying that the function is increasing.
    
    For $y\geq 0$, \begin{equation} \begin{aligned} \frac{d}{dy}\left(\varepsilon+\frac{1-\varepsilon}{1+e^{-y}}\right)&=\varepsilon+\frac{(1-\varepsilon)e^{-y}}{(1+e^{-y})^2} \end{aligned}\nonumber \end{equation} The function is increasing since $\varepsilon + \frac{(1-\varepsilon)e^{-y}}{(1+e^{-y})^2}>0$ for all $y\geq 0$.
  3. Since the function is continuous, then the function is right-continuous.
  $\hspace{12.5cm}\blacksquare$
Proof. We know that, \begin{equation}\nonumber P(X>t)=1-P(X\leq t)=1-F_X(t) \end{equation} and \begin{equation}\nonumber P(Y>t)=1-P(Y\leq t)=1-F_Y(t) \end{equation} Hence we have, \begin{equation}\nonumber \begin{aligned} P(X>t)=1-F_X(t)\;&\overset{?}{\geq}\;1-F_Y(t)=P(Y>t)\\ \end{aligned} \end{equation} Since $F_X(t)\leq F_Y(t)$, then the difference $1-F_X(t)$ tends to get bigger than $1-F_Y(t)$. Thus for all $t$, $P(X>t)\geq P(X>t)$.

Now if $F_X(t) < F_Y(t)$ for some $t$, then using the same argument above, $P(X>t) > P(X>t)$ for some $t$.
$\hspace{13.5cm}\blacksquare$
Proof. For a function to be a pdf, it has to satisfy the following:
1. $g(x)\geq 0$ for all $x$; and,
2. $\displaystyle\int_{-\infty}^{\infty}g(x)\,dx=1$.
For any arbitrary $x_0$, $F(x_0)<1$. Thus, $g(x)$ is always positive. Now, \begin{equation} \begin{aligned} \int_{-\infty}^{\infty}g(x)\,dx&= \int_{-\infty}^{x_0}g(x)\,dx+ \int_{x_0}^{\infty}g(x)\,dx\\ &=\int_{x_0}^{\infty}g(x)\,dx\\ &=\int_{x_0}^{\infty}\frac{f(x)}{(1-F(x_0))}\,dx\\ &=\frac{1}{1-F(x_0)}\int_{x_0}^{\infty}f(x)\,dx\\ &=\frac{1}{1-F(x_0)}[F(\infty)-F(x_0)]\\ &=\frac{1}{1-F(x_0)}[1-F(x_0)]=1,\;\text{since}\;\lim_{x\to \infty}F(x)=1\\ \end{aligned}\nonumber \end{equation} $\hspace{13.5cm}\blacksquare$
In order for $f(x)$ to be a pdf, it has to integrate to 1.
1. \begin{equation} \begin{aligned} \int_{-\infty}^{\infty}f(x)&=\int_{0}^{\frac{\pi}{2}}\mathrm{c}\sin x=\displaystyle\left.-(\mathrm{c})\cos x\displaystyle\right\rvert_{0}^{\frac{\pi}{2}}\\ &=-\mathrm{c}\left(\cos\left(\frac{\pi}{2}\right)-\cos(0)\right)\\ &=-\mathrm{c}(0-1)=1\mathrm{c} \end{aligned}\nonumber \end{equation} Hence, $\mathrm{c}$ is 1. Confirm this with python,
2. \begin{equation} \begin{aligned} \int_{-\infty}^{\infty}f(x)&=\int_{-\infty}^{\infty} \mathrm{c}\,e^{-|x|}\\ &=\mathrm{c}\left(\int_{-\infty}^{0} \,e^{x}\,dx+\int_{0}^{\infty} e^{-x}\,dx\right)\\ &=\mathrm{c}\left[(e^{0}-e^{-\infty})-(e^{-\infty}-e^{0})\right]\\ &=\mathrm{c}(1+1) = 2\mathrm{c} \end{aligned}\nonumber \end{equation} Hence, c is $\frac{1}{2}$. Confirm this with Python,

Reference

Casella, G. and Berger, R.L. (2001). Statistical Inference. Thomson Learning, Inc.

R: k-Means Clustering on Imaging

2014-09-12T11:08:00.000+08:00

Enough with the theory we recently published, let's take a break and have fun on the application of Statistics used in Data Mining and Machine Learning, the k-Means Clustering.

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. (Wikipedia, Ref 1.)

We will apply this method to an image, wherein we group the pixels into k different clusters. Below is the image that we are going to use,

Colorful Bird From Wall321

We will utilize the following packages for input and output:

jpeg - Read and write JPEG images; and,
ggplot2 - An implementation of the Grammar of Graphics.

Download and Read the Image

Let's get started by downloading the image to our workspace, and tell R that our data is a JPEG file.

Cleaning the Data

Extract the necessary information from the image and organize this for our computation:

The image is represented by large array of pixels with dimension rows by columns by channels -- red, green, and blue or RGB.

Plotting

Plot the original image using the following codes:

Clustering

Apply k-Means clustering on the image:

Plot the clustered colours:

Possible clusters of pixels on different k-Means:

Original	k = 6
Table 1: Different k-Means Clustering.

k = 5	k = 4

k = 3	k = 2

I suggest you try it!

Reference

K-means clustering. Wikipedia. Retrieved September 11, 2014.

Lebesgue Measure and Outer Measure Problems

2014-09-08T23:09:00.000+08:00

More proving, still on Real Analysis. This is my solution and if you find any errors, do let me know.

Problems

Lebesgue Measure: Let $\mu$ be set function defined for all set in $\sigma$-algebra $\mathscr{F}$ with values in $[0,\infty]$. Assume $\mu$ is countably additive over countable disjoint collections of sets in $\mathscr{F}$.

Prove that if $A$ and $B$ are two sets in $\mathscr{F}$, with $A\subseteq B$, then $\mu(A)\leq \mu(B)$. This property is called monotonicity.
Prove that if there is a set $A$ in the collection $\mathscr{F}$ for which $\mu(A)<\infty$, then $\mu(\emptyset)=0$.
Let $\{E_{k}\}_{k=1}^{\infty}$ be a countable collection of sets in $\mathscr{F}$. Prove that $\mu\left(\displaystyle\bigcup_{k=1}^{\infty}E_{k}\right)\leq \displaystyle\sum_{k=1}^{\infty}\mu(E_k)$

Lebesgue Outer Measure:

By using property of outer measure, prove that the interval $[0,1]$ is not countable.
Let $A$ be the set of irrational numbers in the interval $[0,1]$. Prove that $\mu^{*}(A)=1$.
Let $B$ be the set of rational numbers in the interval $[0,1]$, and let $\{I_k\}_{k=1}^{n}$ be finite collection of open intervals that covers $B$. Prove that $\displaystyle\sum_{k=1}^{n}\mu^{*}(I_k)\geq 1$.
Prove that if $\mu^{*}(A)=0$, then $\mu^{*}(A\cup B)=\mu^{*}(B).$

Solutions

Proof. If $A\subseteq B$, then $B= A\cup (B\cap A^c)\Rightarrow B= A\cup (B\backslash A)$. Thus, \begin{equation}\nonumber \begin{aligned} \mu(B)&= \mu(A\cup (B\backslash A))\\ &= \mu(A)+\mu(B\backslash A)\\ &(\text{since}\;\mu\;\text{is countably additive on disjoint sets}) \end{aligned} \end{equation} We can see that $\mu(B)\geq \mu(A)$ since $\mu(B\backslash A) > 0$. $\hspace{13.5cm}\blacksquare$
Proof. For any set $A$ in $\mathscr{F}$ such that $\mu(A)<\infty$, $A\cup \emptyset = A$. Thus, \begin{equation}\nonumber \begin{aligned} \mu(A)&=\mu(A\cup \emptyset)=\mu(A)-\mu(\emptyset)\\ 0&=\mu(\emptyset) \end{aligned} \end{equation} $\hspace{13.5cm}\blacksquare$
Proof. We define a sequence $\{A_n\}_{n=1}^{\infty}\subseteq\mathscr{F}$, such that $A_1=E_1$ and \begin{equation}\nonumber A_n = E_n \backslash \bigcup_{k=1}^{n-1}E_k,\;\text{for}\;n>1 \end{equation} It is easy to see that $A_n$ is pairwise disjoint, and $\bigcup_{n=1}^{\infty}A_n=\bigcup_{k=1}^{\infty}E_k$, also $\{A_n\}\subseteq \{E_k\}$. Thus by countably additive and monotonicity property of $\mu$, we have \begin{equation}\nonumber \begin{aligned} \mu\left(\bigcup_{k=1}^{\infty}E_k\right)&=\mu\left(\bigcup_{n=1}^{\infty}A_n\right)\\ &=\sum_{n=1}^{\infty}\mu(A_n)\\ &\leq \sum_{k=1}^{\infty}\mu(E_k)\;(\text{by monotonicty}). \end{aligned} \end{equation} $\hspace{13.5cm}\blacksquare$
Proof. Let's prove this by contradiction, assume the interval $[0,1]$ is countable. Then we need to show that $\mu^{*}([0,1])=0$ for it to be countable. Now consider $\varepsilon >0$, such that $I = \{[\varepsilon - 0, 1 + \varepsilon]\}$ covers $[0,1]$. Then by property of outer measure that says, $\mu^{*}([a,b])$ is the length of $[a,b]$, we have \begin{equation}\nonumber \mu^{*}([0,1]) = \inf\,\{\ell (I)\} = (1+\varepsilon) - (0-\varepsilon) = 1+2\varepsilon \end{equation} This holds for each $\varepsilon >0$, thus $\mu^{*}([0,1])=1$ which is a contradiction.$\hspace{2.13cm}\blacksquare$
Proof. If $A=\{\mathbb{Q}^c\cap [0,1]\}$ is the set of irrational numbers in the interval $[0,1]$, then $A^c=\{\mathbb{Q}\cap [0,1]\}$ is the set of rational numbers in the interval $[0,1]$. Now consider the following, \begin{equation}\nonumber \begin{aligned} \mu^{*}([0,1])&=\mu^{*}(A)+\mu^{*}(A^{c})\\ \mu^{*}(A)&=\mu^{*}([0,1]) - \mu^{*}(A^{c})\\ &=1 -\mu^{*}(A^{c}) \end{aligned} \end{equation} We need to show that $\mu^{*}(A^{c})$ has outer measure zero. To do that, let $a_1,a_2,\cdots\in A^{c}$. And for $\varepsilon > 0$, $\exists$ $\{I_n\}$ such that $\{I_n\} = \{(a_n - \frac{\varepsilon}{2^{k+1}}, a_n + \frac{\varepsilon}{2^{k+1}})\}$. Thus, $\bigcup_{n=1}^{\infty}I_n$ covers $A^{c}$, and by outer measure we have, \begin{equation}\nonumber \begin{aligned} \mu^{*}(A^c)& \leq \inf\left\{\sum_{n=1}^{\infty}\ell(I_n)\right\}\\ &\leq \inf\left\{\sum_{n=1}^{\infty}\left(a_n + \frac{\varepsilon}{2^{k+1}} - a_n + \frac{\varepsilon}{2^{k+1}}\right) \right\}\\ &\leq \inf\left\{\sum_{n=1}^{\infty}\left(\frac{\varepsilon}{2^{k}}\right) \right\}\\ &\leq \varepsilon \end{aligned} \end{equation} Since this hold for each $\varepsilon$ then $\mu^{*}(A^c)=0$. Thus, $\mu^{*}(A)=1-0=1$.
$\hspace{13.5cm}\blacksquare$
Proof. The rational numbers are dense in $\mathbb{R}$. Thus, any point in the interval $[0,1]$ may it be irrational numbers will always be a point of closure of $B$, that is $\bar{B}=[0,1]$. Since $B\subseteq \bigcup_{k=1}^{\infty}I_k$, then by closure property, $\bar{B}\subseteq \overline{\bigcup_{k=1}^{\infty}I_k}$, which is $\bar{B}\subseteq \bigcup_{k=1}^{\infty}\bar{I}_k$. Thus by definition of outer measure we have, \begin{equation}\nonumber \begin{aligned} 1=\mu^{*}([0,1])&=\mu^{*}(\bar{B})\leq \mu^{*}\left(\bigcup_{k=1}^{\infty}\bar{I}_k\right)\\ &\leq \sum_{k=1}^{\infty}\mu^{*}(\bar{I}_k)=\sum_{k=1}^{\infty}\mu^{*}(I_k). \end{aligned} \end{equation} Thus, \begin{equation}\nonumber\sum_{k=1}^{\infty}\mu^{*}(I_k)\geq 1 \end{equation} $\hspace{13.5cm}\blacksquare$
Proof. We need to show that, \begin{equation}\nonumber \begin{aligned} &\mu^{*}(A\cup B)\leq \mu^{*}(B)\\ &\mu^{*}(B)\leq \mu^{*}(A\cup B) \end{aligned} \end{equation}
1. From the definition of outer measure, \begin{equation} \nonumber \begin{aligned} \mu^{*}(A\cup B)&\leq \mu^{*}(A)+\mu^{*}(B)\\ &\leq \mu^{*}(B). \end{aligned} \end{equation}
2. Since $B\subseteq A\cup B$, then from property of outer measure that if $A\subseteq B$, then $\mu^{*}(A)\leq \mu^{*}(B)$. Hence, \begin{equation}\nonumber \mu^{*}(B)\leq \mu^{*}(A\cup B) \end{equation}
$\hspace{13.5cm}\blacksquare$

Reference

Royden, H.L. and Fitzpatrick, P.M. (2010). Real Analysis. Pearson Education, Inc.

Translation Invariant of Lebesgue Outer Measure

2014-09-07T11:15:00.000+08:00

Another proving problem, this time on Real Analysis.

Problem

Prove that the Lebesgue outer measure is translation invariant. (Use the property that, the length of an interval $l$ is translation invariant.)

Solution

Proof. The outer measure is translation invariant if for $y\in \mathbb{R}$, \begin{equation}\nonumber \mu^{*}(A)=\mu^{*}(A+y) \end{equation} Hence, we need to show that Case 1: $\mu^{*}(A)\leq \mu^{*}(A+y)$; and Case 2: $\mu^{*}(A+y)\leq \mu^{*}(A)$.

Case 1: Consider a countable collection $\{I_n\}_{n=1}^{\infty}$, and let \begin{equation}\nonumber W = \left\{\displaystyle\sum_{n=1}^{\infty}l(I_n)\mid A\subseteq\displaystyle\bigcup_{n=1}^{\infty}I_n\right\} \end{equation} Then the outer measure of $A$ is, \begin{equation}\nonumber \mu^{*}(A)=\inf\,\{W\}. \end{equation} Now consider $x\in W$, then there is a particular collection $\hat{I}_n$ that covers $A$, such that $\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n)=x$, and that of course is the $\inf\,\{W\}$. Further, we see that the collection $\{\hat{I}_n+y\}$ covers $A+y$, that is, $A+y\subseteq \displaystyle\bigcup_{n=1}^{\infty}\{\hat{I}_n + y\}$. And from this, we obtain the following outer measure: \begin{equation}\nonumber \begin{aligned} \mu^{*}(A+y)&=\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n+y)\\ &=\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n),\;\text{since}\;l\;\text{is translation invariant}.\\ &=x. \end{aligned} \end{equation} And therefore, $W\subseteq\left\{\displaystyle\sum_{n=1}^{\infty}I_n\mid A+y\subseteq \displaystyle\bigcup_{n=1}^{\infty}I_n\right\}$, implying $\mu^{*}(A)\leq \mu^{*}(A+y)$.

Case 2: Using the same flow of reasoning as in Case 1, consider a countable collection $\{I_n\}_{n=1}^{\infty}$, and let \begin{equation}\nonumber V = \left\{\displaystyle\sum_{n=1}^{\infty}l(I_n)\mid A+y\subseteq\displaystyle\bigcup_{n=1}^{\infty}I_n\right\} \end{equation} Then the outer measure of $A$ is, \begin{equation}\nonumber \mu^{*}(A+y)=\inf\,\{V\}. \end{equation} Now consider $x\in V$, then there is a particular collection $\hat{I}_n$ that covers $A+y$, such that $\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n)=x$, and that of course is the $\inf\,\{V\}$. Further, we see that the collection $\{\hat{I}_n+(-y)\}$ covers $A$, that is, $A\subseteq \displaystyle\bigcup_{n=1}^{\infty}\{\hat{I}_n + (-y)\}$. And from this, we obtain the following outer measure: \begin{equation}\nonumber \begin{aligned} \mu^{*}(A)&=\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n+(-y))\\ &=\displaystyle\sum_{n=1}^{\infty}l(\hat{I}_n),\;\text{since}\;l\;\text{is translation invariant}.\\ &=x. \end{aligned} \end{equation} And therefore, $V\subseteq\left\{\displaystyle\sum_{n=1}^{\infty}I_n\mid A\subseteq \displaystyle\bigcup_{n=1}^{\infty}I_n\right\}$, implying $\mu^{*}(A+y)\leq \mu^{*}(A)$.

Since we have shown both cases, then $\mu^{*}(A)=\mu^{*}(A+y).\hspace{3.7cm}\blacksquare$

Reference

Royden, H.L. and Fitzpatrick, P.M. (2010). Real Analysis. Pearson Education, Inc.

R: Image Analysis using EBImage

2014-09-05T20:55:00.000+08:00

Currently, I am taking Statistics for Image Analysis on my masteral, and have been exploring this topic in R. One package that has the capability in this field is the EBImage from Bioconductor, which will be showcased in this post.

Installation

For those using Ubuntu, you may likely to encounter this error:

It has something to do with the tiff.h C header file, but it's not that serious since mytechscribblings has an effective solution for this, do check that out.

Importing Data

To import a raw image, consider the following codes:

Output of display(Image).

Yes, this is the photo that we are going to use for our analysis. Needless to say, that's me and my friends. In the proceeding section we will do image manipulation and other processing.

Image Properties

So what do we get from our raw image? To answer that, simply run print(Image). This will return the properties of the image, including the array of pixel values. With these information, we apply mathematical and statistical operations to do enhancement on the image.

There are two sections (Summary and array of the pixels) in the above output, with the following entries for the first section:

Code	Value	Description
Table 1: Information from 1st section of `print(Image)`.
`colormode`	Color	The type (Color/Grayscale) of the color of the image.
`storage.mode`	double	Type of values in the array.
`dim`	1984 1488 3	Dimension of the array, (x, y, z).
`nb.total.frames:`	3	Number of channels in each pixel, z entry in `dim`.
`nb.render.frames`	1	Number of channels rendered.

The second section is the obtained values from mapping pixels in the image to the real line between 0 and 1 (inclusive). Both extremes of this interval [0, 1], are black and white colors, respectively. Hence, pixels with values closer to any of these end points are expected to be darker or lighter, respectively. And because pixels are contained in a large array, then we can do all matrix manipulations available in R for processing.

Adjusting Brightness

It is better to start with the basic first, one of which is the brightness. As discussed above, brightness can be manipulated using + or -:

Lighter Darker

Table 2: Adjusting Brightness.

Output of display(Image1).

Output of display(Image2).

Adjusting Contrast

Contrast can be manipulated using multiplication operator(*):

Low High

Table 3: Adjusting Contrast.

Output of display(Image3).

Output of display(Image4).

Gamma Correction

Gamma correction is the name of a nonlinear operation used to code and decode luminance or tristimulus values in video or still image systems, defined by the following power-law expression: \begin{equation}\nonumber V_{\mathrm{out}} = AV_{\mathrm{in}}^{\gamma} \end{equation} where $A$ is a constant and the input and output values are non-negative real values; in the common case of $A = 1$, inputs and outputs are typically in the range 0-1. A gamma value $\gamma< 1$ is sometimes called an encoding gamma (Wikipedia, Ref. 1).

$\gamma = 2$ $\gamma = 0.7$

Table 4: Adjusting Gamma Correction.

Output of display(Image5).

Output of display(Image6).

Cropping

Slicing array of pixels, simply mean cropping the image.

Output of the above code.

Spatial Transformation

Spatial manipulation like rotate (rotate), flip (flip), and translate (translate) are also available in the package. Check this out,

Color Management

Since the array of pixels has three axes in its dimension, for example in our case is 1984 x 1488 x 3. The third axis is the slot for the three channels: Red, Green and Blue, or RGB. Hence, transforming the color.mode from Color to Grayscale, implies disjoining the three channels from single rendered frame (three channels for each pixel) to three separate array of pixels for red, green, and blue frames.

Original	Red Channel
Table 5: Color Mode Transformation.

Green Channel	Blue Channel

To revert the color mode, simply run

Filtering

In this section, we will do smoothing/blurring using low-pass filter, and edge-detection using high-pass filter. In addition, we will also investigate median filter to remove noise.

Low-Pass (Blur)
Table 6: Image Filtering.

High Pass

Original Filtered

Table 7: Median Filter.

From Google, Link Here.

Output of display(medFltr)

For comparison, I run median filter on first-neighborhood in Mathematica, and I got this

Clearly, Mathematica has better enhancement than R for this particular filter. But R has a good foundation already, as we witness with EBImage. There are still lots of interesting functions in the said package, that is worth exploring, I suggest you check that out.

For the meantime, we will stop here, but hoping we can play more on this topic in the succeeding post.

References

Gamma Correction. Wikipedia. Retrieved August 31, 2014.
Gregoire Pau, Oleg Sklyar, Wolfgang Huber (2014). Introduction to EBImage, an image processing and analysis toolkit for R.

$y_1 = 98$	$y_2 = 102$	$y_3=154$
$y_4 = 133$	$y_5 = 190$	$y_6=175$

Sample No.	Sample, $\mathcal{S}$	$P(\mathcal{S})$
1	$\{1,3,5\}$	$1/8$
2	$\{1,3,6\}$	$1/8$
3	$\{1,4,5\}$	$1/8$
4	$\{1,4,6\}$	$1/8$
5	$\{2,3,5\}$	$1/8$
6	$\{2,3,6\}$	$1/8$
7	$\{2,4,5\}$	$1/8$
8	$\{2,4,6\}$	$1/8$

Age (months)	9	10	11	12	13	14	15	16	17	18	19	20
Number of Children	13	35	44	69	36	24	7	3	2	5	1	1

Obs	x	y
1	-5.0	.000001487
2	-4.9	.000002439
3	-4.8	.000003961
4	-4.7	.000006370
5	-4.6	.000010141