Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures for clustering of data from such visualizations is proposed in this paper. The methodology is theoretically justified, and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.

]]>Classification is an important tool with many useful applications. Fisher's linear discriminant analysis (LDA) is a traditional model-based classification method which makes use of the Gaussian distributional information. However, in the high-dimensional, low-sample-size setting, LDA cannot be directly deployed because the sample covariance is not invertible. While there are modern methods for high-dimensional data, they may not fully use the information as LDA does. Hence in some situations, it is still desirable to use a model-based method for classification. This paper exploits the potential of LDA in a more complicated data setting. In many real applications, it is costly to manually place labels on observations; consequently, often only a small portion of labeled data is available while a large number of observations are left without labels. It is a great challenge to obtain good classification performance through the labeled data alone, especially in the high-dimensional setting. In order to overcome this issue, we propose a semisupervised sparse LDA classifier to take advantage of the seemingly useless unlabeled data, which helps to boost the classification performance in some situations. A direct estimation method is used to reconstruct LDA and achieve sparsity; meanwhile we employ the difference-convex algorithm to handle the nonconvex loss function associated with the unlabeled data. Theoretical properties of the proposed classifier are studied. Our simulated examples help understand when and how the information extracted from the unlabeled data can be useful. A real data example further illustrates the usefulness of the proposed method.

]]>Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting—the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

]]>In this paper we propose a Bayesian semiparametric regression model to estimate and test the effect of a genetic pathway on prostate-specific antigen (PSA) measurements for patients with prostate cancer. The underlying functional relationship between the genetic pathway and PSA is modeled using reproducing kernel Hilbert space (RKHS) theory. The RKHS formulation makes our model highly flexible, which can capture the complex multidimensional relationship between the genes in a genetic pathway and the response. Moreover, the higher order and nonlinear interactions among the genes in a pathway are also automatically modeled through our kernel-based representation. We illustrate the connection between our semiparametric regression based on RKHS and a linear mixed model by choosing a special prior distribution on the model parameters. To test the significance of a genetic pathway toward the phenotypic response like PSA, we propose a Bayesian hypothesis testing scheme based on the Bayes factor. An efficient Markov chain Monte Carlo algorithm is designed to estimate the model parameters, Bayes factors, and the genetic pathway effect simultaneously. We illustrate the effectiveness of our model by five simulation studies and one real prostate cancer gene expression data analysis.

]]>In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a *bag-of-words* benchmark data set, where the features are represented by the frequencies of occurrence of each word.

Noise-affected economic time series, realizations of stochastic processes exhibiting complex and possibly nonlinear dynamics, are dealt with. This is often the case of time series found in economics, which notoriously suffer from problems such as low signal-to-noise ratios, asymmetric cycles and multiregimes patterns. In such a framework, even sophisticated statistical models might generate suboptimal predictions, whose quality can further deteriorate unless time consuming updating or deeper model revision procedures are carried out on a regular basis. However, when the models' outcomes are expected to be disseminated in timeliness manner (as in the case of Central Banks or national statistical offices), their modification might not be a viable solution, due to time constraints. On the other hand, if the application of simpler linear models usually entails relatively easier tuning-up procedures, this would come at the expenses of the quality of the predictions yielded. A mixed, self-tuning forecasting method is therefore proposed. This is an automatic, 2-stage procedure, able to generate predictions by exploiting the denoising capabilities provided by the wavelet theory in conjunction with a compounded forecasting generator. Its out-of-sample performances are evaluated through an empirical study carried out on macroeconomic time series.

]]>A nonparametric statistic, called the roughness of concomitant ranks, is proposed for testing whether 2 quantitative vectors are dependent. The new testing procedure is highly computationally efficient and simple, and exhibits competitive empirical performance in simulations and 2 microarray data analyses. We apply the new method to screen variables for high-dimensional data analysis. For a low signal-to-noise ratio setting, we suggest the use of data binning to increase the power of the test. Simulation results show the fine performance of the proposed method with the existing screening methods.

]]>Quadratic and linear discriminant analysis (QDA and LDA) are the most often applied classification rules under normality. In QDA, a separate covariance matrix is estimated for each group. If there are more variables than observations in the groups, the usual estimates are singular and cannot be used anymore. Assuming homoscedasticity, as in LDA, reduces the number of parameters to estimate. This rather strong assumption is however rarely verified in practice. Regularized discriminant techniques that are computable in high dimension and cover the path between the 2 extremes QDA and LDA have been proposed in the literature. However, these procedures rely on sample covariance matrices. As such, they become inappropriate in the presence of cellwise outliers, a type of outliers that is very likely to occur in high-dimensional datasets. In this paper, we propose cellwise robust counterparts of these regularized discriminant techniques by inserting cellwise robust covariance matrices. Our methodology results in a family of discriminant methods that (1) are robust against outlying cells, (2) cover the gap between LDA and QDA, and (3) are computable in high dimension. The good performance of the new methods is illustrated through simulated and real data examples. As a by-product, visual tools are provided for the detection of outliers.

]]>