Renewable energy researchers use computer simulation to aid the design of lithium ion storage devices. The underlying models contain several physical input parameters that affect model predictions. Effective design and analysis must understand the sensitivity of model predictions to changes in model parameters, but global sensitivity analyses become increasingly challenging as the number of input parameters increases. Active subspaces are part of an emerging set of tools for discovering and exploiting low-dimensional structures in the map from high-dimensional inputs to model outputs. We extend linear and quadratic model-based heuristics for active subspace discovery to time-dependent processes and apply the resulting technique to a lithium ion battery model. The results reveal low-dimensional structure and sensitivity metrics that a designer may exploit to study the relationship between parameters and predictions.

]]>Noise-affected economic time series, realizations of stochastic processes exhibiting complex and possibly nonlinear dynamics, are dealt with. This is often the case of time series found in economics, which notoriously suffer from problems such as low signal-to-noise ratios, asymmetric cycles and multiregimes patterns. In such a framework, even sophisticated statistical models might generate suboptimal predictions, whose quality can further deteriorate unless time consuming updating or deeper model revision procedures are carried out on a regular basis. However, when the models' outcomes are expected to be disseminated in timeliness manner (as in the case of Central Banks or national statistical offices), their modification might not be a viable solution, due to time constraints. On the other hand, if the application of simpler linear models usually entails relatively easier tuning-up procedures, this would come at the expenses of the quality of the predictions yielded. A mixed, self-tuning forecasting method is therefore proposed. This is an automatic, 2-stage procedure, able to generate predictions by exploiting the denoising capabilities provided by the wavelet theory in conjunction with a compounded forecasting generator. Its out-of-sample performances are evaluated through an empirical study carried out on macroeconomic time series.

]]>Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting—the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

]]>In this paper we propose a Bayesian semiparametric regression model to estimate and test the effect of a genetic pathway on prostate-specific antigen (PSA) measurements for patients with prostate cancer. The underlying functional relationship between the genetic pathway and PSA is modeled using reproducing kernel Hilbert space (RKHS) theory. The RKHS formulation makes our model highly flexible, which can capture the complex multidimensional relationship between the genes in a genetic pathway and the response. Moreover, the higher order and nonlinear interactions among the genes in a pathway are also automatically modeled through our kernel-based representation. We illustrate the connection between our semiparametric regression based on RKHS and a linear mixed model by choosing a special prior distribution on the model parameters. To test the significance of a genetic pathway toward the phenotypic response like PSA, we propose a Bayesian hypothesis testing scheme based on the Bayes factor. An efficient Markov chain Monte Carlo algorithm is designed to estimate the model parameters, Bayes factors, and the genetic pathway effect simultaneously. We illustrate the effectiveness of our model by five simulation studies and one real prostate cancer gene expression data analysis.

]]>In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a *bag-of-words* benchmark data set, where the features are represented by the frequencies of occurrence of each word.

A new method is introduced for combining information from multiple sources to support one-class classification. The contributing sources may represent measurements taken by different sensors of the same physical entity, repeated measurements by a single sensor, or numerous features computed from a single measured image or signal. The approach utilizes the theory of statistical hypothesis testing, and applies Fisher's technique for combining *p*-values, modified to handle nonindependent sources. Classifier outputs take the form of fused *p*-values, which may be used to gauge the consistency of unknown entities with one or more class hypotheses. The approach enables rigorous assessment of classification uncertainties, and allows for traceability of classifier decisions back to the constituent sources, both of which are important for high-consequence decision support. Application of the technique is illustrated in two challenge problems, one for skin segmentation and the other for terrain labeling. The method is seen to be particularly effective for relatively small training samples.

We compare some well-known Bayesian global optimization methods in four distinct regimes, corresponding to high and low levels of measurement noise and to high and low levels of “quenched noise” (which term we use to describe the roughness of the function we are trying to optimize). We isolate the two stages of this optimization in terms of a “regressor,” which fits a model to the data measured so far, and a “selector,” which identifies the next point to be measured. The focus of this paper is to investigate the choice of selector when the regressor is well matched to the data.

]]>The achievement of inertial confinement fusion ignition on the National Ignition Facility relies on the collection and interpretation of a limited (and expensive) set of experimental data. These data are therefore supplemented with state-of-the-art multidimensional radiation-hydrodynamic simulations to provide a better understanding of implosion dynamics and behavior. We present a relatively large number (∼ 4000) of systematically perturbed 2D simulations to probe our understanding of low-mode fuel and ablator asymmetries seeded by asymmetric illumination. We find that Gaussian process surrogate models are able to predict both the total neutron yield and the degradation in performance due to asymmetries. The surrogates are then applied to simulations containing new sources of degradation to quantify the impact of the new source.

]]>