A nonparametric statistic, called the roughness of concomitant ranks, is proposed for testing whether 2 quantitative vectors are dependent. The new testing procedure is highly computationally efficient and simple, and exhibits competitive empirical performance in simulations and 2 microarray data analyses. We apply the new method to screen variables for high-dimensional data analysis. For a low signal-to-noise ratio setting, we suggest the use of data binning to increase the power of the test. Simulation results show the fine performance of the proposed method with the existing screening methods.

]]>Noise-affected economic time series, realizations of stochastic processes exhibiting complex and possibly nonlinear dynamics, are dealt with. This is often the case of time series found in economics, which notoriously suffer from problems such as low signal-to-noise ratios, asymmetric cycles and multiregimes patterns. In such a framework, even sophisticated statistical models might generate suboptimal predictions, whose quality can further deteriorate unless time consuming updating or deeper model revision procedures are carried out on a regular basis. However, when the models' outcomes are expected to be disseminated in timeliness manner (as in the case of Central Banks or national statistical offices), their modification might not be a viable solution, due to time constraints. On the other hand, if the application of simpler linear models usually entails relatively easier tuning-up procedures, this would come at the expenses of the quality of the predictions yielded. A mixed, self-tuning forecasting method is therefore proposed. This is an automatic, 2-stage procedure, able to generate predictions by exploiting the denoising capabilities provided by the wavelet theory in conjunction with a compounded forecasting generator. Its out-of-sample performances are evaluated through an empirical study carried out on macroeconomic time series.

]]>Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting—the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

]]>In this paper we propose a Bayesian semiparametric regression model to estimate and test the effect of a genetic pathway on prostate-specific antigen (PSA) measurements for patients with prostate cancer. The underlying functional relationship between the genetic pathway and PSA is modeled using reproducing kernel Hilbert space (RKHS) theory. The RKHS formulation makes our model highly flexible, which can capture the complex multidimensional relationship between the genes in a genetic pathway and the response. Moreover, the higher order and nonlinear interactions among the genes in a pathway are also automatically modeled through our kernel-based representation. We illustrate the connection between our semiparametric regression based on RKHS and a linear mixed model by choosing a special prior distribution on the model parameters. To test the significance of a genetic pathway toward the phenotypic response like PSA, we propose a Bayesian hypothesis testing scheme based on the Bayes factor. An efficient Markov chain Monte Carlo algorithm is designed to estimate the model parameters, Bayes factors, and the genetic pathway effect simultaneously. We illustrate the effectiveness of our model by five simulation studies and one real prostate cancer gene expression data analysis.

]]>In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero-inflated Poisson mixture modeling framework that incorporates a model-based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a *bag-of-words* benchmark data set, where the features are represented by the frequencies of occurrence of each word.

Renewable energy researchers use computer simulation to aid the design of lithium ion storage devices. The underlying models contain several physical input parameters that affect model predictions. Effective design and analysis must understand the sensitivity of model predictions to changes in model parameters, but global sensitivity analyses become increasingly challenging as the number of input parameters increases. Active subspaces are part of an emerging set of tools for discovering and exploiting low-dimensional structures in the map from high-dimensional inputs to model outputs. We extend linear and quadratic model-based heuristics for active subspace discovery to time-dependent processes and apply the resulting technique to a lithium ion battery model. The results reveal low-dimensional structure and sensitivity metrics that a designer may exploit to study the relationship between parameters and predictions.

]]>Dynamic model reduction in power systems is necessary for improving computational efficiency. Traditional model reduction using linearized models or offline analysis is not adequate to capture dynamic behaviors of the power system, especially with the new mix of intermittent generation and intelligent consumption, making the power system more dynamic and nonlinear. Real-time dynamic model reduction has emerged to fill this important need. This paper explores using clustering techniques to analyze real-time phasor measurements to identify groups of generators with similar behavior, as well as a representative generator from each group for dynamic model reduction. Two clustering techniques—graph clustering and *k*-means—are considered. These techniques are compared with a previously developed dynamic model reduction approach using singular value decomposition. Two sample power grid datasets are used to test these different model reduction techniques. Based on the algorithms' relative performance, recommendations are provided for practical use.

Seismic inversions produce seismic models, which are 3-dimensional (3D) images of wave velocity of the entire planet retrieved by fitting seismic measurements made on records of past earthquakes or other seismic events. Computing power of the TeraFlop era, along with the dataflow from new, very dense, seismic arrays, has led to a new generation of 3D seismic Earth models with an unprecedented level of resolution.

Here we compare two recent models of western United States from the Dynamic North America (DNA) seismic imaging effort. The two models only differ in the wave propagation that was used for their inversion: one is based on ray theory (RT), and the other on finite frequency (FF). We evaluate the two models using an independent numerical method and statistical tests. We show that they differ in how they produce seismic signals from a subset of earthquakes that were used in the original inversion and were recorded on the US array. This is especially true for measurements done in the Yellowstone area which has a large negative seismic anomaly. This result is of importance for seismologists who have been debating on the practical benefit of using FF in ill-posed Earth inversions. Model evaluation, such as the one reported here, represents an opportunity for collaboration between geophysical and statistical communities. More opportunities should arise with the upcoming Exascale era, which will provide enough computational power to explore together several sources of errors in models with thousands of parameters, opening the way of uncertainty quantification of seismic models.

We are interested in detecting and analyzing global changes in dynamic networks (networks that evolve with time). More precisely, we consider changes in the activity distribution within the network, in terms of density (ie, edge existence) and intensity (ie, edge weight). Detecting change in local properties, as well as individual measurements or metrics, has been well studied and often reduces to traditional statistical process control. In contrast, detecting change in larger scale structure of the network is more challenging and not as well understood. We address this problem by proposing a framework for detecting change in network structure based on separate pieces: a probabilistic model for partitioning nodes by their behavior, a label-unswitching heuristic, and an approach to change detection for sequences of complex objects. We examine the performance of one instantiation of such a framework using mostly previously available pieces. The dataset we use for these investigations is the publicly available New York City Taxi and Limousine Commission dataset covering all taxi trips in New York City since 2009. Using it, we investigate the evolution of an ensemble of networks under different spatiotemporal resolutions. We identify the community structure by fitting a weighted stochastic block model. We offer insights on different node ranking and clustering methods, their ability to capture the rhythm of life in the Big Apple, and their potential usefulness in highlighting changes in the underlying network structure.

]]>Magnetohydrodynamics (MHD)—the study of electrically conducting fluids—can be harnessed to produce efficient, low-emissions power generation. Today, computational modeling assists engineers in studying candidate designs for such generators. However, these models are computationally expensive, so thoroughly studying the effects of the model's many input parameters on output predictions is typically infeasible. We study two approaches for reducing the input dimension of the models: (i) classical dimensional analysis based on the inputs' units and (ii) active subspaces, which reveal low-dimensional subspaces in the space of inputs that affect the outputs the most. We also review the mathematical connection between the two approaches that leads to consistent application. We study both the simplified Hartmann problem, which admits closed form expressions for the quantities of interest, and a large-scale computational model with adjoint capabilities that enable the derivative computations needed to estimate the active subspaces. The dimension reduction yields insights into the driving factors in the MHD power generation models, which may aid generator designers who employ high-fidelity computational models.

]]>Subsurface applications, including geothermal, geological carbon sequestration, and oil and gas, typically involve maximizing either the extraction of energy or the storage of fluids. Fractures form the main pathways for flow in these systems, and locating these fractures is critical for predicting flow. However, fracture characterization is a highly uncertain process, and data from multiple sources, such as flow and geophysical are needed to reduce this uncertainty. We present a nonintrusive, sequential inversion framework for integrating data from geophysical and flow sources to constrain fracture networks in the subsurface. In this framework, we first estimate bounds on the statistics for the fracture orientations using microseismic data. These bounds are estimated through a combination of a focal mechanism (physics-based approach) and clustering analysis (statistical approach) of seismic data. Then, the fracture lengths are constrained using flow data. The efficacy of this inversion is demonstrated through a representative example.

]]>Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the *partial ordering* of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges) in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared to existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.