Predictions from science and engineering models depend on the values of the model's input parameters. As the number of parameters increases, algorithmic parameter studies such as optimization and uncertainty quantification require many more model evaluations. One way to combat this curse of dimensionality is to seek an alternative parameterization with fewer variables that produces comparable predictions. The *active subspace* is a low-dimensional linear subspace defined by important directions in the model's input space; input perturbations along these directions change the model's prediction more, on average, than perturbations orthogonal to the important directions. We describe a method for checking if a model admits an exploitable active subspace and apply this method to a single-diode solar cell model with five input parameters. We find that the maximum power of the solar cell has a dominant one-dimensional active subspace, which enables us to perform thorough parameter studies in one dimension instead of five. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015

Plutonium-238 is an important specialized power source that radiates heat, which can be converted into electricity. This case study models the thermal output of samples of Pu-238, in which the underlying theoretical model of its decay summarizes a large portion of the observed behavior. A discrepancy function is used to account for missing structure seen in the observed data, but is not included in the physical model. The model combines the assumed physics model, discrepancy and experimental error with an expression of the form, *f*(*x*,*θ*) + *δ*(*x*) + *ɛ*. The combined model improves prediction of new observations in the future by accounting for shortcomings or omissions in the physical model and provides quantitative summaries of the relative contributions of the discrepancy and physics model. In this work, we illustrate how to visualize the discrepancy function when it is modeled using a Gaussian process. With the visualization, scientists can gain understanding about the differences between the observed data and the current scientific model and develop proposals of how to potentially improve their model. A secondary example illustrates how the visualization methods can help with understanding in higher dimensions.

This paper brings explicit considerations of distributed computing architectures and data structures into the rigorous design of Sequential Monte Carlo (SMC) methods. A theoretical result established recently by the authors shows that adapting interaction between particles to suitably control the effective sample size (ESS) is sufficient to guarantee stability of SMC algorithms. Our objective is to leverage this result and devise algorithms which are thus guaranteed to work well in a distributed setting. We make three main contributions to achieve this. First, we study mathematical properties of the ESS as a function of matrices and graphs that parameterize the interaction among particles. Secondly, we show how these graphs can be induced by tree data structures which model the logical network topology of an abstract distributed computing environment. Finally, we present efficient distributed algorithms that achieve the desired ESS control, perform resampling and operate on forests associated with these trees. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015

]]>As with any type of forensics, nuclear forensics seeks to infer historical information using models and data. This article connects nuclear forensics and calibration. We present statistical analyses of a calibration experiment that connect several responses to the associated set of input values and then ‘make a measurement’ using the calibration model. Previous and upcoming real experiments involving production of PuO_{2} powder motivate this article. Both frequentist and Bayesian approaches are considered, and we report findings from a simulation study that compares different analysis methods for different underlying responses between inputs and responses, different numbers of responses, different amounts of natural variability, and replicated or non-replicated calibration experiments and new measurements. Published 2015. This article is a U.S. Government work and is in the public domain in the USA. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015

Understanding social tie development among users is crucial for user engagement in social networking services. In this paper, we analyze the social interactions, both online and offline, of users and investigate the development of their social ties using data trail of ‘how social ties grow’ left in mobile and social networking services. To the best of our knowledge, this is the first research attempt on studying social tie development by considering both online and offline interactions in a heterogeneous yet realistic relationship. In this study, we aim to answer three key questions: (i) is there a correlation between online and offline interactions? (ii) how is the social tie developed via heterogeneous interaction channels? and (iii) would the development of social tie between two users be affected by their common friends? To achieve our goal, we develop a *Social-aware Hidden Markov Model (SaHMM)* that explicitly takes into account the factor of common friends in measure of the social tie development. Our experiments show that, comparing with results obtained using HMM and other heuristic methods, the social tie development captured by our SaHMM is significantly more consistent to lifetime profiles of users.

Visualization can help in model building, diagnosis, and in developing an understanding about how a model summarizes data. This paper proposes three strategies for visualizing statistical models: (i) display the model in the data space, (ii) look at all members of a collection, and (iii) explore the process of model fitting, not just the end result. Each strategy is accompanied by examples, including manova, classification algorithms, hierarchical clustering, ensembles of linear models, projection pursuit, self-organizing maps, and neural networks.

Mining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task, which is computationally expensive, and thus they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we propose FS^{3}, which is a sampling-based method. It mines a small collection of subgraphs that are most frequent in the probabilistic sense. FS^{3} performs a Markov chain Monte Carlo (MCMC) sampling over the space of a fixed-size subgraphs such that the potentially frequent subgraphs are sampled more often. Besides, FS^{3} is equipped with an innovative queue manager. It stores the sampled subgraph in a finite queue over the course of mining in such a manner that the top-*k* positions in the queue contain the most frequent subgraphs. Our experiments on the database of large graphs show that FS^{3} is efficient and obtains subgraphs that are the most frequent among the subgraphs of a given size.