Many synoptic surveys are observing large parts of the sky multiple times. The resulting time series of light measurements, called lightcurves, provide a wonderful window to the dynamic nature of the Universe. However, there are many significant challenges in analyzing these lightcurves. We describe a modeling-based approach using Gaussian process regression for generating critical measures for the classification of such lightcurves. This method has key advantages over other popular nonparametric regression methods in its ability to deal with censoring, a mixture of sparsely and densely sampled curves, the presence of annual gaps caused by objects not being visible throughout the year from a given position on Earth and known but variable measurement errors. We demonstrate that our approach performs better by showing it has a higher correct classification rate than past methods popular in astronomy. Finally, we provide future directions for use in sky-surveys that are getting even bigger by the day. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>We introduce a framework to build a survival/risk bump hunting model with a censored time-to-event response. Our survival bump hunting (SBH) method is based on a recursive peeling procedure that uses a specific survival peeling criterion derived from non-/semi-parametric statistics such as the hazard ratio, the log-rank test or the Nelson–Aalen estimator. To optimize the tuning parameter of the model and validate it, we introduce an objective function based on survival- or prediction-error statistics, such as the log-rank test and the concordance error rate. We also describe two alternative cross-validation techniques adapted for the joint task of decision-rule making by recursive peeling and survival estimation. Numerical analyses show the importance of replicated cross-validation and the differences between criteria and techniques in both low- and high-dimensional settings. Although several non-parametric survival models exist, none address the problem of directly identifying local extrema. We show how SBH efficiently estimates extreme survival/risk subgroups, unlike other models. This provides an insight into the behavior of commonly used models and suggests alternatives to be adopted in practice. Finally, our SBH framework was applied to a clinical dataset. In it, we identified subsets of patients characterized by clinical and demographic covariates with a distinct extreme survival outcome for which tailored medical interventions could be made. An R package Patient Rule Induction Method in Survival, Regression and Classification settings (PRIMsrc) is available on Comprehensive R Archive Network (CRAN) and GitHub.

]]>High-dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the composite large margin (CLM) classifier, to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

]]>Many empirical settings involve the specification of models leading to complicated likelihood functions, for example, finite mixture models that arise in causal inference when using Principal Stratification (PS). Traditional asymptotic results cannot be trusted for the associated likelihood functions, whose logarithms are not close to being quadratic and may be multimodal even with large sample sizes. We first investigate the shape of the likelihood function with models based on PS by providing diagnostic tools for evaluating ellipsoidal approximations based on the second derivatives of the log-likelihood at a mode. In these settings, inference based on standard approximations is inappropriate, and other forms of inference are required. We explore the use of a direct likelihood approach for parsimonious model selection and, specifically, propose comparing values of scaled maximized likelihood functions under competitive models to select preferred models. An extensive simulation study provides guidelines, for calibrating the use of scaled log-likelihood ratio statistics, as functions of the complexity of the models being compared.

]]>Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly becoming popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of *concordance* between a design matrix and a subset of its rows that estimates how well a subset captures the variance-covariance structure of a larger data set. We illustrate the use of this measure in a heuristic method for selecting row partition sizes that balance statistical and computational efficiency goals in real-world problems.

We present a new method to perform a principal axes analysis of symbolic histogram variables. In the symbolic data analysis framework, several Histogram Principal component Analysis (Histogram PCA) have been proposed. Some approaches focus on the relationships between some specific features of histograms such as the means or the quantiles. Others use the association for distributional variables based on the squared Wasserstein distance. In this paper, we propose two new approaches. The first one uses new correlation measures based on Fisher's z scores between corresponding bins of histogram variables. We also suggest the use of the estimator proposed by Olkin and Pratt. In the first approach, histogram variables must have the same number of bins. The second proposed approach, by contrast, extends the previous proposed correlations by considering the corresponding quantiles. This second approach can be used when histograms do not have the same number of bins.

]]>We propose a new approach for mixture modeling based only upon pairwise distances via the concept of hypothetical local mapping (HLM). This work is motivated by the increasingly commonplace applications involving complex objects that cannot be effectively represented by tractable mathematical entities. The new modeling approach consists of two steps. A distance-based clustering algorithm is applied first. Then, HLM takes as input the distances between the training data and their corresponding cluster centroids to estimate the model parameters. In the special case where all the training data are taken as cluster centroids, we obtain a distance-based counterpart of the kernel density. The classification performance of the mixture models is compared with other state-of-the-art distance-based classification methods. Results demonstrate that HLM-based algorithms are highly competitive in terms of classification accuracy and are computationally efficient. Furthermore, the HLM-based modeling approach adapts readily to incremental learning. We have developed and tested two schemes of incremental learning scalable for dynamic data arriving at a high speed.

This paper brings explicit considerations of distributed computing architectures and data structures into the rigorous design of Sequential Monte Carlo (SMC) methods. A theoretical result established recently by the authors shows that adapting interaction between particles to suitably control the effective sample size (ESS) is sufficient to guarantee stability of SMC algorithms. Our objective is to leverage this result and devise algorithms which are thus guaranteed to work well in a distributed setting. We make three main contributions to achieve this. First, we study mathematical properties of the ESS as a function of matrices and graphs that parameterize the interaction among particles. Secondly, we show how these graphs can be induced by tree data structures which model the logical network topology of an abstract distributed computing environment. Finally, we present efficient distributed algorithms that achieve the desired ESS control, perform resampling and operate on forests associated with these trees. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015

]]>Understanding social tie development among users is crucial for user engagement in social networking services. In this paper, we analyze the social interactions, both online and offline, of users and investigate the development of their social ties using data trail of ‘how social ties grow’ left in mobile and social networking services. To the best of our knowledge, this is the first research attempt on studying social tie development by considering both online and offline interactions in a heterogeneous yet realistic relationship. In this study, we aim to answer three key questions: (i) is there a correlation between online and offline interactions? (ii) how is the social tie developed via heterogeneous interaction channels? and (iii) would the development of social tie between two users be affected by their common friends? To achieve our goal, we develop a *Social-aware Hidden Markov Model (SaHMM)* that explicitly takes into account the factor of common friends in measure of the social tie development. Our experiments show that, comparing with results obtained using HMM and other heuristic methods, the social tie development captured by our SaHMM is significantly more consistent to lifetime profiles of users.

Predictions from science and engineering models depend on the values of the model's input parameters. As the number of parameters increases, algorithmic parameter studies such as optimization and uncertainty quantification require many more model evaluations. One way to combat this curse of dimensionality is to seek an alternative parameterization with fewer variables that produces comparable predictions. The *active subspace* is a low-dimensional linear subspace defined by important directions in the model's input space; input perturbations along these directions change the model's prediction more, on average, than perturbations orthogonal to the important directions. We describe a method for checking if a model admits an exploitable active subspace and apply this method to a single-diode solar cell model with five input parameters. We find that the maximum power of the solar cell has a dominant one-dimensional active subspace, which enables us to perform thorough parameter studies in one dimension instead of five.

Plutonium-238 is an important specialized power source that radiates heat, which can be converted into electricity. This case study models the thermal output of samples of Pu-238, in which the underlying theoretical model of its decay summarizes a large portion of the observed behavior. A discrepancy function is used to account for missing structure seen in the observed data, but is not included in the physical model. The model combines the assumed physics model, discrepancy and experimental error with an expression of the form, *f*(*x*,*θ*) + *δ*(*x*) + *ɛ*. The combined model improves prediction of new observations in the future by accounting for shortcomings or omissions in the physical model and provides quantitative summaries of the relative contributions of the discrepancy and physics model. In this work, we illustrate how to visualize the discrepancy function when it is modeled using a Gaussian process. With the visualization, scientists can gain understanding about the differences between the observed data and the current scientific model and develop proposals of how to potentially improve their model. A secondary example illustrates how the visualization methods can help with understanding in higher dimensions.

This study aimed to organize a body of trajectories in order to identify, search for and classify both common and uncommon behaviors among objects such as aircraft and ships. Existing comparison functions such as the Fréchet distance are computationally expensive and yield counterintuitive results in some cases. We propose an approach using feature vectors whose components represent succinctly the salient information in trajectories. These features incorporate basic information such as the total distance traveled and the distance between start/stop points as well as geometric features related to the properties of the convex hull, trajectory curvature and general distance geometry. Additionally, these features can generally be mapped easily to behaviors of interest to humans who are searching large databases. Most of these geometric features are invariant under rigid transformation. We demonstrate the use of different subsets of these features to identify trajectories similar to an exemplar, cluster a database of several hundred thousand trajectories and identify outliers.

With large data collection projects such as the Dark Energy Survey underway, data from distant supernovae (SNe) are becoming increasingly available. As the quantity of information increases, the ability to quickly and accurately classify SNe has become essential. An area of great interest is the development of a strictly photometric classification mechanism. The first step in the advancement of modern photometric classification is the estimation of individual Supernova (SN) light curves. We propose the use of hierarchical Gaussian processes to model light curves. Individual SN light curves are assigned a Gaussian process prior centered at a type-specific mean curve which is also assigned a Gaussian process prior. Properties inherent in this Bayesian non-parametric form of modeling yield flexible yet smooth curves estimates with a unique quantification of the error surrounding these curve estimates. Specifying the hierarchical structure relates individual SN light curves in such a way that borrowing strength across curves is possible. This allows for the estimation of SN light curves in entirety even when data are sparse. Additionally, it also yields a meaningful representation of SN class differences in the form of mean curves. The differences inherent in these mean curves may eventually allow for classification of SNe.

A two-stage Pareto front approach can improve the process of making a decision about which input values simultaneously optimize multiple responses. However, ignoring estimation uncertainty and natural variability in the responses can potentially lead to suboptimal choices about those input values. A simulation-based approach is used to quantify and examine the impact that variability has on the superior solutions identified on the Pareto front and their performance. Because each optimization scenario has its own unique characteristics, including responses with different amounts of natural variability, the impact of variability on the solutions varies from situation to situation. We study how varying the amount of response variability affects the locations identified for the front and the characteristics of the most promising solutions on the front. We illustrate the method with an application involving process improvement through variance reduction.

The single-diode model is a widely used representation of the electrical performance of a photovoltaic (PV) device. This model relates the PV device's terminal current and voltage at a given irradiance and temperature. In order to obtain reasonable estimates of single-diode model parameters given noisy data, one ought to be able to characterize the estimability of the model parameters. Here, we look to an established method called data cloning to check for evidence of inestimability.

Geospatial semantic graphs provide a robust foundation for representing and analyzing remote sensor data. In particular, they support a variety of pattern search operations that capture the spatial and temporal relationships among the objects and events in the data. However, in the presence of large data corpora, even a carefully constructed search query may return a large number of unintended matches. This work considers the problem of calculating a quality score for each match to the query, given that the underlying data are uncertain. We present a preliminary evaluation of three methods for determining both match quality scores and associated uncertainty bounds, illustrated in the context of an example based on overhead imagery data.

A targeted network intrusion typically evolves through multiple phases, termed the attack chain. When appropriate data are monitored, these phases will generate multiple events across the attack chain on a compromised host. It is shown empirically that events in different parts of the attack chain are largely independent under nonattack conditions. This suggests that a powerful detector can be constructed by combining across events spanning the attack. This article describes the development of such a detector for a larger network. To construct events that span the attack chain, multiple data sources are used, and the detector combines across events observed on the same machine, across local neighborhoods of machines linked by network communications, as well as across events observed on multiple computers. A probabilistic approach for evaluating the combined events is developed, and empirical investigations support the underlying assumptions. The detection power of the approach is studied by inserting plausible attack scenarios into observed network and host data, and an application to a real-world intrusion is given.

As with any type of forensics, nuclear forensics seeks to infer historical information using models and data. This article connects nuclear forensics and calibration. We present statistical analyses of a calibration experiment that connect several responses to the associated set of input values and then ‘make a measurement’ using the calibration model. Previous and upcoming real experiments involving production of PuO_{2} powder motivate this article. Both frequentist and Bayesian approaches are considered, and we report findings from a simulation study that compares different analysis methods for different underlying responses between inputs and responses, different numbers of responses, different amounts of natural variability, and replicated or non-replicated calibration experiments and new measurements.