Prototypes, as Rosch (1973) defined the term in the cognitive sciences field, are ideal exemplars that summarize and represent groups of objects (or categories) and that are “typical" according to their internal resemblance and external dissimilarity *vis-à-vis* other groups or categories. In line with the cognitive approach, we propose a data-driven procedure for identifying prototypes that is based on archetypal analysis and compositional data analysis. The procedure presented here exploits the properties of archetypes, both in terms of their external dissimilarity in relation to other points in the data set and in terms of their ability to represent the data through compositions in a simplex in which it is possible to cluster all of the observations. The proposed procedure is useful not only for the usual real data points; it may also be used for interval-valued data, functional data, and relational data, and it provides well-separated and clearly profiled prototypes. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

Eigen-functions are of key importance in graph mining since they can be used to approximate many graph parameters, such as node centrality, epidemic threshold, graph robustness, with high accuracy. As real-world graphs are changing over time, those parameters may get sharp changes correspondingly. Taking virus propagation network for example, new connections between infected and susceptible people appear all the time, and some of the crucial infections may lead to large decreasing on the epidemic threshold of the network. As a consequence, the virus would spread around the network quickly. However, if we can keep track of the epidemic threshold as the graph structure changes, those crucial infections would be identified timely so that counter measures can be taken proactively to contain the spread process. In our paper, we propose two online eigen-functions tracking algorithms which can effectively monitor those key parameters with linear complexity. Furthermore, we propose a general attribution analysis framework which can be used to identify important structural changes in the evolving process. In addition, we introduce an error estimation method for the proposed eigen-functions tracking algorithms to estimate the tracking error at each time stamp. Finally, extensive evaluations are conducted to validate the effectiveness and efficiency of the proposed algorithms. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard feasible generalized linear squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277–288, 2000) can accommodate outliers, but it is hard to compute. First, we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real dataset from economics. Next, we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the general multivariate chain ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Voting on legislative bills to form new laws serves as a key function of most of the legislatures. Predicting the votes of such deliberative bodies leads better understanding of government policies and generate actionable strategies for social good. However, it is very difficult to predict legislative votes due to the myriad factors that affect the political decision-making process. In this paper, we present a novel prediction model that maximizes the usage of publicly accessible heterogeneous data, i.e., bill text and lawmakers' profile data, to carry out effective legislative prediction. In particular, we propose to design a probabilistic prediction model which archives high consistency with past vote recorders while ensuring the minimum uncertainty of the vote prediction reflecting the firm legal ground often hold by the lawmakers. In addition, the proposed legislative prediction model enjoys the following properties: inductive and analytical solution, abilities to deal with the prediction on new bills and new legislators, and the robustness to missing vote issue. We conduct extensive empirical study using the real legislative data from the joint sessions of the United States Congress and compare with other representative methods in both quantitative political science and data mining communities. The experimental results clearly corroborate that the proposed method provides superior prediction accuracy with visible performance gain. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>The quality of academic research is difficult to measure and rather controversial. Hirsch has proposed the h index [1], a measure that has the advantage of summarizing in a single summary statistic the information that is contained in the citation counts of each scientist. Although the h index has received a great deal of interest, only a few papers have analyzed its statistical properties and implications. We claim that statistical modeling can give a lot of added value over a simple summary like the h index. To show this, in this paper we propose a negative binomial distribution to jointly model the two main components of the h index: the number of papers and their citations. We then propose a Bayesian model that allows to obtain posterior inferences on the parameters of the distribution and, in addition, a predictive distribution for the h index itself. Such a predictive distribution can be used to compare scientists on a fairer ground, and in terms of their future contribution, rather than on their past performance.

]]>Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? We review experience with five problems where probabilistic forecasting played an important role. This leads us to identify five types of potential users: low stakes users, who do not need probabilistic forecasts; general assessors, who need an overall idea of the uncertainty in the forecast; change assessors, who need to know if a change is out of line with expectations; risk avoiders, who wish to limit the risk of an adverse outcome; and decision theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and consider their goals. Cognitive research tells us that calibration is important for trust in probability forecasts and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest often seem to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role but in a limited range of applications. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Although Big Data can have the potential to help researchers in science and industry solve large and complex problems, basic statistical ideas are often ignored in the Big Data literature. It is not true that simply having massive amounts of data renders subject-matter models and experiments obsolete, alleviates the need to ensure data quality and no longer requires that variables accurately measure what they are supposed to. We refer to these fundamentals as missing links in the Big Data process. In this paper, we illustrate the challenges of making decisions from Big Data through a series of case studies. We offer some strategies to help ensure that projects based on Big Data analyses are successful. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Detection of clustering and estimation of incidence risks are important and useful in public health and epidemiological research. The popular spatial regression models for disease risks, such as conditional autoregressive (CAR) models, assume a known spatial dependence structure for the error distribution and a set of common regression parameters for the mean structure. While it is often difficult to justify the structural assumption on spatial dependence, the assumption on a common regression surface may not be practical for a large spatial domain. We conceptualize a study region as a union of spatially connected clusters where a cluster is composed of geographically adjacent regions. We propose a regression model with cluster-wise varying regression parameters. Our model is able to capture a spatial clustering structure, while the corresponding cluster-wise regression parameters are estimated given the estimated clustering configuration. The proposed model is flexible in terms of regional and global shrinking as well as the number of clusters, cluster memberships and cluster locations. We develop an algorithm based on the reversible jump Markov chain Monte Carlo (MCMC) method for model estimation. The numerical study shows effectiveness of the proposed methodology. The method is computationally efficient and thus amenable to *large* datasets. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an Occupational Safety and Health Administration (OSHA) database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death), and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency-based methods currently in wide use, and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can incorporate phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing the computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, **textreg**. Overall, we argue that sparse methods have much to offer in text analysis and is a branch of research that should be considered further in this context. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016