We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an Occupational Safety and Health Administration (OSHA) database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death), and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency-based methods currently in wide use, and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). For a particular topic of interest (e.g., mental health disability, or carbon monoxide exposure), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. Using a branch-and-bound approach, this method can incorporate phrases of arbitrary length, which allows for potentially rich summarization. We discuss how focus on the purpose of the summaries can inform choices of tuning parameters and model constraints. We evaluate this tool by comparing the computational time and summary statistics of the resulting word lists to three other methods in the literature. We also present a new R package, **textreg**. Overall, we argue that sparse methods have much to offer in text analysis and is a branch of research that should be considered further in this context. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

Eigen-functions are of key importance in graph mining since they can be used to approximate many graph parameters, such as node centrality, epidemic threshold, graph robustness, with high accuracy. As real-world graphs are changing over time, those parameters may get sharp changes correspondingly. Taking virus propagation network for example, new connections between infected and susceptible people appear all the time, and some of the crucial infections may lead to large decreasing on the epidemic threshold of the network. As a consequence, the virus would spread around the network quickly. However, if we can keep track of the epidemic threshold as the graph structure changes, those crucial infections would be identified timely so that counter measures can be taken proactively to contain the spread process. In our paper, we propose two online eigen-functions tracking algorithms which can effectively monitor those key parameters with linear complexity. Furthermore, we propose a general attribution analysis framework which can be used to identify important structural changes in the evolving process. In addition, we introduce an error estimation method for the proposed eigen-functions tracking algorithms to estimate the tracking error at each time stamp. Finally, extensive evaluations are conducted to validate the effectiveness and efficiency of the proposed algorithms. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>We explore whether free agents in Major League Baseball meet the expectations set forth by newly signed contracts. The value and duration of these contracts are negotiated between the player (and his agent) and the signing team and are based primarily on the player's performance to date, projected future performance, and potential marketing value to the team. We develop two classes of models to explore this problem using a variety of regression- and tree-based machine learning algorithms. The market model uses player and team data to predict the market value of a player's performance (i.e., average contract salary). The performance model uses the same data to predict wins above replacement as a surrogate for overall player performance. We translate this measure into dollars using position-based conversion factors. Analysis of these models demonstrates that the performance model more consistently predicts and assesses player value with respect to their free agent contracts. Together, these models can be used to target or avoid free agents (or other players) whose performance-based value differs significantly from their market value. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Most work on predicting the outcome of basketball matches so far has focused on National College Athletics Association Basketball (NCAAB) games. Since NCAAB and professional (National Basketball Association, NBA) basketball have a number of differences, it is not clear to what degree these results can be transferred. We explore a number of different representations, training settings, and classifiers, and contrast their results on NCAAB and NBA data. We find that adjusted efficiencies work well for the NBA, the NCAAB regular season is not ideal for training to predict its post-season, the two leagues require classifiers with different bias, and Naïve Bayes predicts the outcome of NBA playoff series well. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Analysis of dual sports performance typically involves observational techniques to gather data samples during actual competition. These techniques are limited by the amount of data that can be collected and the need to define the observable variables in advance. Today's advanced technologies have considerably overcome these limitations, enabling high-volume data collection for post-recording analysis. The present study was based on the three-dimensional kinematic data recorded by the automated ball-tracking Hawk-Eye system between the 2003 and 2008 seasons in elite tennis tournaments, which provided a database of 262 596 points. The analysis consisted of an examination of the relationships between the various characteristics of the serve summed up by the resulting ball trajectory and winning-point probabilities. The influence of factors such as serve speed, serve location, court surface, gender differences, and spin intensity on the winning-point rate was assessed to gain insight into efficient serve tendencies in world-class tennis. The implications for practitioners are highlighted and directions for future research in tennis performance analysis based on automatic ball tracking are proposed. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>In this paper, we present two approaches to analyzing pass event data to uncover sometimes-nonobvious insights into the game of soccer. We illustrate the utility of our methods by applying them to data from the 2012–2013 La Liga season. We first show that teams are characterized by where on the pitch they attempt passes, and can be identified by their passing styles. Using heatmaps of pass locations as features, we achieved a mean accuracy of 87% in a 20-team classification task. We also investigated using pass locations over the course of a possession to predict shots. For this task, we achieved an area under the receiver operating characteristic (AUROC) of 0.785. Finally, we used the weights of the predictive model to rank players by the value of their passes. Shockingly, Cristiano Ronaldo and Lionel Messi topped the rankings. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard feasible generalized linear squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277–288, 2000) can accommodate outliers, but it is hard to compute. First, we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real dataset from economics. Next, we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the general multivariate chain ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Voting on legislative bills to form new laws serves as a key function of most of the legislatures. Predicting the votes of such deliberative bodies leads better understanding of government policies and generate actionable strategies for social good. However, it is very difficult to predict legislative votes due to the myriad factors that affect the political decision-making process. In this paper, we present a novel prediction model that maximizes the usage of publicly accessible heterogeneous data, i.e., bill text and lawmakers' profile data, to carry out effective legislative prediction. In particular, we propose to design a probabilistic prediction model which archives high consistency with past vote recorders while ensuring the minimum uncertainty of the vote prediction reflecting the firm legal ground often hold by the lawmakers. In addition, the proposed legislative prediction model enjoys the following properties: inductive and analytical solution, abilities to deal with the prediction on new bills and new legislators, and the robustness to missing vote issue. We conduct extensive empirical study using the real legislative data from the joint sessions of the United States Congress and compare with other representative methods in both quantitative political science and data mining communities. The experimental results clearly corroborate that the proposed method provides superior prediction accuracy with visible performance gain. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? We review experience with five problems where probabilistic forecasting played an important role. This leads us to identify five types of potential users: low stakes users, who do not need probabilistic forecasts; general assessors, who need an overall idea of the uncertainty in the forecast; change assessors, who need to know if a change is out of line with expectations; risk avoiders, who wish to limit the risk of an adverse outcome; and decision theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and consider their goals. Cognitive research tells us that calibration is important for trust in probability forecasts and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest often seem to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role but in a limited range of applications. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>In real life, many important datasets are not publicly accessible due to various reasons, including privacy protection and maintenance of business competitiveness. However, Knowledge discovery and pattern mining from these datasets can bring enormous benefit both to the data owner and the external entities. In this paper, we propose a novel solution for this task, which is based on Markov chain Monte Carlo (MCMC) sampling of frequent patterns. Instead of returning all the frequent patterns, the proposed paradigm sends back a small set of randomly selected patterns so that the confidentiality of the dataset can be maintained. Our solution also allows interactive sampling, so that the sampled patterns can fulfill the user's requirement effectively. We show experimental results from several real-life datasets to validate the capability and usefulness of our solution. In particular, we show examples that by using our proposed solution, an eCommerce marketplace can allow pattern mining on user session data without disclosing the data to the public; such a mining paradigm can help the sellers in the marketplace, which eventually can boost the market's own revenue. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>This paper brings explicit considerations of distributed computing architectures and data structures into the rigorous design of Sequential Monte Carlo (SMC) methods. A theoretical result established recently by the authors shows that adapting interaction between particles to suitably control the effective sample size (ESS) is sufficient to guarantee stability of SMC algorithms. Our objective is to leverage this result and devise algorithms which are thus guaranteed to work well in a distributed setting. We make three main contributions to achieve this. First, we study mathematical properties of the ESS as a function of matrices and graphs that parameterize the interaction among particles. Secondly, we show how these graphs can be induced by tree data structures which model the logical network topology of an abstract distributed computing environment. Finally, we present efficient distributed algorithms that achieve the desired ESS control, perform resampling and operate on forests associated with these trees. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015

]]>Linear regression models depend directly on the design matrix and its properties. Techniques that efficiently estimate model coefficients by partitioning rows of the design matrix are increasingly becoming popular for large-scale problems because they fit well with modern parallel computing architectures. We propose a simple measure of *concordance* between a design matrix and a subset of its rows that estimates how well a subset captures the variance-covariance structure of a larger data set. We illustrate the use of this measure in a heuristic method for selecting row partition sizes that balance statistical and computational efficiency goals in real-world problems.

Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, thus motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>How can we correlate the neural activity in the human brain as it responds to typed words, with properties of these terms (like ‘edible’, ‘fits in hand’)? In short, we want to find latent variables, that jointly explain both the brain activity, as well as the behavioral responses. This is one of many settings of the *Coupled Matrix-Tensor Factorization* (CMTF) problem.

Can we enhance *any* CMTF solver, so that it can operate on potentially very large datasets that may not fit in main memory? We introduce Turbo-SMT, a meta-method capable of doing exactly that: it boosts the performance of *any* CMTF algorithm, produces sparse and interpretable solutions, and parallelizes *any* CMTF algorithm, producing sparse and interpretable solutions (up to *65 fold*). Additionally, we improve upon ALS, the work-horse algorithm for CMTF, with respect to efficiency and robustness to missing values.

We apply Turbo-SMT to BrainQ, a dataset consisting of a (nouns, brain voxels, human subjects) tensor and a (nouns, properties) matrix, with coupling along the nouns dimension. Turbo-SMT is able to find meaningful latent variables, as well as to predict brain activity with competitive accuracy. Finally, we demonstrate the generality of Turbo-SMT, by applying it on a FACEBOOK dataset (users, ‘friends', wall-postings); there, Turbo-SMT spots spammer-like anomalies. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016