Clustering methods for time series have been widely studied and applied within a range of different fields. They are generally based on the choice of a relevant metric. The aim of this paper is to propose and discuss a clustering technique in the frequency domain for stationary time series. The idea of the new procedure consists in analyzing the discrete component of the spectrum, avoiding the introduction of any metric for the classification of the time series. The novel technique is suitable for time series that show strong periodic components and is based on an efficient algorithm requiring less computational and memory resources, making it appropriate for large and complex temporal databases. The problem of the selection of the optimal partition is also addressed along with a proposal that takes into account the stability of the clusters and the efficiency of the procedure in classifying the time series among the different groups. The results of a simulation study show the relative merits of the proposed procedure compared to other spectral-based approaches. An application to a large time-series database provided by a big electric company is also discussed. The application showed the good performance of the proposed technique, which was able to classify the time series in a few groups of customers with homogeneous electricity demand patterns.

]]>Power use in data centers and high-performance computing (HPC) facilities has grown in tandem with increases in the size and number of these facilities. Substantial innovation is needed to enable meaningful reduction in energy footprints in leadership-class HPC systems. In this paper, we focus on characterizing and investigating application-level power usage. We demonstrate potential methods for predicting power usage based on a priori and in situ characteristics. Finally, we highlight a potential use case of this method through a simulated power-aware scheduler using historical jobs from a real scientific HPC system.

]]>Eigen-functions are of key importance in graph mining since they can be used to approximate many graph parameters, such as node centrality, epidemic threshold, graph robustness, with high accuracy. As real-world graphs are changing over time, those parameters may get sharp changes correspondingly. Taking virus propagation network for example, new connections between infected and susceptible people appear all the time, and some of the crucial infections may lead to large decreasing on the epidemic threshold of the network. As a consequence, the virus would spread around the network quickly. However, if we can keep track of the epidemic threshold as the graph structure changes, those crucial infections would be identified timely so that counter measures can be taken proactively to contain the spread process. In our paper, we propose two online eigen-functions tracking algorithms which can effectively monitor those key parameters with linear complexity. Furthermore, we propose a general attribution analysis framework which can be used to identify important structural changes in the evolving process. In addition, we introduce an error estimation method for the proposed eigen-functions tracking algorithms to estimate the tracking error at each time stamp. Finally, extensive evaluations are conducted to validate the effectiveness and efficiency of the proposed algorithms. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard feasible generalized linear squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277–288, 2000) can accommodate outliers, but it is hard to compute. First, we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real dataset from economics. Next, we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the general multivariate chain ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Voting on legislative bills to form new laws serves as a key function of most of the legislatures. Predicting the votes of such deliberative bodies leads better understanding of government policies and generate actionable strategies for social good. However, it is very difficult to predict legislative votes due to the myriad factors that affect the political decision-making process. In this paper, we present a novel prediction model that maximizes the usage of publicly accessible heterogeneous data, i.e., bill text and lawmakers' profile data, to carry out effective legislative prediction. In particular, we propose to design a probabilistic prediction model which archives high consistency with past vote recorders while ensuring the minimum uncertainty of the vote prediction reflecting the firm legal ground often hold by the lawmakers. In addition, the proposed legislative prediction model enjoys the following properties: inductive and analytical solution, abilities to deal with the prediction on new bills and new legislators, and the robustness to missing vote issue. We conduct extensive empirical study using the real legislative data from the joint sessions of the United States Congress and compare with other representative methods in both quantitative political science and data mining communities. The experimental results clearly corroborate that the proposed method provides superior prediction accuracy with visible performance gain. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

]]>Prototypes, as Rosch (1973) defined the term in the cognitive sciences field, are ideal exemplars that summarize and represent groups of objects (or categories) and that are “typical" according to their internal resemblance and external dissimilarity *vis-à-vis* other groups or categories. In line with the cognitive approach, we propose a data-driven procedure for identifying prototypes that is based on archetypal analysis and compositional data analysis. The procedure presented here exploits the properties of archetypes, both in terms of their external dissimilarity in relation to other points in the data set and in terms of their ability to represent the data through compositions in a simplex in which it is possible to cluster all of the observations. The proposed procedure is useful not only for the usual real data points; it may also be used for interval-valued data, functional data, and relational data, and it provides well-separated and clearly profiled prototypes. © 2016 Wiley Periodicals, Inc. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2016

Biomonitoring techniques are widely used to assess environmental damages through the changes occurring in the composition of species communities. Among the living organisms used as bioindicators, epiphytic lichens, are recognized as reliable indicators of air pollution. However, lichen biodiversity studies are generally based on the analysis of a scalar measure that omits the species composition. For this reason, we propose to analyze lichen data through diversity profiles and the functional data analysis approach. Indeed, diversity profiles may be naturally considered as functional data because they are expressed as functions of the species abundance vector in a fixed domain. The peculiarity of these data is that the functional space is constituted by a set of curves belonging to the same family. In this context, simultaneous confidence bands are obtained for the mean diversity profile through the Karhunen-Love (KL) decomposition. The novelty of our method lies in exploiting the known form of the function underlying the data. This allows us to work directly on the functional space by avoiding smoothing techniques. The confidence band procedure is applied to a real data set concerning lichen data in Tuscany region (central Italy). Confidence bands functional data analysis intrinsic diversity profile lichen data mean function KL expansion.

]]>We review two alternative ways of modeling stability and change of longitudinal data by using time-fixed and time-varying covariates for the observed individuals. Both the methods build on the foundation of finite mixture models, and are commonly applied in many fields but they look at the data from different perspectives. Our attempt is to make comparisons when the ordinal nature of the response variable is of interest.

The latent Markov model is based on time-varying latent variables to explain the observable behavior of the individuals. It is proposed in a semiparametric formulation as the latent process has a discrete distribution and is characterized by a Markov structure. The growth mixture model is based on a latent categorical variable that accounts for the unobserved heterogeneity in the observed trajectories and on a mixture of Gaussian random variables to account for the variability in the growth factors. We refer to a real data example on self-reported health status to illustrate their peculiarities and differences.

Temporal data describe processes and phenomena that evolve over time. In many real-world applications temporal data are characterized by temporal autocorrelation, which expresses the dependence of time-stamped data over a certain a time lag. Often such processes and phenomena are characterized by evolving complex entities, which we can represent with evolving networks of data. In this scenario, a task that deserves attention is regression inference in temporal network data. In this paper, we investigate how to improve the predictive inference on network data by accommodating temporal autocorrelation of the historical data in the learning process of the prediction models. Historical data is a type of temporal data where most part of the elements has been already stored. In practice, we study how to explicitly consider the influence of data of a network observed in the past, to enhance the prediction on the same network observed at the present. The proposed approach relies on a model ensemble built with individual predictors learned on historical network data. The predictors are trained from summary networks, which synthesize the effect of the autocorrelation in distinct sequences of network observations. Summary networks are identified with a sliding window model. Finally, the model ensemble combines together the predictors with a weighting schema, which reflects the degree of influence of a predictor with respect to the network observed at the present. So, we aim at accommodating the temporal autocorrelation both in the data and in the prediction model. Empirical evaluation demonstrates that the proposed approach can boost regression performance in real-world network data.

]]>Monetary policies, either actual or perceived, cause changes in monetary interest rates. These changes impact the economy through financial institutions, which react to changes in the monetary policy with changes in their administered rates, on both deposits and lendings. In this paper we provide a dynamic modeling for describing how administered bank interest rates react in response to changes in money market rates, in a multicountry setting: in addition, by means of hierarchical equations, we take into account how such changes are affected by the macroeconomic fundamentals of each country. The paper applies the proposed models to interest rates on different loans (to corporates and families) in seven European economies, showing how the monetary policy and the specific situation of each country have differently impacted lendings across time.

]]>