Nonparametric and parametric subset selection procedures are used in the analysis of state motor vehicle traffic fatality rates (MVTFRs), for the years 1994 through 2012, to identify subsets of states that contain the ‘best’ (lowest MVTFR) and ‘worst’ (highest MVTFR) states with a prescribed probability. A new Bayesian model is developed and applied to the traffic fatality data and the results contrasted to those obtained with the subset selection procedures. All analyses are applied within the context of a two-way block design.

For further resources related to this article, please visit the WIREs website.

This figure shows the states that were selected by nonparametric subset selections rules to contain the state with the lowest traffic fatality rate (red) and the highest traffic fatality rate (blue) with a probability of correct selection being 0.90. Analysis is based on data for the years 1994 through 2012.

Gaussian Process (GP) models provide a very flexible nonparametric approach to modeling location-and-time indexed datasets. However, the storage and computational requirements for GP models are infeasible for large spatial datasets. Nearest Neighbor Gaussian Processes (Datta A, Banerjee S, Finley AO, Gelfand AE. Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. *J Am Stat Assoc* 2016., JASA) provide a scalable alternative by using local information from few nearest neighbors. Scalability is achieved by using the neighbor sets in a conditional specification of the model. We show how this is equivalent to sparse modeling of Cholesky factors of large covariance matrices. We also discuss a general approach to construct scalable Gaussian Processes using sparse local kriging. We present a multivariate data analysis which demonstrates how the nearest neighbor approach yields inference indistinguishable from the full rank GP despite being several times faster. Finally, we also propose a variant of the NNGP model for automating the selection of the neighbor set size. *WIREs Comput Stat* 2016, 8:162–171. doi: 10.1002/wics.1383

For further resources related to this article, please visit the WIREs website.

Data Science, considered as a science by itself, is in general terms, the extraction of knowledge from data. Symbolic data analysis (SDA) gives a new way of thinking in Data Science by extending the standard input to a set of classes of individual entities. Hence, classes of a given population are considered to be units of a higher level population to be studied. Such classes often represent the real units of interest. In order to take variability between the members of each class into account, classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data, called ‘symbolic’ as they cannot be reduced to numbers without losing much information. The first step in SDA is to build the symbolic data table where the rows are classes and the variables can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. SDA is a new paradigm which opens up a vast domain of research and applications by giving complementary results to classical methods applied to standard data. SDA also gives answers to big data and complex data challenges as big data can be reduced and summarized by classes and as complex data with multiple unstructured data tables and unpaired variables can be transformed into a structured data table with paired symbolic-valued variables. *WIREs Comput Stat* 2016, 8:172–205. doi: 10.1002/wics.1384

For further resources related to this article, please visit the WIREs website.

]]>
Data Science, considered as a science by itself, is in general terms, the extraction of knowledge from data. Symbolic data analysis (SDA) gives a new way of thinking in Data Science by extending the standard input to a set of classes of individual entities. Hence, classes of a given population are considered to be units of a higher level population to be studied. Such classes often represent the real units of interest. In order to take variability between the members of each class into account, classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data, called ‘symbolic’ as they cannot be reduced to numbers without losing much information. The first step in SDA is to build the symbolic data table where the rows are classes and the variables can take symbolic values. The second step is to study and extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. SDA is a new paradigm which opens up a vast domain of research and applications by giving complementary results to classical methods applied to standard data. SDA also gives answers to big data and complex data challenges as big data can be reduced and summarized by classes and as complex data with multiple unstructured data tables and unpaired variables can be transformed into a structured data table with paired symbolic-valued variables. WIREs Comput Stat 2016, 8:172–205. doi: 10.1002/wics.1384
For further resources related to this article, please visit the WIREs website.
http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fwics.1384 The top-K tau-path screen for monotone association in subpopulations http://feedproxy.google.com/~r/wileyonlinelibrary/wics/~3/pQJM7AoH45M/doiThe top-K tau-path screen for monotone association in subpopulations Srinath Sampath, Adriano Caloiaro, Wayne Johnson, Joseph S. Verducci 2016-06-30T05:55:28.598363-05:00 doi:10.1002/wics.1382 John Wiley & Sons, Inc. 10.1002/wics.1382 http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fwics.1382 Overview 206 218

A pair of variables that tend to rise and fall either together or in opposition are said to be monotonically associated. For certain phenomena, this tendency is causally restricted to a subpopulation, as, e.g., the severity of an allergic reaction trending with the concentration of an air pollutant. Previously, Yu et al. (*Stat Methodol* 2011, 8:97–111) devised a method of rearranging observations to test paired data to see if such an association might be present in a subpopulation. However, the computational intensity of the method limited its application to relatively small samples of data, and the test itself only judges if association is present in some subpopulation; it does not clearly identify the subsample that came from this subpopulation, especially when the whole sample tests positive. The present study adds a ‘top-*K*’ feature (Sampath S, Verducci JS. *Stat Anal Data Min* 2013, 6:458–471) based on a multistage ranking model, that identifies a concise subsample that is likely to contain a high proportion of observations from the subpopulation in which the association is supported. Computational improvements incorporated into this top-*K* tau-path algorithm now allow the method to be extended to thousands of pairs of variables measured on sample sizes in the thousands. A description of the new algorithm along with measures of computational complexity and practical efficiency help to gauge its potential use in different settings. Simulation studies catalog its accuracy in various settings, and an example from finance illustrates its step-by-step use. *WIREs Comput Stat* 2016, 8:206–218. doi: 10.1002/wics.1382

For further resources related to this article, please visit the WIREs website.

]]>
A pair of variables that tend to rise and fall either together or in opposition are said to be monotonically associated. For certain phenomena, this tendency is causally restricted to a subpopulation, as, e.g., the severity of an allergic reaction trending with the concentration of an air pollutant. Previously, Yu et al. (Stat Methodol 2011, 8:97–111) devised a method of rearranging observations to test paired data to see if such an association might be present in a subpopulation. However, the computational intensity of the method limited its application to relatively small samples of data, and the test itself only judges if association is present in some subpopulation; it does not clearly identify the subsample that came from this subpopulation, especially when the whole sample tests positive. The present study adds a ‘top-K’ feature (Sampath S, Verducci JS. Stat Anal Data Min 2013, 6:458–471) based on a multistage ranking model, that identifies a concise subsample that is likely to contain a high proportion of observations from the subpopulation in which the association is supported. Computational improvements incorporated into this top-K tau-path algorithm now allow the method to be extended to thousands of pairs of variables measured on sample sizes in the thousands. A description of the new algorithm along with measures of computational complexity and practical efficiency help to gauge its potential use in different settings. Simulation studies catalog its accuracy in various settings, and an example from finance illustrates its step-by-step use. WIREs Comput Stat 2016, 8:206–218. doi: 10.1002/wics.1382
For further resources related to this article, please visit the WIREs website.
http://onlinelibrary.wiley.com/resolve/doi?DOI=10.1002%2Fwics.1382