An action plan to expand the technical areas of statistics focuses on the data analyst. The plan sets out six technical areas of work for a university department and advocates a specific allocation of resources devoted to research in each area and to courses in each area. The value of technical work is judged by the extent to which it benefits the data analyst, either directly or indirectly. The plan is also applicable to government research labs and corporate research organizations.

Recent advances in data mining have integrated kernel functions with Bayesian probabilistic analysis of Gaussian distributions. These machine-learning approaches can incorporate prior information with new data to calculate probabilistic rather than deterministic values for unknown parameters. This article extensively analyzes a specific Bayesian kernel model that uses a kernel function to calculate a posterior beta distribution that is conjugate to the prior beta distribution. Numerical testing of the beta kernel model on several benchmark datasets reveals that this model's accuracy is comparable with those of the support vector machine (SVM), relevance vector machine, naive Bayes, and logistic regression, and the model runs more quickly than all the other algorithms except for logistic regression. When one class occurs much more frequently than the other class, the beta kernel model often outperforms other strategies to handle imbalanced datasets, including under-sampling, over-sampling, and the Synthetic Minority Over-Sampling Technique. If data arrive sequentially over time, the beta kernel model easily and quickly updates the probability distribution, and this model is more accurate than an incremental SVM algorithm for online learning.

In several modern applications, ranging from genomics to neuroimaging, there is a need to compare measurements across different populations, such as those collected from samples of healthy and diseased individuals. The interest is in detecting a group effect, and typically many thousands or even millions of tests need to be performed simultaneously, as exemplified in genomics where single tests are applied for each gene across the genome. Traditional procedures, such as multivariate analysis of variance (MANOVA), are not suitable when dealing with nonvector-valued data structures such as functional or graph-structured observations. In this article, we discuss an existing distance-based MANOVA-like approach, the distance-based F (DBF) test, for detecting such differences. The null sampling distribution of the DBF test statistic relies on the distribution of the measurements and the chosen distance measure, and is generally unavailable in closed form. In practice, Monte Carlo permutation methods are deployed which introduce errors in estimating small *p*-values and increase familywise type I error rates when not using enough permutations. In this work, we propose an approximate distribution for the DBF test allowing inferences to be drawn without the need for costly permutations. This is achieved by approximating the permutation distribution that would be obtained by enumerating all permutations by the Pearson type III distribution using moment matching. The use of the Pearson type III distribution is motivated via empirical observations with real data. We provide evidence with real and simulated data that the resulting approximate null distribution of the DBF test is flexible enough to work well with a range of distance measures. Through extensive simulations involving different data types and distance measures, we provide evidence that the proposed methodology yields the same statistical power that would otherwise only be achievable if many millions of Monte Carlo permutations were performed.