Wiley: Statistical Analysis and Data Mining: An ASA Data Science Journal: Table of Contents

Prompting Shapes the Statistical Tails of LLM‐Generated Biomedical Data

Andrej Novak, Milana Grbić, Matej Ivaniček, Dragan Matić — Sun, 07 Jun 2026 23:23:16 -0700

ABSTRACT

We evaluate four large language models (gpt-4o, gpt-4.1, o3, o4-mini) on six biomedically inspired tasks motivated by domains that often exhibit heavy-tailed or strongly skewed behavior. We compare three prompt styles—Natural (N), Mixed (M), and Constrained (C)—that differ in distributional pressure; temperature and top-p$$ p $$ are varied where supported. The study is designed to quantify how prompt-conditioned distributional pressure modulates the tail-risk profile of generated numeric outputs, rather than to assess fidelity to an external biomedical ground-truth distribution. Outliers are flagged using a stabilized MAD–z$$ z $$ rule (zMAD>3$$ {z}_{\mathrm{MAD}}>3 $$), and threshold robustness is summarized by the normalized area under outlier–threshold curves (nAURC). Tail behavior is assessed via complementary diagnostics (exceedances, high quantiles, Moors' kurtosis, and a secondary Pareto-equivalent exceedance index), with uncertainty quantified using pooled binomial aggregation (Wilson/Newcombe intervals), logit-based CIs for nAURC, dominance areas over threshold grids, and random-effects meta-analysis across problems. Three results emerge. (1) Prompt style is the primary driver of extremes: Under our evaluation criteria, Natural prompts yield the most conservative tail-risk profile: they consistently minimize outlier rate and nAURC and produce the lightest tails. The ordering between Mixed and Constrained varies by model, while Constrained often produces heavier tails, as expected under explicit tail instructions. (2) These patterns persist across thresholds and problems: Natural robustly dominates both Mixed and Constrained in robustness-curve dominance area, and meta-analytic gaps in outlier share are consistently positive for C–N and M–N across models. (3) Sampling (temperature/top-p$$ p $$) modulates risk within a fixed (model, prompt) cell: effects are consistently positive for outlier incidence in our temperature contrasts and can be substantial in some settings (notably gpt-4.1 under Natural prompting), but they do not overturn the qualitative prompt ordering in pooled summaries. Overall, prompt design is the most reliable lever for controlling outliers and heavy-tail risk in LLM-generated numeric data, with sampling parameters providing additional combination-specific control. These conclusions apply most directly to the heavy-tailed biomedical-inspired settings studied here.

Mixture‐Based Estimation of Multivariate Data Hypervolume

Luca Scrucca — Mon, 01 Jun 2026 06:01:10 -0700

ABSTRACT

Estimating the hypervolume occupied by multivariate data is a fundamental problem in statistics and data science, with applications ranging from ecology and machine learning to multi-objective optimization and Bayesian inference. Traditional approaches rely on geometric approximations, kernel density estimation, or convex-hull constructions, which often suffer from restrictive assumptions or do not scale well in higher dimensions. We introduce a novel methodology for hypervolume estimation based on finite Gaussian mixture models. The proposed approach defines the hypervolume as a high-probability region of the fitted mixture density and estimates its volume using efficient Monte Carlo techniques, such as Latin hypercube sampling and importance sampling. An automatic, data-driven procedure selects the density threshold that determines the region over which the hypervolume is computed. Across simulations, the proposed mixture-based estimator proves broadly applicable and achieves accuracy, flexibility, and computational efficiency equal to or superior to those of existing methods. Applications to anomaly detection and ecological niche estimation illustrate the method's practical utility and interpretability in complex multivariate settings.

Pattern Matching for Multivariate Time Series Forecasting

Noé Lebreton, Julien Ah‐Pine, Julien Jacques, Matthieu Neveu — Fri, 29 May 2026 08:40:24 -0700

ABSTRACT

This article presents a new approach to multivariate time series forecasting. While most existing techniques in the literature focus on forecasting a single time series, forecasting multiple time series is a common goal in many applications. To deal with this, we introduce a new method, Weighted Nearest Neighbors for multivariate time series (WNN_multi). This method forecasts the future of a given series by identifying similar patterns not only in its own past but also in the past of related time series. Once the k nearest neighbors are identified, forecasts are made by averaging their future values. We evaluate the proposed approach on several real-world datasets and compare its performance against state-of-the-art forecasting techniques. The results demonstrate that our method achieves comparable or significantly improved performance, showcasing its effectiveness.

Wasserstein Regression, Forecasting, and Change‐Point Detection for Daily Traffic Flow Distributions

Abdolnasser Sadeghkhani — Fri, 29 May 2026 08:37:41 -0700

ABSTRACT

We develop a distribution-valued framework for modeling, forecasting, and monitoring traffic flow counts by treating each day as a probability distribution summarized by jittered empirical quantile signatures. Inference is conducted under the 2-Wasserstein geometry, which in one dimension is isometric to the L2(0,1)$$ {L}^2\left(0,1\right) $$ metric on quantile functions. This representation preserves the empirical distribution of within-day traffic intensities beyond mean aggregation while deliberately abstracting away from the chronological ordering of the intraday curve. We introduce Wasserstein-based distributional regression, one-step-ahead forecasting, and a Wasserstein CUSUM statistic for change-point detection and localization. Our theory provides finite-sample and asymptotic guarantees under the two-stage sampling structure of traffic data, with error bounds that separate the roles of the number of days T$$ T $$ and the within-day resolution m$$ m $$. Simulations show competitive performance under location shifts and substantial gains under dispersion or shape changes. An analysis of publicly available interstate traffic volumes illustrates quantile-dependent covariate effects and interpretable regime changes via quantile-shift diagnostics.

Matrix‐Variate Skew Normal Distribution: Properties and Estimation

Atila P. Correia, Carlos A. R. Diniz, Victor H. Lachos — Fri, 29 May 2026 08:30:27 -0700

ABSTRACT

In this article, we introduce a matrix-variate skew normal distribution and its extended version for modeling asymmetric matrix-valued data. We investigate the main theoretical properties of these models and develop an EM-type algorithm for maximum likelihood estimation. The proposed methodology is assessed through simulation studies and illustrated with an application to historical Dow Jones dividend data, highlighting its practical relevance for modeling asymmetric dependence structures.

BOST‐LAWS: A Bayesian Online Spatio‐Temporal Prediction Framework With Likelihood‐Adjusted Weighted Smoothing for Disease Surveillance

I. M. L. Nadeesha Jayaweera, Yanzhao Wang, Jian Zou — Fri, 15 May 2026 00:54:10 -0700

ABSTRACT

Accurate prediction of disease transmission is challenged by its dynamic nature, influenced by factors such as population density. This study introduces a novel approach that enhances predictions of disease surveillance data by integrating likelihood weighting into the integrated nested Laplace approximation (INLA) framework, specifically tailored to account for population density within a spatiotemporal Bayesian methodology. Our method prioritizes recent information for non-stationary outbreak time series online prediction by employing calibrated discounting on historical data through weight adjustments on their likelihoods. Empirical analysis of real COVID-19 daily case count data from Massachusetts counties demonstrates the effectiveness of this approach, revealing improved prediction accuracy compared to existing methods. The INLA-based method with weighted smoothing offers a promising avenue for enhancing infectious disease forecasting models, with significant potential applications in public health decision-making and resource allocation.

STEC‐Net: A Spatiotemporal Graph Neural Framework for Community Discovery in Dynamic Social Networks

Yingnan Xu, Shuangshuang Chu — Tue, 12 May 2026 00:00:00 -0700

ABSTRACT

Community discovery is a central problem in the analysis of dynamic social networks. Traditional community discovery methods mainly focus on the formation and dissolution of links between nodes, and therefore often fail to capture the richer spatial structure and temporal dependency underlying network evolution. To address this limitation, we propose STEC-Net, a spatiotemporal graph neural framework for community discovery in dynamic social networks. STEC-Net integrates spatial structure and temporal dynamics within a unified embedding architecture. First, Graph Convolutional Networks (GCNs) are used to learn snapshot-level node representations from network topology. To adapt the spatial encoder to structural evolution, a GRU-based weight evolution mechanism is introduced to update the GCN parameters over time. Then, a second Gated Recurrent Unit (GRU) is employed to model temporal dependencies across snapshot embeddings and to learn spatiotemporal node representations. Finally, a Self-Organizing Map (SOM) is applied to the learned embeddings to cluster nodes and infer their community affiliations. Experiments on four types of dynamic networks show that STEC-Net consistently outperforms traditional community discovery methods in terms of purity, normalized mutual information, homogeneity, and completeness. These results demonstrate that STEC-Net can effectively uncover evolving community structures in dynamic social networks.

Unsupervised Time‐Event Probabilistic Classification Using Large Panels of Time Series

Máximo Camacho, Javier Palarea‐Albaladejo, Manuel Ruiz Marín — Sun, 10 May 2026 20:32:29 -0700

ABSTRACT

This study presents a framework to perform unsupervised time-event probabilistic classification using time series data of large cross-sectional dimension. These datasets often exhibit complexities such as non-linearities, structural breaks, asynchronicity, missing data, and outliers; which hampers their analysis and modeling. To address these challenges, the proposed approach integrates symbolic analysis, compositional data analysis, and Markov-switching time series modeling into a unified methodology. A Monte Carlo simulation study demonstrates the robustness of the method in various challenging scenarios. The practical applicability of the framework is illustrated through two economic case studies: (i) identifying recurrent recession and expansion regimes in the US economy using state-level data, and (ii) detecting breakpoints in high-volatility episodes in the US stock market using data from all assets in the S&P 500 index.

Wavelet‐Based Single‐Index Additive Models With Irregular Link and Additive Functions

Anestis Antoniadis, Umberto Amato, Italia De Feis, Irène Gijbels — Tue, 05 May 2026 02:02:47 -0700

ABSTRACT

Because of the complexity of data sets in practice, there has been much interest in developing statistical analysis tools for problems involving high-dimensional covariates. Examples of these models include partial linear additive models (PLAMs) and single-index models (SIMs). A common feature of these models is that they achieve dimension reduction to circumvent the “curse of dimensionality” while retaining the flexibility of the nonparametric regression. In the statistical and machine learning literature, fitting the additive parts in PLAM models and the link function in SIM models by nonparametric methods usually requires smooth additive components and regular link functions, and it is usually achieved using kernel methods or spline smoothing. In this work, we present a novel intrinsically interpretable combination of these two models with competitive predictive performance. We relax the smoothness assumptions and develop a nonparametric estimation procedure of the additive components and the link function that uses wavelet bases expansions adapted to non-equispaced designs. Simulation studies and real data analyses are employed to demonstrate the usefulness of the approach. Computer codes are provided as Supporting Information.

Causal Mediation Analysis With Latent Subgroups for Survival Model

Yerong Sun, Yuejin Zhou, Tao Hu, Tiejun Tong, WenWu Wang — Sun, 03 May 2026 22:47:31 -0700

ABSTRACT

Causal mediation analysis is an effective method for understanding the mechanism between the exposure and the outcome, often assuming that the mediation model is consistent for each individual in the target population. In practice, however, the natural indirect effect (NIE) may vary across individuals due to their distinct characteristics. As a result, the population can be partitioned into subgroups according to the varying sizes of the NIEs. Distinguishing subgroups within the study population enables the development of more precise and targeted treatment strategies. In this paper, we propose an identifiable mixture mediation model with latent subgroups for the survival data, where the outcome follows an accelerated failure time model and the mediator is Gaussian distributed. We further employ three information criteria including the AIC, BIC, and singular BIC (sBIC) to select the number of subgroups, followed by the expectation–maximization (EM) algorithm to estimate the model parameters and NIEs. Simulation study shows that the sBIC is the most robust and efficient criterion for selecting the number of subgroups; therefore, we recommend the sBIC-EM algorithm for practical use. Lastly, we apply our algorithm to the lung cancer data and discover two latent groups with opposing NIEs.

New Robust M‐Estimator for Random‐Coefficients Panel Data Model: Algorithms, Simulation, and Application to Insurance Data

Mohamed R. Abonazel, Amr R. Kamel, Ahmed H. Youssef — Sun, 03 May 2026 22:32:09 -0700

ABSTRACT

This paper proposes a new class of M-estimators based on an innovative objective function that provides highly robust and efficient estimates. The resulting estimator, referred to as the robust AKY estimator, is introduced as an alternative to the random coefficient regression (RCR) estimator in the presence of outliers. The results show that, for normal and clean data, the proposed robust AKY estimator performs almost as well as the RCR estimator. However, it demonstrates significantly greater resistance to outliers when applied to contaminated datasets within the random coefficients panel data (RCPD) model. To evaluate its performance, a Monte Carlo simulation study was conducted under various data-generation scenarios with different levels of outlier contamination. The results were compared with those of the non-robust RCR estimator and several existing robust M-estimators, including Huber, Hampel, Andrew, and Bisquare. In addition, the proposed robust AKY estimator was evaluated using a real insurance dataset. The findings from both the simulation study and the empirical application indicate that the robust estimators (Huber, Hampel, Andrew, Bisquare, and AKY) outperform the RCR estimator in the presence of outliers in the RCPD model. Moreover, the proposed robust AKY estimator is more efficient than the other robust M-estimators.

Issue Information

Tue, 28 Apr 2026 23:37:46 -0700

Statistical Analysis and Data Mining: An ASA Data Science Journal, Volume 19, Issue 3, June 2026.

Bayesian Dirichlet Process Copula Mixtures for Heterogeneous Multi‐Cluster Data: Methods and an NBA Player Stats Application

Yujian Liu, Siyi Yu — Tue, 28 Apr 2026 23:37:06 -0700

ABSTRACT

We propose an approach for fitting multi-cluster data using copula-based Dirichlet process mixture models (DPM). Unlike conventional finite mixture models, our framework uses Sklar's theorem to accommodate heterogeneous marginal distributions and complex inter-variable dependencies. We adopt a slice-sampling MCMC scheme to enable full Bayesian inference, which makes the posterior distribution on the number of clusters and the cluster-specific copula parameters simultaneously available. Simulation studies show that this DPM-copula approach can accurately capture and recover heavy-tailed or skewed clusters, while the Gaussian mixture model cannot. We apply our method to real NBA player-level advanced statistics from the 2021–2024 seasons, demonstrating how the model discovers distinct subgroups that exhibit different correlation structures and marginal shapes. These insights show the advantages of a flexible, copula-based approach for multi-cluster data analysis in sports analytics and beyond.