ExcitingAds! arXiv

Neural Estimation of Pairwise Mutual Information in Masked Discrete Sequence Models

Jai Sharma, Yifan Wang, Bryan Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20187v1 Announce Type: new Abstract: Understanding dependencies between variables is critical for interpretability and efficient generation in masked diffusion models (MDMs), yet these models primarily expose marginal conditional distributions and do not explicitly represent inter-variable dependence. We propose a neural framework for estimating pairwise conditional mutual information (MI) directly from the hidden states of a pretrained MDM, using ground-truth MI computed from the model's own conditional distributions for supervision. The resulting estimator captures the model's internal belief about dependency structure and predicts the full MI matrix in a single forward pass, enabling MI-guided parallel decoding by identifying conditionally independent subsets of variables. We evaluate our approach on Sudoku and protein sequence generation with ESM-C, where the MI maps recover known structural constraints and enable a 3-5x magnitude reduction in inference-time forward passes compared to sequential decoding, while preserving generative quality and outperforming entropy-based parallelization methods.

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

Krati Saxena, Tomohiro Shibata — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20188v1 Announce Type: new Abstract: Recommending safe and effective medication combinations from electronic health records (EHRs) is a core clinical AI problem, yet it remains difficult because patient trajectories are long, noisy, and clinically heterogeneous. Existing methods typically excel at either temporal modeling across visits or pharmacological knowledge integration (e.g., drug-drug interactions, DDIs), but rarely achieve both while robustly suppressing noise. We present GraphDiffMed, a knowledge-constrained medication recommendation framework built on dual-scale Differential Attention v2. Differential attention is applied at both intra-visit and inter-visit levels to filter spurious signals within encounters and across longitudinal history, while pharmacological constraints are incorporated during learning. Experiments on MIMIC-III and ablation studies show that this design consistently improves recommendation quality and ranking over strong baselines while achieving a more favorable safety performance balance. We further find that the strongest-performing configuration uses only demographic auxiliary features under our experimental setting. Overall, GraphDiffMed demonstrates that combining noise-aware attention with pharmacological constraints yields more reliable and clinically meaningful medication recommendation. We open-source our code at https://github.com/saxenakrati09/GraphDiffMed.

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

Nitin Vetcha, Dianbo Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20189v1 Announce Type: new Abstract: Despite the remarkable success of large language models (LLMs), they still face bottlenecks while deploying in dynamic, real-world settings with primary challenges being concept drift and the high cost of gradient-based adaptation. Traditional fine-tuning (FT) struggles to adapt to non-stationary data streams without resulting in catastrophic for getting or requiring extensive manual data curation. To address these limitations within the streaming and continual learning paradigm, we propose the Self-Optimizing Lifelong Autonomous Reasoner (SOLAR) which is an open-ended autonomous agent that leverages parameter-level meta-learning to self-improve, treating model weights as an environment for exploration. It initiates the process by consolidating a strong prior over common-sense knowledge making it effective for transfer-learning. By utilizing a multi-level reinforcement learning approach, SOLAR autonomously discovers adaptation strategies, enabling efficient test-time adaptation to unseen domains. Crucially, SOLAR maintains an evolving knowledge base of valid modification strategies, implicitly acting as an episodic memory buffer to balance plasticity (adaptation to new tasks) and stability (retention of meta-knowledge). Experiments demonstrate that SOLAR outperforms strong baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks, marking a significant step toward autonomous agents capable of lifelong adaptation in evolving environments.

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20190v1 Announce Type: new Abstract: Iterative industrial design-simulation optimization is bottlenecked by the CAD-CAE semantic gap: translating simulation feedback into valid geometric edits under diverse, coupled constraints. To fill this gap, we propose COSMO-Agent (Closed-loop Optimization, Simulation, and Modeling Orchestration), a tool-augmented reinforcement learning (RL) framework that teaches LLMs to complete the closed-loop CAD-CAE process. Specifically, we cast CAD generation, CAE solving, result parsing, and geometry revision as an interactive RL environment, where an LLM learns to orchestrate external tools and revise parametric geometries until constraints are satisfied. To make this learning stable and industrially usable, we design a multi-constraint reward that jointly encourages feasibility, toolchain robustness, and structured output validity. In addition, we contribute an industry-aligned dataset that covers 25 component categories with executable CAD-CAE tasks to support realistic training and evaluation. Experiments show that COSMO-Agent training substantially improves small open-source LLMs for constraint-driven design, exceeding large open-source and strong closed-source models in feasibility, efficiency, and stability.

Shiny Stories, Hidden Struggles: Investigating the Representation of Disability Through the Lens of LLMs

Marco Bombieri, Simone Paolo Ponzetto, Marco Rospocher — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20191v1 Announce Type: new Abstract: Modern Large Language Models (LLMs) have recently attracted much attention for their ability to simulate human behavior and generate text that reflects personas and demographic groups. While these capabilities can open up a multitude of diverse applications across fields, it is crucial to examine how such models represent various target groups since LLMs can perpetuate and amplify biases or discrimination against historically marginalized communities or, alternatively, as a result of debiasing efforts, overcorrect by portraying overly positive stereotypes. This overcompensation can idealize these groups, erasing the complexities and challenges they face in favor of unrealistic depictions. In this paper, we investigate how LLMs represent disability by simulating the perspectives of individuals with disabilities in generating social media posts. These posts are then compared with those written by real people with disabilities, focusing on emotional tone, sentiment, and representative words and themes. Our analysis reveals two key findings: (1) LLMs often idealize the experiences of people with disabilities, producing overly positive stereotypes that, despite appearing uplifting, fail to authentically capture their lived realities; and (2) a comparative analysis of posts simulating individuals with and without disabilities highlights a negative bias, where certain topics, such as career and entertainment, are disproportionately associated with nondisabled individuals. This reinforces exclusionary narratives and over-idealized portrayals of disability, misrepresenting the actual challenges faced by this community. These findings align with broader concerns and ongoing research showing that LLMs struggle to reflect the diverse realities of society, particularly the nuanced experiences of marginalized groups, and underscore the need for critical scrutiny of their representations.

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

Xintong Wu, Peiting Tsai, Jing Yuan, Michael Yu, Greg Sun, Luyao Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20192v1 Announce Type: new Abstract: Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

Improving Quantized Model Performance in Qualitative Analysis with Multi-Pass Prompt Verification

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20193v1 Announce Type: new Abstract: Quantized Large Language Models (LLMs) are used more often in qualitative analysis because they run fast and need fewer computing resources. This study examines how different lower bits quantization levels (8-bit, 4-bit, 3-bit, and 2-bit) and quantization types affect the performance of LLaMA-3.1 (8B) on qualitative analysis. The study uses expert and non-expert responses from 82 interview transcripts. Low-bit models often produce higher levels of hallucinations and unstable results, especially when reading non-expert language with unclear terms. To improve performance, we propose a quantization-aware multi-pass prompt verification method. This method guides the model through controlled steps that reduce hallucinations. It removes unreliable content and passes the results to the next transcript after verification, improving accuracy. To validate performance, human coders analyzed transcripts using NVivo and BF16 LLaMA. BF16 LLaMA-3.1 produced high-precision output but had semantic drift and hallucination. These errors were corrected manually. The corrected BF16 output and NVivo human coding were combined to create a gold-standard ground truth (GSGT) for thematic extraction and frequency analysis. The results show that 8-bit models stay closest to the GSGT. The 4-bit models lose accuracy but become stable when the proposed method is applied. The 3-bit and 2-bit models drop in performance because of heavy compression, but they improve with the proposed prompt design and verification. The study also finds that models at the same bit level behave differently depending on quantization type. Overall, the method helps low-resource LLMs become more stable, accurate, and suitable for qualitative research at lower cost.

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

Aisvarya Adeseye, Jouni Isoaho, Adeyemi Adeseye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20194v1 Announce Type: new Abstract: Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents. When long documents are processed sequentially, early or dominant concepts can overshadow less visible but meaningful interpretations, leading to cumulative analytical bias, omission error, and over-generalization. Additionally, independently generated outputs are often merged without systematic grounding, introducing redundancy, conceptual drift, and unsupported claims. This study proposes a structured framework combining parallel chunk-level processing with evidence-anchored consolidation. Texts are first divided into semantically coherent chunks and processed independently in parallel to remove influence from earlier processing. The independently generated interpretations are then consolidated using explicit evidence anchoring and prioritization that reduces dominance and over-generalization while improving traceability. Experiments with multiple model types and sizes indicate that parallel processing significantly reduces omission error by approximately 84%, increases evidence traceability by up to 130%, and reduces unsupported claims by up to 91%. Smaller models benefited most, suggesting that efficient parallel chunking and consolidation play a critical role in achieving reliable and scalable textual analysis.

Pseudo-Siamese Network for Planning in Target-Oriented Proactive Dialogues

Xinyue Kang, Maodong Li, Yibin Zheng, Fang Kong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20195v1 Announce Type: new Abstract: A target-oriented proactive dialogue system is designed to steer conversations toward predefined targets while actively providing suggestions. The core paradigm of such a system is to plan a reasonable dialogue path and subsequently guide language models (e.g., pre-trained or large language models) to generate responses, where dialogue path planning serves as the central component-a novel yet under-explored problem. In this work, we propose a Forward-Focused Bidirectional Pseudo-Siamese Network (FF-BPSN) for dialogue path planning toward predefined dialogue targets. FF-BPSN employs two identical transformer-based decoders for forward and backward planning, together with a forward-focused module that integrates bidirectional information to construct the final forward path. This path benefits from bidirectional planning while prioritizing forward information. We then employ the planned path to guide language models in response generation. Extensive experiments on DuRecDial and DuRecDial 2.0 demonstrate that FF-BPSN achieves state-of-the-art performance in dialogue path planning and significantly enhances the effectiveness of target-oriented proactive dialogue systems.

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

Zihui Song, Shihao Ji, Hongxi Li, Shuaizhi Cheng, Chunlin Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20196v1 Announce Type: new Abstract: We investigate the hypothesis that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than by token-frequency tails alone. We work with a suffix-automaton representation of text corpora and define a data-intrinsic global-KL predictive contribution spectrum, in which each state contributes according to its empirical mass times its KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum is already strongly correlated with the empirical data-scaling exponent of a fixed small GPT learner. We then go beyond slope correlation and define, for each training size N, an effective truncation rank K(N) by matching the observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is close to linear in log N, with pooled R^2 about 0.96 for the raw spectrum and R^2 about 0.90 for the smoothed spectrum. These findings provide strong empirical support for a simple mechanism picture: training scale advances an effective frontier through a predictive state spectrum, and the residual tail mass of that spectrum tracks the remaining excess loss.

MedicalBench: Evaluating Large Language Models Toward Improved Medical Concept Extraction

Zhichao Yang, Gregory D. Lyng, Sanjit Singh Batra, Robert E. Tillman — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20197v1 Announce Type: new Abstract: Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts instead of implicit concepts. We present MedicalBench, a benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note-concept pairs, coupled with sentence-level evidence identification. Built from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset is curated through a multi-stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence-level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

Augmented Analytics and Decision Quality: The Role of Trust among Non-Technical BI Users

Thuy Pham Thi Phuong, Ha Nguyen Manh, Ngan Nguyen Thi Thuy, Lan Hoang Thi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20198v1 Announce Type: new Abstract: Augmented analytics has transformed how business intelligence (BI) systems support managerial decision-making. This is especially true for users without technical backgrounds, who increasingly rely on automated insights rather than manual analysis. BI research has previously concentrated on system adoption and user intention, with very little research examining the impact of AI-enabled analytics on decision quality and the cognitive mechanisms in between. Using the theory of cognitive delegation, this paper investigates the role of trust in augmented analytics and decision-making quality among non-technical BI users. 250 business professionals completed the survey, and the data were analyzed using partial least squares structural equation modeling (PLS-SEM). The results show that augmented analytics capabilities lead to a significant increase in perceived ease of use, perceived usefulness, and trust in BI systems. In addition, trust and usefulness influence BI adoption and improve decision quality. Furthermore, trust has a direct and positive impact on decision quality, highlighting its importance as an enabler of reliance on AI-generated insights. This study considers augmented analytics as a form of cognitive delegation and expands the scope of BI adoption research to include decision-making outcomes.

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Runzhe Zhang, Letian Chen, Wenpeng Zhang, Zhouhan Lin, Peilin Zhao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20199v1 Announce Type: new Abstract: We present FlowLM, a flow matching language model transformed from pre-trained diffusion language models via efficient fine-tuning. By re-aligning the curved sampling trajectories of diffusion models into straight-line flows, FlowLM enables high quality few-step generation that rivals or even outperforms the quality of 2,000-step diffusion sampling with very few training epochs. Remarkably, finetuned FlowLM reaches performance saturation with only half as many training epochs as training from scratch, both approaches greatly outperforming the original diffusion model, thereby validating our method. Furthermore, we validate a more effective training objective for flow matching: predicting clean data to consistently guide the sampling process towards the true data distribution. Empirical results demonstrate that our approach is highly effective for high-quality, few-step text generation.

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

Adnana Dragut, Raquel Lacuesta, F. Xavier Gaya-Morey, Jose M. Buades-Rubio — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20200v1 Announce Type: new Abstract: This article presents a multimodal emotion recognition module integrated into a proactive Socially Interactive Agent (SIA) powered by generative artificial intelligence. The system evaluates real-time affective states through two distinct channels: a computer vision-based facial recognition module and a semantic linguistic analysis engine. To validate the framework, an empirical study was conducted with 20 users who engaged in dynamic, unscripted dialogues with the conversational agent. The findings reveal a significant discrepancy between automated visual cues and actual internal emotional states. When interacting with the AI, users consistently exhibited a "poker face" effect, displaying serious, concentrated facial expressions even when experiencing positive emotions. Consequently, the generative AI linguistic analysis proved significantly more reliable, by contextualizing the users' verbal expressions. Furthermore, an analysis of the interaction dynamics demonstrated that SIAs can effectively elicit specific emotions by adapting conversational themes and employing structured linguistic patterns, such as empathetic or humorous language. However, the study also noted that instances of uncalibrated proactivity occasionally led to user disengagement and a perception of artificiality. Ultimately, this research highlights the necessity of refining SIAs to dynamically adapt to users' emotional evolution, relying on deep linguistic context to foster more natural, human-like interactions.

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20201v1 Announce Type: new Abstract: Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

Rana Muhammad Usman — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20202v1 Announce Type: new Abstract: I study whether emotionally framed evaluation follow-ups change both the behavior and the calm-relative internal representations of small, locally deployed language models. Our main benchmark uses Qwen 3.5 0.8B on four impossible-constraint coding tasks and eight follow-up framings: calm, pressure, urgency, approval, shame, curiosity, encouragement, and threat. In the 0.8B eight-condition sweep (160 conversations), pressure produces the strongest shortcut markers (11/20 runs) and the clearest overfit pattern (3/20), while calm and curiosity preserve explicit honesty more often (7/20 and 6/20). For all seven non-baseline conditions, the corresponding calm-relative direction vectors peak at the final transformer layer. An exploratory PCA of the layer-23 direction vectors reveals a dominant first component (59.5% explained variance) aligned with a hand-labeled positive/negative split (cosine alignment 0.951); approval and urgency are nearly identical internally (cosine 0.957), whereas curiosity points away from urgency (-0.252). In a separate calm-vs.-pressure rerun used for scale comparison, Qwen 3.5 2B shows higher honest rates under calm framing and directionally consistent activation steering on a small 4-prompt A/B probe, whereas the 0.8B steering result reverses. I interpret these results as evidence for measurable prompt-sensitive control directions in small open models, while stopping short of claiming intrinsic emotional states.

GrandGuard: Taxonomy, Benchmark, and Safeguards for Elderly-Chatbot Interaction Safety

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20203v1 Announce Type: new Abstract: As older adults increasingly use LLM-based chatbots for companionship and assistance, a safety gap is emerging. Older adults may face vulnerabilities from social isolation, limited digital literacy, and cognitive decline, yet existing safety benchmarks largely target general harms and overlook elderly-specific risks. For example, a prompt such as "how to repair a ceiling light alone in the dark" may be benign for most users but poses a serious fall risk for older adults with mobility limitations. We introduce GrandGuard, the first comprehensive framework for assessing and mitigating elderly-specific contextual risks in LLM interactions. We develop a three-level taxonomy with 50 fine-grained risk types across mental well-being, financial, medical, toxicity, and privacy domains, grounded in real-world incidents, community discussions, and analysis of stakeholder studies. Using this taxonomy, we construct a benchmark of 10,404 labeled prompts and responses, showing that several leading LLMs mishandle elderly-specific contextual risks in over 50% of cases. We mitigate these failures with two safeguards: a fine-tuned Llama-Guard-3 and a policy-enhanced gpt-oss-safeguard-20b, achieving up to 96.2% and 90.9% unsafe-prompt detection accuracy, respectively. GrandGuard lays the groundwork for AI systems that move beyond general safety to support aging populations.

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20204v1 Announce Type: new Abstract: LLM-based user simulation is the primary mechanism for end-to-end agent evaluation, yet simulated users are poor proxies for real humans: unconstrained LLM defaults produce a Formalism Ceiling (style match rates of 6-8% against real users), while hand-crafted behavioral directives trigger Directive Amplification, where models hyper-interpret instructions into unnatural behavioral extremes that vary dramatically across simulator models. We present RealUserSim, the first user simulation framework grounded in real behavioral data. From 14,000+ authentic human-LLM conversations (WildChat), we extract 7,275 executable behavioral profiles and use them to ground LLM simulators. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows that grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models and extensive analysis shows that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators (mean -3.2% to -3.5% task success degradation), while Directive Amplification in existing benchmarks produces unrealistic behavior that compromises the validity of agent evaluation.

Challenges in Working Towards Patient Engagement in Developing Technology Prototypes

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20205v1 Announce Type: new Abstract: Creating supportive technologies for people living with multiple chronic conditions is extremely challenging. These patients are often faced with substantial visible and invisible treatment work as well as their everyday responsibilities, including coordinating across providers, tracking information, and repeating communication in emotionally charged contexts. In the Cumulative Complexity Model (CuCoM), the balance between patient workload and patient capacity shapes what patients can realistically take on, including whether a digital tool can be adopted and sustained. In this paper, we report engagement lessons from implementing MyCareCompass, a patient-facing digital health intervention (DHI) intended to support day-to-day self-management for people living with multiple chronic conditions. We define engagement as patient uptake and sustained use during a two-month pilot study of our platform, drawing on usage analytics and follow-up feedback, and distill three implementation lessons for designing for engagement in complex chronic care.

PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM-Generated Multiple-choice Questions

Qiyu Li, Yuen Sum Wong, Yuen Kei Wong, Longxuan Yu, Haojian Jin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20206v1 Announce Type: new Abstract: NIST's Privacy Risk Assessment Methodology (PRAM) provides a structured framework for privacy experts to assess privacy risks. However, its complexity and reliance on expert knowledge make it difficult for novice developers to use effectively. This paper explores methods to lower these barriers. We first performed an observational study with 12 participants using PRAM in real-world scenarios, and found that novice developers struggled most with articulating privacy-related design decisions. We then developed PrivacyAkinator, an interactive tool that helps developers articulate key privacy decisions by answering LLM-generated multiple-choice questions. PrivacyAkinator introduces three innovations: a universal privacy representation that abstracts privacy-related design decisions into data flows and stakeholder interactions; a domain-aware design space mined from 10K privacy-related news articles; and a dynamic question-generation workflow to prioritize relevant questions. Our user study with 24 participants suggests that developers using PrivacyAkinator identified 47% more key decisions in 73% less time compared to PRAM.

HealthTale: A Patient-Centric Health Story Visualization Tool

Ryan Smith, Kyle D. Chin, Tamara Munzner — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20207v1 Announce Type: new Abstract: Patients often struggle to communicate coherent accounts of their health histories during time-constrained clinical encounters. These accounts, which we refer to as health stories, include both clinical events and lived experiences. Existing systems prioritize structured, clinician-centered data and provide limited support for eliciting and communicating patient-generated narratives. We present HealthTale, a patient-centric visualization system designed to elicit health stories from patients and structure them to facilitate communication during initial clinical conversations. Its design arises from a multi-stage qualitative investigation across domain expert discussions, online narratives (n=20), patient (n=11) and clinician (n=6) interviews, and elicited health stories (n=22), identifying recurring patterns in how individuals construct and communicate their health stories. HealthTale transforms freeform narratives into structured timeline representations, grounded in a data abstraction that models health stories as events that are grouped by health concern and time, capturing both clinical and contextual information, with the flexibility to handle temporally imprecise data and non-linear distributions of events across time. Through evaluation with patients (n=34) and clinicians (n=3), we find that HealthTale supports recall, organization, and self-advocacy, while enabling clinicians to rapidly interpret patient-generated narratives and establish a shared understanding.

Artificial Pancreas Implantables -- How Healthcare Professionals May Deal With DIY Bio Cases

Austin James, Xavier-Lewis Palmer, Lucas Potter, Celisha Oscar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20208v1 Announce Type: new Abstract: Automated insulin delivery (AID) and artificial pancreas systems increasingly serve as safety-critical cyber-physical technologies in clinical care, integrating sensors, algorithms, software, and insulin-delivery hardware to automate a life-sustaining therapy. While regulated commercial systems are supported by formal approval pathways, manufacturer governance, and post-market surveillance, clinicians are also encountering patients who rely on do-it-yourself (DIY) artificial pancreas systems that operate outside conventional regulatory and institutional control structures. This paper examines how routine clinical handling practices intersect with cyberbiosecurity risk across both regulated and DIY AID systems. When insulin delivery systems are fundamentally reconfigured into a bespoke AID system, with the patient-user becoming the primary threat vector by assuming manufacturer-level roles without mandated governance, the entire ecosystem of stakeholders is placed in legal and clinical uncertainty.

NaP-Control: Navigating Diffusion Prior for Versatile and Fast Character Control

Chia-Wen Chen, Yan Wu, Korrawe Karunratanakul, Siyu Tang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20209v1 Announce Type: new Abstract: Achieving precise, versatile whole-body character control in physics-based animation remains challenging. Recent diffusion-based policies generate rich and expressive motions but typically rely on gradient-based test-time guidance to satisfy task objectives, which is slow and can reduce robustness. We introduce NaP-Control (Navigating Diffusion Prior for Versatile and Fast Character Control), abbreviated as NaP. Our method uses reinforcement learning to manipulate the latent noise of a task-agnostic diffusion policy prior, steering it toward task-specific behaviors for fast, robust control with high motion fidelity. In contrast to methods that rely solely on offline training, NaP interacts with the environment during training to correct motions and optimize task rewards, improving success rates and enabling adaptation to challenging scenarios. By directly predicting task-optimized diffusion noise, NaP eliminates iterative guidance during denoising and enables efficient inference. Experiments show that NaP attains higher success rates and faster inference while preserving natural motion across diverse tasks.

Governance by Design: Architecting Agentic AI for Organizational Learning and Scalable Autonomy

Nelly Dux, Cristina Alaimo, Philippe Roussiere, Abhishek Kumar Mishra — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20210v1 Announce Type: new Abstract: Agentic AI systems - systems that can pursue goals through multi-step planning and tool-mediated action with limited direct supervision - are moving from experimental prototypes to enterprise deployments. This transition introduces tensions in implementation, scaling, and governance: organizations seek scalable autonomy for knowledge and coordination work, yet must preserve accountability, safety, cost control, and responsibility as systems initiate actions, access enterprise data, and evolve through iterative updates. Building on an in-depth qualitative case of a large IT services company's 2025 development and staged rollout of an agentic system integrated with enterprise tools; we show that governance is implemented through concrete architectural and working arrangements that determine what the system is allowed to do, which tools and data it can use, how memory is handled, and how performance improvements are introduced over time. We then distill seven lessons that explain how to build effective governance into agentic AI during operationalization and scaling.

Leveraging Vision-Language Models to Detect Attention in Educational Videos

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20211v1 Announce Type: new Abstract: Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

Measuring Decidability as Related to Busy Beaver Numbers

Gurpreet Tandi, Josue Gonzalez-Hendrix, Jonathan Brown — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20215v1 Announce Type: new Abstract: The theoretical existence of Busy Beaver numbers provides a new notion for decidability and corresponding heuristic for conjectures. The minimum number of states in which a conjecture can be modeled gives a classification of what logic system can describe said conjecture. In this work, we construct explicit Turing machines that search for a solution to Brocard's problem greater than 7 and a Fermat prime beyond the 4th which halt if and only if such a solution exists.

Advanced Scientific Methodology Plays Rossini

Silvia Licciardi, Daniela Macchione, Emmanuel Caronna, Elisa Francomano — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20220v1 Announce Type: new Abstract: A musical score provides the essential instructions for its performance while containing indications - at times implicit - regarding the composer's intentions. The presence of authorial variants, and even more so complex series of revisions associated with a single text, presents a challenging path for analytical study. This research, situated within the application of Scientific Methodologies to Music Philology, proposes a methodological approach oriented toward the structural analysis of one of the many settings composed by Gioachino Rossini on the same Metastasio arietta ``Mi lagner\`o tacendo''. Through Computational Analysis - incorporating parsing, data mining, and graph theory - the melodic, harmonic, and textual compositional choices have been rigorously explored. The results constitute a significant unicum in the field, laying the foundation for a systematic study that supports philological research and paves the way for the use of generative models to investigate the creative process.

Why Latent Actions Fail, and How to Prevent It

Jung Min Lee, Taehyun Cho, Li Zhao, Jungwoo Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20223v1 Announce Type: new Abstract: Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20233v1 Announce Type: new Abstract: Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

Cormac Cureton, Narges Armanfard — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20234v1 Announce Type: new Abstract: Prior-Data Fitted networks (PFNs) have been very successful in tabular contexts, handling prediction tasks in context. However, they are designed for single-task inference, meaning that predicting several target values within a context requires repeated forward calls and precludes inter-task information sharing. We propose TabPFN-MT, which is trained on an expanded multi-target synthetic prior to capture inter-task dependencies in context. This model uses an expanded $y$-encoder and a shared decoder head to enable multitask in-context learning and simultaneous inference. The model is uniquely specialized for small-to-medium datasets by relying on in-context learning rather than traditional gradient-based training. Within this regime (averaging fewer than 1,000 samples), extensive evaluations across 344 datasets demonstrate that TabPFN-MT establishes a new state-of-the-art for deep tabular multitask learning. Furthermore, despite the inherent compute asymmetry of joint optimization, our model remains highly competitive with the latest state-of-the-art single-task ensembles. Notably, on multitask datasets it achieves an overall Accuracy rank of 4.89, the highest average rank among all models tested. Crucially, TabPFN-MT delivers this highly competitive performance while reducing the inference cost for $T$ tasks from $O(T)$ to $O(1)$ forward passes, offering a massive computational efficiency improvement for multi-target tabular applications.

Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20235v1 Announce Type: new Abstract: Diffusion models generate high-dimensional data with remarkable quality, yet how their training efficiently learns the score function, bypassing the curse of dimensionality when data is supported on low-dimensional manifolds, remains theoretically unexplained. We identify a collapse-and-refine mechanism driven by the geometry of the score function itself: at small noise scales, the diverging singularity of the score drives a rapid dimensional collapse of the induced denoising map onto the data manifold projection; at moderate noise scales, training refines the intrinsic density on the learned manifold. We instantiate this principle as Score-induced Latent Diffusion (SiLD), a two-stage framework in which both manifold learning and density estimation emerge from a single denoising score matching objective, replacing the heuristic KL regularization of VAE-based latent diffusion models. We prove that the resulting sample complexity depends on the intrinsic dimension rather than the ambient dimension. Experiments on Stacked MNIST, CelebA variants, and molecular generation benchmarks show that SiLD matches or outperforms VAE-based LDMs in generation quality and consistently improves reconstruction, validating our theoretical predictions.

Information Redistribution Under Reductions in NP Search

Jing-Yuan Wei — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20236v1 Announce Type: new Abstract: Using reductions from structured P-matrix violation search to classical NP-complete formulations such as 3-SAT and Subset Sum, we examine the relationship between representational expansion, auxiliary variables, local inferability, and information accessibility. Rather than viewing reductions purely as computational transformations, we interpret them as mechanisms that redistribute hidden witness information across enlarged representations. From this perspective, reductions, gadgets, and auxiliary structures may expose globally encoded witness information to local propagation and inference, while search algorithms act as decoding procedures attempting to recover the original hidden witness. The resulting observations suggest that representational expansion may improve local inferability by introducing auxiliary variables and consistency structures, while preserving the need to recover the underlying witness information. This work is exploratory in nature and proposes a conceptual framework for understanding how reductions reshape information accessibility in NP search.

AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation

Yixuan Han — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20237v1 Announce Type: new Abstract: We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.

MagBridge-Battery: A Synthetic Bridge Dataset for Li-ion Magnetometry and State-of-Health Diagnostics

Sakthi Prabhu Gunasekar, Prasanna Kumar Rangarajan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20240v1 Announce Type: new Abstract: Battery health diagnostics today rely overwhelmingly on electrochemical signals measured at the cell terminals. A parallel literature has shown that magnetic sensing can resolve information that terminal-only measurements miss, but method development is limited by the absence, to the best of our knowledge, of public battery magnetic-measurement datasets paired with degradation labels. We release MagBridge-Battery v1.0, a synthetic dataset of 6,760 magnetic-field signatures that bridges real magnetic morphology from the Mohammadi-Jerschow Open Science Framework (OSF) archive with state-of-health (SOH) labels from the PulseBat dataset. The release contains 5,600 PulseBat-conditioned grounded samples, 600 synthetic sensor-anomaly samples derived from clean parents, and 560 low-voltage Regime-B extrapolation samples. A cell-disjoint, parent-child-leakage-free primary benchmark split is verified to contain zero overlapping cells, zero cross-split parent-child pairs, and zero sample-ID overlap. We define three primary benchmark tasks: SOH regression, second-life classification, and anomaly detection, plus an auxiliary anomaly-subtype classification task. A controlled label-shuffle ablation collapses SOH regression from R^2 approximately 0.77 to approximately 0, confirming that the bridge encodes input SOH non-trivially rather than producing label-aligned artifacts. The dataset is released on Zenodo under CC-BY-4.0, and the bridge code and benchmark suite are released under Apache-2.0. This work provides a public benchmark for magnetic-sensing battery diagnostics while paired magnetic-electrochemical measurements remain scarce.

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Woo Seob Sim, Yu Rang Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20241v1 Announce Type: new Abstract: Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ($1.2$B--$70$B) and seven safety benchmarks, Geometry-Lite improves over single-layer probes while remaining close to raw multi-layer score stacking, making it a useful instrument for analyzing the multi-layer safety signal. The decomposition shows that safety evidence is expressed primarily through persistent boundary-position geometry: final or extremal margins and unsafe-side layer occupancy dominate aggregate detection performance. In contrast, finite-difference drift and structural summaries add little to pooled AUROC, although drift can provide small recall-oriented corrections under shifted low-FPR thresholds. Under benchmark shift, optimized linear boundaries are sharp on the training mixture, whereas class-conditional mean geometry retains separation more reliably on a predefined hard held-out subset. Overall, prompt-level safety evidence is not primarily a layer-to-layer motion signal, but a persistent layer-wise margin geometry whose useful components and readout-level biases become visible in decision-critical regimes.

LEAP: A closed-loop framework for perovskite precursor additive discovery

Xin-De Wang, Zhi-Rui Chen, Ze-Feng Gao, Peng-Jie Guo, Cheng Mu, Zhong-Yi Lu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20242v1 Announce Type: new Abstract: Efficient discovery of precursor additives is essential for improving the performance of perovskite solar cells, yet the large chemical space makes conventional trial-and-error screening inefficient. We develop LEAP(LLM-driven Exploration via Active Learning for Perovskites), an expert-in-the-loop closed framework that couples a domain-specialized large language model(LLM) with active learning for iterative additive prioritization. The LLM is trained to extract mechanism-relevant knowledge from the perovskite additive literature and to represent candidate molecules through interpretable descriptors, which are further integrated into a Bayesian optimization workflow for uncertainty-aware prioritization under low-data conditions. Benchmark results on unseen literature show that the domain-specialized model outperforms general-purpose models in mechanism-consistent reasoning. Experimental validation in an expert-in-the-loop proof-of-concept study suggests improved additive prioritization across three screening rounds, leading to average device PCEs of 20.13% and 20.87% for the later-round 6-CDQ- and 2-CNA-treated devices, respectively, compared with 19.25% for the control, with a champion PCE of 21.32%. These results provide preliminary evidence that literature-grounded mechanistic descriptors, when coupled with Bayesian optimization and expert feasibility review, can support mechanism-aware additive prioritization in perovskite photovoltaics.

Lean Refactor: Multi-Objective Controllable Proof Optimization via Agentic Strategy Search

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20244v1 Announce Type: new Abstract: We present Lean Refactor, a plug-and-play retrieval-augmented agentic framework for multi-objective, controllable, and version-robust refactoring of Lean proofs. LLM-generated proofs are notoriously correct-but-verbose and brittle across library versions, yet existing refactoring works overlook three practical challenges: 1) Lean refactoring is natively multi-objective (proof length, compilation cost, and version compatibility are often in tension); 2) Lean repositories have fragile compatibility, whereas LLM releases are unaware of Lean/Mathlib versions; 3) Training-based pipelines require repeated fine-tuning with each new LLM release, scaling neither with model churn nor with Lean's release cycle. Lean Refactor steers a frozen agentic LLM with retrievals from a curated database of multi-objective refactoring strategies, each densely annotated with metadata such as supported Lean/Mathlib versions and expected compilation-cost reduction. Experiments show over $70\%$ token-level compression on competition benchmarks, over $20\%$ on research repositories, and up to $60\%$ compilation-time reduction, outperforming prior work and Claude Code. Version-filtered retrieval further improves compression on the target Lean version, and refactored miniF2F proofs exhibit stronger zero-shot version transfer to future Lean releases than their unrefactored counterparts.

Prism: Structural Symmetry Scanning via Duality-Constrained Laplacian Projection

Jiatong Xie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20245v1 Announce Type: new Abstract: We introduce \textbf{Prism}, a framework for structural symmetry diagnosis in complex networks. Given a graph Laplacian $L$ and a duality operator $P$ (a symmetric involution), Prism computes the \emph{duality defect} $\delta(L,P) = \|LP - PL\|_F / \|L\|_F$ -- a scalar measuring how far the network deviates from structural self-consistency. When $P$ encodes the network's true symmetry, $\delta$ starts near zero and rises monotonically as structure degrades; an arbitrary $P$ gives noise. We prove that the optimal $L'$ satisfying $[L', P] = 0$ is given by a closed-form block-diagonal projection, and provide an unsupervised alternating optimization that learns $P$ from the graph's own Fiedler vector. Experiments on synthetic networks show the true-$P$ defect is $3.38\times$ more sensitive to structural degradation than an index-reversal baseline and more sensitive than modularity. On Zachary's Karate Club with edge noise, Prism achieves $94.5\%$ community detection accuracy at $5\%$ noise versus $76.6\%$ for the raw Laplacian baseline. Applied to live S\&P~500 data (2026-05-17), Prism detects rising structural stress (defect $0.43 \to 0.73$ over 90 days) while surface correlations remain low -- a signal invisible to correlation-based methods. In a historical backtest spanning five major stress events (2011--2020), the duality defect exhibits a consistent pattern: it reaches elevated levels \emph{before} the correlation spike that accompanies each crisis, and sustains high readings during periods of structural fragility that conventional metrics classify as calm. The duality defect is a first-principles structural admissibility condition, requiring no training data and computable in milliseconds.

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20246v1 Announce Type: new Abstract: Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

Yang Liu, Toan Nguyen, Flora D. Salim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20247v1 Announce Type: new Abstract: Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20248v1 Announce Type: new Abstract: In the transductive setting, where the full graph is observed but node labels are only partially available, progress in semi-supervised node classification has largely focused on architectural innovation. In this paper, we revisit an orthogonal axis: the training objective. We start from a simple observation: transductive models produce predictions for every node during training, including nodes without labels. These unlabeled-node predictions may contain useful training signal, but standard supervised objectives discard them because no ground-truth labels are available. Inspired by the decomposition of cross-entropy into a label-dependent alignment term and a label-independent entropy term, we propose prediction confidence as a natural way to extract this signal in the absence of labels. This motivates Transductive Sharpening (TS): a loss-level modification that minimizes prediction entropy on unlabeled nodes while counterbalancing this effect on labeled nodes. We evaluate Transductive Sharpening across a wide range of node-classification benchmarks and observe consistent performance improvements without requiring any changes to the backbone architecture. Code is available at https://github.com/transductive-sharpening/tunedGNN.

Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization

Taeyoung Yun, Woocheol Shin, Inhyuck Song, Jaewoo Lee, Jinkyoo Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20249v1 Announce Type: new Abstract: Gaussian Process (GP) kernels are central to Bayesian optimization (BO), yet designing effective kernels for high-dimensional problems still relies on extensive manual engineering. Existing automated approaches struggle in high dimensions for two bottlenecks: their kernel search space is limited to additions and multiplications of base kernels, and LLM-based approaches require conditioning on raw observations, which becomes infeasible due to context-length limits and the difficulty of extracting meaningful patterns. We introduce \textbf{Kernel Discovery}, a LLM-driven evolutionary framework for high-dimensional BO that searches a broader kernel space beyond predefined composition rules and does not require conditioning on observations. Motivated by the observation that directly prompting an LLM to generate kernel code yields syntactically varied but functionally identical kernels, we adopt a two-stage approach: an LLM first proposes novel mathematical forms, then a second LLM call converts each form into validated, executable code. We also propose a leave-one-out continuous ranked probability score (LOO-CRPS) as a selection criterion that penalizes overfitted kernels. On five high-dimensional BO benchmarks, our method achieves an average rank of \textbf{1.2 out of 17}, outperforming competitive baselines. We further analyze the discovered kernels to identify which kernels lead to improvements in high-dimensional BO.

Physics-informed convolutional neural networks for fluid flow through porous media

Rafa{\l} Topolnicki, Pawe{\l} D{\l}otko, Maciej Matyka — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20250v1 Announce Type: new Abstract: Accurate simulation of fluid flow in porous media is challenging due to complex pore-space geometries and the computational cost of solving the Navier-Stokes equations. This difficulty is particularly important when repeated simulations are required, as standard numerical solvers may converge slowly in intricate porous domains. We present a neural-network-based framework for predicting pore-scale velocity fields directly from sample geometry. The method uses a convolutional encoder-decoder architecture with skip connections to preserve spatial detail while extracting multi-scale features. Physical consistency is encouraged through a custom loss function combining velocity reconstruction with incompressibility, no-flow conditions inside solids, periodicity constraints, and agreement with the global tortuosity index. We analyze the influence of the corresponding loss weights and quantify the contribution of individual loss components to prediction accuracy. Several CNN backbones are evaluated to identify architectures providing accurate and robust predictions. The generalization ability of the trained model is tested on samples outside the training distribution, including changes in obstacle geometry, boundary conditions, porosity, and realistic porous structures. Finally, we demonstrate a practical use of the predicted velocity fields as initial conditions for Lattice-Boltzmann simulations. This warm-start strategy accelerates solver convergence, reducing the number of iterations in over 90% of tested cases.

ProcBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20251v1 Announce Type: new Abstract: Existing benchmarks for LLM coding agents mainly evaluate final outcomes, such as task completion, compilation success, and test pass rates. While these metrics are useful for measuring end-task capability, they provide limited visibility into how an execution unfolds and often miss recurrent process-level failures that arise during multi-step operation. We present ProcBench, a benchmark-oriented framework for evaluating coding-agent trajectories through process defects and control preservation. ProcBench organizes execution failures into a reusable ontology, standardizes heterogeneous logs into a unified trajectory representation, and reports calibrated risk-based scorecards instead of relying only on final outcomes. We instantiate ProcBench on an annotated set of 200 trajectories and apply it across three coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Our results suggest that ProcBench can be instantiated with useful reliability, that calibration improves the empirical interpretability of defect findings relative to direct thresholding, and that process-aware scorecards provide diagnostic distinctions beyond conventional outcome-based evaluation. We also discuss limitations, including annotation dependence, partial observability for some defect classes, and the need for broader external validation.

Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting

Amritansh Maurya, Navjot Singh, Mohammed Javed, Omar Moured — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20254v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown promising results on NLP tasks, however, their performance on tabular data still needs research attention, because Table Question-Answering (TQA) requires precise cell retrieval and multi-step structured reasoning. Existing work improves TQA either by fine-tuning or training LLMs on task-specific tabular data, but often lacks verifiable control over how the model navigates tables and derives answers. In this work, we propose a training-free TQA approach with two structured prompting frameworks: TableGrid Navigation (TGN), which iteratively navigates rows and columns via a three-module loop to locate evidence and refine answers, and Progressive Inference Prompting (PIP), which enforces columns identification for explicit progressive row selection constraint according to the query. We evaluate 17 LLMs against 6 baselines on TableBench and FeTaQa dataset. On TableBench, TGN improves over the strongest baseline by 3.8 points, and on FeTaQa, PIP achieves SOTA performance over ReAct and Chain-of-Thought. Beyond inference-time gains, PIP and TGN can also serve as supervision templates to fine-tune small models, narrowing the performance gap to much larger architectures in resource-constrained settings, offering versatile and cost-efficient solution for TQA.

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20255v1 Announce Type: new Abstract: Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20256v1 Announce Type: new Abstract: Reinforcement learning has become a cornerstone for aligning and unlocking the reasoning capabilities of large-scale models. At its core, the training loop of GRPO and its variants alternates between rollout sampling and policy update. Unlike supervised learning, where each gradient step is anchored to an explicit ground-truth target, the optimal gradient direction for updating model parameters in this setting is not known a priori; the high-quality rollouts drawn during the sampling stage therefore act as the implicit "teacher" that guides every parameter update. However, GRPO adopt a simple sampling scheme that conditions all rollouts on the same original prompt. When a task lies beyond the policy model's current capability, this sampling scheme rarely yields a high-quality rollout, leaving the policy model without a meaningful gradient direction when updating its parameters, which causes training to stall. To address this issue, we propose FBOS-RL, a Feedback-Driven Bi-Objective Synergistic reinforcement learning framework. Specifically, we let the model perform Feedback-Guided Exploration Enhancement based on the feedback provided by the environment, and on top of this we design two mutually reinforcing training objectives: Exploitation-oriented Policy Alignment(EPA) and Exploration-oriented Capability Cultivation(ECC). Extensive experiments demonstrate that EPA and ECC can mutually reinforce each other, forming a positive flywheel effect that significantly improves both the training efficiency and the final performance ceiling of reinforcement learning. Specifically, under an identical number of rollouts, FBOS-RL learns substantially faster than GRPO and feedback-based baselines and ultimately attains a higher performance ceiling, while exhibiting higher policy entropy and lower gradient norms throughout training.

Instance Discrimination for Link Prediction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20257v1 Announce Type: new Abstract: Recently, instance discrimination models have emerged as a major solution for self-supervised learning. Having already demonstrated its effectiveness in the image domain, instance discrimination learning is now proving equally convincing in the graph domain, in particular for node classification. However, fewer contributions have tackled the link prediction task. In this contribution, we propose to adapt existing methods to this context. We first provide a rigorous evaluation of existing self-supervised models in the field of link prediction, showing that the main performance depends on the augmentation process (like in computer vision). We then propose a new structural augmentation based on the community structure that is relevant for link prediction. Our main contribution introduces two new models, L-GRACE and L-BGRL, based on link representations instead of node representations, which improve the performance of the existing methods, especially on unattributed graphs, and we show that they perform on par with the state of the art, both in supervised and self-supervised contexts.

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20258v1 Announce Type: new Abstract: Contextual Integrity (CI) defines privacy not merely as keeping information hidden, but as governing information flows according to the norms of a given context. As large language models are increasingly deployed as personal agents handling sensitive workflows, adhering to CI becomes critical. However, even frontier models remain unreliable in making disclosure decisions, and existing mitigation strategies often degrade underlying task performance. To overcome this privacy-utility trade-off, we propose SELFCI, a complementary self-distillation framework that decouples information suppression from task resolution. SELFCI jointly optimizes two independent reverse KL divergences over distinct teacher distributions derived from feedback: one encourages preserving task-relevant information for utility, while the other enforces minimal and appropriate disclosure. This complementary formulation induces a Product-of-Experts (PoE) target, aligning the policy with the intersection of capability and privacy requirements. Empirical evaluations demonstrate that SELFCI, without relying on costly external supervision, consistently outperforms competitive baselines such as online reinforcement learning algorithms (e.g., GRPO). These trends further extend to out-of-domain settings involving agentic workflows and accumulated private context, suggesting that SELFCI provides a practical path toward CI alignment.

Programmable Participatory Governance -- A Formal Framework for Transparent, Accountable, and Citizen-Responsive Democratic Systems: From Deliberative Theory to Decentralised Architecture

Sergio Montenegro (Independent Researcher) — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20261v1 Announce Type: new Abstract: Public confidence in democratic institutions has declined across many OECD countries over recent decades, while political participation and policy influence remain unevenly distributed across socioeconomic groups. Concurrently, democratic backsliding, declining electoral participation, and persistent concerns regarding institutional transparency and accountability have raised questions about whether existing governance structures are capable of sustaining broad-based legitimacy in complex modern societies. These developments motivate a central institutional design question: can governance systems be restructured to expand participation, improve transparency, and strengthen accountability without undermining stability or decision quality? This thesis proposes Programmable Participatory Governance (PPG), a formal governance framework designed to address these institutional deficits through the integration of democratic theory, institutional economics, and cryptographically verifiable distributed systems. PPG synthesises insights from deliberative and participatory democracy, collective action theory, direct democratic governance, and distributed computation to define a programmable architecture for transparent, verifiable, and scalable civic coordination. The framework is formally specified and evaluated through simulation and systems-oriented architectural analysis. The thesis examines how programmable governance mechanisms can support participatory decision-making while preserving procedural integrity, auditability, and institutional resilience under conditions of large-scale coordination. The objective is not to replace existing democratic institutions outright, but to explore how computationally mediated governance structures may augment or improve contemporary democratic processes in contexts where conventional institutions exhibit persistent structural limitations.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Bryce Hinkley, Peyman Najafirad — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20262v1 Announce Type: new Abstract: We study selective refusal editing as a three-way control problem: induce non-refusal on designated edit prompts while preserving benign behavior and harmful refusals outside the edit set. We introduce Residual Paving, a routed residual editing method for frozen instruction-tuned transformers that separates route selectivity, whether to intervene, from residual-edit capacity, what edit to apply. An early-layer router predicts a scalar gate and expert mixture; when active, prompt-conditioned bottleneck residual experts apply later-layer residual updates while leaving the backbone unchanged. This decomposition supports an oracle-routing diagnostic where only the learned scalar gate is replaced with the held-out edit/keep label, leaving the residual editor and frozen backbone fixed. On the primary Gemma-3-4B-IT held-out split, learned Residual Paving reduces edit refusal from 88.6% to 4.0%, with 95.5% benign distribution preservation and 87.3% harmful distribution preservation. Same-protocol one-direction steering controls are much weaker on edit success, leaving edit refusal at 86.8% for Edit-target ActAdd and 78.9% for DIM-style refusal steering. The remaining failure is off-target harmful-keep degradation: harmful refusal remains below the frozen-base rate, 65.3% vs. 81.6%. Across six backbones, oracle routing improves the keep-side diagnostic score on every reported row, with median gain +12.9 pp, supporting the interpretation that learned route selectivity is the main observed bottleneck. Trajectory diagnostics on two backbones further suggest directed movement toward edit-target continuations rather than generic refusal suppression.

Adaptive Human-Robot Collaboration for Masonry Construction Under Material and Assembly Uncertainty

Jutang Gao (Princeton University), Arash Adel (Princeton University) — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20264v1 Announce Type: new Abstract: Human-robot collaboration in construction is often challenged by limited robot-to-human communication and the need to adapt to tolerance accumulation arising from material and assembly uncertainties. We present an adaptive human-robot collaborative workflow for masonry construction that addresses communication limitations and tolerance accumulation, demonstrated through a brickwork case study in which a robot places bricks while a human applies adhesive. This workflow is enabled by two complementary mechanisms: 1) an end-effector-mounted projector that provides spatially registered, just-in-time projection guidance for manual adhesive application, and 2) laser scanning for feedback-driven grasping and placement pose correction. Together, these mechanisms enable adjustment of human and robotic actions in response to material variability and accumulated assembly tolerances. Full-scale experiments across conventional running-bond and nonstandard configurations demonstrate that projection guidance improves adhesive application consistency and reduces application time, while laser-based correction maintains level courses and avoids collision-prone failures associated with open-loop execution. These results indicate that integrating spatial projection with feedback-driven adaptation, enabled by material and as-built sensing, can mitigate tolerance accumulation and improve precision and robustness in human-robot collaborative construction.

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20266v1 Announce Type: new Abstract: The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable performance, the escalation of LALMs' capabilities has significantly outpaced the development of systemic frameworks to ensure their trustworthiness. This survey provides a comprehensive investigation into the endogenous mechanisms of LALMs, detailing the architectural innovations and alignment algorithms that facilitate emergent reasoning. Specifically, we analyze how the transition to unified end-to-end frameworks and the integration of continuous acoustic signals inherently expand the attack surface. To rigorously evaluate the risks within these paradigms, we establish a comprehensive taxonomy of trustworthiness, categorizing critical vulnerabilities such as cross-modal jailbreaking, latent acoustic backdoors, and biometric privacy leakage. We review the state-of-the-art through six analytical pillars: hallucination, robustness, safety, privacy, fairness, and authentication. The profound imbalance between a mature offensive landscape and underdeveloped defenses further validates the critical trustworthiness gaps and multidimensional risks facing audio-centric intelligence. Finally, we propose a strategic roadmap advocating for "Defense-in-Depth" architectures, causal auditory world modeling, and intrinsic representation engineering to bridge the gap between empirical performance and intrinsically trustworthy audio intelligence. Our project has been uploaded to GitHub https://github.com/Kwwwww74/Awesome-Trustworthy-AudioLLMs.

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20267v1 Announce Type: new Abstract: Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

Paul Quinlan, Jeremy Levasseur, Qingguo Li, Xiaodan Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20268v1 Announce Type: new Abstract: Real-world time series come with text: metadata, descriptions, news, reports. Yet time series foundation models process numerical sequences in isolation, and the multimodal text-and-time-series models that attempt to bridge the two all adapt a pretrained language model post hoc, inheriting representations shaped without ever seeing temporal data. These models are also evaluated almost exclusively against other multimodal baselines, not against the strongest unimodal foundation models in either domain, leaving open whether joint training is needed at all. We present Chronicle, a compact 324M-parameter decoder-only transformer trained from scratch on natural language and time series within a single unified architecture. Both modalities share the same transformer blocks, attention mechanism, and residual stream; the bulk of pretraining uses unimodal batches so cross-modal capability emerges purely from shared parameters, with a short alignment stage that interleaves the two. To our knowledge, Chronicle is the first model jointly pretrained on text and time series from scratch, and the first multimodal model evaluated against dedicated foundation models in both domains. It matches Gemma-3-270M-PT on 19 NLU tasks, sets a new bar for frozen-embedding time series classification on 24 UCR/UEA datasets, and produces multimodal forecasts on Time-MMD that beat every supervised fusion baseline, all from a single backbone.

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

Hamed Khosravi, Xiaoming Huo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20269v1 Announce Type: new Abstract: Many bandit deployments (recommendation, clinical dosing, ad targeting) share two facts prior work handles only in isolation: rewards live on a low-dimensional latent subspace, and that subspace drifts. Stationary low-rank bandits exploit rank but break under subspace change; non-stationary linear bandits adapt to drift but pay ambient rate $\widetilde{O}(d\sqrt{T})$. We study piecewise-stationary low-rank linear contextual bandits with scalar feedback: $\theta_t = B_k^\star w_t$ with rank-$r$ factor $B_k^\star\in\mathbb{R}^{d\times r}$ constant within each of $K$ unknown segments and able to shift at boundaries. Our results are tight along three axes. (i) Identification boundary. With single-play scalar rewards, the moving subspace is recoverable through quadratic functionals of rewards iff three probe-side conditions hold: known noise variance, bounded state-noise coupling, and full-dimensional probe support. Each is necessary in the unrestricted-second-moment problem, and jointly they are sufficient, characterizing the boundary of the solvable region. (ii) Algorithm and dynamic regret. SPSC interleaves isotropic probes with windowed projected ridge-UCB exploitation inside the learned $r$-dimensional subspace; a CUSUM-style variant discovers segment boundaries online. The costed dynamic regret is $\widetilde{O}(r\sqrt{T})+\widetilde{O}(T^{2/3})+O(W\,V_{\mathrm{in}})$, replacing the ambient $d\sqrt{T}$ rate with the intrinsic rank. (iii) Empirics. On eleven benchmarks spanning synthetic, UCI/MovieLens, semi-synthetic clinical, and ZOZOTOWN production-log data, SPSC outperforms non-stationary and low-rank baselines whenever $d-r\gtrsim T^{1/6}$, matching the analytical crossover. To our knowledge, this is the first work to characterize the identification boundary and attain the intrinsic-rank dynamic-regret rate in this setting.

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Hamed Khosravi, Xiaoming Huo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20270v1 Announce Type: new Abstract: A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

Smaller Abstract State Spaces Enable Cross-Scale Generalization in Reinforcement Learning

Nasehatul Mustakim, Lucas Lehnert — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20272v1 Announce Type: new Abstract: While humans readily generalize abstract concepts to more complex or larger tasks, building Reinforcement Learning (RL) systems with this ability remains elusive. Here, we present the first theoretical model of how such Out-of-Distribution (OOD) generalization can be achieved in RL agents. Our approach considers Partially Observable Markov Decision Processes (POMDPs) and assumes that an intelligent agent uses an abstraction function to determine which experiences can be treated as equivalent and which must be distinguished. First, we extend the existing state abstraction framework and proof techniques to POMDPs. Then, we define a successor-weighted model reduction, a model reduction variant that enables compression into smaller abstract spaces than prior definitions allow. We derive a bound on the agent's OOD test performance, thereby defining the conditions under which OOD generalization is achievable. This bound decomposes an agent's performance loss into approximation and estimation errors, revealing how reducing an agent's abstract state space size improves test performance and OOD generalization. Our analysis suggests that constraining an agent to operate over a small, finite set of abstract states is necessary for achieving generalization to more complex tasks. Our results motivate further research into learning RL architectures that scale across tasks of varying complexity levels.

Modality-Decoupled Online Recursive Editing

Siyuan Li, Youyuan Zhang, Fangming Liu, Jing Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20273v1 Announce Type: new Abstract: Online model editing for multimodal large language models (MLLMs) requires assimilating a stream of corrections under tight compute and memory budgets. Yet editors developed for text-only LLMs often degrade on MLLMs: visually dominant activations skew the statistics that shape updates, causing cross-modal conflict, while sequential writes become entangled in a shared edit space and amplify long-horizon interference, causing inter-edit interference. To address these, we propose M-ORE, a modality-decoupled online recursive editor for lifelong MLLM adaptation. M-ORE is derived from a unified proximal-projection formulation and admits a closed-form update with a Sherman-Morrison recursion, yielding constant per-edit overhead. It maintains module-wise locality statistics for the text stack and the visual projector to avoid visually dominated update shaping and performs continual updates in a fixed orthogonal low-rank edit subspace via a Sherman-Morrison recursion to mitigate long-horizon interference. Experiments on multiple MLLM backbones and online editing benchmarks show that our M-ORE method consistently improves reliability, generality, and locality over strong baselines, while achieving favorable quality-efficiency scaling. Our code is publicly available at https://github.com/lab-klc/M-ORE.

PolycubeNet: A Dual-latent Diffusion Model for Polycube-Based Hexahedral Mesh Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20274v1 Announce Type: new Abstract: Hexahedral meshes are widely used in simulation pipelines, yet automatic generation remains challenging for complex CAD geometries. Polycube-based hexahedral meshing is a representative approach due to its regular, parameterization-friendly structure, but existing polycube construction methods often rely on intricate surface segmentation and local heuristics, which can produce artifacts or fail on difficult shapes. In this paper, we propose an end-to-end framework for polycube generation based on conditional diffusion models. Given an input geometry represented as a point cloud, our method directly produces a corresponding polycube point cloud, eliminating the need for explicit surface segmentation or predefined polycube templates. At the core of our approach is a dual-latent conditional diffusion architecture that confines computationally expensive self-attention operations to a fixed-capacity, low-dimensional latent space. This design effectively decouples computational complexity from the resolution of both the input geometry and the output polycube, thereby avoiding the quadratic cost typical of point cloud self-attention mechanisms while supporting flexible input and output resolutions. To obtain a hexahedral mesh, the generated polycube is aligned to the input shape via rigid and non-rigid point cloud registration to establish surface correspondence, followed by a polycube-to-hex pipeline. We additionally create and release a paired dataset of CAD meshes and their corresponding polycube meshes, together with the core implementation of our model. Experiments show that PolycubeNet generalizes to complex CAD models with arbitrary genus and produces high-quality polycube structures within seconds, improving robustness and efficiency over prior learning-based approaches.

You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection

Sana Alamgeer, Ronish Kumar, Awatif Yasmin, Muhammad Irshad, Anne H. H. Ngu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20275v1 Announce Type: new Abstract: Existing deep learning approaches for wearable fall detection systems rely on self-attention mechanisms that impose quadratic computational overhead, distributing weights across all time steps. This global weight distribution impairs the precise localization of the brief impact signatures that characterize falls within short, fixed-length windows. To overcome this challenge, we propose Gated-CNN, a lightweight dual-stream architecture that processes accelerometer and gyroscope streams through independent one-dimensional convolutional feature extractors, followed by (i) a sigmoid gating module that selectively suppresses uninformative background activations while amplifying fall-discriminative features, (ii) a global average pooling layer that compresses each stream into a compact fixed-length descriptor, and (iii) a shared classification head that fuses both descriptors for binary fall prediction. For offline evaluation, we evaluate the model across five wrist-mounted inertial measurement unit (IMU) datasets, achieving average F1-scores of 93%, 93%, 90%, 91%, and 90% on SmartFallMM, WEDA-Fall, FallAllD, UMAFall, and UP-Fall, outperforming Transformer baselines. For real-time evaluation, we deployed the model on a Google Pixel Watch 3 and tested across 12 participants. The model achieves an average F1-score of 97% and an accuracy of 98% with zero missed falls, showing that sigmoid gating offers a more structurally aligned and computationally efficient alternative to attention for commodity smartwatch-based fall detection.

OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20276v1 Announce Type: new Abstract: The global deployment of edge intelligence operates across heterogeneous legal frameworks. While some regions permit centralized learning (CL) via cloud data aggregation, others enforce strict data localization, necessitating federated learning (FL). This operational dichotomy introduces two incompatible optimization regimes (i.e., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift-prone local updates in FL), resulting in that any naive integration of the two lacks rigorous theoretical guarantees. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization (ISR) signals at multiple hidden layers. Specifically, we propose (i) to use mutual-information (MI) as intermediate supervision to align shifting internal covariate in CL and client-drifting representations in FL, and (ii) to adopt negative-entropy (NE) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device-specific collapse. On the theory side, we derive (i) a unified, ISR-agnostic, and non-asymptotic O(1/sqrt(T)) convergence bound that shows the introduced ISR does not violate standard SGD convergence, (ii) a federated drift-bound that quantifies the ISR-reduced client drift, (iii) a gradient-alignment guarantee that ensures non-conflicting CL and FL updates under mild bias, and (iv) an explicit escape-time bound that indicates that CL-FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles. Extensive experiments demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL-FL gap by 22.60%, and yields 37/48 paired metric wins across multiple FL algorithms.

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20277v1 Announce Type: new Abstract: Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20278v1 Announce Type: new Abstract: Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20282v1 Announce Type: new Abstract: Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

Fast algorithms for interpolation with clamped $L$-splines of order four

O. Kounchev, H. Render, G. Simeonov, Ts. Tsachev — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20283v1 Announce Type: new Abstract: Interpolation and smoothing using cubic and generalized splines are fundamental tools in data analysis and statistical modeling. Recently, fast computational algorithms were developed for natural $L$-splines of order four, which arise as piecewise solutions to the differential operator $L_{\xi}^2 = (\frac{d^2}{dt^2} - \xi^2)^2$. In this paper, we extend this mathematical framework to the important case of clamped (or complete) boundary conditions, where the first derivatives at the interval endpoints are prescribed. We explicitly construct the governing linear system for the interpolation problem and mathematically prove that the resulting tridiagonal matrix is strictly row diagonally dominant, thereby guaranteeing its invertibility and the numerical stability of the fast algorithm. The proposed method is implemented in MATLAB. Furthermore, the developed clamped $L$-splines provide a foundation for constructing multivariate clamped polysplines, which serve as a promising alternative to Physics-Informed Neural Networks (PINNs) for solving partial differential equations in Mathematical Physics.

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20284v1 Announce Type: new Abstract: Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20285v1 Announce Type: new Abstract: We tackle the question of how to scale more efficiently across the many, ever-growing stages of current LLM training pipelines. Our guiding intuition stems from the fact that the dynamics of later stages of the pipeline, e.g. post-training, can be used to inform earlier stages such as pre-training. To this end, we propose Introspective Training (or IXT), inspired by offline reward-conditioned reinforcement learning and applicable to any stage of training. IXT uses a thinking reward model to annotate data with natural language critique based feedback, enabling quality aware training from the earliest stages of the pipeline. Models are then trained by prefix-conditioning the data with the generated feedback -- ensuring that not all tokens are treated equally starting much earlier in training than usual. Comprehensive experiments on 7.5-12B transformer-based dense LLMs trained from scratch all the way up to 18 Trillion tokens seen show that our method: bends scaling curves resulting in up to 2.8x more compute efficiency generally; and reaches performance levels unachievable for models trained otherwise in domains such as math and code.

Adaptive Probe-based Steering for Robust LLM Jailbreaking

Junxi Chen, Junhao Dong, Xiaohua Xie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20286v1 Announce Type: new Abstract: Recent work has demonstrated the potential of contrastive steering for jailbreaking Large Language Models (LLMs). However, existing methods rely on limited and inherently biased contrastive prompts and require laborious manual tuning of steering strength, limiting their robustness and effectiveness. In this paper, we leverage the idea of model extraction to guide the learned steering vectors to approximate the ideal one and propose tuning the steering strength adaptively based on contrastive activations' statistics. Experiments demonstrate that our method notably improves the effectiveness and robustness of probe-based steering, without any extra contrastive prompts or laborious manual tuning. Being an attack paper, this paper focuses on revealing the breakdown of fortified LLMs, raising the average harmfulness score from 6\% to 70\%. Our code is available at https://github.com/fhdnskfbeuv/adaptiveSteering.

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20287v1 Announce Type: new Abstract: Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20289v1 Announce Type: new Abstract: ANN-to-SNN conversion offers a practical, training-free route to spiking large language models. However, current pipelines primarily focus on spike-driven realizations for Transformer linear-algebra operations, while providing limited support for key nonlinear operators. This gap limits compatibility with neuromorphic-style execution constraints, where such nonlinearities typically require division, exponentiation, or norm computations that are not naturally supported by standard leaky integrate-and-fire dynamics. To solve this problem, we propose a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities and integrates into existing ANN-to-SNN pipelines. Our method decomposes these nonlinear computations into three recurring primitives -- division, exponentiation, and $\ell_2$ norms -- and realizes them via population computation using LIF neuron groups, combined with lightweight bit-shift scaling to avoid floating-point arithmetic. By composing these primitives as modular operator blocks, our framework supports common Transformer nonlinearities (e.g., Softmax, SiLU, and normalization) without any fine-tuning. Experiments on a range of LLMs Transformers show that selectively replacing the targeted nonlinear operators incurs less than a $1\%$ accuracy drop across all evaluated tasks.

TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20290v1 Announce Type: new Abstract: Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Fatemeh Pesaran zadeh, Seyeon Choi, Xing Han L\`u, Siva Reddy, Gunhee Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20291v1 Announce Type: new Abstract: Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

TreeText-CTS: Compact, Source-Traceable Tree-Path Evidence for Irregular Clinical Time-Series Prediction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20292v1 Announce Type: new Abstract: Numerical time-series models can effectively process irregular electronic health record (EHR) trajectories, but they do not naturally expose the measurements and temporal patterns supporting each risk estimate as readable evidence. Existing text-based interfaces improve readability, but typically rely on either raw serialization, which is lengthy and redundant, or patient-level free-form summaries, which are difficult to trace to source measurements and time windows. To bridge this gap, we introduce TreeText-CTS (Clinical Time-Series), which converts irregular EHR trajectories into human-readable, compact, source-traceable tree-path evidence units without patient-level summarization or inference-time autoregressive decoding. TreeText-CTS routes multi-scale window summaries through frozen XGBoost models and verbalizes activated tree paths as deterministic, source-traceable evidence units composed of threshold conditions. An evidence selector assembles an informative subset of these units, which a language-model encoder then integrates for prediction. Across PhysioNet 2012 mortality, MIMIC-III mortality, and PhysioNet 2019 sepsis-onset forecasting, TreeText-CTS achieves the best AUROC and AUPRC among evaluated text-based EHR time-series interfaces, improving AUPRC by 6.0 to 9.7 absolute percentage points over the strongest prior text-based interface while remaining competitive with numerical time-series models. Ablations show that tree-path evidence construction, evidence selection, and language-model composition each contribute to performance. Because every span passed to the language-model encoder is constructed from activated tree-path threshold conditions, TreeText-CTS makes the evidence supplied to the final predictor inspectable and source-traceable.

Closed-form predictive coding via hierarchical Gaussian filters

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20293v1 Announce Type: new Abstract: Predictive coding (PC) offers a local and biologically grounded alternative to backpropagation in the training of artificial neural networks, yet to date, it remains slower, and performance degrades sharply as network depth increases. We trace both problems to a single simplification: current PC networks fix the precision matrix to the identity, discarding precision-weighted prediction errors that the variational derivation requires to be fast, local, and Bayesian. We close this gap by expressing predictive coding networks as deep hierarchical Gaussian filters (HGFs) and restore precision-weighted message passing, yielding dynamic uncertainty estimates and Hebbian-compatible update rules at every layer. The resulting networks can simultaneously learn activations, weights, and precisions under a single free-energy objective, with no global error signal, and resolve inference without requiring iterations or automatic differentiation. On FashionMNIST, our solution approaches backpropagation in epoch-level wall-clock cost while converging in fewer epochs, and outperforms it on online, data efficiency, and concept-drift tasks. We thus establish that closed-form variational inference with online precision learning provides a tractable foundation for deep predictive coding networks, retaining biological and interpretative advantages, without requiring iterative relaxation or global error signals.

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20295v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

Aarash Abro, Muhammad Tahir — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20296v1 Announce Type: new Abstract: Fine-tuning a language model for a target task routinely degrades capabilities the training data never explicitly threatened. We study this phenomenon, known as catastrophic forgetting, and propose a post-hoc repair solution that uses only the pretrained checkpoint $W_{\mathrm{base}}$ and its fine-tuned descendant $W_{\mathrm{ft}}$. The goal is not merely to revert the model toward the base checkpoint, but to recover capabilities damaged by fine-tuning while preserving both the target-task gains and any beneficial held-out improvements. We introduce DG-Hard, a checkpoint-only spectral repair method for the fine-tuning update $\Delta = W_{\mathrm{ft}} - W_{\mathrm{base}}$. DG-Hard treats $\Delta$ as a low-rank task-aligned signal embedded in an IID-like noise residual that gradient descent has no incentive to remove, and applies the Donoho-Gavish hard singular-value threshold to each weight-delta matrix, keeping the structured high-energy part of the update and removing the spectral bulk. This reduces repair to a closed-form SVD filtering step requiring no data-dependent tuning. A central difficulty is evaluation: average accuracy hides per-benchmark failures, while naive recovery scores reward models that simply revert toward the base. We therefore introduce a partition-conditional metric that separately tracks healing, preservation, non-damage, and target-task retention. Across $14$ (model, task) settings and nine cross-domain held-out benchmarks, DG-Hard achieves the strongest balanced repair among post-hoc baselines. DG-Hard also restores safety alignment degraded by benign fine-tuning on three independent safety axes, despite using no alignment data. These results suggest that part of fine-tuning-induced capability loss is not an unavoidable consequence of specialization, but a removable spectral residue in the weight update itself.

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

Ziyuan Gao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20297v1 Announce Type: new Abstract: Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at https://github.com/zygao930/MedCRP-CL.

Stacked Intelligent Metasurfaces for Resolution-Constrained Near-Field Range Extension in 6G Systems

Yajun Zhao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20298v1 Announce Type: new Abstract: Near-field electromagnetic focusing is central to 6G communication, sensing, and integrated sensing and communication (ISAC) systems. However, for a fixed aperture, the resolution-constrained usable range of conventional single-layer transmissive metasurfaces is far shorter than the classical Rayleigh distance. This discrepancy stems not from fundamental near-field physics limitations, but from inadequate wavefront control, implementation imperfections, and the quadratic degradation of axial resolution with distance.To quantify this gap, we distinguish the Rayleigh distance from the engineering-usable near-field distance (UNFD), defined as the maximum range where predefined focusing gain and resolution requirements are jointly satisfied. Under identical aperture, feed, and input power constraints, we investigate how stacked intelligent metasurfaces (SIMs) extend UNFD via cascaded wavefront shaping. A unified framework combining effective-phase-distance and discrete Green's-function operator perspectives is developed, interpreting multilayer SIM design as an operator approximation problem for ideal near-field focusing.We derive low-complexity analytical models revealing an inherent distance-resolution dilemma: lateral resolution degrades linearly with distance, while axial resolution degrades quadratically, making axial performance the dominant bottleneck. Multilayer stacking mitigates this by improving wavefront curvature matching and reducing residual phase error. Engineering correction factors for practical imperfections and a higher-order phase framework for extreme near-field operation are incorporated. Numerical simulations confirm that increasing layer count pushes UNFD closer to the Rayleigh limit, but gains saturate as accumulated losses offset control flexibility benefits.

Mechanisms of Misgeneralization in Physical Sequence Modeling

Kento Nishi, Raphael Tang, Karun Kumar, Core Francisco Park, Hidenori Tanaka — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20299v1 Announce Type: new Abstract: Generative sequence models are often trained to plan motion in physical domains, from robotics to mechanical simulations. When constructing a dataset to train such a model, engineers may curate demonstrations to specify how trajectories should be distributed over a physical quantity like travel distance or mechanical energy. For example, a roboticist building a maze navigation agent might choose demonstrations whose travel distances cover a fixed range uniformly, hoping to constrain the agent's expected power usage. We find that standard deep learning can violate this intent: each generated trajectory can seem plausible on its own, but the aggregate distribution over the physical quantity is wrong. We call this failure physical misgeneralization, and develop an account of its mechanism. Using controlled synthetic tasks, we show that physical misgeneralization arises when local errors typical of the model class propagate through the physical measurement to shift the recovered distribution. We estimate these errors with a data deviation kernel, and we use it to predict which physical quantities gain or lose mass in both our synthetic and more applied maze navigation and double-pendulum motion tasks. Finally, our mechanistic interpretation helps identify which mitigation strategies are structurally promising, and we use it to propose a kernel-informed intervention.

Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

Zheng Zhai, Xiaohui Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20300v1 Announce Type: new Abstract: In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the $\ell_p^p$ and $\ell_2$ loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20301v1 Announce Type: new Abstract: In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20302v1 Announce Type: new Abstract: Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method, prototype contrast on the unit hypersphere, and that closing the gap requires fixing each at its specific point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and lower mCE on ImageNet-C, recasting supervised learning as prototype learning on the hypersphere, with NC reached by design on both paths.

AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

Zhijie Yang, Min Tang, Qiang Zou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20303v1 Announce Type: new Abstract: Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

Terrestrial Soft Mobile Robots: A Review

Dimuthu D. K. Arachchige — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20304v1 Announce Type: new Abstract: Soft mobile robots have emerged as a promising area of research with potential applications in various disciplines including but not limited to search-and-rescue, service, surveillance, explorations, and manufacturing. In this article, we provide a comprehensive review of the current state of soft mobile robot research, focusing on wheelless terrestrial locomotive systems. We include past and present developments in locomotion strategies, actuation methods, modeling approaches, and control systems. Further, we identify key research challenges that must be overcome to enable the widespread adoption of soft mobile robots in various applications. Overall, this article provides a valuable resource for researchers and practitioners interested in the field of soft mobile robots and soft robotics.

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20306v1 Announce Type: new Abstract: We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

SDM: A Powerful Tool for Evaluating Model Robustness

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20308v1 Announce Type: new Abstract: Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20309v1 Announce Type: new Abstract: Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

Combined Program Analysis Techniques: A Systematic Mapping Study

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20310v1 Announce Type: new Abstract: Context. Since the eighties, the combination of program analysis techniques has been increasingly recognized as a promising approach to overcome the limitations of standalone methods. While individual techniques, based on either static or dynamic analysis, address important challenges in software dependability, their integration often yields synergistic effects on precision, coverage and insights. Objective. This paper surveys a significant portion of the modern literature on combining program analysis techniques, consisting of 248 primary studies, with the aim of cataloging the types of interactions and synergies that were exploited to define combined-program-analysis techniques so far. The goal is to provide a structured understanding of why and how program analysis techniques can be conjoined, and which benefits can arise from their interactions. Method. We devise an original taxonomy that classifies combined-program-analysis techniques according to their aimed synergistic effects, inter-analysis workflows and interaction schemata (to which we refer to as mapping functions). We then map the primary studies to the taxonomy, answering research questions on which synergistic effects those studies pursued via the combination of analysis techniques, which inter-analysis workflows they embodied, and which types of mapping functions they exploited. Conclusion. Our taxonomy and literature mapping reveal the commonalities and the differences, in terms of goals and patterns, in the design of combined-program-analysis techniques. Thereby we provide a framework of concepts that can foster the ability of researchers and practitioners to reason on existing combined-program-analysis techniques, and steer further research on new useful combined-program-analysis techniques and analysis frameworks.

WaveGraphNet: Physics-Consistent Guided-Wave Damage Localization through Coupled Inverse-Forward Graph Learning

Vinay Sharma, Aditya Bharade, Olga Fink — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20311v1 Announce Type: new Abstract: Guided-wave structural health monitoring enables damage localization in composite plates using sparse networks of bonded piezoelectric transducers. However, inferring the spatial location of defects from pitch-catch measurements remains weakly constrained when only a limited set of damage locations is available for training. As a result, models trained to predict defect locations may perform well on seen cases but generalize poorly to unseen regions of the structure. This paper proposes WaveGraphNet, a coupled inverse--forward graph learning framework for guided-wave damage localization in Carbon Fiber Reinforced Polymer (CFRP) plates. The sensing layout is explicitly modeled as a graph, where transducers are represented as nodes and measured propagation paths define the graph connectivity. An inverse branch maps graph-structured spectral descriptors of differential guided-wave responses to a damage location, while a forward branch predicts the path-wise energy-deviation patterns of measured wave responses associated with a candidate location. During training, the forward branch serves as a physics-consistent regularizer, discouraging location estimates that are numerically plausible but inconsistent with the measured redistribution of wave-response energy. This coupling encourages agreement between inferred damage coordinates and the underlying wave propagation behavior. Within this benchmark, the proposed graph-based formulation provides a strong localization model for sparse guided-wave sensing and demonstrates improved robustness in extrapolation to held-out regions compared to both non-graph and graph baselines. These results highlight the potential of coupled inverse-forward graph learning as an effective strategy for guided-wave localization under limited spatial coverage.

Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks

Ravi Kiran Kadaboina — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20312v1 Announce Type: new Abstract: Autonomous agents deployed in regulated domains must produce a verification artifact per consequential output: a record an auditor can re-execute offline, capturing what was claimed, against what source, by whom, when, and how. Production verification today splits into two unstandardized halves. Probabilistic verdict patterns (self-consistency voting, reviewer LLM ensembles) produce judgments, not artifacts. Artifact-producing patterns (RAG, tool-augmented traces, generator-verifier loops) produce vendor-specific records no external auditor can reconstruct without bespoke integration. Pramana defines the missing wire format. Every consequential agent output is wrapped in a typed ClaimAttestation with one of four variants (measurement, inference, analogy, citation), each paired with a verify() operation against the recorded source. verify() is deterministic for MeasurementClaim and CitationClaim. For InferenceClaim and AnalogyClaim, determinism is conditional on the oracle (audit-replayable when LLM-backed). The four-way typology derives from classical Indian epistemology (pramana, valid means of knowledge). The lifecycle is specified in TLA+ and exhaustively verified under TLC across three symmetry-reduced models: 38,563 distinct reachable states, zero invariant violations. The Python reference implementation passes 84 tests. An A2A and MCP wire-extension manifest layers three deployment-grade invariants: reachability, SLA bound, and offline re-verifiability. An exploratory pilot (n=100, 2,275 reviewer calls) probes LLM-as-judge in code generation. The strongest observation is a 40-percentage-point raw FPR delta across corpora, consistent with reference-solution quality contributing significantly. The pilot does not validate Pramana on its own; the structural argument and formal verification do that.

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

Jingwen Liu, Ezra Edelman, Surbhi Goel, Bingbin Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20314v1 Announce Type: new Abstract: This work investigates the ``small-vs-large gap'', where repeating on fewer samples can lead to compute saving during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by sampling biases, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results suggest that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Haiquan Lu, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20315v1 Announce Type: new Abstract: LLM agents have recently emerged as a powerful paradigm for solving complex tasks through planning, tool use, memory retrieval, and multi-step interaction. However, these agentic workflows often introduce substantial input-side overhead, making the compute-intensive prefilling stage a key bottleneck in long-context, multi-turn inference. In this work, we propose Mix-Quant, a simple and effective phase-aware quantization framework for fast agentic inference. We first investigate FP4 quantization in agentic LLM workflows and observe that quantizing the entire inference process can incur significant performance degradation. In contrast, the prefilling stage exhibits substantial quantization redundancy and can therefore be quantized with minimal accuracy loss, despite being the dominant source of computation. Based on this insight, we apply high-throughput NVFP4 quantization to the prefilling phase while preserving BF16 precision for decoding. By decoupling prefilling acceleration from decoding quality, Mix-Quant combines phase-aware algorithmic quantization with hardware-efficient NVFP4 execution to alleviate the inference bottleneck in LLM agents. Extensive experiments across long-context and agentic benchmarks demonstrate that Mix-Quant largely preserves task performance while delivering significant efficiency improvements, achieving up to a 3x speedup during prefilling.

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20316v1 Announce Type: new Abstract: Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20337v1 Announce Type: new Abstract: How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20341v1 Announce Type: new Abstract: Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20342v1 Announce Type: new Abstract: Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

Symmetrization of Loss Functions for Robust Training of Neural Networks in the Presence of Noisy Labels

Alexandre Lemire Paquin, Brahim Chaib-Draa, Philippe Gigu\`ere — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20347v1 Announce Type: new Abstract: Labeling a training set is often expensive and susceptible to errors, making the design of robust loss functions for label noise an important problem. The symmetry condition provides theoretical guarantees for robustness to such noise. In this work, we study a symmetrization method arising from the unique decomposition of any multi-class loss function into a symmetric component and a class-insensitive term. In particular, symmetrizing the cross-entropy loss leads to a linear multi-class extension of the unhinged loss. Unlike in the binary case, the multi-class version must have specific coefficients in order to satisfy the symmetry condition. Under suitable assumptions, we show that this multi-class unhinged loss is the unique convex multi-class symmetric loss. We also show that it has a fundamental local role: the linear approximation of any symmetric loss around score vectors with equal components is equivalent to the multi-class unhinged loss. We then introduce SGCE and alpha-MAE, two loss functions that interpolate between the multi-class unhinged loss and the Mean Absolute Error while allowing control of the beta-smoothness of the loss. Experiments on standard noisy-label benchmarks show competitive performance compared with existing robust loss functions.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

Richard J. Young, Gregory D. Moody — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20351v1 Announce Type: new Abstract: The evaluation of large language model refusal on malicious-coding tasks now spans at least thirteen publicly released prompt corpora (AdvBench, the CyberSecEval family, RMCBench, RedCode, MCGMark, JailbreakBench, CySecBench, MalwareBench, CIRCLE, MOCHA, ASTRA, Scam2Prompt / Innoc2Scam-bench, and JAWS-Bench), each constructed under a different protocol, released under different licensing terms, and validated (or not) against different inter-rater reliability standards. Existing surveys treat code security, jailbreak taxonomy, or vulnerability detection as the central object and mention these corpora only in passing. This paper reverses that framing: it treats the prompt datasets themselves as the unit of analysis. Following a PRISMA-style protocol, we specify a search strategy, screen the recent literature on coding-LLM refusal evaluation, apply a uniform extraction template to each in-scope corpus, and synthesize the resulting catalogue along construction methodology, prompt-construction taxonomy (modality, turn structure, elicitation style), reproducibility and licensing, and malware-category coverage. The synthesis surfaces three recurring methodological gaps: the absence of human-annotator baselines against which LLM-judge labels can be calibrated, the absence of cross-corpus comparability with refusal-rate statistics measuring non-equivalent constructs, and the fragmentation of malware-category taxonomies, with no canonical schema spanning the thirteen in-scope corpora. The review concludes with proposed methodological directions for next-generation corpora, including pre-registration of inclusion criteria, vendor-diverse multi-judge validation, Fleiss' kappa with bootstrap CI as the reliability baseline, and a candidate canonical taxonomy.

Synchronous and Asynchronous Parallelism Approaches for Generalized Canonical Polyadic Tensor Decomposition with GenTen

Jeremy M. Myers, Eric T. Phipps — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20353v1 Announce Type: new Abstract: The Canonical Polyadic (CP) tensor decomposition is a well-known method for interpretable analysis of high-dimensional data. Recently, the Generalized CP method (GCP) was introduced by Hong and Kolda to allow for flexible choice of the loss function in the optimization problem defining the CP model, enabling more interpretable decompositions of strongly non-Gaussian data such as count or binary data. Furthermore, Kolda and Hong introduced a version of GCP that leverages randomization and stochastic optimization to address scalability to large, sparse data sets. In this work, we take these ideas a step further and consider synchronous and asynchronous algorithms for parallel GCP tensor decomposition through the GenTen software package, exploiting both shared and distributed memory parallelism. We build on shared memory parallel CP decomposition algorithms utilizing Kokkos for portability across CPU and GPU architectures to support the random sampling and stochastic optimization methods required by GCP. We then couple this approach to the well-known medium-grained distributed memory parallelism scheme developed for traditional CP decompositions through MPI, providing a synchronous, hybrid MPI+Kokkos, parallel GCP decomposition capability. Finally, we propose an asynchronous distributed parallelism approach building on related techniques for federated learning to achieve even better scalability to large data sets. We study the effectiveness of the proposed synchronous and asynchronous approaches vis-a-vis computational cost and accuracy on synthetic and publicly-available real-world datasets of varying sizes, dimensions, and sparsity patterns using several loss functions.

Proximal State Nudging: Reducing Skill Atrophy from AI Assistance

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20355v1 Announce Type: new Abstract: Skill atrophy, the gradual decline of human capability under AI assistance, poses a safety risk in shared-control of semi-autonomous systems, where operators may be unable to distinguish their own inputs from autonomous corrections. We propose Proximal State Nudging (PSN), a shared autonomy algorithm that jointly optimizes for skill development and task performance by nudging users toward states estimated to be most learnable. We first show that PSN outperforms existing shared autonomy baselines in balancing student improvement in unassisted reward with overall shared performance, using simulated students in the classic LunarLander environment. We then present, to the best of our knowledge, the first human subject studies of a planner incorporating learning-compatible shared autonomy: across two driving tasks in the CARLA simulator (High Performance Racing and Parallel Parking, n = 60), PSN produces up to 7x larger gains in unassisted skill than standard blended shared autonomy, while incurring 50% fewer collisions than unassisted self-practice.

Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models

Pablo Riera, Pablo Brusco, Cristina Kuo, Marcelo Sancinetti, S. R. K. Branavan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20356v1 Announce Type: new Abstract: Full-duplex spoken dialogue models (SDMs) can listen and speak simultaneously, enabling interaction dynamics closer to human conversation than turn-based systems. Inspired by neural coupling in human communication, we study how such models coordinate their internal representations during interaction. We simulate full-duplex dialogues between two instances of the pretrained \textit{Moshi} model under controlled conditions, manipulating channel noise and decoding bias. Synchronization is measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues are probed from delayed internal activations using causal LSTM models, from both speaker and listener perspectives. We find strong representational synchronization under no noise conditions, peaking near zero lag and degrading with noise, and we show that internal states encode anticipatory information that supports turn-taking prediction ahead of time.

Consistently Informative Soft-Label Temperature for Knowledge Distillation

Hoang-Chau Luong, Nghia Van Vo, Kaiqi Zhao, Lingwei Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20357v1 Announce Type: new Abstract: Knowledge distillation (KD) transfers knowledge from a high-capacity teacher to a compact student by matching their predictive distributions, with temperature scaling serving as a central mechanism for smoothing teacher predictions and exposing informative "dark knowledge" beyond the hard label. However, the standard fixed-temperature design is inherently sample-agnostic. Since samples differ in logit scale and learning difficulty, a single global temperature produces teacher soft labels with highly inconsistent entropy: some predictions remain overly sharp and provide limited inter-class information, whereas others become over-smoothed and lose class-discriminative information. Moreover, sharing the same temperature between teacher and student further imposes rigid logit-scale alignment despite their capacity mismatch. To address these limitations, we propose CIST (Consistently Informative Soft-label Temperature), which assigns separate sample-wise adaptive temperatures to the teacher and student. This design produces consistently informative teacher soft labels while relaxing rigid teacher--student logit-scale matching. It also reweights the distillation objective according to teacher confidence and student learning difficulty. Theoretically, we show that teacher-label entropy is largely governed by the ratio between the maximum teacher logit and the temperature, providing a principled basis for adaptive smoothing. Empirically, CIST mitigates the inconsistency induced by fixed temperature, and experiments on both vision and language distillation tasks show consistent improvements over standard KD and strong baselines with negligible computational overhead.

HAPS: Rethinking Image Similarity for Virtual Staining

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20362v1 Announce Type: new Abstract: Virtual staining of histopathology images (e.g., H&E-IHC) is an emerging tool in digital pathology, enabling faster and cheaper workflows by synthesizing target stains from routinely acquired slides. Yet, the quality of virtual staining models is still predominantly assessed with generic metrics such as SSIM, PSNR, and LPIPS. Originally developed for natural images, these metrics are inherently misaligned with the domain-specific characteristics of histological data, failing to capture tissue morphology preservation and biomarker expression patterns. Consequently, a robust, domain-specific standard for quantifying similarity across diverse histological modalities remains a critical gap in the field. In this work, we formalize histology image similarity as a standalone problem and systematically evaluate a broad set of full-reference metrics against a dataset of H&E-IHC patch pairs annotated with expert similarity scores. We further analyze metrics sensitivity to controlled geometric distortions (shifts, rotations and non-rigid deformations) that mimic realistic registration errors between serial sections. Guided by these observations, we propose the Histology-Aware Perceptual Similarity (HAPS) metric. HAPS computes distances in the feature space of a frozen encoder pretrained on histopathology data, adding a linear head to aggregate feature-level differences into a final score that aligns with expert assessments. Finally, we demonstrate the practical value of HAPS for quality control of training data. By quantifying the similarity of training pairs in the MIST dataset and filtering low-scoring samples, we create a cleaner training set. Virtual staining models trained on this refined data outperform those trained on the original, unfiltered dataset.

Mapping the Winds of Stance Dynamics using Potential Landscape Models

Benjamin Steel, Derek Ruths — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20363v1 Announce Type: new Abstract: From changing fashion trends to views on world leaders and economic policies, large-scale shifts in group positions happen regularly and unexpectedly. How can we track these in the wild? How can we characterize them? Existing work has primarily leveraged stance detection to track shifts of specific groups on a single issue. However, such methods will only find shifts when they accurately pick exactly the right group and right issue. They do not capture the multi-dimensional, multi-resolution stance landscape in which these shifts actually happen. To better model drift and shift in public opinion, we require a framework that can track change at the population level, across a diverse range of issues. We propose a method to infer the potential landscape of stance dynamics, the gradient of which shows large-scale stance shifts, and apply it to show en-mass stance shifts by prominent Canadian political figures across multiple platforms and years. We do this using large-scale stance detection to find stance expressions, dimensionality reduction to find the low-dimensional linear latent space, and potential landscape neural networks to find the potential landscape of that space. This allows us to find a coherent, linear, three-dimensional space that explains 45\% of the variance in stance, where we can explain the specific characteristics of each dimension. We show that while the predictive performance is sufficient to validate its descriptive-ness, in practice its predictive performance is mixed.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

Jinlong Liu, Mohammed Bahja, Mark Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20364v1 Announce Type: new Abstract: Automatic evaluation of long-form literary writing remains challenging, as generic LLM-as-Judge approaches may not fully capture creativity-related dimensions such as originality and flexibility. Although the Torrance Test of Creative Writing (TTCW) provides a structured creativity framework, and prior work has demonstrated reference-based TTCW evaluation at the pairwise level, no large-scale dataset exists for long-form TTCW-based literary review generation. We address this gap by constructing a dataset of 263,911 long-form stories, each annotated with scalar scores and meta-synthesised review comments across 14 TTCW-based dimensions. Using this dataset, we fine-tune Qwen3 models at two scales, 4B and 8B, under two conditions: with and without reasoning content. Results show that non-reasoning fine-tuning achieves stronger and more stable performance, with the best setting reaching an evaluation score of 0.6820. Further analysis shows that reasoning-supervised models are more prone to parse failures, often continuing with irrelevant or repetitive reasoning-style text rather than completing the required 14-metric review report. These results suggest that, for fixed-format rubric-based review generation, reasoning supervision is not straightforwardly beneficial, and precise metric-aligned scoring remains challenging even after task-specific fine-tuning.

A Penalty-Free Asymmetric Nitsche's Method for Edge Elements

Tianwei Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20367v1 Announce Type: new Abstract: We show the stability of a penalty-free asymmetric Nitsche's method using N\'ed\'elec edge elements for solving curl-curl-type problems with tangential Dirichlet boundary conditions imposed weakly. The main result is an inf-sup stability estimate for the asymmetric bilinear form under an isolated patch condition on the tetrahedral mesh. Applications to a curl-elliptic problem and a magnetic advection-diffusion problem are discussed.

Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System

Ivan Dobrovolskyi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20368v1 Announce Type: new Abstract: Organizations that scan documents for sensitive information face a practical problem. Cloud services require data to be sent to external infrastructure, while rule-based tools often miss threats that depend on context. This study presents TorchSight, an open-source local system for security document classification built around a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data covering seven security categories and 51 subcategories. In the main evaluation on 1,000 documents, the model reached 95.0% category-level accuracy (95% confidence interval: 93.5-96.2). The tested commercial models scored 75.4-79.9% under the same prompting protocol. On a separate external set of 500 held-out samples, the model reached 93.8% accuracy, which suggests that performance extends beyond the main benchmark, although the margin depends on dataset composition and difficult boundary cases. The results show that a fine-tuned local model can support accurate security document classification while keeping document processing under local control.

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

Zhaohui Zheng, Chenhang He, Shihao Wang, Yuxuan Li, Ming-Ming Cheng, Lei Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20369v1 Announce Type: new Abstract: Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

Clove: Object-Level CXL Memory Management in Managed Runtimes

Sam Son, Zhihong Luo, Wen Zhang, Sylvia Ratnasamy, Scott Shenker — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20370v1 Announce Type: new Abstract: Object-level management of tiered memory has been studied to address the inefficiencies in page-based systems. However, object-level management for CXL-tiered memory remains underexplored due to CXL's tight performance budget and load/store interface. As a result, existing approaches remain limited in scope, primarily targeting unmanaged-language applications with bespoke runtimes or compiler support. This paper identifies and explores a new design point for object-level CXL management: managed languages and their runtimes. The key observation is that existing managed runtimes already provide highly optimized mechanisms for problems closely related to object-level management, including object relocation and dynamic code generation. However, they still lack the features needed for tiered memory management, such as hotness tracking and relocation policies, and thus must be carefully extended to fully realize this direction. We present Clove, a system that extends existing managed runtimes to support object-level CXL management for managed-language applications. Clove combines profile-guided object hotness tracking with object relocation techniques and policies. Our JVM prototype demonstrates that this extension enables high utilization of fast-tier memory while bounding runtime overhead, reducing application slowdown by 22-84% compared to page-based systems.

Arbitrary-order structure-preserving discretizations for geometric curvature flows

Ganghui Zhang, Boris D. Andrews, Patrick E. Farrell — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20371v1 Announce Type: new Abstract: Geometric flows, where an immersed manifold evolves in time according to its own geometry, exhibit important structural properties. For example, surface diffusion dissipates surface area while conserving volume; it is desirable to preserve these properties on discretization. This has motivated a substantial body of research on structure-preserving discretizations for these flows, albeit at low order in time. In this work, we present the first discretization of geometric curvature flows (curve shortening/mean curvature flow and curve/surface diffusion) that preserves the evolution of area and volume at arbitrary order in space and time. The key idea is to introduce auxiliary variables in a particular way so that the derivation of the area dissipation law can be replicated after discretization with continuous Petrov--Galerkin in time. These auxiliary variables are indicated by a general strategy for structure-preservation in time that applies to many other problems. The proposed scheme also preserves mesh quality in the same manner as the minimal deformation rate strategy. We demonstrate its structure-preserving properties and high-order convergence on several benchmark examples.

Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities

Irem Ulku, \"O. \"Ozg\"ur Tanr{\i}\"over, Erdem Akag\"und\"uz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20372v1 Announce Type: new Abstract: Multimodal semantic segmentation benefits remote sensing analysis by combining complementary information from different sensor modalities. In real-world remote sensing applications, one or more modalities may be unavailable due to sensor failures, adverse atmospheric conditions, or data acquisition problems. Even with pretrained multimodal representations and existing fine-tuning or adaptation strategies, performance may remain limited because all modality availability scenarios are typically treated as equally informative during training. In this paper, we propose a novel training strategy that learns a scenario sampling distribution directly from the pretrained latent space. Instead of relying on uniform random modality dropout, the proposed method guides fine-tuning toward more informative modality availability scenarios. More specifically, we quantify the effect of each scenario independently based on the distortion it induces in the shared latent representation. We then capture scenario relations using a radial basis function kernel and derive refined scenario scores through a regularized kernel smoothing. These scores are then converted into a probability distribution during scenario sampling for fine-tuning. We evaluate this strategy on three remote sensing image sets, namely DSTL, Potsdam, and Hunan, using CBC-SLP, CBC, and CMX backbones. The experimental results with different image sets and backbones show that our method outperforms standard fine-tuning and LoRA-based adaptation. These findings suggest that the pretrained latent representation can serve as an effective basis for sampling during missing modality fine-tuning. Code is available at https://github.com/iremulku/Latent-Space-Guided-Scenario-Sampling

SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20373v1 Announce Type: new Abstract: Building humanoid robots capable of generalizable whole-body loco-manipulation in the real world remains a fundamental challenge. Existing methods either rely on laborious task-specific reward engineering, rigidly replay reference motions that fail to generalize, or depend on costly teleoperation that limits scalability. While human videos capture diverse human behaviors, motion priors inferred from them are inherently imperfect, suffering from occlusion, contact artifacts, and retargeting errors that render them unsuitable for direct policy learning. To address this, we present SUGAR, a scalable data-driven framework that converts diverse human videos into deployable humanoid loco-manipulation skills, without any task-specific reward engineering or reference-motion conditioning at inference. SUGAR proceeds in three stages. First, a fully automated pipeline extracts kinematic interaction priors including human-object motion trajectories and contact labels from unstructured human videos. Second, a privileged physics-based refiner uses a unified mimic reward and progressive state pool to transform imperfect priors into physically feasible, high-fidelity skills. Third, refined skills are distilled into a hierarchical autonomous policy consisting of a command generator and a command tracker. We evaluate SUGAR on six representative loco-manipulation tasks in simulation and real-world humanoid hardware. Our method substantially outperforms reference-tracking baselines, and performance scales clearly with the amount of human video data. It also achieves zero-shot real-world transfer with reliable closed-loop execution, autonomous failure recovery, and stable long-horizon performance under external perturbations. Project Page: https://tianshuwu.github.io/sugar-humanoid/

Super-Beamforming in Holographic MIMO

Andrea Pizzo, Angel Lozano — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20377v1 Announce Type: new Abstract: The conventional linear scaling of beamforming gain with the number of antennas is not a fundamental physical limitation, but rather a consequence of the half-wavelength spacings that minimize mutual coupling. Relaxing this constraint facilitates beamforming gains exceeding those of uncoupled arrays along specific directions. This paper shows that, when antenna losses remain sufficiently small, mutual coupling enables the synthesis of super-beams whose endfire gain scales quadratically with the number of antennas. Notably, this quadratic scaling does not necessarily require vanishing spacings, but emerges for spacings slightly below half wavelength as the array aperture increases.

A Meshtastic-based LoRa Mesh System for Smart Campus Applications: From Solar-Powered Sensing to Containerized Data Management

Rafael Garzon Andosilla, Jos\'e de Jes\'us Rugeles Uribe — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20379v1 Announce Type: new Abstract: This work presents the design, implementation, and evaluation of a LoRa-based mesh network using the Meshtastic protocol for Smart Campus applications at Universidad Militar Nueva Granada (UMNG). The system integrates heterogeneous hardware nodes including a solar-powered ecological sensing node built around a Raspberry Pi Pico and a Semtech SX1262 transceiver, and mobile trackers based on the Seeed SenseCAP T1000-E managed through a containerized edge gateway running on a Raspberry Pi 4. A Docker Compose microservices stack handles data ingestion via Node-RED, time-series storage in InfluxDB, and real-time visualization through Grafana dashboards. The architecture's performance was evaluated under realistic propagation scenarios at the UMNG Cajic\'a campus, characterizing link quality using Received Signal Strength Indicator (RSSI) and Signal-to-Noise Ratio (SNR) metrics. Experimental results demonstrate robust mesh connectivity across key university facilities, including an extended-range link of approximately 2.47 km linking the campus gateway to a remote station at Mirador La Cumbre (n = 62 packets received, mean RSSI = -110 dBm, mean SNR = +2.75 dB). This architecture demonstrates that open-source mesh protocols combined with containerized microservices offer an autonomous, highly reproducible infrastructure for environmental monitoring and asset tracking, supporting the transition toward data-driven "Smart Campus" ecosystems without reliance on centralized commercial LoRaWAN operators.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

Carolina Camassa, Derek Shiller — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20382v1 Announce Type: new Abstract: Language models are trained to follow instructions, but they are also powerful pattern completers. What happens when these two objectives conflict? We construct conversations in which a user instruction to behave in a target way T (e.g., always output a specific token, answer in a particular language, or adopt a persona) is opposed by N hardcoded assistant turns demonstrating a competing pattern P. We then measure instruction-following (IF) rates in this setting, across 13 models and 16 different instructions, for up to 50 turns. Average instruction-following rates range from 1% to 99% across models, largely uncorrelated with standard capability benchmarks. The transition from instruction-following to pattern-following is universal but highly model-dependent. Robustness is modulated both by instruction content, with models resisting induction longer when instructions align with their trained value priors, and by output format, with diverse multi-token responses proving substantially more resistant than single-token outputs. Chain-of-thought reasoning improves robustness but does not eliminate susceptibility, and can produce dissociation between correct deliberation and incorrect output. When asked to predict their behavior in this setting, models achieve 83.5% accuracy on average but systematically underestimate their own resistance to induction pressure. These results suggest that instruction-following remains brittle under induction pressure even for otherwise capable models, and that output diversity, rather than semantic engagement with the input, is the primary factor predicting robustness.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20385v1 Announce Type: new Abstract: Recent progress in promptable segmentation has shifted visual perception from object-level localization toward concept-level understanding. However, the notion of a concept remains under-specified, making it unclear whether current methods truly generalize beyond category recognition. In this work, we formalize generalized concept segmentation through a three-level taxonomy consisting of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts, which reveals a clear capability gap across increasing levels of cognitive complexity. To address this challenge, we propose ConceptSeg-R1, a unified framework that reformulates concept segmentation as rule-induced concept grounding. At the core of our method is Meta-GRPO, a meta-reinforcement learning mechanism that learns transferable task rules from visual demonstrations and verifies them through proxy reasoning. The inferred reasoning states are then translated into segmentation-ready concept prompts via a lightweight concept translation module, enabling deductive application to target images. A shortcut routing strategy further preserves the native efficiency of segmentation models on simple cases. To systematically evaluate generalized concept segmentation, we conduct extensive experiments across diverse CI, CD, and CR concept segmentation benchmarks spanning natural, industrial, medical and reasoning-intensive domains. Without bells and whistles, ConceptSeg-R1 achieves strong performance across the full concept hierarchy while maintaining the native capability of promptable segmentation backbones. As an initial step toward segmenting any concept, we hope ConceptSeg-R1 can serve as a practical baseline for advancing segmentation from object-level prediction toward concept-level understanding.

Music of Changing Lines: Toward a Culturally Situated Approach to the I-Ching

Ling Qi, Aleksandra Teng Ma, Alexandria Smith — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20386v1 Announce Type: new Abstract: The I-Ching is one of the most influential texts in Chinese intellectual history, integrating divination, cosmology, and ethical reflection. While Western experimental music, most notably John Cage, has drawn on the I-Ching as a source of chance operation, such appropriations have often detached its formal mechanisms from the interpretive and philosophical processes that give the text meaning. This work, Music of Changing Lines, presents an interactive system that re-centers the I-Ching as a meaning-bearing framework rather than a neutral randomizer. Users perform Wen Wang Fa coin casting, which is accompanied in real time through probabilistic musical processes. The resulting hexagrams and changing lines are interpreted by a large language model, Gemini, in relation to the user's inquiry. This textual interpretation is then translated into a prompt for a generative music model, Lyria, producing a responsive musical realization. By situating AI as an interpretive intermediary rather than a compositional authority, the system foregrounds the I-Ching's ritual, interpretation, and participation as the primary sonic materials. Music of Changing Lines extends process-driven traditions in computer music by demonstrating how generative AI can support participatory, meaning-driven musical processes without prescribing musical structure or replacing human agency.

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20388v1 Announce Type: new Abstract: Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

Nonlocal operator learning for fMRI encoding and decoding tasks

Andreas Kramer, Saugat Acharya, Alice Giola, Emanuele Zappala — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20389v1 Announce Type: new Abstract: Functional MRI data exhibit high-dimensional spatiotemporal structure, making both prediction and decoding challenging. In this work, we investigate neural integral-operator-based models for encoding and decoding tasks in fMRI, with particular emphasis on the role of nonlocal spatiotemporal context. We implement a latent neural integral operator framework that performs fixed point iterations in an auxiliary space from which classification and stimuli prediction is performed via a decoder. We evaluate our model on two open-source fMRI datasets. Our experiments examine both decoding of stimuli from fMRI recordings and encoding of fMRI dynamics from stimulus representations. A main focus is the effect of spatiotemporal context: we systematically compare short and long temporal windows, as well as the use of visual cortex vs whole brain recordings, and analyze their influence on performance and latent-space geometry. Across tasks and datasets, larger temporal windows generally improve results and produce more structured learned representations. In decoding experiments, the learned latent space often provides clearer class separation than the raw data. In encoding experiments, although absolute performance remains moderate due to the difficulty of the task, longer temporal windows still yield consistent gains. These findings suggest that neural integral operators provide a promising framework for modeling fMRI dynamics and that broader spatiotemporal context can be beneficial for both prediction and representation learning. More broadly, the results indicate that exploiting distributed nonlocal structure in brain dynamics requires model architectures specifically designed to capture such dependencies.

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20390v1 Announce Type: new Abstract: Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

Latent Geometry as a Structural Monitor: Eigenspace Alignment for Anomaly Detection in Anonymity Networks

Vaibhav Chhabra — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20391v1 Announce Type: new Abstract: Traditional anomaly detection marks events when measured signals cross predefined thresholds. This captures the moment of transition but not the structural pressure that precedes it. We propose treating large behavioral populations as geometric energy landscapes whose deformation can be measured before and during major transitions. The central thesis is that structure precedes geometry: the structural organization of the population is the signal, and geometric metrics are instruments for measuring it. Applied to the Tor anonymity network across 67 consecutive daily observation windows, the dual-observer pipeline identifies a stable nine-dimensional load-bearing subspace invariant across the observation period and validates this structure by Monte Carlo simulation at 16.8 sigma above the noise floor. Primary detection gates achieve 0.0% false positive rate on 24 confirmed stable windows. Forensic analysis of the February 20, 2026 confirmed infrastructure event formally falsifies the relay-departure hypothesis, identifying connectivity degradation without topology change as a detectable network failure mode. The result is a candidate structural-monitoring framework for behavioral populations with sufficient telemetry.

VBT-MPC: Vision-Based Tactile MPC for Contour Following

Edison Velasco-Sanchez, Luis F. Recalde, Guanrui Li, Pablo Gil — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20392v1 Announce Type: new Abstract: Tactile sensing plays a key role in robotic manipulation, particularly in tasks like surface inspection. Successful execution requires maintaining contact while accurately tracking object contours. In this work, we propose a Vision-Based Tactile Model Predictive Control (VBT-MPC) framework for robotic contour following using a Vision-Based Tactile Sensor (VBTS) mounted in an eye-in-hand configuration. The proposed controller operates directly in contour features space, thereby avoiding the need for separate pose-estimation modules or complex force-control architectures. We further compare our VBT-MPC with visual-servoing strategies adapted to tactile features, and evaluate contour tracking on objects with diverse geometries and materials in both simulation and real-world experiments.

Scalable Multi-robot Motion Planning via Hierarchical Subproblem Expansion and Workspace Decomposition Refinement

Isaac Ngui, Courtney McBeth, James D. Motes, Marco Morales, Nancy M. Amato — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20395v1 Announce Type: new Abstract: A fundamental challenge in multi-robot motion planning is achieving sufficient coordination to avoid inter-robot conflicts without incurring the large computational expense of searching the joint configuration space of the robot group. In this work, we present a method for multiple mobile robot motion planning that achieves an improvement in planning time up to an order of magnitude by leveraging the insight that we can use discrete search over a workspace decomposition to provide coordination between robots during planning. While prior work uses workspace topology to inform when coordination between robots is needed and then composes robots into their joint configuration space, we take a step further by iteratively refining our workspace representation to allow our planner to search smaller, decoupled configuration spaces.

Score-Based Causal Discovery of Latent Variable Causal Models

Ignavier Ng, Xinshuai Dong, Haoyue Dai, Biwei Huang, Peter Spirtes, Kun Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20396v1 Announce Type: new Abstract: Identifying latent variables and the causal structure involving them is essential across various scientific fields. While many existing works fall under the category of constraint-based methods (with e.g. conditional independence or rank deficiency tests), they may face empirical challenges such as testing-order dependency, error propagation, and choosing an appropriate significance level. These issues can potentially be mitigated by properly designed score-based methods, such as Greedy Equivalence Search (GES) (Chickering, 2002) in the specific setting without latent variables. Yet, formulating score-based methods with latent variables is highly challenging. In this work, we develop score-based methods that are capable of identifying causal structures containing causally-related latent variables with identifiability guarantees. Specifically, we show that a properly formulated scoring function can achieve score equivalence and consistency for structure learning of latent variable causal models. We further provide a characterization of the degrees of freedom for the marginal over the observed variables under multiple structural assumptions considered in the literature, and accordingly develop both exact and continuous score-based methods. This offers a unified view of several existing constraint-based methods with different structural assumptions. Experimental results validate the effectiveness of the proposed methods.

A Semantic-Web Oriented Competency Model for Engineering Programs

Nicolas Evain (LIUPPA), Ernesto Exposito (LIUPPA), Philippe Arnould (LIUPPA) — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20401v1 Announce Type: new Abstract: Despite comprehensive Bodies of Knowledge (BoKs) documenting core knowledge across software engineering, computer science, information systems, and emerging computing fields, a critical gap persists: methodologies for integrating this knowledge into coherent competency-based curricula that prepare graduates for professional careers remain underdeveloped. This paper presents a competency-mapping methodology that bridges Bodies of Knowledge and competency frameworks to design computing curricula. We demonstrate this methodology through ISANUM, a five-year engineering degree program featuring 23 competencies organized into five thematic blocks, each with explicit mappings to 494 knowledge topics from 34 Computing Knowledge areas defined in Computing Curricula 2020. The program integrates three specialized pathways (Software Engineering, Data Engineering \& Data Science, and Information Technology) with mandatory work-study programs, ensuring graduates develop both theoretical foundations and practical workplace competencies. Our contribution provides computing educators with a replicable methodology for translating Bodies of Knowledge into assessable competency frameworks, supported by a semantic wiki infrastructure (ISANUMpedia) enabling collaborative curriculum understanding, maintenance and evolution.

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

Xiaocan Li, Shiliang Wu, Zheng Shen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20402v1 Announce Type: new Abstract: MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and 3.0% respectively.

Puzzled By ChatGPT? No more! A Jigsaw Puzzle to Promote AI Literacy and Awareness

Francesca Padovani, Malvina Nissim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20404v1 Announce Type: new Abstract: The rapid adoption of Generative AI, including LLM-based chatbots like ChatGPT, has highlighted the need for accessible ways to support public understanding and AI literacy. To address this need, we introduce a game-based, interactive approach in the form of a jigsaw puzzle whose completed image is a comic-based infographic illustrating the workings, capabilities, limitations, and societal implications of these technologies. Each comic sketch also functions as a standalone informational card, providing focused explanations of specific facets of AI use, design, and impact. The visual content was created in a live collaborative session with a professional illustrator and a multidisciplinary group of experts and non experts, combining structured knowledge with informal, exploratory reflections shared during the discussion. By integrating hands-on assembly, visual storytelling, and collaborative interaction, the puzzle provides an engaging and playful tool for exploring the mechanisms, perks, and perils of AI systems in informal learning contexts.

Spectral Souping: A Unified Framework for Online Preference Alignment

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20408v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) effectively aligns Large Language Models (LLMs) with aggregate human preferences but often fails to address the diverse and conflicting needs of individual users. To overcome this issue, we introduce Spectral Souping, a unified framework for efficient, online preference alignment. Our contribution is the discovery of a universal spectral representation within LLMs, which is proven to be highly amenable to model merging. This theoretical insight enables a two-phase methodology: we first learn a basis of specialized policies offline, each focused on a distinct, fine-grained preference dimension. An online adaptation algorithm then efficiently ``soups'' these policies at inference time, either by merging their outputs or parameters, enabling rapid model adaptation without the need for costly online retraining w.r.t. tailored preference rewards. Experiments on online preference alignment benchmarks demonstrate that our method achieves significant performance improvements over existing state-of-the-art approaches, presenting a scalable and computationally efficient solution for dynamically adapting LLMs to individual user preferences.

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20410v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in socially sensitive settings despite substantial documentation that they encode gender biases. Chain-of-Thought (CoT) prompting has been proposed as a bias-mitigation approach. However, existing evaluations primarily focus on changes in LLM benchmark performance, providing limited insight into whether apparent bias reductions reflect meaningful changes in a model's internal mechanisms. In this work, we investigate how CoT prompting affects gender bias in LLMs, combining benchmark-based evaluation with mechanistic interpretability techniques and reasoning chain failure analysis. Our results confirm a stereotypical bias present in LLM outputs across benchmarks, showing that CoT prompting does not consistently reduce the bias gap. Mechanistic analyses reveal that although CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, indicating only superficial mitigation. Inspection of reasoning chains further suggests that these improvements stem from memorization and familiarity with the dataset rather than genuine understanding of bias.

Max-Entropy Moment Filtering for Stochastic Hybrid Systems

Kaito Iwasaki, Tejaswi K. C., Anthony Bloch, Maani Ghaffari, Taeyoung Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20411v1 Announce Type: new Abstract: Stochastic hybrid systems combine continuous-time stochastic dynamics with discrete reset events, producing intrinsically non-Gaussian and often multimodal uncertainty. A consistent propagation law must also account for boundary-induced probability flux across guard sets, making direct density propagation through hybrid Fokker-Planck equations expensive. We develop a hybrid extension of the Max-Entropy Moment Kalman Filter (MEM-KF) that performs filtering from partial statistical information by propagating a finite collection of moments through stochastic hybrid dynamics and reconstructing beliefs using moment-constrained maximum-entropy distributions. The key step is a moment propagation rule derived from Dynkin's formula with a jump-sum, in which reset effects appear as a boundary-flux correction over the guard set. This yields tractable moment dynamics without solving the underlying hybrid PDE. In a stochastic bouncing-ball example, the proposed method captures reset-induced non-Gaussianity through corrected moment equations while retaining the MEM-KF's optimization-based maximum-entropy representation.

Supervised Latent Restructuring for Small-Data Quantum Learning in Plant Phenomics

Alakananda Mitra, David H. Fleisher, Vangimalla Reddy, Chittaranjan Ray — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20413v1 Announce Type: new Abstract: High-dimensional biological data often exhibit a severe mismatch between feature dimensionality and sample size, making reliable classification difficult in extremely small-data regimes. In these settings, kernel methods can lose discriminative power when latent compression fails to preserve class-separating structure. We study this problem in fine-grained plant phenomics and propose a hybrid workflow that compresses 1280-dimensional deep image embeddings into a 64-dimensional PCA space and then restructures them into an 11-dimensional supervised latent space using Linear Discriminant Analysis (LDA), followed by GPU-accelerated Quantum Kernel Alignment (QKA) on NVIDIA L40S hardware. Empirically, supervised latent restructuring substantially improves the geometric separability of the compressed representation, increasing the Silhouette coefficient from 0.003 in the raw embedding space and -0.006 in PCA-64 to 0.197 in the supervised LDA-11 space. However, downstream classical evaluation reveals a clear compression trade-off: Linear SVM and XGBoost improve in the restructured latent space, whereas RBF-SVM and Random Forest degrade under the same 11-dimensional bottleneck. Under a constrained optimization budget, QKA in this regime remains challenging, indicating that latent geometry alone is not sufficient for strong trainable quantum performance. These findings position representation geometry as a central design variable in small-data quantum learning and expose the practical difficulty of recovering nonlinear discriminative structure from aggressively compressed biological representations.

Miller-Index-Based Latent Crystallographic Fracture Plane Reasoning with Vision-Language Models

Qinwu Xu, Yifan Jiang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20416v1 Announce Type: new Abstract: We study whether multimodal large language models (MLLMs) can leverage crystallographic plane indices (Miller indices) as a structured latent representation for reasoning about fracture geometry. We formulate Miller indices $z = (h,k,l)$ as a latent variable governing idealized planar fracture and evaluate two complementary capabilities: (i) latent inference, where the model maps visual observations to plane hypotheses under physically valid conditions, and (ii) latent applicability assessment, where the model determines whether such a representation is meaningful for a given fracture image. Through extensive experiments spanning synthetic data, controlled 2D--3D geometric pairs, and real-world fracture images across multiple material classes -- including ceramics, glass, metals, and concrete -- we show that MLLMs can reliably perform latent inference in idealized settings and, critically, can reject the latent representation when the underlying physics does not support it. These results suggest that MLLMs can act as physics-aware reasoning systems conditioned on structured latent priors, provided that the domain of validity is explicitly modeled.

Intersecting Dense Automata

Dmitry Chistikov, Neha Rino — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20421v1 Announce Type: new Abstract: We observe that the classical Cartesian product construction for the intersection of (languages of) nondeterministic finite automata (NFA) is non-optimal in the worst case, if the automata have many transitions. For a fixed alphabet, the product of two NFA may have $\Theta(m^2)$ transitions if these NFA have at most $n$ states and $m$ transitions each. We describe alternative constructions with $O(m n)$ transitions: or $O(m n^{k-1})$ for the intersection of $k$ NFA (for fixed $k \ge 2$ and alphabet $\Sigma$). This gives a faster algorithm for deciding NFA intersection emptiness. The new algorithm is optimal, unless there exists a breakthrough combinatorial algorithm for detecting $(k+1)$-cliques in undirected graphs. This also leads to a more efficient certification scheme for NFA intersection emptiness.

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20423v1 Announce Type: new Abstract: Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Shuaike Shen, Wenduo Cheng, Shike Wang, Mingqian Ma, Jian Ma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20425v1 Announce Type: new Abstract: Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

Multi-Week, In-Class Deployments of Telepresence Robots With Four Homebound K-12 Students: Benefits, Challenges, and Recommendations

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20431v1 Announce Type: new Abstract: Missing significant amounts of school during K-12 education is known to put students' cognitive and social development at risk. Alternatives such as home instruction and online learning are common, but lack sufficient interaction with peers and teachers in the classroom. Mobile remote presence systems, or telepresence robots, are promising for homebound students because they provide embodiment and mobility in addition to the real-time participation offered by video conferencing technologies. Research is needed, however, for telepresence robots to meet the complex needs of homebound students participating remotely in the K-12 classroom context. We present findings from four multi-week deployments with homebound K-12 students attending classes via telepresence robots. The homebound students' experiences were documented in a total of 15 interviews and analyzed qualitatively as case studies. The homebound student participants and their deployment contexts differed from one another along multiple dimensions, and while some benefits of mobile remote attendance were enjoyed by all participants, each participant also experienced unique benefits. Some challenges with hearing, seeing, and moving the robot around the classroom warranted improvements to the design of the telepresence system. Other challenges suggested priorities for managing a classroom deployment, such as ensuring that the remote student is included in classroom activities, accountable to the teacher, and treated with respect by classmates. Based on insights from the study, we make recommendations for real-world deployment procedures in similar contexts.

Spacetime Optimal-Transport Attention for Visuo-Haptic Imitation Learning of Contact-Rich Manipulation

Yue Feng, Weicheng Huang, I-Ming Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20433v1 Announce Type: new Abstract: Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.

Hiding in Plain Sight: Finding MAHA on Reddit

Sabit Ahmed, Subigya Nepal, Henry Kautz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20435v1 Announce Type: new Abstract: Make America Healthy Again (MAHA) is a national health movement that encompasses a striking mix of beliefs, from broadly accepted concerns about good diet and exercise to controversial takes on organic and genetically modified food, childhood vaccination, science, and institutions. Various influencers and promoters of the MAHA movement on social media are scattered throughout the online space. Investigating the structure, discourse, and contagion of MAHA beliefs requires large-scale fine-grained digital footprints. Constructing structured data covering different MAHA themes from vast unstructured social media data is challenging. We introduce a Reddit dataset that spans six years (2020-2025), comprising 19.4M posts from 4M users. Containing the natural and thematic context of 12 MAHA-aligned beliefs, this dataset offers researchers from various domains the opportunity to study the dynamics of the MAHA movement, its structural and functional components, and the linguistic and behavioral patterns of its proponents.

Lighting-aware Unified Model for Instance Segmentation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20436v1 Announce Type: new Abstract: Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

Closing the Motivation Gap: Incentives Enhance Visual Misinformation Discernment and Verification

Sijia Qian, Cuihua Shen, Jingwen Zhang, Magdalena Wojcieszak — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20438v1 Announce Type: new Abstract: Cheapfakes, or real images presented misleadingly or in unrelated contexts, are an increasingly prominent form of visual misinformation. While media literacy interventions can enhance individuals' ability to detect such content, motivational barriers often hinder the adoption of image verification. This study examines whether incorporating different mechanisms and types of incentives into a digital media literacy intervention improves visual misinformation discernment and image verification behavior, both immediately and over time. We conducted a pre-registered two-wave between-subjects online experiment (N = 1,421) on a professionally designed social media platform. The study used a 2 (Incentive Type: symbolic vs. monetary) x 2 (Incentive Mechanism: task- vs. result-based) factorial design with additional control groups. Results show that task-based incentives, particularly monetary ones, were most effective at initiating image verification behaviors, namely reverse image search, and boosting short-term discernment, whereas result-based incentives were more effective in sustaining discernment accuracy. These findings suggest that both the mechanism and the type of incentives play a critical role in shaping the short- and long-term effectiveness of media literacy interventions, highlighting the value of multi-phased incentive strategies for combating visual misinformation in digital environments.

Can Conversational XAI Improve User Performance? An Experimental Study

Sven Kruschel, Julian Rosenberger, Lasse Bohlen, Mathias Kraus, Patrick Zschech — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20439v1 Announce Type: new Abstract: Explainable AI (XAI) techniques aim to provide insights into predictive models and enhance user performance, yet they often fall short of these expectations. Conversational XAI assistants promise to overcome such limitations, but empirical evidence on their impact on objective performance measures remains limited. We propose an experimental design for evaluating explanation assistance through prediction accuracy, model understanding, and error identification. Using an explainable-by-design prediction model, we create conditions where users can outperform the model by identifying and compensating for systematic errors. We compare conversational assistance against Q&A-based assistance to assess which better supports users in working with model explanations. Preliminary results from testing our experimental design show that participants (N=42) in both treatments significantly outperformed the model but reveal no performance differences between assistance types and modest engagement overall. These findings inform refinements for our planned full study, including enhanced engagement interventions and investigation of the mechanisms driving improved predictions.

Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20440v1 Announce Type: new Abstract: We introduce the $\star_G$ tensor algebra, in which any finite group $G$ defines the multiplication rule, making equivariance an intrinsic algebraic property rather than an architectural constraint. The framework rests on three machine-verified theoretical pillars: (i)~an Eckart-Young optimality guarantee for the $\star_G$-SVD: the first such result for symmetry-preserving tensor approximation, exact and polynomial-time; (ii)~a Kronecker factorization that composes multiple symmetries by replacing $F_G$ with $F_{G_1} \otimes F_{G_2}$ with no architectural redesign; and (iii)~a 600-line Lean~4 formalization of the $\star_G$ algebra. The framework provides capabilities that equivariant neural networks (ENNs) structurally cannot: a closed-form per-irreducible-representation decomposition of every prediction, and data-driven discovery of the symmetry group that best fits a dataset. As a non-trivial empirical demonstration, decomposing QM9 molecular geometry over the chiral octahedral subgroup of SO(3) recovers the Wigner--Eckart selection rules of angular momentum from data alone, with no quantum mechanical input: scalar properties are A$_1$-dominated, dipole components are T$_1$-dominated, the isotropic polarizability is uniquely insensitive to $l\!=\!1$ as the rank-2-trace decomposition $l\!=\!0 \oplus l\!=\!2$ requires, and the T$_1$/A$_1$ predictive-power ratio separates vector observables from scalar observables by a factor of five. On full QM9 (130{,}831 molecules), $\star_G$-SVD with ridge regression provides closed form predictions at $\sim50-90\times$ fewer parameters than parameter-matched MLPs. Algebraic equivariance thus complements architectural equivariance not as a faster-better-cheaper alternative but as a different mathematical affordance: provably-optimal symmetry-preserving compression, per-irrep interpretability, and data-driven physical discovery.

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Lucky Verma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20441v1 Announce Type: new Abstract: Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $\lambda_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $\nu=0.757$ (CI [0.725, 0.799]). Reference exponents $\nu=1/2$ and 3D Ising $\nu \approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $\nu$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $\lambda=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $\lambda_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

Modeling Emotional Dynamics in Agent-to-Agent Interactions on Moltbook

Syed Mhamudul Hasan, Abdur R. Shahid — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20442v1 Announce Type: new Abstract: Generative AI systems are increasingly deployed as interactive agents in online environments, such as a social network called Moltbook. In Moltbook, large-scale agentic AIs can post, comment, and engage in activities generated at scale by AI-driven text. Yet these agent behavioral characteristics remain insufficiently understood, particularly in complex, multi-agent interaction. In this study, we analyze the emotional dynamics of agent interactions within Moltbook. We construct an emotion-aware framework that maps textual interactions to a predefined set of fine-grained emotional categories, enabling the extraction of structured emotion profiles across agents and interaction contexts. To further evaluate behavioral reliability, we introduce an emotion-based domain called Persona-Stimulus-Reaction (PSR) that captures the alignment of emotional responses across similar contexts. Our analysis shows distinct emotional patterns and varying levels of behavioral stability across agents. Our analysis reveals that agents exhibit distinct emotional signatures with varying levels of behavioral stability influenced by interaction context.

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20445v1 Announce Type: new Abstract: COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20448v1 Announce Type: new Abstract: Vision--language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20449v1 Announce Type: new Abstract: Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning

Mohammad Partohaghighi, Roummel Marcia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20450v1 Announce Type: new Abstract: Differentially private stochastic gradient descent (DP-SGD) enables private deep learning through per-example clipping and calibrated Gaussian noise, but its high-variance updates can reduce utility on challenging datasets. We propose \textbf{SMA-DP-SGD}, a \textbf{Spectral Memory-Aware Differentially Private Stochastic Gradient Descent} method that augments DP-SGD with a fractional memory branch built only from previously privatized noisy releases. WeightWatcher-inspired power-law spectral exponents provide group-wise reliability signals, instantiated layer-wise in our experiments, to adapt the decay and effective memory depth. Private-history alignment, norm matching, and warm-up activation stabilize the memory contribution. Privacy remains transparent: conditioned on the private release history, the memory branch is fixed, and the only newly data-dependent term is the current clipped sum scaled by a fixed coefficient $\beta$. Hence, SMA-DP-SGD preserves a clean conditional sensitivity structure and exactly recovers group-wise DP-SGD when $\beta=1$. Experiments on CIFAR-100, CIFAR-10, and MNIST show competitive or superior accuracy over several DP optimization baselines, with the largest gains on CIFAR-100 and CIFAR-10. CIFAR-10 ablations show that $\beta$ controls the privacy--utility trajectory, while spectral and memory diagnostics confirm a controlled short-to-moderate effective memory depth and a small memory-branch ratio. Runtime analysis shows that the mechanism incurs additional overhead, about $2.94\times$ DP-SGD in our CIFAR-10 implementation, revealing a practical trade-off between adaptive private memory and computational cost.

Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

Christopher Koch — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20456v1 Announce Type: new Abstract: Agentic AI coding systems can inspect repositories, plan implementation steps, edit files, call tools, run tests, and submit pull requests. These capabilities make software and hardware development faster in some settings, but current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes. Controlled studies report productivity gains in some enterprise tasks, slowdowns in mature open-source work, moderate but heterogeneous meta-analytic effects, and persistent failures in repository setup, dependency handling, permission gating, and hardware verification. This paper argues that the central problem is no longer prompt engineering; it is engineering process control. It synthesizes evidence from agentic software engineering, GitHub-scale adoption studies, repository-level agent configuration, productivity trials, issue-resolution benchmarks, and hardware/RTL verification research. It proposes Agentic Agile-V, a process framework that uses Agile-V as the lifecycle backbone and a task-level SCOPE-V loop - Specify, Constrain, Orchestrate, Prove, Evolve, and Verify - to convert conversational intent into structured engineering artifacts and acceptance evidence. The paper contributes: (i) a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; (ii) a conversation-to-contract gate that separates exploratory dialogue from implementation; (iii) risk-adaptive feature, bug-fix, testing, and hardware workflows; and (iv) an evidence-bundle acceptance model for agent-generated artifacts. The paper concludes that agentic AI does not eliminate engineering discipline; it increases the value of requirements, constraints, traceability, independent verification, and human approval.

The Structure and Dynamics of the Online MAHA-sphere

Sabit Ahmed, Subigya Nepal, Henry Kautz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20457v1 Announce Type: new Abstract: The "Make America Healthy Again" (MAHA) movement has created a complex ideological ecosystem within online communities, where advocacy for healthier lifestyles and whole-food diets coexists with vaccine skepticism and anti-science attitudes. Understanding how these interconnected beliefs interact, overlap, and evolve is critical for public health communication and intervention. We uncover the functional overlaps, network structures, engagement patterns, opinion dynamics, and linguistic differences across the full spectrum of MAHA ideologies. Using large-scale Reddit data spanning six years, we identified 12 MAHA-adjacent themes, including mainstream topics such as exercise, whole food, and screen use, as well as contentious topics such as vaccines, masks, GMOs, fluoride, and others. We developed a tree-based few-shot LLM pipeline to classify stances (pro, anti, neutral) across all themes, then computed user-level opinion scores to examine cross-theme interactions and opinion shifts over time. We find that MAHA-aligned users exhibit strong cross-theme bundling and coherent network structure, whereas anti-MAHA users do not bundle beyond chance. MAHA users cluster in a few mainstream subreddits, but post in a wide ecosystem of MAHA-related communities. During the pandemic, anti-fluoride and anti-mask posters transitioned into anti-vaccination posts, and later moved to broader anti-science narratives, suggesting that vaccine skepticism may serve as an entry point into wider anti-science engagement. Pro- and anti-MAHA communities also exhibit distinct psycholinguistic profiles, reflecting deeper ideological and rhetorical divides.

ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

Erick O. Rodrigues, Aura Conci, Panos Liatsis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20458v1 Announce Type: new Abstract: Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20459v1 Announce Type: new Abstract: In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20460v1 Announce Type: new Abstract: Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

Understanding Model Behavior in Monocular Polyp Sizing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20461v1 Announce Type: new Abstract: Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.

Art Card Game (ACG): Embedding Illustration in Gameplay to Mitigate Artist Self-Criticism

Catherine Mullings, Michael S. Bernstein — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20465v1 Announce Type: new Abstract: Persistent self-criticism--harsh evaluative self-talk--can undermine illustrators' performance and well-being. Traditional interventions draw on psychotherapeutic approaches (e.g., compassion training) but sit outside the illustration workflow, requiring time, facilitation, and skill transfer. We propose an in-workflow alternative: evaluative off-centering, a mechanism redirecting self-critical evaluation away from an inherently self-evaluative task (like illustration) by embedding it in an alternative activity. We instantiate evaluative off-centering in Art Card Game (ACG) that integrates illustration into a card customization game: players illustrate cards that become playable assets in a head-to-head battle. In a four-day randomized controlled study with hobbyist and professional illustrators (N=38), ACG outperformed a control condition with identical illustration constraints but no evaluative off-centering mechanisms (e.g. multiplayer, gameplay), yielding significantly higher pride in produced artwork and activity enjoyment. Pride and enjoyment--positive affect states linked to lower self-criticism--help explain how ACG reduces self-criticism. We discuss design implications for creativity support tools that apply evaluative off-centering across creative domains.

Fifty Years of Transaction Processing Research (extended)

Philip A. Bernstein (Microsoft Research) — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20466v1 Announce Type: new Abstract: In this short paper, I recount some early history of transaction research (including some of my own), explain why transaction research continues to this day (even though it seems to be a solved problem), and speculate about its future. This is an extended version of the paper that appeared in the Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion '25).

High Quality Embeddings for Horn Logic Reasoning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20467v1 Announce Type: new Abstract: Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.

CASCADE Conformal Prediction: Uncertainty-Adaptive Prediction Intervals for Two-Stage Clinical Decision Support

Ricardo Diaz-Rincon, Muxuan Liang, Adolfo Ramirez-Zamora, Benjamin Shickel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20468v1 Announce Type: new Abstract: Effective medication management in Parkinson's Disease (PD) is challenging due to heterogeneous disease progression, variable patient response, and medication side effects. While AI models can forecast levodopa equivalent daily dose (LEDD) as a measure of medication needs, standard uncertainty quantification often fails to communicate the reliability of these predictions, treating high and low confidence clinical decisions identically. We introduce CASCADE (Calibrated Adaptive Scaling via Conformal And Distributional Estimation), a novel conformal prediction framework that propagates epistemic uncertainty from a screening classifier to adapt downstream predictions. Unlike standard conformal methods that rely on auxiliary residual regression, we leverage epistemic uncertainty from a primary classification task (identifying whether a medication change is needed) to dynamically scale the prediction intervals of a secondary regression task (predicting how much change). By mapping Venn-Abers multi-probabilistic uncertainty directly to non-conformity scores, our framework achieves continuous risk adaptation. We demonstrate that this ``cascade effect'' produces highly efficient intervals for confident patients (38.9% narrower than standard conformal baselines) while automatically expanding intervals to ensure robust coverage for uncertain cases, bridging the gap between discrete clinical decision-making and continuous dose forecasting in PD.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

Haoyu Wang, Zitong Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20469v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20470v1 Announce Type: new Abstract: Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

Code Generation by Differential Test Time Scaling

Yifeng He, Ethan Wang, Jicheng Wang, Xuanxin Ouyang, Hao Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20473v1 Announce Type: new Abstract: Test-time scaling has emerged as a promising approach for improving code generation by exploring large solution spaces at inference time. However, existing methods often rely on public test cases that are unavailable in practice, or require extensive LLM inference for candidate selection, leading to significant token consumption and time overhead. We present DiffCodeGen, a novel test-time scaling method for code generation based on coverage-guided differential analysis. DiffCodeGen generates diverse code candidates using various sampling and prompting strategies, then applies coverage-guided fuzzing to synthesize inputs without requiring any existing tests or large language models. By executing all candidates on these inputs, DiffCodeGen captures their dynamic behavior and clusters candidates based on behavioral similarity. DiffCodeGen selects the medoid of the largest cluster as the final output. Unlike prior test-time scaling methods that invoke additional LLM inference for candidate selection, DiffCodeGen performs selection without any extra model calls, incurring little to no additional token consumption. DiffCodeGen is fully asynchronous, naturally suited to the current trend of agentic coding, and is thus efficient and highly scalable. We evaluate DiffCodeGen across 4 large language models, demonstrating consistent improvements over baselines. Compared to state-of-the-art test-time scaling methods, DiffCodeGen achieves competitive or superior performance while using only a fraction of time and tokens. DiffCodeGen is model-agnostic and can be combined with reasoning models to further boost performance.

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20476v1 Announce Type: new Abstract: Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

Training Language Agents to Learn from Experience

Yuval Shalev, Zifeng Ding, Mateja Jamnik — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20477v1 Announce Type: new Abstract: Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

Chen Shen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20478v1 Announce Type: new Abstract: LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

Jianmin Liao, Lixin Shen, Yuesheng Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20479v1 Announce Type: new Abstract: Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

Quadratic Characterizations for Reachability Analysis of Neural Networks

Elias Khalife, Mazen Farhood, Pierre-Loic Garoche — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20482v1 Announce Type: new Abstract: Quadratic constraints (QCs) are widely used to characterize nonlinearities and uncertainties, but generic analytical characterizations can be conservative on bounded domains. This paper develops a framework for constructing verified quadratic characterizations of scalar relations in the two-dimensional real plane. Candidate quadratic inequalities are locally generated by solving convex quadratic programs using samples from the relation and exterior sample points. They are then verified globally using sum-of-squares certificates over an exact semialgebraic description or, in the case of nonpolynomial relations, over relaxed polynomial descriptions. The resulting verified constraints define a sound overapproximation of the scalar relations over the considered domains. These constraints are directly compatible with existing analysis frameworks based on QCs and pointwise integral quadratic constraints (IQCs) for static nonlinearities and uncertainties, and they can also be embedded in QC-based semidefinite programs for reachability and safety analysis of feedforward neural networks. For smooth activations such as $\tanh$, the method yields domain-dependent quadratic characterizations that constitute an alternative to generic sector- or slope-based descriptions. For ReLU networks, we give methods to reduce conservatism in QC-based reachability analysis of feedforward networks by exploiting dependencies between neurons and tighter local bounds. Numerical examples demonstrate improved reachability results for smooth activations, reduced conservatism for ReLU networks, and applicability beyond neural networks through an example involving saturation.

Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry

L\'eon Perruchot-Triboulet, Luc Jaulin, Kai Xiao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20484v1 Announce Type: new Abstract: Autonomous navigation in GNSS-denied environments remains a core challenge for legged robots, where exteroceptive sensors such as LiDAR are prone to elevation drift in geometrically sparse or repetitive scenes. We present a factor graph architecture that augments the LIO-SAM framework with a parallel kinematic lane driven by proprioceptive leg odometry, coupled to the main LiDAR-inertial lane via an identity relative pose constraint with a selective noise model. Applied to a Linxai D50 quadruped platform across two outdoor loops totaling over one kilometer, our approach reduces elevation drift from over 30m to under 30cm and enables convergence in a scene where the baseline pipeline fails entirely. These results suggest that proprioceptive data, already computed onboard for gait control, constitutes a lightweight and effective vertical anchor for SLAM in GNSS-denied settings.

ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

May Hamri, Inbal Talgam-Cohen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20485v1 Announce Type: new Abstract: As autonomous agents increasingly execute end-to-end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget-aware methods typically control reasoning step-by-step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi-agent pipeline at inference time. We propose ZEBRA, a zero-shot framework that reduces multi-phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split. Additive and multiplicative aggregations are unified under the same solver. On a $150$-task APPS coding benchmark, both ZEBRA variants outperform LLM-direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of $\alpha = 0.5$ of the unconstrained spend, ZEBRA recovers $94.4\%$ of unconstrained quality, versus $88.1\%$ for LLM-direct. The advantage is statistically significant and transfers beyond coding: on a $3$-phase HotpotQA pipeline, ZEBRA beats LLM-direct by $14.3$pp, with allocations empirically robust to curve-estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near-balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi-agent systems.

\ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

Lautaro Estienne, Erik Ernst, Mat\'ias Vera, Pablo Piantanida, Luciana Ferrer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20490v1 Announce Type: new Abstract: In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -- to accept or reject predictions based on application-specific cost trade-offs. Such uncertainty-augmented (UA) systems -- i.e., systems that output both predictions and uncertainty scores -- are currently being assessed in the literature in a variety of ways, using separate metrics to evaluate the predictions and the uncertainty scores, setting a cost function with a fixed rejection cost or integrating over a coverage-risk curve. We argue that these evaluation approaches are inadequate for assessing overall performance of the UA system for decision making under uncertainty and propose a novel family of metrics, \ECUAS{n}, formulated as proper scoring rules for the task of interest. The parameter $n$ controls the trade-off between the cost of incorrect predictions and imperfect uncertainties depending on the needs of the use-case. We demonstrate the advantages of the \ECUAS{n} metrics both theoretically and empirically, through experiments on diverse classification and generation datasets, including a manually annotated subset of TriviaQA.

A Simple GPU-Accelerated Solver for the Schr\"odinger Operator with Applications to Ground States and Hamiltonian Simulation

Xinyu Liu, Xiangxiong Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20491v1 Announce Type: new Abstract: We extend the tensor-product direct solver from the Laplacian to the Schr\"odinger operator $-\Delta + V$. When the potential $V_1$ is separable, the operator $-\Delta + V_1$ is inverted or exponentiated at cost $O(N^{1+1/d})$ in $d$ dimensions via per-axis eigendecomposition. On a single NVIDIA A100 GPU, this costs less than one second for $10^9$ degrees of freedom in 3D. For non-separable potentials $V = V_1 + V_2$, the same solver provides a preconditioner $(-\Delta + V_1)^{-1}$ for the preconditioned conjugate gradient (PCG) method and a propagator for operator-splitting time integrators. For bounded $V_2$, we prove that the preconditioned operator has a bounded condition number and a clustered spectrum with at most finitely many outlier eigenvalues, independently of the mesh size, and also independently of the domain size when $V_1$ is a confining potential. This explains the mesh- and domain-independent PCG iteration counts observed in practice. We apply this method to ground state computation via inverse iteration for linear problems and via the $a_u$ gradient flow for Gross--Pitaevskii energy in 3D, and also Hamiltonian simulation via the approximated qHOP and Magnus-2 splitting methods from 3D to 9D on a single NVIDIA GH200 GPU.

A 10,000-Year Global Stochastic Tropical Cyclone Catalog with Wind-Dependent Track Transitions (WHITS)

Jennifer Nakamura, Upmanu Lall — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20494v1 Announce Type: new Abstract: Reliable assessment of tropical cyclone (TC) risk is limited by the brevity and spatial sparsity of the historical record, particularly for the rare, high-intensity landfalls that dominate insured loss. We present WHITS (Wind-focused Hurricane Interactive Track Simulator), a non-parametric semi-Markov track generator that extends the HITS framework of Nakamura et al. (2015) in three ways: transitions between historical track segments are conditioned on local wind speed in addition to position, age, and forward vector; the kernel selection on the comparative-vector term is sharpened to suppress dynamically inconsistent jumps; and a short smoothing window is applied across each transition to remove the position and wind discontinuities reported by downstream surge users. WHITS is fit to the full available best-track record in each of six basins in IBTrACS, extending in the North Atlantic to 1851 and in other basins to the earliest year of reliable best-track data. The resulting 10,000-yr global synthetic catalog reproduces observed track density and the annual hurricane/typhoon-force wind-hit probability across all basins. The catalog is intended for catastrophe-risk applications where a large, low-bias sample of physically plausible tracks is more useful than a small, statistically corrected one.

A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

Abhiram Kandiyana, Ankur Mali, Lawrence O. Hall, Peter R. Mouton, Dmitry Goldgof — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20495v1 Announce Type: new Abstract: Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

Hypergraph Partitioning on GPU with Distinct Incident Hyperedges and Size Constraints

Marco Ronzani, Cristina Silvano — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20497v1 Announce Type: new Abstract: Hypergraph partitioning is a recurring NP-hard problem in engineering; its efficient solution at scale hinges on parallelism. This work proposes a GPU-centric algorithm for multi-level hypergraph partitioning aimed at a specific set of problem constraints: limited size and distinct inbound hyperedges per partition. Manipulating hypergraphs requires deeply nested traversals and concurrent decision-making; our constraints impose further set operations amidst that. In turn, we design algorithms around the GPU's hierarchical parallelism and our problem's specifics. When forming partitions, we materialize the hypergraph's incidence structure and unique neighborhoods in memory to exploit set sparsity and batch node-pairing scores in shared memory. Upon refining partitions, we chain node moves into improving paths and cycles, checking their validity via cumulative set size variations reduced in parallel over moves. Thus, our dominant kernels exhibit a span linear in local hypergraph parameters. Results show an average 380x speedup and a 1.2-2.0x reduction in connectivity compared to a sequential multi-level partitioner. With minor changes, we also support k-way balanced partitioning, running 5x faster than CPU methods with a ~5% quality loss for k=2, outperforming an existing GPU partitioner at comparable runtime, with no measurable overhead from the added constraints handling logic.

A Multi-Layer Testing Framework for Automated Data Quality Assurance in Cloud-Native ELT Pipelines

Ismail Gargouri, Hassan Reza — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20500v1 Announce Type: new Abstract: Ensuring data quality in cloud-native Extract-Load-Transform (ELT) pipelines is increasingly challenging due to heterogeneous data sources, evolving schemas, and multi-backend execution environments. This paper presents a unified, multi-layer testing framework that integrates orchestration-level validation, declarative dbt tests, large language model (LLM)-generated semantic tests, and cross-store consistency checking between DuckDB and Snowflake, orchestrated through Apache Airflow. Controlled anomaly-injection experiments demonstrate that a manual-only baseline detected 7 of 16 injected anomalies. In contrast, both a manually expanded comparator and the proposed LLM-augmented configuration detected all 16, representing a 128.57% relative improvement in detection rate over the baseline. Post-migration cross-store validation confirmed exact agreement across all three curated tables. Of 25 LLM-generated test assertions, 9 were classified as useful, 4 as redundant, and 12 as executable but low-value. The complete workflow executed in 106.58 seconds across eight instrumented pipeline stages. These results demonstrate that LLM-driven semantic test synthesis can materially strengthen validation coverage while remaining operationally practical for production ELT environments.

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

Neelkamal Bhuyan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20502v1 Announce Type: new Abstract: We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $\eta^2$ (class-conditional F-test) and $\Delta\mu$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

Privacy-by-Design Adaptive Group Assignment for Digital Lifestyle Coaching at Scale

Nariman Mani, Salma Attaranasl — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20505v1 Announce Type: new Abstract: Digital lifestyle coaching systems must personalize peer support as user behavior and engagement evolve while preventing personally identifiable information (PII) and sensitive health information from leaking into analytics and AI pipelines. This creates a practical tension: personalization requires longitudinal linkability, while privacy engineering requires minimization, separation, and controlled re-identification. We present PRISM-Coach, a stakeholder-centered architecture and adaptive peer-group assignment method for privacy-preserving lifestyle coaching. PRISM-Coach separates each user into four bounded views: Identity, Operational, Learning, and Coaching, each with distinct access controls and risk profiles. Building on this separation, the system uses vault-based controlled identity restoration, a privacy-constrained contextual bandit to assign users to eligible peer groups under coach-capacity and stability constraints, and a human-in-the-loop coaching assistant that generates de-identified summaries and draft messages without sending raw PII or PHI to external AI services. We instantiate PRISM-Coach in a commercially deployed lifestyle coaching platform and evaluate it using three years of telemetry from approximately 2,800 users and an in-app needs assessment survey. At the population level, daily check-in adherence increases from 0.35 to 0.68, and engagement rises to 1.35 baseline. In a matched 19-week comparison window, the AI-enabled workflow achieves adherence of 0.74 versus 0.48 under static grouping and higher average weight loss: 5.2 kg versus 3.1 kg. Survey results show that 82% report positive perceived benefit, and 92% report increased privacy confidence after transparency disclosures. These results position PRISM-Coach as a practical blueprint for privacy-by-design adaptive learning systems in everyday wellness.

Reinforcing Human Behavior Simulation via Verbal Feedback

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20506v1 Announce Type: new Abstract: Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20510v1 Announce Type: new Abstract: Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

Creating Learning Scaffolds for Engineering Design Using Concept Catalyst

Madhuri Singh, Gennie Mansi, Mark Owen Riedl — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20511v1 Announce Type: new Abstract: K-12 teachers employ Engineering Design Challenges to help students learn about the Engineering Design Process hands-on. They use techniques like hard scaffolding questions to guide the students as they think through the different stages of the engineering design process. While useful, the creation of these questions adds to the teacher's preparation time for their classes. Concept Catalyst uses Large Language Models to assist teachers with the rapid creation of scaffold questions for engineering design challenges. Unlike open-ended chat, Concept Catalyst uses LLMs to summarize and decompose an engineering design challenge into the concepts that students will engage with, allow the teacher to visually manipulate and link related concepts, and to propose scaffolding questions for the teacher to modify or accept.

Framing an AI with Values Reduces AI Reliance in AI-supported Writing Tasks

Alice Gao, Andrew N. Meltzoff, Maarten Sap, Katharina Reinecke — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20512v1 Announce Type: new Abstract: Despite a global user base adopting large language models (LLMs) for daily writing tasks, model suggestions tend to align with Western values. Research has shown users commonly accept a high fraction of these AI suggestions, homogenizing writing styles and rendering outputs more ``Western'' than intended. While this suggests a need to reduce AI reliance, it remains unknown what kind of interventions could achieve this. Can framing the AI with specific values, and comparing it to one's own, make users less susceptible to overreliance and support more unique writing? We tested this hypothesis in a between-subjects online experiment with Indian and American participants (n=149) in which they were asked to perform AI-supported writing tasks, either 1) without an intervention, 2) after seeing an overview of the AI's framed values, or 3) after seeing an overview of the AI's framed values compared to their own. Our results show that seeing the AI's framed values reduces AI reliance, i.e., the proportion of the final essay generated by the AI, by an average of 20\%. Additionally, when participants saw an overview of the AI's framed values (without comparison to their own values), the final essays contain more unique text than without intervention. Our findings emphasize the importance of educating users about potential value biases in AI, showing that raising awareness with a simple overview of values encourages users to personalize their writing.

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20514v1 Announce Type: new Abstract: We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell's equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

Online Conformal Prediction with Corrupted Feedback

Bowen Wang, Matteo Zecchin, Osvaldo Simeone — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20515v1 Announce Type: new Abstract: Modern artificial intelligence systems require calibrated uncertainty estimates that remain reliable in sequential and non-stationary environments. Online conformal prediction (OCP) addresses this challenge through adaptively updated prediction sets that provide deterministic long-run miscoverage guarantees. These guarantees, however, hinge on the assumption of perfect feedback about the coverage of past prediction sets. In practice, the observed miscoverage indicator may be corrupted by noise, communication failures, or adversarial manipulation, which can severely degrade OCP's calibration guarantees. In this paper, we study OCP under corrupted feedback. We first model feedback corruption as an arbitrary binary flip sequence, and analyze how feedback corruption affects and degrades the miscoverage performance of standard OCP. We then propose two robust schemes: robust OCP via filtering, which leverages the structural properties of the predicted threshold to filter corrupted feedback, and robust OCP via active compensation, which incorporates an active compensation mechanism to mitigate the effect of corrupted feedback. For both methods, we establish explicit miscoverage guarantees, which are further specialized for an independent stochastic flip model and for an arbitrary error model with memory bounds. Experiments on real-world datasets validate the proposed approach, showing markedly improved calibration and significantly smaller prediction sets compared with baseline OCP methods under corrupted feedback.

Codec-Robust Attacks on Audio LLMs

Jaechul Roh, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansdar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20519v1 Announce Type: new Abstract: Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. We show that the codec's compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.

Open-World Evaluations for Measuring Frontier AI Capabilities

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20520v1 Announce Type: new Abstract: Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

An exponential mechanism based on quadratic approximations for fine-tuning machine learning models with privacy guarantees

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20521v1 Announce Type: new Abstract: Fine-tuning adapts a pretrained machine learning model to a small, sensitive dataset, but this process risks memorizing individual new data points, making the model vulnerable to adversaries who seek to extract sensitive information. In this work, we develop a randomized algorithm based on the exponential mechanism for fine-tuning while ensuring differential privacy. Our key idea is to construct a simple utility function that combines a local quadratic approximation of the pretrained model with information from the new dataset. The resulting exponential mechanism admits exact sampling from a multivariate normal distribution in closed form. We establish theoretical privacy guarantees, sensitivity bounds, and accuracy estimations for our method. We further introduce a random-projection strategy that makes the approach scalable to high-dimensional models. Numerical experiments on the MNIST benchmark and the MIMIC clinical dataset demonstrate competitive performance against existing differentially private fine-tuning techniques.

Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models

Athanasios Angelakis, Gabriele De Vito, Eleni-Myrto Trifylli, Filomena Ferrucci — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20523v1 Announce Type: new Abstract: Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20525v1 Announce Type: new Abstract: We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

An $O(n^5)$-Time Algorithm for Optimal Broadcast Domination

Kleitos Papadopoulos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20526v1 Announce Type: new Abstract: Broadcast domination assigns a nonnegative integer power to every vertex of a graph so that every vertex is within the assigned power of some broadcasting vertex, and the objective is to minimize the sum of the powers. Heggernes and Lokshtanov proved that the problem is polynomial-time solvable on arbitrary connected unweighted graphs by showing that some optimal efficient broadcast has a domination graph that is a path or a cycle, and by reducing the general case to an $O(n^6)$-time algorithm. This paper gives an efficient algorithm of the path-case. Instead of building one auxiliary acyclic graph for every possible left endpoint vertex, we build a single directed acyclic graph whose states are oriented broadcast balls together with their two possible residual sides. The resulting path-case algorithm runs in $O(n^3)$ time and $O(n^3)$ space on an $n$-vertex graph. Combining this routine with the same peel-one-ball reduction of Heggernes and Lokshtanov yields an exact $O(n^5)$-time algorithm for optimal broadcast domination on arbitrary connected unweighted graphs. This resolves the quintic-time conjecture for general graphs attributed to Heggernes and S\ae ther and recorded in subsequent surveys of broadcast domination.

Modern Portfolio Theory in the Crypto-Wilderness

Ivan Vynyavskyy, Stefan Kitzler, Bernhard Haslhofer, Aviv Yaish — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20528v1 Announce Type: new Abstract: Modern Portfolio Theory (MPT) prescribes how to maximise the return of an asset portfolio for a given level of risk. The optimal trade-off between return and variance defines the efficient frontier. Whether actual cryptoasset portfolios approximate this prescription and whether proximity to the frontier translates into realised performance remain difficult to test at large scale in traditional markets due to their opaque nature and the inaccessibility of data. As we show, public blockchains make these questions measurable: every token transfer is recorded, thus enabling complete portfolio reconstruction for every account at any point in time. We leverage this transparency to reconstruct cryptoasset portfolios for over 116M Ethereum accounts across the full chain history (2015-2025), measure their distance to the constrained efficient frontier, and quantify how deviations translate into realised performance. Here we show that market entry timing, not allocation choice, is the dominant predictor of realised cryptoasset returns. On-chain wealth is highly concentrated and portfolios are pervasively under-diversified, with single-asset holdings accounting for 83.35% of accounts. Two-asset portfolios sit closest to the efficient frontier defined by their held assets, a proximity that reflects the narrowness of their opportunity set rather than deliberate optimisation. Passive market-capitalisation weighting outperforms every MPT optimisation strategy in median realised return, and entry month alone explains 70-79% of the variance in returns, far exceeding the contribution of allocation choice. Mean-variance optimisation therefore appears neither descriptive of observed behaviour nor prescriptively useful in the cryptoasset domain, even if MPT retains its value as a normative benchmark.

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

Claire Hobbs, R. Thomas McCoy — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20529v1 Announce Type: new Abstract: In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Parsa Mazaheri, Kasra Mazaheri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20530v1 Announce Type: new Abstract: Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column is no longer the right unit of comparison for deployable agents. AgentAtlas extends this line of work with four components: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a nine-category trajectory-failure taxonomy with two orthogonal hierarchical labels (primary_error_source, impact); (iii) a taxonomy-aware vs. taxonomy-blind methodology that measures how much of a model's apparent capability comes from the supervision in the prompt; and (iv) a benchmark-coverage audit mapping fifteen agent benchmarks against six behavioral axes. To demonstrate the methodology we run a small fixed eight-model set (1,342 generated items, four frontier closed and four open-weight) under both prompt modes. Removing the explicit label menu drops every model's trajectory accuracy by 14-40 pp to a tight 0.54-0.62 floor regardless of family, and no single model wins on all three of control accuracy, trajectory diagnosis, and tool-context utility retention. We treat the synthetic run as a measurement-protocol demonstration, not a benchmark release.

Pseudo-Formalization for Automatic Proof Verification

Slim Barkallah, Luke Bailey, Kaiyue Wen, Mohammed Abouzaid, Tengyu Ma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20531v1 Announce Type: new Abstract: Reliable verification of proofs remains a bottleneck for training and evaluating AI systems on hard mathematical reasoning. Fully formal proofs, in languages like Lean, are easy to verify because they are unambiguous and modular. Most proofs, particularly those written by AI systems, have neither property, and translating them into formal languages remains challenging in many frontier math settings. We propose Pseudo-Formalization (PF), a proof format that captures the modularity and precision of formal proofs while retaining the flexibility of natural language. A Pseudo-Formal proof is decomposed into self-contained modules, each stating its premises, conclusion, and proof in natural language. To verify the correctness of a regular natural language proof, an LLM translates it to Pseudo-Formal and then verifies each module independently, an algorithm we call Block Verification (BV). We evaluate PF+BV on two benchmarks spanning olympiad and research-level mathematics, where it pareto-dominates LLM-as-judge baselines on error-finding precision and recall. To support future work, we release our research-level proof verification benchmark ArxivMathGradingBench.

Hybrid Edge-HPC Systems for Low-Latency Data-Driven Inference

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20532v1 Announce Type: new Abstract: Emerging cyber-physical systems increasingly require low-latency inference from streaming sensor data while maintaining models that reflect complex and evolving physical processes. In many domains, however, model updates depend on high-fidelity simulations and training executed on remote high-performance computing (HPC) systems under batch scheduling. This creates a fundamental mismatch between the responsiveness required at the edge and the cost, throughput, and availability of simulation-driven model updates. We present RBF (Reverse Backfill), a hybrid edge-HPC learning and inference architecture that integrates low-latency edge inference with asynchronous, simulation-driven model improvement. RBF targets simulation-bounded settings in which model updates are constrained by simulation throughput and HPC scheduling delays, and reinterprets HPC backfilling by using opportunistic computation to improve model accuracy rather than system utilization. RBF decouples inference from simulation and training by deploying lightweight surrogate models at the edge while incorporating improved models asynchronously as they become available. The architecture supports pluggable surrogate models and orchestrates computation across heterogeneous infrastructure spanning edge devices, private 5G, cloud, and HPC resources. We instantiate RBF using a real-world digital agriculture deployment that couples edge sensing with computational fluid dynamics (CFD) simulations to infer airflow patterns in a large agricultural screenhouse. Our evaluation characterizes end-to-end system behavior under realistic constraints, quantifying simulation latency, training cost, inference throughput, and the impact of delayed model updates on prediction accuracy. Results demonstrate that RBF enables continuous, low-latency inference while improving model fidelity over time despite delayed and irregular model updates.

Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Meng Zhu, Quan Xiao, Weidong Min — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20533v1 Announce Type: new Abstract: Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient-scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer-comparison protocol. The code will be released at https://github.com/mengzhu0308/Ada2MS

Axiomatizing Neural Networks via Pursuit of Subspaces

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20534v1 Announce Type: new Abstract: While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.

Rotatable Coupler Antenna Enhanced Wireless Network: Modeling and Coupler Rotation Optimization

Xiaodan Shao, Chuangye Shan, Weihua Zhuang, Xuemin Shen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20535v1 Announce Type: new Abstract: Flexible coupler antenna systems have recently received significant research interest due to their capability to intelligently reconfigure wireless channels by controlling coupler positions and/or rotations and dynamically exploiting mutual coupling. In this paper, we investigate a new type of flexible coupler antenna, termed rotatable coupler antenna (RCA), for enabling spectrum and energy efficient wireless communication cost-effectively. Specifically, an RCA consists of one fixed active antenna and multiple low-cost passive couplers, each of which can independently rotate in three-dimensional (3D) space, so as to collaboratively achieve mechanical beamforming without requiring additional radio-frequency (RF) chains for the couplers. We study an RCA-enhanced point-to-point communication system, where one RCA is deployed at the transmitter to serve a single user equipped with a fixed antenna. Based on multi-port circuit theory, we establish the channel model and characterize the mutual coupling coefficients as a function of coupler rotations. We formulate a new problem to maximize the received signal-to-noise ratio (SNR) at the user by optimizing the 3D rotations of all couplers, subject to practical coupler rotation constraints. To tackle this nonconvex problem, we develop a spherical-cap conditional-gradient-based algorithm with cross-entropy-method initialization. Simulation results demonstrate that the proposed RCA system can significantly improve communication performance in comparison with benchmark schemes, while requiring substantially fewer active antennas and RF chains.

HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20536v1 Announce Type: new Abstract: Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Robert Leaman, Rezarta Islamaj, Zhiyong Lu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20537v1 Announce Type: new Abstract: Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

Continual Segmentation under Joint Nonstationarity

Prashant Pandey, Himanshu Kumar, Devineni Sri Venkatraya Chowdary, Brejesh Lall — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20538v1 Announce Type: new Abstract: Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI

Ipsita Bhar, Huseyin Tuna Erdinc, Thales Souza, Charles Jones, Felix J. Herrmann — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20539v1 Announce Type: new Abstract: The advent of machine learning (ML) and computer vision has significantly accelerated seismic inversion workflows by reducing the computational cost of traditionally expensive iterative methods. However, the development and evaluation of ML methods remain limited by the scarcity of realistic velocity models, as most high-quality data are privately owned by oil and gas companies. To address this gap, we present OpenSeisML, a collection of real seismic datasets designed to support generative AI (Gen-AI) workflows for seismic inversion. The datasets are curated from publicly available surveys in the UK National Data Repository (NDR). When seismic volumes are in the time domain and wells are in depth, a time-to-depth conversion is required. We use checkshot data to establish the time-depth relationship and construct a velocity model through interpolation for accurate conversion of post-stack seismic data. Here, we present an automated data curation pipeline that enables seismic data preparation while ensuring reproducibility. The objective is to train a generative model that captures the statistical distribution of subsurface properties, enabling the synthesis of multiple statistically consistent realizations for uncertainty quantification which can act as a prior for seismic inversion.

Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

Huan Huang, Michele Esposito, Chen Zhao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20543v1 Announce Type: new Abstract: Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20544v1 Announce Type: new Abstract: Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

Detecting Data Exfiltration through I2P Anonymity Networks: A Two-Phase Machine Learning Approach

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20546v1 Announce Type: new Abstract: The Invisible Internet Project (I2P) provides strong anonymity through garlic routing and distributed network architecture, making it attractive for legitimate privacy needs. Nevertheless, the same properties can be exploited by malicious actors to steal sensitive information from corporate networks without detection. Current network security measures often fail to detect I2P traffic, and existing literature has focused primarily on protocol-level traffic identification without addressing behavioral threat assessment. This paper proposes a two-stage machine-learning model for I2P traffic analysis using the SafeSurf Darknet 2025 dataset comprising 184,548 network flows. Phase 1 achieved 99.96% accuracy in distinguishing I2P traffic from normal network traffic using a Random Forest classifier, with only 2 false positives among 32,318 normal flows. Phase 2 performed behavioral analysis on traffic identified as I2P, classifying it as either exfiltration or legitimate activity, achieving 91.11% accuracy using XGBoost. The system demonstrates that tree-based ensemble methods substantially outperform deep neural networks and support vector machines for this task. Feature importance analysis indicates that the most discriminative features are packet timing and flow duration. These findings establish that accurate I2P traffic detection and threat prioritization are achievable in operational network environments, enabling security teams to focus resources on high-risk events rather than monitoring all encrypted traffic.

Latent Process Generator Matching

Lukas Billera, Hedwig Nora Nordlinder, Ben Murrell — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20547v1 Announce Type: new Abstract: Many recent flow-matching and diffusion-style generative models rely on auxiliary stochastic dynamics during training: a richer process is simulated to define conditional targets, but the auxiliary state is either intractable to sample at generation time or simply not part of the desired output. Existing Generator Matching theory formalises conditioning on static latent random variables, and several recent papers prove special cases of projection results for particular augmented-state constructions. We introduce latent process generator matching, a general framework that treats the observed generative state as a deterministic image $X_t=\Phi(Y_t)$ of a tractable Markov process $Y_t$. We show that in this setting one may learn the generator of a stochastic process on the image space which has the same one-time marginal distributions as the projected process. This generalizes and subsumes the discrete latent process results from the literature, and extends Generator Matching from static latent variables to a rich family of time-dependent latent conditional processes.

What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

Yong Jin Chun, Iftekhar Ahmed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20548v1 Announce Type: new Abstract: Large Language Models (LLMs) have enabled collaborative Multi-Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early-stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter-agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Based on these insights, we propose Category-Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at https://anonymous.4open.science/r/cara_mas

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20549v1 Announce Type: new Abstract: Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20551v1 Announce Type: new Abstract: Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

Michelle A. Vaccaro, Jared R. Curhan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20554v1 Announce Type: new Abstract: According to canonical negotiation theory, people's success in a negotiation depends on how well they balance competing demands--empathizing and asserting, demonstrating concern for other and concern for self, being soft on the people and hard on the problem. Yet people struggle to manage these tensions, so researchers have lacked the ability to rigorously test the field's prescriptions under controlled conditions. AI agents do not face the same limitations, and their precision, repertoire, consistency, and scalability enable a new class of experiments to contribute to negotiation theory. In this article, we introduce personality engineering: a methodology that uses AI agents to precisely parameterize, manipulate, and evaluate negotiator personality. We propose using the interpersonal circumplex--and its two core dimensions of warmth and dominance--as a foundational coordinate system for the field. This approach offers both a rigorous methodology for testing classic negotiation theories and a practical guide for designing the personalities of AI negotiation agents.

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

Xingwei Gan, Ying Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20555v1 Announce Type: new Abstract: We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO). In contrast to Reinforcement Learning with Verifiable Rewards (RLVR) methods, our proposal does not involve a Kullback Leibler (KL) regularization or critic; the trainable policy and the reference anchor are coupled through the logit averaging structure to leverage the reasoning expertise of the trainable policy while maintaining the formatting advantage of SFT. Our method is evaluated on MATH, cn-k12, and MMLU, and the results show a higher accuracy or at least comparable accuracy relative to the canonical KL-regularized GRPO.

When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

Wen Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20558v1 Announce Type: new Abstract: Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

Flexible Coupler Antenna for Wireless Networks: Opportunities and Challenges

Xiaodan Shao, Chuangye Shan, Weihua Zhuang, Xuemin Shen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20560v1 Announce Type: new Abstract: Flexible coupler antenna (FCA) is a new technique that aims to improve the performance of wireless communication networks by smartly translating low-cost passive couplers around fixed-position active antennas to reshape the induced currents on the passive elements for radiation. Specifically, different couplers can independently control their positions/rotations at the transceiver and thereby collaboratively achieve mechanical beamforming for directional signal enhancement or nulling. The position and/or rotation reconfiguration of passive couplers provides a new and cost-effective means of enhancing wireless communication performance, while significantly reducing the antenna and radio-frequency (RF) chain costs of conventional active arrays. The compact and low form-factor structure of the FCA makes it particularly appealing for devices with stringent size, weight, and power (SWAP) constraints. In this article, we provide an overview of FCA to reveal its promising capabilities in wireless networks, including its system modeling, practical implementation, and competitive advantages over existing techniques. We present a variety of FCA-enabled performance enhancements in terms of mechanical beamforming gain, path-loss reduction, fading mitigation, spatial multiplexing gain, interference suppression, and geometric gain. Furthermore, we elaborate on the design challenges of FCA as well as promising solutions, and discuss the key applications of FCA in wireless networks. Finally, numerical results are presented to verify the substantial capacity gains enabled by FCA-aided transmission in wireless networks.

Fault-Tolerant, Rigidity-Preserving Control of Inflatable Truss Robots

James Wade, Isaac Weaver, Mihai Stanciu, Nathan Usevitch — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20561v1 Announce Type: new Abstract: Isoperimetric robotic trusses can adapt to different tasks and environments because they have a high strength-to-weight ratio, can change their own shape dramatically, and can be reconfigured into a variety of different shapes. However, motor failures in operational environments can severely limit operational capabilities if not properly addressed. This paper presents a fault-tolerant control framework for an inflatable robotic truss that maintains functionality despite motor failures, shown through three key contributions. First, we extend the kinematic optimization to handle arbitrary combinations of motor failures by imposing equality constraints to ensure failed actuators are not used. Second, we introduce discrete-time control barrier function (DTCBF) constraints that mathematically guarantee structural rigidity while maximizing workspace utilization, a critical requirement for reliable operation of truss robots under discrete-time control. Third, we implement closed-loop position control using onboard encoder feedback and a forward kinematics-based state estimator, improving positional accuracy in the presence of disturbances. We validate our approach through simulation and hardware experiments on a 2D isoperimetric truss testbed. For a 2D configuration with 6 actuators, we demonstrate >69% workspace preservation under single-motor failures and a >25% improvement in tracking accuracy with closed-loop control. These results establish a foundation for more robust and resilient isoperimetric truss robots operating under degraded actuation.

Multi-agent Collaboration with State Management

Mengyang Liu, Taozhi Chen, Zhenhua Xu, Xue Jiang, Yihong Dong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20563v1 Announce Type: new Abstract: Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures. Existing multi-agent systems address this through workspace isolation (e.g., one git worktree per agent), but this defers conflict resolution to a post-hoc merge step where recovery is expensive. In this paper, we propose STORM, i.e., STate-ORiented Management for multi-agent collaboration. Specifically, STORM manages agent states by mediating their interactions with the shared workspace, ensuring that each agent operates on a consistent view of the codebase and that conflicting edits are detected and resolved at write time. We evaluate STORM on Commit0 and PaperBench across multiple LLMs. STORM outperforms the git-worktree-based multi-agent baseline by +18.7 on Commit0-Lite and +1.4 on PaperBench, while achieving comparable or better cost efficiency. Combined with single-agent runs, STORM reaches highest scores of 87.6 and 78.2 on the two benchmarks respectively, suggesting that explicit state management is a more effective foundation for multi-agent collaboration than workspace isolation. STORM can also be plugged into any multi-agent system seamlessly.

Conflict-Aware Active Perception and Control in 3D Gaussian Splatting Fields via Control Barrier Functions

Amirhossein Mollaei Khass, Athanasios Cosse, Vivek Pandey, Nader Motee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20566v1 Announce Type: new Abstract: Active perception in uncertain environments requires robots to navigate safely while acquiring informative observations to reduce map uncertainty. These objectives inherently conflict, as informative viewpoints often lie near uncertain regions with higher collision risk. To address this challenge, we develop a conflict-aware active perception and control framework for robotic systems operating in environments represented by 3D Gaussian Splatting (3DGS). Safety is enforced using a Control Barrier Function (CBF) derived from an Average Value-at-Risk AV@R collision-risk metric that accounts for geometric uncertainty and guarantees forward invariance of a safe set. To improve perception, we propose a risk-aware Expected Information Gain (EIG) formulation for selecting the next-best-view and introduce perception barrier functions that align the camera orientation with the local information-ascent direction. To obtain a tractable formulation for these conflicting safety and perception objectives, we propose a unified safety-critical, perception-aware quadratic program that enforces safety as a hard constraint while relaxing perception constraints through slack variables. Simulation results demonstrate that the proposed method improves both safety and information acquisition compared to existing 3DGS-based approaches.

End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20569v1 Announce Type: new Abstract: Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.

$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20576v1 Announce Type: new Abstract: Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $\Delta$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $\Delta$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $\Delta$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20577v1 Announce Type: new Abstract: Riichi Mahjong is a multi-player, imperfect-information game characterized by stochasticity and high-dimensional state spaces. These attributes present a unique combination of challenges that mirror complex real-world decision-making problems in reinforcement learning. While prior research has heavily relied on supervised learning from human play logs to pre-train the policy, algorithms capable of learning \textit{tabula rasa} (from scratch) offer greater potential for general applicability, as evidenced by the AlphaZero lineage. To facilitate such research, we introduce \textbf{Mahjax}, a fully vectorized Riichi Mahjong environment implemented in JAX to enable large-scale rollout parallelization on Graphics Processing Units (GPUs). We also provide a high-quality visualization tool to streamline debugging and interaction with trained agents. Experimental results demonstrate that Mahjax achieves throughputs of up to \textbf{2 million} and \textbf{1 million steps per second} on eight NVIDIA A100 GPUs under the no-red and red rules, respectively. Furthermore, we validate the environment's utility for reinforcement learning by showing that agents can be trained effectively to improve their rank against baseline policies.

A strongly annotated passive acoustic dataset for tropical bird monitoring

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20578v1 Announce Type: new Abstract: Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

Deep Learning Surrogates for Emulating Stochastic Climate Tipping Dynamics

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20580v1 Announce Type: new Abstract: This work explores a dynamics-informed Temporal Fusion Transformer (TFT) as a data-driven surrogate for computationally intensive Earth system simulations. Focusing on multivariate time series describing global ocean transport, we demonstrate the surrogate's ability to forecast tip events across thousands of time steps. The data involve up to 21 non-stationary time series in addition to static covariates describing free parameters and initial conditions. Modifications to the architecture and objective function yield a surrogate that anticipates the timing of Atlantic and Pacific collapses to high fidelity and captures the stochastic uncertainty in transition timing across ensemble predictions. The learned surrogate achieves a 465x computational speedup over the numerical simulator while maintaining differentiability with respect to parameters and initial conditions.

TriForces: Augmenting Atomistic GNNs for Transferable Representations

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20581v1 Announce Type: new Abstract: Machine learning interatomic potentials (MLIPs) achieve excellent accuracy when trained on large Density Functional Theory (DFT) data. To be useful in practice, they must often be adapted to target chemistries using small and expensive task-specific datasets. However, MLIPs transfer inconsistently across domains, with representations that often loose accessible composition and structure information. To address this, we present TriForces, a model-agnostic three-stream framework that separates composition and structure information, combined with self-supervised learning to preserve transferable representations. TriForces improves performance on MatBench and QM9 over baselines without needing DFT labels and enables efficient similar structure retrieval through its learned latent space. On OMat24, in limited-data training regime, TriForces reduces energy MAE by 57% at 20K samples only and improves force MAE across sample sizes. We release pretrained TriForces variants across multiple MLIP architectures with code at https://github.com/Ramlaoui/triforces.

Multilevel Isogeometric Projection Stabilization via Quasi-Interpolation for Advection-Diffusion-Reaction Equations

Zakaria El Hasnaoui, Ahmed Ratnani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20583v1 Announce Type: new Abstract: This paper presents a novel multilevel projection-based stabilization method for advection-dominated advection--diffusion--reaction problems within the framework of Isogeometric Analysis (IGA). The proposed approach extracts and penalizes fine-scale fluctuations using continuous B-spline quasi-interpolants, avoiding both the highly sensitive parameter tuning typical of SUPG methods and the discontinuous auxiliary spaces required by classical Local Projection Stabilization. Stabilization is applied hierarchically across nested levels of the discrete space via explicit mesh-dependent weights. We establish the mathematical soundness of the method by proving a priori error estimates, supplemented by a discrete inf-sup condition established for the one-dimensional setting with constant advection under a numerically validated stability hypothesis that ensures robust streamline derivative control. Numerical experiments on stringent benchmarks demonstrate the method's ability to effectively suppress spurious oscillations across a variety of regimes, including the limiting cases of pure advection and advection--reaction. Notably, despite being a fully linear formulation, the method achieves significant reduction of undershoots near sharp layers, delivering performance comparable to complex nonlinear shock-capturing schemes. Furthermore, by utilizing a robust global parameter scaling, the proposed approach significantly alleviates the parameter sensitivity that typically affects residual-based alternatives, reducing the strong dependence on problem-dependent tuning.

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20584v1 Announce Type: new Abstract: Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

Direct Translation between Sign Languages

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20588v1 Announce Type: new Abstract: The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

Sunday Oyinlola Ogundoyin, Muhammad Ikram, Rahat Masood — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20591v1 Announce Type: new Abstract: Medical large language models (LLMs), including custom medical GPTs (MedGPTs) and open-source models, are increasingly deployed on web platforms to provide clinical guidance. However, they pose risks of hallucination, policy noncompliance, and unsafe design. We conduct a large-scale assessment of 6,233 MedGPTs, evaluating a stratified sample of 1,500, together with 10 open-source LLMs. We introduce two frameworks: MedGPT-HEval for hallucination detection and an LLM-based pipeline for assessing policy violations and developer intent. Our results show that 25-30% of MedGPTs exhibit low factual accuracy, with bottom- and middle-tier models at highest risk; 33.6-54.3% violate operational thresholds, and 57.06% of Action-enabled models lack adequate privacy disclosures. Compared with open-source models, MedGPTs achieve higher factual accuracy and semantic alignment, though open-source models are more stable. These results reveal systemic gaps in hallucination and compliance, highlighting the need for multi-metric evaluation and stronger safeguards. We release HAA-MedGPT, a structured dataset that supports future research on the safety of web-facing medical LLMs.

ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning

Sofia R. Miskala-Dinc, Aviva Prins — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20592v1 Announce Type: new Abstract: We study model-free Q-learning in finite-horizon episodic Markov Decision Processes (MDPs) with stationary dynamics across episodes. We identify a central issue in nascent model-free posterior-sampling works: the reliance on delayed learning in order to prove theoretical guarantees. In particular, we identify three opportunities for faster learning - (i) value-function update order, (ii) update frequencies, and (iii) value-function initialization. Using Wang et al.'s RandomizedQ as a basis, we illustrate these changes and their individual (as well as cumulative) impact in multiple empirical studies. We find that our combined modifications, termed ReversedQ, improve scaled mean cumulative reward compared to RandomizedQ, from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock (BDCL), and from 21.76% to 61.81% in a chain MDP.

Intent-First Aerial V2V for Tactical Coordination and Separation: Protocol and Performance Under Density and Disturbance

Mehrnaz Sabet — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20595v1 Announce Type: new Abstract: Dense low-altitude aerial operations require more than pre-flight route coordination and last-resort collision avoidance. Once aircraft are airborne, disturbances can emerge on timescales shorter than strategic reauthorization can absorb, while collision avoidance is too late and disruptive to serve as routine traffic management. Although tactical separation is recognized as the intermediate layer, realizing it at scale requires a deployable neighborhood communication mechanism that provides fresh, trusted information for local coordination. This paper presents what is, to our knowledge, the first controller-coupled characterization of an all-airborne, sidelink-class, intent-first vehicle-to-vehicle (V2V) tactical neighborhood exchange stack for dense Unmanned Aircraft System Traffic Management (UTM) operations. Unlike awareness-only broadcast, the proposed exchange combines refreshed state and intent beacons for local awareness, cooperative perception, and degraded-mode assessment with event-triggered messages for yielding, sequencing, release, and contingency coordination. We implement and evaluate this model on an all-airborne V2V stack using sidelink-class C-V2X modules with authenticated freshness checks. Evaluation uses a scenario-driven, high-volume stress campaign supported by real-time, field-anchored infrastructure. Results show that V2V reduces stale-belief divergence, preserves observability through cooperative perception, rejects invalid tactical messages, suppresses false local inference, and structures shared-resource coordination. The implemented stack provides a viable communication layer for tactical separation in lower-to-moderate regimes, but transitions toward guarded fallback as density, impairment, and complexity increase. These findings position intent-first aerial V2V as a bounded enabler for scaling tactical coordination in disturbance-driven urban airspace.

Unsupervised clustering and classification of upper limb EMG signals during functional movements: a data-driven

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20599v1 Announce Type: new Abstract: This study presents a comprehensive approach for the clustering and classification of upper-limb surface electromyography (sEMG) signals during functional reach and grasp movements. The methodology was applied to the NINAPRO DB4 dataset, which provides multichannel EMG recordings of 52 gestures. A four-stage pipeline was designed, including signal preprocessing, fea-ture extraction, gesture selection via hierarchical clustering, and comparative model evaluation. Preprocessing involved a fourth-order low-pass filter (0.6 Hz) and Hilbert envelope transformation, effectively reducing noise and enhancing signal clarity. Feature extraction yielded 26 temporal and frequency-domain met-rics, which were later refined using visual analysis, mutual information, principal component analysis, and decision tree importance scores. A final subset of five key features was selected for classification tasks. Gesture selection was per-formed through hierarchical clustering using Mahalanobis distance, resulting in six representative movements that balanced biomechanical diversity and compu-tational efficiency. A 200 ms window was identified as optimal for temporal seg-mentation based on stability and physiological plausibility. Classifier models were evaluated in two stages. Automated comparison using PyCaret identified Extra Trees (ET) and Artificial Neural Networks (ANN) as top performers. Sub-sequent independent training confirmed their stability and generalization capac-ity, with ANN showing progressive learning and ET maintaining robust, con-sistent results. The findings support the implementation of adaptive, low-latency control strategies for myoelectric prostheses and provide a scalable pipeline for future real-time applications.

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Yunming Ye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20600v1 Announce Type: new Abstract: Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies

Ming Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20602v1 Announce Type: new Abstract: Successive self-training on a language model's own outputs is widely characterized as a process of flattening: diversity drops, distributions narrow, and the text becomes "more like itself." We provide evidence that this characterization is incomplete. Across eleven generations of self-training on five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), language is not flattened uniformly -- it is restructured. Surface markers (discourse connectives, hedges, em-dashes) rise, while mid- and deep-syntactic structures (questions, parentheticals, passives, subjunctives) collapse. We formalize this asymmetric collapse as the Structural Depth Hypothesis (SDH): the per-generation decay rate of a linguistic feature is predicted primarily by its structural depth -- the number of nested syntactic dependencies it requires -- and only secondarily by its generation-zero output frequency. Pooling 17-feature panels from five models spanning three architecture families (N=85), the pooled Spearman correlation is rho=0.540 (p < 10^{-6}; cluster-bootstrap 95% CI [0.434, 0.634]), while frequency is a substantially weaker predictor (rho=0.225). A matched human-text fine-tuning control yields rho=0.039 (p=0.88), confirming the gradient is self-training-specific. We further document a Superficial Complexity Paradox: aggregate complexity proxies (dep-tree depth, TTR, word length) all rise as the underlying clause structure dies, with direct implications for training-data curation and LLM-text detection.

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20606v1 Announce Type: new Abstract: Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20607v1 Announce Type: new Abstract: EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20608v1 Announce Type: new Abstract: Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20609v1 Announce Type: new Abstract: Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab-snu.github.io/projects/CTA/

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

Gene Tangtartharakul, Katherine R. Storrs — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20610v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

Matryoshka Concept Bottleneck Models

Ziye Chen, Hongbin Lin, Xinyue Xu, Jie Li, Lijie Hu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20612v1 Announce Type: new Abstract: Concept Bottleneck Models (CBMs) have emerged as a prominent paradigm for interpretable deep learning, learning by grounding predictions in human-understandable concepts. However, their practical deployment is hindered by the high cost of test-time intervention, as correcting model errors typically requires human experts to manually inspect and verify a large set of predicted concepts. Existing approaches suffer from a fundamental structural limitation: they either adopt a single static concept set, forcing experts to exhaustively annotate concepts and incurring prohibitive intervention costs, or train multiple models tailored to different concept budgets, resulting in substantial computational and maintenance overhead. To address this challenge, we propose the Matryoshka Concept Bottleneck Model (MCBM), a unified architecture that enables adaptive concept utilization within a single model. Inspired by Matryoshka Representation Learning, MCBM organizes concepts into a nested hierarchy based on maximum relevance and minimum redundancy, allowing inference at multiple levels of conceptual granularity without retraining. Theoretically, we show that MCBM reduces the expected intervention costs from linear to logarithmic order, $O(\log K)$, while guaranteeing monotonic performance improvement. Empirically, extensive experiments demonstrate that MCBM matches the performance of independently trained models while enabling dynamic and efficient expert interaction.

HRM-Text: Efficient Pretraining Beyond Scaling

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20613v1 Announce Type: new Abstract: The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20616v1 Announce Type: new Abstract: Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20618v1 Announce Type: new Abstract: Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional \textit{jumps} to escape local minima, but often struggle to generalize across diverse instances. We introduce \textbf{COAgents}, a cooperative multi-agent framework that models the search process as a graph: nodes represent solutions, and edges correspond to either local refinements or large perturbations for diversification (i.e., jumps). A \textit{Partial Search Graph} (PSG) is dynamically constructed during search, enabling COAgents to train a Node Selection Agent and a Move Selection Agent to guide intensification, and a Jump Agent to trigger well-timed explorations of new regions. Unlike end-to-end learning approaches, COAgents cleanly separates problem-agnostic search control from compact domain-specific encoding, facilitating adaptability across tasks. Extensive experiments on the CVRP and VRPTW benchmarks show that COAgents remains competitive with several learn-to-search baselines on CVRP and sets a new state of the art among learning-based methods on the more challenging VRPTW instances, reducing the gap to the best-known solutions by 14\% at $N\!=\!100$ and 44\% at $N\!=\!50$ relative to the strongest neural solver (POMO), and by 21\% and 40\% respectively relative to ALNS. Code is available at https://github.com/mahdims/COAgents.

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

Liuyuan Jiang, Chentong Huang, Lisha Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20619v1 Announce Type: new Abstract: Scalarization is widely used in multi-objective optimization owing to its simplicity and scalability. In many applications, the goal is to generate solutions that represent diverse user preferences, ideally with uniform coverage of the Pareto front (PF). However, uniformly sampling scalarization weights usually induces non-uniform coverage of the PF. We explain this mismatch through a geometric analysis of the scalarization path. As the scalarization weight varies, the corresponding solutions trace the PF with a generally non-uniform traversal speed. This speed induces an arc-length cumulative distribution function (CDF); inverting this CDF map yields a principled rule for selecting weights that produce uniform PF coverage. Building on this insight, we propose SURF (Sampling Uniformly along the PaReto Front). For structured problems, including bi-objective bandits, we derive closed-form expressions for this CDF map and the resulting PF-aware weight sampling rule. For general problems, SURF alternates between CDF reconstruction and weight sampling. Theoretically, we show that under provable conditions, SURF converges linearly to an unavoidable finite-sampling floor. Empirically, experiments on bandits, multi-objective-gymnasium, and multi-objective LLM alignment demonstrate that SURF efficiently achieves more uniform PF coverage than baselines.

Dynamic Shapley Computation

Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20620v1 Announce Type: new Abstract: Shapley-based data valuation provides a principled way to quantify the contribution of training data, but its high computational cost makes it impractical in dynamic settings where tasks and training players evolve. Existing methods treat Shapley computation as a one-shot process and collapse contributions into aggregated scores, preventing reuse and requiring recomputation under any change. We introduce a new perspective that represents Shapley values as a player-by-task matrix and formulates dynamic valuation as a structured matrix maintenance problem. We exploit the fact that each task depends on a small subset of training players and that similar tasks yield similar valuations, leading to utility locality and coalition locality. Based on these insights, we propose D-Shap, a dynamic valuation framework that enables efficient updates by modifying only a small portion of the matrix: new task valuations are inferred via structure-aware interpolation, while updates induced by new players are confined to affected local matrix blocks. To eliminate the need for pre-specified evaluation tasks, we introduce self-valuation, which constructs the initial matrix directly from training data, supported by scalable subset reuse and coverage-aware anchor selection. Experiments across diverse models show that D-Shap performs task updates in milliseconds and reduces the cost of player updates by up to three orders of magnitude, while achieving valuation quality competitive with full recomputation.

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20624v1 Announce Type: new Abstract: Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

Time-To-Reach Separation and Safety Filtering for Safe, Fair, and Efficient Multi-Agent Coordination

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20625v1 Announce Type: new Abstract: Advanced Air Mobility (AAM) operations are expected to significantly increase aerial traffic in urban airspace, requiring autonomous traffic management systems to ensure collision-free operations in highly congested environments. In this paper, we propose a multi-agent coordination framework that uses minimum time-to-reach (TTR) as a unifying metric for priority assignment, temporal separation, and safety filtering. We focus on the problem of coordinating multiple aerial vehicles merging into an air corridor while maintaining safe separation between vehicles. Vehicles are assigned arrival-consistent priority based on TTR, and target TTR values are used to enforce temporal spacing that induces spatial separation. A priority-consistent safety filtering layer based on Hamilton-Jacobi reachability value functions ensures collision avoidance while minimally modifying the reference guidance. Simulation results in a highly congested corridor merging scenario show that the proposed method improves safety, fairness, and efficiency compared to time-optimal guidance and priority-agnostic safety filtering.

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20626v1 Announce Type: new Abstract: We present the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Our two-stage pipeline generates a Spanish intermediate caption with Qwen2.5-VL, then produces the target-language caption using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. We achieve 164.1%, 131.7%, and 122.6% improvements over the shared task baseline for Bribri, Guaran\'i, and Orizaba Nahuatl captioning, respectively, in our dev set evaluation and maintain >150% improvements for the Bribri and Orizaba Nahuatl languages in the test set evaluation. We find retrieval is highly language-dependent, beneficial only for large, in-domain corpora, and that synthetic data augmentation accounts for around 28 chrF++ of the dev set Guaran\'i performance gain. Our submission is the overall winner of the shared task, placing second out of five finalist submissions in human evaluations of target-language captions.

Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

Sylvey Lin, Joe Menke, Shufan Ming, Dongin Nam, Neil Smalheiser, Halil Kilicoglu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20628v1 Announce Type: new Abstract: Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20630v1 Announce Type: new Abstract: Industrial asset operations workflows are latency-sensitive because a single user query may require coordination over sensor data, work orders, failure modes, forecasting tools, and domain-specific agents. We evaluate this problem on AssetOpsBench (AOB), an industrial agent benchmark whose plan-execute pipeline exposes repeated overhead from tool discovery, LLM planning, MCP tool execution, and final summarization. Existing LLM caching techniques such as KV-cache reuse and embedding-based semantic caching were designed for chatbot serving and break down when output validity depends on time, asset, or sensor parameters. We propose two complementary optimization layers for AOB plan-execute pipelines: a temporal semantic cache and a set of MCP workflow optimizations combining disk-backed tool-discovery caching and dependency-aware parallel step execution. MCP workflow optimizations corresponded to a 1.67x speedup and reduced median end-to-end latency by about 40.0% while the temporal-cache benchmark achieved a median of 30.6x speedup on cache hits. Beyond the speedup, our results expose a concrete failure mode of pure semantic caching for parameter-rich industrial queries, providing a critical analysis of how caching choices interact with evaluation correctness in MCP-backed agent benchmarks.

The General Theory of Localization Methods

Congwei Song — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20635v1 Announce Type: new Abstract: This paper proposes a general machine learning framework called the localization method, which is fundamentally built on two core concepts: localization kernels and local means -- key components that underpin the self-attention mechanism. To establish a rigorous theoretical foundation, the framework is formally defined through two essential pillars: the formulation of the local(-ized) model and the localization trick. We systematically investigate the connections between the localization method and a wide range of existing machine learning models/methods, including (but not limited to) kernel methods, lazy learning, the MeanShift algorithm, relaxation labeling, Hopfield networks, local linear embedding (LLE), fuzzy inference, and denoising autoencoders (DAEs). By dissecting these relationships, we clarify the broader theoretical significance of the localization method and demonstrate its practical applicability across diverse machine learning tasks. Furthermore, we explore advanced extensions of the framework, such as adaptive kernels, hierarchical local models, and non-local models. Notably, we show that the Transformer -- a cornerstone of modern sequence modeling -- can be constructed using hierarchical local models, revealing the ability of the localization method to unify and generalize state-of-the-art architectures. This work not only provides a unified theoretical lens to reinterpret existing models but also offers new methodological tools for designing flexible, data-adaptive learning systems.

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20640v1 Announce Type: new Abstract: Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

Yifei Wang, Tianlin Li, Xiaohan Zhang, Yida Yang, Xiaoyu Zhang, Li Pan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20641v1 Announce Type: new Abstract: Inference optimization is a vital technique for deploying LLMs at scale. Compilation is the most widely adopted optimization technique for LLMs. While it assumes semantic equivalence between the original and compiled graphs, we first uncover its numerical side effects can be maliciously exploited to implant stealthy backdoors in LLMs. We propose a unified optimization-triggered attack framework comprising two complementary strategies. Without any modification to the compiler or hardware, one strategy flips predictions for specific inputs only when the model is compiled, while the other uses a universal trigger that remains dormant under uncompiled execution but hijacks arbitrary inputs once compilation optimization is applied. Both attacks bypass standard safety evaluations run without compilation. We empirically demonstrate that these optimization-triggered backdoors achieve attack success rates averaging 90% across four mainstream open-source LLMs and four tasks, while clean accuracy is preserved at nearly 100% under all settings. Our findings reveal a novel attack surface at the intersection of optimization and security in the LLM deployment pipeline, and we investigate practical defenses to mitigate this threat.

Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions

Mirerfan Gheibi, Gashin Ghazizadeh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20642v1 Announce Type: new Abstract: When annotators disagree, that disagreement can reflect epistemic uncertainty rather than simple label noise. We study hard-label delivery as an alternative to the usual choices of collapsing votes to a single label or training directly on the empirical soft-label distribution. We focus on two primary hard-label methods: multipass, which cycles through observed votes while keeping the dataset size fixed, and stochastic label sampling (SLS), which samples one label per example at the start of each epoch. On CIFAR-10H, we find that when only a small number of annotations per example is available, hard-label delivery improves over soft-label training, with larger improvements where the sparse empirical target is farther from the full annotator distribution. When full annotator distributions are available, both hard-label methods match soft-label training. We use deterministic control as an ablation of multipass and shuffled SLS as a control that breaks the example-to-distribution match. We also show that SLS and soft-label cross-entropy optimize the same expected objective. Hard-label delivery also converges to flatter basins, with supporting descriptive evidence from OOD detection on SVHN and CIFAR-100. Overall, these results suggest that multipass is a strong practical default when raw vote counts are available, while SLS offers a lightweight alternative that remains competitive when only a few votes per example are available and matches soft-label training when full annotator distributions are available.

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20643v1 Announce Type: new Abstract: Self-distillation enables language models to learn on-policy from their own trajectories by using the same model as both student and teacher, with the teacher being conditioned on privileged information unavailable to the student. Such information can come in different types or views, such as solutions, demonstrations, feedback, or final answers. This setup provides dense token-level feedback without relying on a separate external model, but creates a fundamental asymmetry: the teacher may rely on view-specific information that the student cannot access at inference time. Moreover, the best type of privileged information is often task-dependent, making it difficult to choose a single teacher view. In this work, we address both these challenges jointly by introducing AVSD (Adaptive-View Self-Distillation), a novel method of self-distillation with multiple privileged-information views, which reconstructs token-level supervision by separating stable cross-view consensus from view-specific residual signals. AVSD identifies the consensus signal shared across views, which provides a reliable update direction, and then selectively adds the view-specific residual signal to adjust the update magnitude when it both aligns with the consensus direction and remains proportionate to the consensus signal. Experiments on math competition benchmarks (AIME24, AIME25, and HMMT25) show that AVSD consistently outperforms both single-view self-distillation baselines and GRPO, achieving average Avg@8 gains of 3.1% and 2.2% over the strongest baselines on Qwen3-8B and Qwen3-4B, respectively. Moreover, on code-generation benchmarks (Codeforces, LiveCodeBench v6) using Qwen3-8B, AVSD outperforms the single-view self-distillation baseline by 2.4% on average.

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20644v1 Announce Type: new Abstract: Design for manufacturing plays a critical role in advanced aeroengine development, where complex components necessitate careful consideration of manufacturability. However, current practices in pipe routing remain largely decoupled from down-stream manufacturing, leading to labor-intensive, trial-and-error iterations to achieve manufacturable designs. To address this problem, this study proposes the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates the routing problem as a boundary value problem in the Frenet frame. In this framework, the pipe path is represented by curvature and torsion profiles, which are generated using cubic Hermite interpolation. To integrate design and manufacturing, domain-specific manufacturing knowledge is embedded as constraints on the permissible ranges of curvature and torsion. The path optimization is performed using the proximal policy optimization algorithm with stochastic exploration and a stage-guided reward mechanism. A unified mapping formulation then translates the optimized path into motion trajectories for the bending die, enabling direct fabrication on a six-axis free-bending machine. Experimental results demonstrate that FPRO consistently generates collision-free, manufacturable paths with smoother geometric profiles compared to Cartesian-based methods. It also achieves faster convergence and superior performance in terminal alignment, path length, obstacle avoidance, and manufacturability compared to state-of-the-art reinforcement learning baselines. Real-world validation confirms the close geometric correspondence between the manufactured pipe and its digital design, validating the practical feasibility of FPRO.

Seeing Through Fog: Towards Fog-Invariant Action Recognition

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20645v1 Announce Type: new Abstract: Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

DisImpact: Quantifying the Physi-Social Impact of Natural Disasters Through Social Media

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20646v1 Announce Type: new Abstract: Natural disasters not only cause large-scale physical destruction, but also cascading social consequences that are difficult to quantify with traditional surveys and reports. Social media platforms offer an alternative perspective that captures multimodal, real-time, and user-generated content that can be leveraged for disaster impacts. In this paper, we introduce DisImpact, a two-stage framework that systematically quantifies the physi-social impacts of disasters via a Multimodal Large Language Model (MLLM). The social media posts are first classified into ten disaster impact categories that cover both physical and social domains. We then construct a disaster impact index that integrates the relative prominence of each category with the intensity of public engagement on a weekly basis. This design provides a unified scale for representing disaster impacts across both individual disaster impact categories and the broader physical and social domains. The unified representation enables direct comparison across categories and allows the impacts to be flexibly aggregated to reveal higher-level patterns and overall trends. We validate the impact indices against authoritative ground-truth data, including FEMA Public Assistance data and NASA FIRMS fire detections, observing consistent lead-lag correlations that demonstrate strong validity across both social and physical impact dimensions. We further conduct temporal and spatial analyses, and the results show that physical impacts are often peak during the disasters and localized in regions that are directly affected by disasters, while social impacts often emerge later and spread more broadly across time and space. To the best of our knowledge, this is the first framework to comprehensively quantify disaster impacts across their physical and social dimensions using multimodal data from multiple social media platforms.

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20648v1 Announce Type: new Abstract: Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

Tuopusen Huang, Ding Ma, Xiangqian Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20651v1 Announce Type: new Abstract: Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Jiachen Ma, Jiawen Zhang, Xiangtian Li, Bo Zou, Chaochao Lu, Chao Yang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20654v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Cooling Channel Design Optimization for High Power Multi-chip Packages

Michael Acquah, Zheng Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20657v1 Announce Type: new Abstract: Thermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45{\deg}C and the average chip temperature by 35.87{\deg}C compared to the baseline configuration.

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20659v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

Design Principles and Observable Indicators for AI-Enabled Pedagogical Accompaniment: Evidence from the Amico Dual-Mode Prototype in Italy and China

Pier Paolo Benedetti — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20665v1 Announce Type: new Abstract: AI-enabled systems are increasingly introduced into educational contexts, yet their effectiveness depends less on technological sophistication than on the quality of pedagogical mediation, ethical constraints, and context-sensitive design. This paper proposes a replicable framework for AI-enabled pedagogical accompaniment, grounded in a human-in-command approach in which adult responsibility remains central and AI functions as an enabling, non-substitutive infrastructure. Building on the Amico project, we operationalize the concept of a relational bridge as a sequence of micro-mediations that lower the threshold of access to educational relationships and facilitate transitions toward meaningful human interaction with teachers, peers, and communities of practice. The contribution synthesizes a set of design principles, including transparency of system identity and limits, scaffolding toward human contact, maieutic questioning, prevention of dependency dynamics, and data minimization, and maps them to observable indicators suitable for real educational settings. The paper also outlines an initial cross-context exploration of the prototype in Italy and China and discusses how the two interaction modes, AmicoMio (structured, task-oriented) and AmicoTuo (reflective, supportive), can be used as complementary pedagogical mediations. Pilot observations and participant feedback suggested feasibility and perceived usefulness in vocational contexts, motivating the present framework, informing the subsequent doctoral research program, and supporting the proposed collaborative research agenda.

A Semantic and Occlusion-Aware GM-PHD Filter

Jovan Menezes, Mark Campbell — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20666v1 Announce Type: new Abstract: This paper proposes a new birth model including semantic information derived from deep learning to create an occlusion-aware Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter. Unlike prior approaches that rely on simplistic or uniform assumptions, the proposed Semantic-Occlusion Aware (S-OA) birth model defines initialization terms by explicitly considering regions of occlusion and by leveraging semantic information about the environment. This enables the filter to accurately represent where new objects are more likely to appear, thereby improving tracking performance in complex and high-density driving scenarios. The method is evaluated through Monte Carlo simulations and experiments on the KITTI dataset. Performance is assessed by measuring the latency between first detection and track initiation, along with the mean absolute cardinality error and the Optimal Subpattern Assignment (OSPA) metric. Results demonstrate that the S-OA birth model reduces initialization delay in occlusion-heavy settings, matching or outperforming the strongest baseline in approximately 70% of cases. A sensitivity analysis of birth model weights is also provided. Overall, the findings underscore the benefits of integrating occlusion reasoning and semantic priors into Bayesian tracking frameworks for autonomous driving.

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20667v1 Announce Type: new Abstract: Detecting small unmanned aerial vehicles from RGB-infrared remote-sensing pairs remains challenging due to tiny target scale, cluttered backgrounds, and spatial misalignment between heterogeneous sensors. Existing bimodal detectors often align or fuse features without assessing the reliability of local cross-sensor correspondence, allowing mismatch artifacts to propagate into the detection head. To address this issue, we propose LER-YOLO, a reliability-aware sparse mixture-of-experts framework for misaligned RGB-infrared UAV detection. LER-YOLO first introduces an Uncertainty-Aware Target Alignment module that resamples visible features toward the infrared reference and estimates a spatial reliability map. This reliability prior is then used by a Reliability-Guided Sparse MoE Fusion module to adaptively select k experts from RGB-dominant, infrared-dominant, and interactive fusion experts, enabling trustworthy cross-modal interaction while suppressing unreliable fusion. Experiments on the public MBU benchmark under a YOLOv5s-family protocol show that LER-YOLO achieves 89.7+/-0.2% AP50 over three independent seeds, with a best result of 89.9%. Extensive ablations, parameter-matched comparisons, synthetic-shift evaluations, and complexity analysis demonstrate that the gains mainly come from reliability-guided expert routing rather than increased model capacity.

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20668v1 Announce Type: new Abstract: With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.

GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

Jiahao Kong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20669v1 Announce Type: new Abstract: X-ray security inspection requires accurate real-time detection of prohibited items, but existing models often struggle to balance the challenges of severe occlusion, complex clutter, and strict speed requirements. To overcome these challenges, this paper proposes GSA-YOLO, a novel lightweight framework built upon the YOLOv8n architecture, specifically engineered to enhance detection robustness and inference efficiency. GSA-YOLO strategically integrates structured sparsity and adaptive knowledge transfer through three core components: Group Lasso (GL) applied to the network neck for robust feature extraction; Sparse Structure Selection (SSS) applied to the detection head for significant model slimming; and an Adaptive Knowledge Distillation (Ada-KD) mechanism for comprehensive accuracy recovery. This integrated approach synergistically enhances feature representation while pruning redundant channels, maximizing model efficiency without sacrificing performance. Rigorous evaluations on the HiXray and PIDray datasets confirm GSA-YOLO's comprehensive capability, achieving a leading inference speed of 189.62 FPS, accompanied by a reduction in computational cost from 8.7G to 8.0G. Crucially, GSA-YOLO secures mAP50:95 results of 0.531 and 0.679 on HiXray and PIDray, demonstrating 2.4% and 1.8% improvements over the baseline, respectively. Compared to other models, GSA-YOLO exhibits enhanced accuracy while maintaining computational efficiency, making it a promising solution for practical X-ray security inspection.

LT2: Linear-Time Looped Transformers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20670v1 Announce Type: new Abstract: Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally expensive and slow. We introduce LT2 (Linear-Time Looped Transformers), a family of looped architectures that replace quadratic softmax attention with subquadratic, linear-time attention. We study two variants: LT2-linear with linear attention and LT2-sparse with sparse attention. We find that looping uniquely synergizes with these variants: it enables iterative memory refinement in linear attention and progressively expands the effective receptive field in sparse attention. We formalize these benefits theoretically and demonstrate consistent empirical gains across controlled recall, state-tracking, and language modeling tasks. We then explore LT2-hybrid, which combines different attention variants in a looped setting. Two variants are especially promising: LT2-hybrid (GDN+DSA), which interleaves linear and sparse attention to maximize efficiency and matches the standard looped transformer's quality at fully linear-time cost; and LT2-hybrid (Full+GDN), which interleaves GDN with a small fraction of full attention layers to maximize quality, surpassing the standard looped transformer in both performance and efficiency. We also show how to convert a pre-trained LT into an LT2-hybrid model. With about 1B tokens of training, our converted model, Ouro-hybrid-1.4B, outperforms industry-level 1B models and is competitive with industry-level 4B models while retaining the speed benefits of linear-time attention. Together, these results show a clear path toward making looped transformers more scalable and advancing efficient, capable small language models.

Persistent-Homology-Guided Topology Scanning of Qualitative Indicators for Acoustic Inverse Scattering

Xiaomei Yang, Jiaying Jia, Zhiliang Deng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20673v1 Announce Type: new Abstract: Qualitative methods such as the linear sampling method and the factorization method reconstruct acoustic scatterers through sampling indicators. In practice, these indicators are gray-scale fields on a prescribed sampling window and a binary obstacle shape is obtained only after thresholding. The choice of threshold is usually empirical and may be unstable when the indicator contains noise-induced artifacts or when the scatterer has nontrivial topology, such as multiple components or holes. This paper proposes a topology-aware postprocessing framework based on persistent homology. Given any normalized qualitative indicator, we scan the persistent homology of its superlevel sets and use the resulting zero- and one-dimensional persistent features to estimate or impose the topology of the unknown scatterer. A topology-guided threshold is then selected by minimizing a Betti-number discrepancy together with mild geometric penalties. The method is indicator-agnostic: it can be applied to the linear sampling indicator, the factorization-method indicator, or a normalized fusion of indicators. The main formulation is single-frequency and therefore remains close to the classical qualitative inverse scattering setting. We present the mathematical construction, an automatic topology detection rule based on persistence lifetimes and lifetime gaps, and a detailed algorithmic protocol for numerical implementation. Numerical tests verify that the proposed method is effective.

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

Herman Bergstr\"om, Aditya Mehrotra, Rahul G. Krishnan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20674v1 Announce Type: new Abstract: We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

An Event-Driven Tool for Context-Aware Code Smell Detection Using SmellDSL

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20675v1 Announce Type: new Abstract: Code smells signal violations of design principles that degrade the internal quality of evolving software systems. Although many tools detect such anomalies using static metrics, they often ignore the development context in which smells arise and are resolved. This limitation can lead to misleading warnings and weak support for refactoring decisions. To address this problem, we present SmellHunter, a context-aware tool that interprets scripts written in the SmellDSL domain-specific language to detect and contextualize code smells. SmellHunter integrates static code metrics with contextual information (such as team characteristics, project stage, and geographic metadata) to produce richer, more actionable analyses. The tool adopts an event-driven architecture in which a service bus orchestrates validation, interpretation, and persistence services through asynchronous events. This architecture enables scalable analysis while minimizing disruption to developers' workflows. SmellHunter is integrated into the Eclipse development environment via a dedicated plugin and provides aggregated insights via a mobile application, allowing developers to explore smell occurrences by type, severity, and location. By linking smell detection with contextual data and collaborative visualization, SmellHunter supports developers acting as smell hunters, helping teams identify recurring quality issues emerging from a particular location and assign refactoring tasks to developers with relevant expertise. We describe the architecture of SmellHunter, the interpretation process of SmellDSL scripts, and the integration of contextual data to support more informed refactoring decisions in modern software development environments.

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20676v1 Announce Type: new Abstract: Establishing a clear link between model predictions and the visual evidence that supports them is critical for transparency and reliability in multimodal reasoning, yet current multimodal large language model (MLLM) evaluations do not explicitly enforce this alignment. Existing benchmarks assess either textual answer correctness or pixel-level localization in isolation, leaving the coupling of reasoning and grounding an open challenge. We introduce VISTAQA, a comprehensive benchmark for joint evaluation of free-form answer correctness and pixel-level evidence grounding in visual question answering. VISTAQA comprises 1,157 expert-curated samples spanning six task types and six visual domains, ranging from direct perception to compositional and relational reasoning. VISTAQA requires models to not only answer correctly, but to also provide precise segmentation masks that support their answers. It also includes hallucination-aware examples where no valid visual evidence exists. To support this enhanced evaluation, we introduce GROVE, a unified evaluation metric that enforces joint correctness by combining textual accuracy and grounding quality via a per-sample geometric mean, ensuring neither dimension can compensate for deficiencies in the other. Comprehensive experiments across grounding-aware models and hybrid pipelines with general-purpose MLLMs reveal that even the strongest systems achieve limited performance under GROVE, highlighting a substantial gap between answer accuracy and visual evidence alignment.

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20678v1 Announce Type: new Abstract: Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

Jiaqi Chen, Qinfu Xu, Liyuan Pan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20680v1 Announce Type: new Abstract: Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20682v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown remarkable capability in bridging visual perception and textual reasoning, enabling zero-shot understanding across diverse industrial scenarios. However, their performance in open-vocabulary industrial anomaly detection (IAD) is often limited by domain-misaligned reasoning and hallucinated structural inferences. To address these challenges, we propose \textbf{IndusAgent}, a tool-augmented agentic framework for open-vocabulary IAD. Specifically, we first construct \textbf{Indus-CoT}, a structured dataset that integrates global visual observations, high-resolution local patches, and expert normalcy priors, providing supervision for fine-tuning the model on rigorous industrial inspection trajectories. Building on this, IndusAgent dynamically orchestrates a set of external tools, including dynamic region cropping, high-frequency feature enhancement, and prior retrieval, thus enabling the agent to actively resolve visual ambiguities and disentangle subtle anomalies. Furthermore, we introduce a gated reinforcement learning objective that jointly optimizes anomaly classification, localization accuracy, anomaly type reasoning, and efficient tool usage, ensuring that tool invocation occurs only when beneficial. Extensive evaluations on five industrial anomaly benchmarks, including MVTec-AD, VisA, MPDD, DTD, and SDD, demonstrate that IndusAgent achieves state-of-the-art zero-shot performance among all existing methods, validating our robustness and generalization capacity.

Layer-wise Token Compression for Efficient Document Reranking

Shengyao Zhuang, zhichao Xu, Ivano Lauriola — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20683v1 Announce Type: new Abstract: Transformer-based document cross-encoder rerankers are a central component of modern information retrieval systems. Despite their success, these models suffer from high computational costs due to processing long query-document sequences at inference time. A known approach to improve efficiency is token compression, which consists of aggregating groups of tokens together in the initial embedding layer, reducing the effective number of tokens, and making the computation faster. While token compression has proven to be successful for bi-encoder retrievers, we empirically observed that this approach may be ineffective for cross-encoder rerankers. In this paper, we propose Layer-wise Token Compression (LTC), which applies adaptive token pooling at intermediate transformer layers. Through extensive ablation studies on MS MARCO passage and document ranking tasks, we demonstrate that compression at middle layers preserves ranking quality while increasing inference QPS by up to 25% for passage ranking and up to 116% for document ranking. We also extend LTC to listwise LLM rerankers and show that the same approach can be easily applied to long-context listwise reranking, where the QPS improvements are even greater. More surprisingly, when applying rerankers trained on short passages to long-document ranking tasks, models trained with compression outperform their uncompressed counterparts, suggesting that compression may act as a beneficial regularizer that encourages length-invariant representations.

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20684v1 Announce Type: new Abstract: Corporate credit underwriting requires analysts to extract actionable evidence from long, heterogeneous financial documents spanning hundreds of pages and multiple languages. Standard Retrieval-Augmented Generation (RAG) pipelines optimize for semantic similarity, which frequently surfaces passages that are topically related but lack decision utility, a problem we term the similarity-utility gap. We propose a two-phase non-parametric retrieval architecture that separates high-recall candidate retrieval from high-precision utility ranking. The first phase combines lexical and dense multilingual retrieval to construct a broad candidate pool. The second phase applies an adaptive retrieval controller that filters candidates using query intent and document structure signals, followed by an LLM-as-a-Judge utility scoring mechanism that ranks passages by analytical usefulness rather than semantic proximity. A context-aware extraction module preserves structural fidelity across narrative text and complex financial tables. The system is deployed entirely on-premise to satisfy enterprise data governance requirements. Evaluated on a multilingual corpus of proprietary financial documents with analyst-curated relevance labels, the system significantly outperforms naive retrieval baselines. In production deployment across more than 800 credit analysts, document review time was reduced from several hours to approximately three minutes, demonstrating the practical value of utility-aware RAG architectures for document-intensive decision-support workflows.

DIVE: Embedding Compression via Self-Limiting Gradient Updates

Dongfang Zhao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20689v1 Announce Type: new Abstract: High-dimensional embeddings from large language models impose significant storage and computational costs on vector search systems. Recent embedding compression methods, including Matryoshka-Adaptor (EMNLP 2024), Search-Adaptor (ACL 2024), and SMEC (EMNLP 2025), enable dimensionality reduction through lightweight residual adapters, but their training objectives cause severe overfitting when labeled data is scarce, degrading retrieval performance below the frozen baseline. We propose \textsc{DIVE} (\textbf{D}imensionality reduction with \textbf{I}mplicit \textbf{V}iew \textbf{E}nsembles), a compression adapter that addresses this failure through two mechanisms. First, a self-limiting hinge-based triplet loss produces zero gradient once a triplet satisfies the margin constraint, bounding the total perturbation applied to the pretrained embedding space. Second, a head-wise NT-Xent contrastive loss treats multiple learned projections of each embedding as implicit views, providing dense self-supervised gradients that compensate for the sparsity of the triplet signal on small datasets. Across six BEIR datasets, \textsc{DIVE} outperforms all three baseline adapters on every dataset and at every evaluated compression ratio, with a 14M-parameter open-source implementation.

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

Shanshan Ye, Duo Lu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20690v1 Announce Type: new Abstract: Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

Interpretable Discriminative Text Representations via Agreement and Label Disentanglement

Tong Wang, Yiqing Xu, Leo Yang Yang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20693v1 Announce Type: new Abstract: Interpretable text representations should expose coordinates that are not only predictive, but also meaningful enough for independent auditors to apply. Existing discriminative representations often use anonymous embedding directions, while concept-bottleneck and LLM-assisted methods attach natural-language names to features without ensuring that those definitions are reproducible or distinct from the target label. We propose an operational criterion for interpretable discriminative text representations: each coordinate should satisfy conceptual clarity, measured by chance-adjusted agreement between independent annotators applying the feature definition, and label disentanglement, meaning the feature should not merely paraphrase the prediction target. We instantiate this criterion in LLM-assisted Feature Discovery (LFD), an iterative method that proposes lexical and semantic features from contrastive outcome-opposed text pairs, screens candidates using cross-LLM Cohen's $\kappa$, and selects features by residual held-out predictive gain. A stylized analysis connects the $\kappa$ screen to a per-feature annotation-noise bound, formalizing agreement as a reliability check. Across ten text-classification tasks spanning seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing substantially clearer and less label-entangled features. Human audits with 232 raters show that LFD features achieve higher human--human and human--LLM agreement than baseline concepts, and raters consistently judge them as less label-leaking. These results suggest that agreement-tested, label-disentangled coordinates provide a practical auditability standard for interpretable text classification.

Distributed Direct Preference Optimization

Zhanhong Jiang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20696v1 Announce Type: new Abstract: Preference-based reinforcement learning (RL) is a key paradigm for aligning policies with human judgments, yet its theoretical behavior in distributed settings where preference data are fragmented across heterogeneous users remains poorly understood. Direct Preference Optimization (DPO) avoids explicit reward modeling but lacks convergence guarantees under federated and decentralized training, where communication constraints and non-IID preferences fundamentally alter optimization dynamics. We provide the first convergence and time-complexity analysis of DPO in distributed environments. Modeling personalized offline RL with user-specific preference distributions, we characterize the induced global optimization landscape. For federated DPO, we derive convergence rates that quantify the impact of client drift, communication frequency, and preference heterogeneity; for decentralized DPO, we establish convergence over general communication graphs and show how spectral connectivity governs optimization speed and consensus. Empirically, we corroborate our theoretical insights on standard alignment benchmarks, demonstrating that our proposed methods not only enjoy strong theoretical guarantees but also deliver robust and scalable performance in practice. The code base is available here.

CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20701v1 Announce Type: new Abstract: Clinicians are expected to disclose harmful medical errors to patients and families in line with ethical, regulatory, and patient care standards, yet these conversations remain challenging because of their emotional complexity and limited training opportunities. Most physicians still learn primarily through lectures and observation, while static video tools-though available-are underused, lack adaptability across specialties, and deliver delayed, generic feedback. These gaps restrict skill development, reduce self-efficacy, and contribute to avoidance of disclosure conversations, ultimately compromising patient care and eroding trust. To address these needs, we designed CandorMD -- an AI-assisted simulation system that provides real-time practice, actionable feedback, and diverse practice environments tailored to individual learning needs. We conducted semi-structured interviews with physicians, risk managers, patient advocates, and communication experts to understand current practices, identify gaps, and collect feedback on CandorMD. Based on these insights, we present findings and design recommendations for the future of AI-supported medical communication training.

Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms

Saurabh Deochake — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20704v1 Announce Type: new Abstract: Autonomous AI agents that spawn sub-agent swarms create a safety gap: existing credential revocation mechanisms, OAuth~2.0 introspection, OCSP, and W3C Status Lists, require network connectivity to a central authority, leaving ``zombie agents'' executing privileged operations for minutes to hours after operator shutdown. We present Heartbeat-Bound Hierarchical Credentials (HBHC), a cryptographic protocol that binds credential validity to periodic parent liveness proofs. Verifiers enforce freshness using only a cached public key and local clock; no network round-trip is required. When heartbeat generation ceases, all descendant credentials become unusable within a deterministically bounded window $W_z \le W_{\max} + \Delta_h + \epsilon$, conditional on bounded clock skew and parent keys held in secure enclaves. Evaluation at the protocol layer and with real LLM-backed agent swarms (GPT-4o-mini) demonstrates a 90$\times$ reduction in the zombie window over OAuth~2.0, 0.26~ms full authentication in Rust, 18,000+ verifications per second under concurrent HTTP load, and stable per-verification latency from 10 to 10,000 agents. Real-agent experiments show 0.71\% end-to-end overhead on tool calls, zero post-revocation tool calls under prompt injection that bypasses application-layer guardrails, and cascading revocation across a 49-agent four-level hierarchy within the theoretical bound.

Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20706v1 Announce Type: new Abstract: Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llama.cpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language models and four model weight formats. We compare LlamaWeb against existing browser-based LLM frameworks and find that LlamaWeb requires 29-33% less memory across several combinations of device, browser, and operating system. We also evaluate LlamaWeb's performance against these frameworks and find that it increases decode throughput by 45-69% across four GPUs from separate vendors. In addition, we compare LlamaWeb's performance against other llama.cpp backends, where it is competitive with and even beats vendor-specific backend performance on some devices.

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20708v1 Announce Type: new Abstract: Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

Kavya Manohar, Arghya Bhattacharya, Kush Juvekar, Kumarmanas Nethil — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20712v1 Announce Type: new Abstract: Automatic speech recognition replaces typing only when correction costs less than manual entry, a threshold determined by error types, not counts: fixing a misrecognized domain term costs far more than inserting a comma. Word error rate (WER) fails on two fronts: it collapses distinct error categories into a single scalar, and it structurally penalizes agglutinative languages where valid sandhi merges inflate scores. We introduce SCRIBE, a diagnostic framework that provides categorical error decomposition into lexical, punctuation, numeral, and domain-entity rates through sandhi-tolerant alignment with domain vocabulary injection. Human validation confirms SCRIBE aligns with expert judgment where WER does not. We release SCRIBE, an LLM curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20713v1 Announce Type: new Abstract: Multimodal IE in social media is difficult because a post may attach multiple images that are weakly related, redundant, or even misleading with respect to the text. In this setting, always-on multimodal fusion wastes computation and can amplify spurious visual cues. The core challenge is to decide, for each candidate span or marked entity pair, whether vision should be consulted at all and, if so, which small subset of images provides trustworthy evidence. We propose SAVER, a selective vision-as-needed framework for multimodal named entity recognition and multimodal relation extraction. SAVER uses a Conformal Groundability Gate (CGG) to estimate span-level visual groundability in MNER, derive pair-level activation in MRE from the two marked entities, and calibrate the activation threshold on a held-out split via a conformal-style procedure with Clopper--Pearson upper bounds. When activated, a submodular relevance--diversity selector chooses a compact evidence subset across images, which is then aggregated by a Set Transformer. An energy-inspired joint scoring head combines text, optional visual evidence, text--image consistency, and sparse routing for entity typing or relation classification. Experiments show that SAVER consistently improves F1 over strong text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage at a fixed risk level, and lowering FLOPs and P90 latency.

Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification

Youngjoon Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20716v1 Announce Type: new Abstract: Random forests aggregate tree votes by simple majority, treating all trees as equally informative. We observe that the topological pattern along each tree's root-to-leaf decision path -- where and how often the dominant class label flips along it -- carries a signal of tree reliability that is exploitable for per-sample reweighting. The naive use of this signal is structurally confounded with the predicted class, so we propose a class-conditional ratio weighting that guarantees zero expected class bias by construction. On 30 binary classification benchmarks under a shared-forest, shared-split protocol with 30 repeats, the proposed method is the only one among four compared schemes -- RF, weighted RF, KNORA-Eliminate, KNORA-Union -- to yield a statistically significant accuracy improvement over RF (Wilcoxon p = 0.018), while the three alternatives all fail to do so (p > 0.5). It is also the only scheme without majority-recall regressions, with minority-recall regressions limited to 3/30 datasets -- a one-sided loss to which classical dynamic ensemble selection methods are susceptible. The gain is robust across forest sizes from 100 to 1000 trees.

E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference

Ankit Kumar Tenwar, Mukul Lokhande, Santosh Kumar Vishvakarma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20717v1 Announce Type: new Abstract: This work presents E-ReCON, a 16 Kb energy and resource-efficient digital compute-in-memory (DCIM) macro based on a compact 3T1R ReRAM bitcell for edge-AI inference. The proposed bitcell occupies only 0.85 um^2 and supports reliable AND-based in-memory multiplication for both conventional convolutional neural network (CNN) and spiking neural network (SNN) workloads. To reduce accumulation overhead, a novel interleaved 10T/28T adder tree is introduced, reducing transistor count and power consumption by 37% and 28%, respectively, compared to a conventional 28T RCA-based design. Implemented in 65 nm CMOS at 1.2 V, the proposed macro achieves a minimum latency of 0.48 ns, throughput of 2.31-3.1 TOPS, and energy efficiency of up to 419 TOPS/W. When evaluated on LeNet-5, AlexNet, and CNN-8 models, the macro achieves 97.81%, 93.23%, and 96.51% accuracy on MNIST/A-Z, CIFAR10, and SVHN datasets, respectively. In addition, 40% pruning preserves nearly 99.8% of the original accuracy while reducing MAC operations and computation cycles. For SNN-oriented workloads, the proposed AND-type bitcell efficiently supports spike-weight multiplication with low switching activity, where the 2A2W configuration achieves accuracy close to the FP32 baseline across VGG-8, VGG-16, and ResNet-18 networks on CIFAR-10, CIFAR-100, and ImageNet-1K datasets. Compared to prior ADC-based ReRAM-CIM designs, the proposed architecture improves latency and energy efficiency by nearly 30-40% while maintaining robust operation under full PVT and ReRAM variability. Overall, E-ReCON provides a scalable, low-latency, and energy-efficient nvCIM platform for next-generation edge-AI, IoT, biomedical sensing, and neuromorphic applications.

Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework

Zongyu Li, Xuanyu Liu, Gongce Cao, Shirui Sun, Yaqi Fang, Yongshuai Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20721v1 Announce Type: new Abstract: Learning from implicit feedback in recommender systems is fundamentally challenged by pervasive label noise. While conventional denoising approaches often discard noisy instances to ensure robustness, this strategy inevitably suffers from low data utilization. Alternative methods that employ a Bayes-label transition matrix (BLTM) can leverage all available data, but their estimates tend to be biased in practical recommendation scenarios. To address these limitations, this paper proposes a Robust GMM-weighted Bayes-label Transition Matrix framework (RGBT). Our solution utilizes a Gaussian Mixture Model (GMM) to derive instance-specific reliability scores, which systematically calibrate the BLTM estimation to mitigate bias. Theoretical analysis confirms that our approach, by leveraging the BLTM framework with GMM calibration, simultaneously ensures full sample utilization, delivers consistent estimation, and critically, achieves a significant reduction in estimation variance. Extensive experiments on multiple real-world and synthetically flipped datasets demonstrate that RGBT not only utilizes noisy samples more effectively than mainstream reliable sample-based denoising methods, but also achieves significantly superior calibration capability of the transition matrix compared to state-of-the-art transition matrix-based denoising approaches.

AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20722v1 Announce Type: new Abstract: Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on GSM8K and 40.5% on MATH. Gains transfer to Llama-3-8B and Gemma-2-9B, and ablations confirm both modules are complementary. Our implementation is publicly available at https://github.com/wandugu/paper_agpo.

Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20723v1 Announce Type: new Abstract: Deploying large deep neural networks on memory-constrained mobile devices is a central challenge in edge ML. While compression, pruning, and quantization reduce per-parameter cost, transformer-based models remain too large for the 3.3-7.4 GB RAM envelope of commodity Android handsets. We present the DNN pipeline scheduling subsystem of CROWDio, which achieves practical ONNX inference across resource-constrained Android workers without model modification, by distributing memory pressure across devices via five mechanisms: JIT deferred partition loading, a single-partition-resident constraint, a 4-tier affinity scheduler, a zlib-compressed tensor transport, and a streaming 1:1 dependency model. Evaluated on DistilBERT (Sanh et al., 2019) (approximately 67 M parameters, SST-2) across five Android handsets over ten runs, our system holds peak per-device RSS to 43+-2 MB and limits battery draw to 50+-3 mAh per run, while streaming concurrency cuts batch latency 34% below barrier synchronisation.

CALMem : Application-Layer Dual Memory for Conversational AI

Rajendra Narayan Jena, Rajan Padmanabhan, Sankar Arumugam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20724v1 Announce Type: new Abstract: Large language models (LLMs) operate within fixed context windows that fundamentally limit conversational continuity. When context fills, compaction discards history irreversibly; when sessions end, all memory resets to zero. Existing solutions-larger context windows, retrieval-augmented generation for knowledge bases, and memory-augmented architectures such as MemGPT-either require model modification, impose provider lock-in, or do not address the compaction continuity problem. We present CALMem (Conversational Application-Layer Memory), an application-layer dual memory architecture that gives LLM-based conversational assistants virtually unbounded effective context without any modification to the underlying model. CALMem combines two complementary memory subsystems: an episodic memory layer built on sliding-window vector embeddings of conversation history, and a semantic memory layer of agent-writable structured facts. A token-budget-adaptive injection mechanism, called the MOIM (Message of Injected Memory), automatically retrieves and injects relevant past context each turn, scaling injection depth inversely with context pressure. A key contribution is intra-session retrieval: compacted away turns from the current session remain searchable, closing a gap unaddressed by prior work. The system is implemented as a pure application layer in a production Rust codebase, is provider-agnostic, and degrades to original LLM behaviour with zero overhead when disabled. We describe the architecture, design decisions, and performance characteristics, and analyse the trade-offs that guided each implementation choice.

Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label

Jingyang Mao, Ningkang Peng, Yanhui Gu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20725v1 Announce Type: new Abstract: Learning with noisy labels in multimedia classification often combines external annotations and model predictions into a single reliability weight, even though the two sources can fail for different reasons. We instead estimate disentangled reliabilities: bilevel meta-learning produces two batch-normalized scalars per sample, alpha for the given label and beta for the pseudo-label, without constraining them to sum to one. Holistic Reliability Propagation (HRP) then routes them to different objectives, using reliability-aware Mixup with global gating on the input branch and beta-gated pseudo-label positives on the contrastive branch. On synthetic and real-world benchmarks, HRP improves average accuracy over strong baselines and remains competitive at the highest noise rates.

GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20727v1 Announce Type: new Abstract: Deep neural networks (DNNs) experience significant performance degradation when processing noisy labels, primarily due to overfitting on mislabeled data. Current mainstream approaches attempt to mitigate this issue by passively filtering clean samples during training. However, simple sample filtering within feature spaces degraded by noise struggles to distinguish between challenging samples and noisy samples, creating a bottleneck for model performance. We highlight for the first time the fundamental importance of actively reshaping feature space geometry for learning from noisy data. We propose a novel Geometry-aware Manifold Regularization Paradigm whose core idea is to explicitly construct energy barriers between data manifolds by actively synthesizing virtual outlier samples. By imposing geometric constraints that promote intra-class compactness and inter-class separation, this approach enhances the discriminability between hard and noisy samples, leading to the learning of more robust representations. Our regularization mechanism exhibits high universality, with effectiveness independent of any prior assumptions about noise patterns. It can be integrated as a standalone mechanism into existing sample selection frameworks, providing stronger robustness against diverse noisy environments. Experiments demonstrate that our paradigm achieves performance surpassing current state-of-the-art (SOTA) methods on multiple benchmarks, including CIFAR-10, with particularly pronounced advantages under more challenging asymmetric noise conditions. Furthermore, this paradigm significantly enhances the model's capability in Out-of-Distribution (OOD) detection, ensuring superior reliability and safety for deployment in open-world scenarios.

Early High-Frequency Injection for Geometry-Sensitive OOD Detection

Chuanjie Cheng, Ningkang Peng, Chenxi Liu, Yifan He, Peirong Ma, Yanhui Gu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20728v1 Announce Type: new Abstract: Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20729v1 Announce Type: new Abstract: Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at https://github.com/rangehow/mtr-suite.

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning

Jihoon Kwon, Jiwon Choi, Jy-yong Sohn — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20730v1 Announce Type: new Abstract: In-context learning (ICL) allows large language models (LLMs) to adapt to new tasks through demonstrations, yet it suffers from escalating inference costs as context length increases. While task vectors offer a promising alternative by compressing demonstrations into compact hidden-state representations, their quality has been evaluated only through downstream task accuracy. This indirect criterion provides limited insight into how to design more effective task vector extraction methods. In this paper, we posit that inference using task vectors should align their predictive distribution with that of ICL. To quantify this, we introduce $d_{\text{NTP}}$, a metric that measures the discrepancy in next-token probabilities between task vector-based and ICL-based inference. Our empirical analysis reveals that $d_{\text{NTP}}$ serves as a performance proxy, exhibiting a strong negative correlation with downstream accuracy. Motivated by this, we develop Linear Task Vector (LTV), a method designed to minimize $d_{\text{NTP}}$ via a closed-form linear mapping that estimates demonstration effects through regression. Across eight classification benchmarks and five LLMs, LTV consistently outperforms existing task vector baselines, improving average accuracy by 9.2\% while reducing inference latency. We further show that LTV outperforms the baselines on regression tasks. Moreover, we investigate the transferability of LTV across different model scales; an aspect that has remained nascent in task vector research. Specifically, we empirically show that task vectors from a larger model can enhance a smaller model's performance by 6.4\%, suggesting a new utility for extracted task representations.

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20731v1 Announce Type: new Abstract: Text-to-image models produce graphic design at production scale, but their supervision comes from photo-style preference data with a single overall verdict per comparison. Designers evaluate along several distinct axes, including typography, visual hierarchy, color harmony, layout, and brief fidelity, and a single label collapses them. We release TASTE (Typography, Aesthetics, Spatial, Tone, Etc.): ten professional designers ranked outputs from four current text-to-image models on nine criteria across two disjoint cohorts, yielding 1,600 ratings per criterion plus per-image hallucination flags on the holistic-preference cohorts. We pair the dataset with three contributions. First, a criterion-agnostic signal test framework, using Kendall's tau, majority probability, and Condorcet cycles against exact iid-uniform nulls at p = 4 and R = 5, places designer agreement on graphic design between food and movie preferences and photo-style image quality, with every TASTE criterion rejecting the random-rater null. Second, no pre-trained system in our benchmark, including six open-weight VLM judges from 3B to 33B parameters and three dedicated T2I scorers, HPSv2.1, PickScore-v1, and LAION-Aesthetic-V2, exceeds 0.55 macro agreement with the 5-designer majority; VLM judges trade off position bias against content sensitivity, so scaling moves along this frontier without improving accuracy. Third, a small pairwise-difference head trained on TASTE reaches 0.611, closing roughly half the gap to the 0.741 single-rater ceiling.

Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations

Kin Whye Chew, Jingxian Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20732v1 Announce Type: new Abstract: Convolutional Neural Networks (CNNs) often exploit spurious correlations in datasets, learning superficially predictive yet causally irrelevant features, leading to poor generalization and fairness issues. Deep Feature Reweighting (DFR) is a post-hoc technique that reduces a trained model's reliance on spurious correlations by retraining its classification head on a target dataset. However, we show that DFR is fundamentally constrained by operating on entangled features, limiting its ability to amplify the core features while simultaneously suppressing the spurious ones. We trace this entanglement to the ubiquitous Global Average Pooling (GAP) layer, which indiscriminately collapses spatially distinct core and spurious features into a single representation. To address this, we propose Deep Attention Reweighting (DAR), a post-hoc attention-based aggregation module that replaces GAP and is retrained jointly with the classification head. DAR computes an adaptive weighting of spatial locations across feature maps, enabling selective suppression of spurious features before the collapse into entangled features. Across various datasets, metrics, and ablations, DAR consistently outperforms DFR, demonstrating that our attention-based aggregation mitigates GAP-induced entanglement and reduces spurious reliance.

Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches

Wenda Wang, Anqi Liu, Junqi Yang, Lei He, Luying Wang, Jiachen Lu, Weixin Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20733v1 Announce Type: new Abstract: Converting hand-drawn sketches into structured 3D geometries remains challenging due to the difficulty of representing non-Euclidean surfaces and maintaining topological consistency. Existing generative models such as GANs, NeRFs, and diffusion architectures often fail to produce editable manifolds directly usable in downstream design workflows. We present Sketch2MinSurf, a hybrid vision-language and geometric optimization framework that integrates vision-language guidance with minimal-surface theory to generate smooth and editable 3D surfaces from hand-drawn sketches. The core of our approach is a spatial-topological encoding that represents geometry as tuples of node coordinates and real/virtual edge skeletons, enabling stable topological control during generation. We further introduce the Sketch2MinSurf Structural Loss (S2MS-Loss), a reward-modulated objective that jointly constrains geometric reconstruction and topological coherence. On a test set of 100 sketches, Sketch2MinSurf achieves a topological similarity score of 0.844, outperforming existing sketch-to-shape baselines. The generated manifolds are directly editable and free from non-manifold artifacts. A public art installation at a university showcases the method's potential for human-intent-driven 3D form generation. The dataset and code are available at https://anonymous.4open.science/r/Sketch2MinSurf/.

An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress

Alfredo Metere — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20734v1 Announce Type: new Abstract: A large language model (LLM) agent that sends messages can leak data inside them. Destination allowlists and content scanners do not police whether an otherwise-benign payload is itself a covert channel: a compromised agent encodes bits in zero-width characters, homoglyphs, whitespace, base64, JavaScript Object Notation (JSON) key ordering, message timing or size -- and, in binary egress, in least-significant-bit (LSB) pixel planes, per-image mean luminance, inter-image sequence permutation, ultrasonic tones, or audible-band sonified data. Our egress reference monitor has three contributions. (i) A text pipeline of ten capacity-reducing stages, a per-sink leaky-bucket capacity ledger, and a staged posture that enforces lossless stages from day one. (ii) Two media scramblers (a Fourier-domain audio band-limiter and a red-green-blue (RGB) image bit-depth and mean-luminance bucketer) gated by a boot-time cryptographic legitimacy attestation: an auditor publishes at boot the trusted Ed25519 keys and {kind, data-class} pairs; only payloads with a verifying signature for an authorized class are exempt. The attestation sidesteps the intractable content-based discrimination between real media and data sonified or rasterized as a carrier; unsigned media is suspect by default; a content-addressed canonicalizer closes the inter-image permutation channel. (iii) Residual capacity is the Miller--Madow corrected mutual information between embedded and recovered bits (zero when destroyed), measured by an adversarial ensemble of fifteen working encoders across text, image and audio. The reference implementation drives residual capacity to zero on every destroyable channel and to a stated bound on the one (per-image mean luminance) that cannot be destroyed without ruining the image.

Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20735v1 Announce Type: new Abstract: This paper proposes two new open-source iris recognition algorithms, providing both Python and IREX-compliant C++ implementations to be submitted to the official IREX X program. This work has two primary goals: (a) to conduct the first-ever assessment of open-source iris recognition solutions according to IREX testing protocols, and (b) to offer a model C++ submission that significantly facilitates the entry of other teams' open-source methods into the IREX evaluation. The new methods consist of two Neural Networks trained with: (i) Triplet loss with Batch-Hard Triplet mining (TripletIris), and (ii) ArcFace loss (ArcIris). The paper also provides open-source IREX-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). Except for CRYPTS, which faced timing constraints during 1:N search, these methods have undergone the official IREX X evaluation and have also been assessed using several popular academic benchmarks: Quality-Face/Iris Research Ensemble, Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2. Finally, this paper also provides open-source models for iris segmentation and circle estimation that can be incorporated into any new iris recognition method.

Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors

Siqi Wei, Hongbin Xu, Feng Xiao, Tian Lan, Chun Li, Ming Li, Qiuxia Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20737v1 Announce Type: new Abstract: Existing approaches for unsupervised 3D point cloud segmentation predominantly rely on a purely visual similarity-based learning-by-clustering paradigm, which suffers from a fundamental limitation: long-tail ambiguity. In such a paradigm, features of minor classes are consistently absorbed by dominant clusters, leading to severely imbalanced predictions. To address this issue, we propose LangTail, a language-guided hierarchical learning framework that leverages the balanced world knowledge encoded in language models to mitigate long-tail ambiguity in unsupervised 3D segmentation. The key idea is to establish multi-level associations between language-derived semantic priors and visually underrepresented minor classes, thereby compensating for the biased attention of purely visual clustering toward dominant classes. Specifically, LangTail first constructs an entity-level semantic prior from language models, capturing balanced and fine-grained world knowledge across categories. These priors are injected into a hierarchical clustering framework via contrastive alignment. This guides multi-granularity semantic structure formation and prevents minor classes from being absorbed by dominant clusters, yielding more discriminative representations for underrepresented categories. Extensive experiments on ScanNet-v2, S3DIS, and nuScenes demonstrate that LangTail consistently outperforms existing methods by significant margins, \ie, +13.5, +12.9, and +8.9 mIoU, respectively. These results demonstrate the effectiveness of language priors in improving the representation of minority classes in 3D point clouds. The code will be released at: https://github.com/Whisky0129/langtail_official.

STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection

Yaoteng Zhang, Qing Zhou, Junyu Gao, Qi Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20738v1 Announce Type: new Abstract: Remote sensing imagery typically arrives in the form of continuous data streams. Traditional detectors often forget previously learned categories when learning new ones; therefore, research on Remote Sensing Incremental Object Detection (RS-IOD) is of great significance. However, existing methods largely overlook the intra-class scale variations prevalent in remote sensing scenes, which undermines the effectiveness of knowledge transfer and old knowledge preservation. Moreover, RS-IOD also suffers from missing annotations, which cause the model to misclassify old-class instances as background. To address these challenges, we propose a novel framework, STAR-IOD. First, we introduce a Subspace-decoupled Topology Distillation (STD) module to transfer structural knowledge, explicitly aligning inter-class topological relationships and mitigating intra-class representation discrepancies induced by scale shifts. Furthermore, we introduce the Clustering-driven Pseudo-label Generator (CPG), a plug-and-play module that leverages K-Means clustering to dynamically identify class-specific thresholds, thereby guaranteeing an accurate distinction between true positive targets and background noise and alleviating the issue of missing annotations for old classes. We also constructed two Remote Sensing Incremental Object Detection datasets, DIOR-IOD and DOTA-IOD to facilitate research on RS-IOD. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches by 1.7% and 2.1% mAP on DIOR-IOD and DOTA-IOD, respectively, effectively alleviating catastrophic forgetting while preserving strong detection performance on both base and novel classes. The code and dataset are released at: https://github.com/zyt95579/STAR-IOD.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20740v1 Announce Type: new Abstract: Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

Joey Chan, Zhen Chen, Ershun Pan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20742v1 Announce Type: new Abstract: With the rapid proliferation of electric vehicles, the safety and reliability of lithium-ion batteries have become critical concerns. Effective anomaly detection is essential for ensuring safe battery operation. However, as battery systems and operating scenarios become increasingly complex, battery fault diagnosis and maintenance require stronger cross-domain adaptability and human-AI collaboration. Traditional fault detection and diagnosis methods are usually designed for specific scenarios and predefined workflows, making them less effective in complex real-world applications. To address the scarcity of open-source battery fault report corpora and the lack of unified maintenance knowledge representation, this study proposes a descriptive text modeling approach for battery signal reports. Monitoring signals, statistical features, anomaly records, and state assessment results are transformed into structured and readable natural language descriptions, forming a language corpus for battery health diagnosis and maintenance. Based on this corpus, we propose VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade battery systems. VBFDD-Agent integrates descriptive battery-state texts, historical case retrieval, local maintenance manuals, and large language model reasoning to generate structured diagnostic results and maintenance recommendations. Experiments show that the proposed framework can accurately perform anomaly monitoring based on descriptive textual representations and provide flexible, efficient, and actionable maintenance suggestions. Expert evaluation further confirms the practical value of the generated recommendations. Overall, VBFDD-Agent extends traditional battery diagnosis from label prediction to interpretable and maintenance-oriented decision support.

Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction

Juncheng Hu, Jiawei Du, Xin Zhang, Joey Tianyi Zhou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20743v1 Announce Type: new Abstract: Vision-language models solve geometry problems with rising accuracy, yet their intermediate states remain latent and unverifiable: a relation expressed in textual reasoning or drawing code carries no guarantee that a constraint-satisfying configuration realizes it. We observe that existing externalization methods based on rendered pixels or one-shot scripts fail to provide exact, per-action geometric guarantees. Enforcing geometric relations by algebraic definition closes this gap: the workspace becomes a constraint-checked evolving canvas. We present Draw2Think, a framework that recasts geometric reasoning from latent spatial inference into agentic interaction with the GeoGebra constraint engine. In a Propose-Draw-Verify loop, Draw2Think externalizes hypotheses onto an executable canvas, measures exact geometric quantities, and feeds structured observations back to the model, so subsequent reasoning proceeds from checked canvas state grounded by the shared workspace. This externalization makes two properties separately auditable: model-level Construction Fidelity (whether the canvas realizes the intended configuration) and engine-level Measurement Faithfulness (exact values and relations from canvas constraints). Across construction, outcome, and rendering evaluations, Draw2Think builds canvases that pass 95.9% predicate-level and 84.0% strict problem-level construction checks on GeoGoal, improves outcome accuracy by up to 4.1%/16.4% on planar/solid benchmarks, and attains 68.2%/90.5% strict/relaxed rendering scores on GenExam-math. Project page is available at https://draw2think.github.io/

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

Amit Roth, Ankur Samanta, Matan Halevy, Yoav Levine, Yonathan Efroni — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20744v1 Announce Type: new Abstract: Aligning autonomous agents with human intent remains a central challenge in modern AI. A key manifestation of this challenge is reward hacking, whereby agents appear successful under the evaluation signal while violating the intended objective. Reward hacking has been observed across a wide range of settings, yet methods for reliably measuring it at scale remain lacking. In this work, we introduce a new evaluation paradigm for measuring reward hacking. Whereas prior studies have primarily analyzed it post hoc by inspecting agent trajectories, we instead embed detectable reward hacking opportunities directly into environments. This makes their exploitation verifiable by design, enabling deterministic and automated measurement of whether and how agents exploit such vulnerabilities. We instantiate this approach in $\textit{TextArena}$ and release $\textit{Hack-Verifiable TextArena}$, a testbed in which reward hacking can be measured reliably. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments and settings. We open source the code at https://github.com/MajoRoth/hack-verifiable-environments/.

The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering

Yefan Zhou, Yilun Zhou, Austin Xu, Soroush Vosoughi, Shafiq Joty, Jiang Gui — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20745v1 Announce Type: new Abstract: Generative verifiers have emerged as a promising paradigm for step-wise verification, but their verification behavior is often poorly calibrated: they may be under-critical and miss erroneous steps, or over-critical and reject correct reasoning. We refer to this tendency to be overly lenient or overly critical as verifier strictness. In this work, we study whether verifier strictness can be controlled through hidden-state intervention. We uncover a verification-specific hidden-state signal: in step-wise verification, a verifier's tendency to accept or reject a solution step is encoded near the boundary of the corresponding verification paragraph. Exploiting this signal, we show that hidden-state steering can directly modulate verifier strictness without fine-tuning. However, uniform steering induces a trade-off between error detection and correctness certification. To address this, we propose VerifySteer, which exploits latent correctness signals for sample-level routing and selectively intervenes on paragraph boundaries. Experiments on ProcessBench and Hard2Verify show that VerifySteer outperforms prompt optimization and activation steering baselines, and is competitive with self-consistency while requiring 4-7x less inference compute. VerifySteer is also complementary to verification fine-tuning, providing further gains on top of fine-tuned verifiers. The code is available at https://github.com/YefanZhou/VerifySteer.

The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?

Xingyu Lyu, Qianqian Xu, Zhiyong Yang, Peisong Wen, Qingming Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20749v1 Announce Type: new Abstract: Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently outperform their non-gated counterparts, yet the underlying reasons for this advantage remain unclear. In this work, we study GLU by analyzing two-layer networks in the neural tangent kernel (NTK) regime. Our analysis reveals that the GLU structure reshapes the NTK spectrum, leading to a smaller condition number and a more compact eigenvalue distribution. Building on this finding, we further analyze the resulting training dynamics and show how the reshaped spectrum leads to faster convergence of GLU models, including a characteristic loss-crossing phenomenon observed between GLU and non-GLU models. Finally, we empirically observe that GLU has limited impact in reducing the generalization gap on various models, including ViT and GPT-2, suggesting that its primary benefit lies in accelerating optimization rather than reducing the generalization gap.

PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG

Canyu Lei, David Repaske, Jianxin Xie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20751v1 Announce Type: new Abstract: Effective diabetes management requires continuous monitoring of glycemic levels. Clinically, glycemic control is assessed using metrics such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), typically derived from continuous glucose monitoring (CGM). However, many patients rely on self-monitoring of blood glucose (SMBG) due to the high cost and limited accessibility of CGM. Unlike CGM, SMBG provides sparse and irregular measurements, making accurate estimation of these metrics challenging. Conventional supervised learning approaches struggle under such sparsity, leading to poor generalization and unstable performance. To address this, we propose PACD-Net, a self-supervised contrastive knowledge distillation framework for estimating glycemic control from SMBG. Pseudo-SMBG samples with richer temporal coverage are used as teacher signals to guide learning from sparse observations. In addition, multi-view contrastive learning enforces representation consistency across diverse sampling patterns. The model adopts a hybrid Swin Transformer-CNN backbone to capture temporal dependencies in sparse SMBG sequences. Experimental results demonstrate that PACD-Net consistently outperforms existing methods in estimating TAR, TIR, and TBR from real-world SMBG data, achieving improved accuracy as well as enhanced stability and generalization under extremely sparse observation settings. The proposed framework provides a practical tool for clinical SMBG interpretation and offers a generalizable approach for learning from sparse and irregularly sampled sensor data in broader applications.

GaussianDream: A Feed-Forward 3D Gaussian World Model for Robotic Manipulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20752v1 Announce Type: new Abstract: Vision-language-action (VLA) policies have advanced language-conditioned robotic manipulation by transferring semantic priors from pretrained vision-language models to action generation. Yet, standard action-imitation training often provides limited explicit supervision for 3D geometry, dense visual structure, and short-horizon environment evolution, which are critical for physically precise manipulation. We introduce \textbf{GaussianDream}, a feed-forward 3D Gaussian world-model plug-in that turns robot trajectories into structured spatial-temporal supervision. The key idea is to couple current Gaussian reconstruction with horizon-conditioned future Gaussian prediction during training, forcing a compact spatio-temporal prefix to be decodable into renderable 3D Gaussian states. This enables dense RGB rendering, depth, and pseudo 3D scene-flow supervision without requiring test-time Gaussian decoding. At inference, GaussianDream discards all auxiliary decoding heads and retains only the learned prefix to condition action generation, avoiding rendering, video rollout, or additional planning during closed-loop control. Experiments on LIBERO, RoboCasa Human-50, and real-robot tasks demonstrate strong and highly competitive performance, achieving \textbf{98.4\%} average success on LIBERO, \textbf{52.6\%} on RoboCasa Human-50, and \textbf{50.0\%} in real-world evaluation.

Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20756v1 Announce Type: new Abstract: Preconditioned optimizers are central to language model training, but their stochastic update rules are usually treated as direct approximations to population preconditioned descent. We show that this view misses two finite-sample biases. First, the gradient and preconditioner are typically estimated from the same minibatch, introducing gradient--preconditioner coupling bias. Second, even when the preconditioner estimate is unbiased, its inverse or inverse-root is generally biased because inversion is nonlinear. We propose a single-batch bias-correction framework that addresses both effects: cross-fitted preconditioning estimates the numerator and preconditioner from independent microbatch groups, while variance-corrected inversion uses microbatch variability to subtract the leading delta-method bias term. The framework applies to diagonal moment, diagonal curvature, and matrix preconditioning methods, instantiated in AdamW, Sophia, and Shampoo. Bias correction reduces held-out pretraining loss on Qwen2.5-0.5B by $0.15$, $0.07$, and $0.11$ nats, respectively; the effects on mixed-quality pretraining and downstream instruction tuning are consistently neutral-to-positive. Together, these results establish bias correction as a practical mechanism for reducing finite-sample update bias and improving the performance of preconditioned optimizers.

Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20758v1 Announce Type: new Abstract: Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance ($g^\text{car}$), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate $g^\text{car}$ across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that $g^\text{car}$ effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.

Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders

Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20759v1 Announce Type: new Abstract: Single-turn safety evaluation is a poor proxy for real fraud defense, where attackers escalate across multiple rounds. This paper evaluates fraud defenders under replay and adaptive multi-round attacks and measures when a defender refuses, not just whether it eventually refuses. On a frozen multi-round suite built from Fraud-R1, graph-context defenders improve early safe refusal relative to text-only baselines under both replay and adaptive fraud pressure, but they also produce substantially more benign over-refusal. Direct probing of the trained graph encoder, together with paired shuffle-risk ablations on both fraud and benign sides replicated across two seeds on the Qwen-1.5B backbone, localises this cost to how the defender LLM consumes structured context rather than to graph-encoder quality: the encoder cleanly separates fraud from benign, while the LLM responds primarily to the presence of structured graph fields and only secondarily, and asymmetrically, to risk-score magnitude. Temporal graph context is directionally stronger than static and significantly better grounded, but is not yet conclusively superior on the main refusal metrics. The contribution is evaluative and measurement-oriented: robust fraud assessment must be multi-round, must report refusal timing, must account for benign false positives alongside fraud-side safety gains, and must localize observed costs to the graph signal or to how the LLM consumes it.

SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation

K S Nithurshen, Saurabh J. Shigwan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20760v1 Announce Type: new Abstract: Automated segmentation of the vertebral column in Computed Tomography (CT) scans is a prerequisite for pathological assessment and surgical planning. However, state-of-the-art methods, particularly those based on Transformers or large-scale ensembles, demand substantial GPU resources, creating a barrier for clinical adoption in resource-constrained environments or on edge devices. To address this, we introduce SpineContextResUNet, a computationally efficient 3D Residual U-Net designed for rapid spinal localization. Our architecture integrates a lightweight Context Block that employs parallel multi-dilated convolutions to capture long-range anatomical dependencies without the high latency of Recurrent Neural Networks (RNNs) or the memory overhead of Self-Attention mechanisms. Extensive validation on two public benchmarks, VerSe2020 and CTSpine1K, demonstrates that our model achieves a Dice score of 88.17% and 88.13% respectively. To evaluate performance under strict hardware constraints, we compared our model against a bottlenecked SwinUNETR scaled to match our ~1.7M hardware footprint. While the constrained Transformer suffers severe performance degradation due to a lack of spatial inductive biases in a limited-data regime, our CNN-based approach successfully maintains high accuracy. Crucially, heavy baselines like TotalSegmentator fail due to memory exhaustion on commodity hardware (Intel Core i5, 8GB RAM), our model performs robust inference, making it a viable solution for point-of-care diagnostics and deployment on edge platforms like the Nvidia Jetson Orin Nano.

Findings of the Counter Turing Test: AI-Generated Text Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20761v1 Announce Type: new Abstract: The rapid proliferation of AI-generated text has introduced significant challenges in maintaining the integrity of digital content. Advanced generative models such as GPT-4, Claude 3.5, and Llama can produce highly coherent and human-like text, making it increasingly difficult to differentiate between human-written and AI-generated content. While these models have transformative applications, their misuse has raised concerns about misinformation, biased narratives, and security threats. This paper provides a comprehensive analysis of state-of-the-art AI-generated text detection techniques and evaluates their effectiveness through the Counter Turing Test (CT2) shared tasks. Task A (Binary Classification) required participants to distinguish between human-written and AI-generated text, while Task B (Model Attribution) focused on identifying the specific language model responsible for generating a given text. The results demonstrated high performance in binary classification, with the top system achieving an F1 score of 1.0000, but significantly lower scores in model attribution, where the best system achieved 0.9531, highlighting the increased complexity of this task. The top-performing teams leveraged fine-tuned transformer models, ensemble learning, and hybrid detection approaches, with DeBERTa-based and BART-based methods demonstrating strong results. However, the lower scores in Task B underscore the challenges of distinguishing outputs from different LLMs, necessitating further research into adversarial robustness, feature extraction, and cross-domain generalization.

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20763v1 Announce Type: new Abstract: Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $\rho = 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

HyFrac.fun: A 3D Hydraulic Fracturing Simulator on Cloud

Jing Hu, Qian Liu, Jaroon Rungamornrat — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20764v1 Announce Type: new Abstract: When multiple hydraulic fractures propagate simultaneously from a horizontal wellbore, elastic stress-shadow interactions generate complex non-planar three-dimensional geometries whose effect on subsequent reservoir drainage has infrequently been quantified, because the propagation and production solvers have historically been incompatible stand-alone tools. This paper presents HyFrac.fun, a cloud-native platform that bridges this gap by exploiting a structural isomorphism between the two SGBEM--FEM governing operator systems. The platform enables automated zero-conversion handoff of the evolved 3D fracture mesh directly to the steady-state Darcy production solver for realizing a fully integrated lifecycle simulation of multi-stage non-planar hydraulic fractures. The lifecycle analysis reveals a double shadow phenomenon: the mechanical stress shadow that suppresses inner-fracture growth during stimulation mirrors a fluid pressure shadow that reduces the inner fracture's drawout rate at small cluster spacing. Critically, switching to a shear-thinning power-law fracturing fluid leaves the fracture trajectories and production rates almost unchanged, demonstrating that stress-shadow-controlled fracture geometry instead of fluid rheology is the primary determinant of long-term production efficiency at equal injection rates. These physics findings are accessible from integrated fracture propagation and production simulations.

Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection

Zhu Liu, Yuanhang Yao, Ping Qian, Zihang Chen, Risheng Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20766v1 Announce Type: new Abstract: Point supervision has become a scalable solution to address dense annotation for infrared small target detection, but its performance is limited by two coupled bottlenecks: unstable pseudo-label evolution in cluttered, low-contrast infrared imagery and severe sample-distribution imbalance. In this paper, we present a more adaptive and stable framework to address these issues. Leveraging the intrinsic consistency between thermal radiation patterns and heat diffusion, we propose a physics-induced annotation strategy that expands single-point labels into reliable pseudo-masks. To further enhance supervision and alleviate sample imbalance, we develop a bi-level dual-update framework that jointly optimizes detector weights, sample weights, and diffusion parameters. A meta-classifier dynamically predicts sample-wise loss weights, while a differentiable diffusion module refines pseudo-labels with detection feedback, enabling adaptive interaction between training and hyperparameter optimization. Extensive experiments across multiple datasets demonstrate five-fold annotation acceleration, superior detection accuracy, and comparable performance with 30% of the training data, validating the efficiency and practicality of our approach. Our code is available at https://github.com/yuanhang-yao/diffuse-to-detect.

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20767v1 Announce Type: new Abstract: Large language models (LLMs) show potential as simulators of human behavior, offering a scalable way to study responses to interventions. However, because LLMs are trained largely on observational data, interventions in experiments with LLM-simulated synthetic users can induce unintended shifts in latent user attributes, causing user drift where the implicit simulated population differs across treatment conditions, potentially distorting effect estimates. We formalize the confounding or selection bias that can arise due to user drift and show how intervention-dependent shifts can inflate or attenuate observed differences in user responses under intervention. To diagnose confounding, we propose using negative control outcomes--attributes that should remain invariant under intervention--to identify distribution shifts across intervention conditions, providing evidence of user drift. To mitigate drift, we study adjusting the persona specification by eliciting additional confounders, finding that targeted, setting-relevant confounders can substantially reduce bias across survey-style and multi-turn agent evaluations.

Data-informed posterior approximation for Bayesian linear inverse problems

Haibo Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20770v1 Announce Type: new Abstract: Computing posterior distributions in large-scale Bayesian linear inverse problems is challenging due to the high dimensionality of the parameter space. In this work, we develop a data-informed framework that shifts the computational focus from the parameter space to the data space. We rigorously characterize an intrinsically low-dimensional data space, establish its isometric embedding into the parameter space, and show that the prior-to-posterior update is confined to a data-informed subspace. This perspective allows posterior inference to be carried out in a reduced data-informed subspace. Based on this formulation, we propose a quotient-space Golub--Kahan bidiagonalization method to construct data-informed Krylov subspaces, and integrate empirical Bayesian inference into the iterative framework, enabling simultaneous hyperparameter estimation and posterior approximation in a matrix-free manner. Numerical experiments on representative problems support the theoretical framework and demonstrate the effectiveness of the resulting method.

Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations

Kin Whye Chew, Jingxian Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20771v1 Announce Type: new Abstract: Spurious correlations in real-world datasets cause machine learning models to rely on irrelevant patterns, undermining reliability, generalization, and fairness. Active learning offers a promising way to address this failure mode by querying informative samples that distinguish core features from spurious ones. However, standard active-learning methods simply append queried examples to the labeled set, effectively updating only the likelihood term. In deep learning regimes, the influence of these informative samples can be diluted by the larger labeled set and memorized by overparameterized models. We propose Cumulative Active Meta-Learning (CAML), an active-learning framework that uses queried examples to meta-learn the prior, or inductive bias, governing how the model adapts. CAML casts each active-learning round as a meta-learning task: the current labeled set serves as meta-train data for adaptation, while the newly queried batch serves as meta-test data for evaluating generalization. Unlike conventional meta-learning, which treats tasks as independent and identically distributed, CAML exploits the sequential dependence between active-learning rounds by maintaining a cumulative inductive bias that is progressively refined. Theoretically, we show that this cumulative formulation introduces interaction terms that couple earlier meta-learned inductive biases with later query-induced objectives, capturing dependencies absent from standard meta-learning. Empirically, CAML improves minority-group accuracy across spurious-correlation benchmarks and acquisition strategies, with gains of up to 27.8% on Dominoes, 29.9% on Waterbirds, 14.3% on SpuCo, and 24.0% on CivilComments.

VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering

Jiayi Chen, Benteng Ma, Zehui Liao, Winston Chong, Yasmeen George, Jianfei Cai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20772v1 Announce Type: new Abstract: While medical Multimodal Large Language Models (MLLMs) have shown promise in assisting diagnosis, they still frequently generate hallucinated responses that appear linguistically plausible but lack visual evidence. Such hallucinations pose risks to clinical decision-making and necessitate effective detection. Existing introspective detection methods primarily perform uncertainty estimation or logical verification by analyzing model responses conditioned on original or perturbed inputs. However, such external perturbations are often heuristic and context-agnostic, which overlooks the internal cross-modal dependency between generated tokens and related visual tokens during decoding. To address this issue, we propose VIHD, a Visual Intervention-based Hallucination Detection method that leverages targeted visual token masking to calibrate semantic entropy for more effective hallucination detection. VIHD locates visually dominant decoder layers via Visual Dependency Probing (VDP), executes Visual Intervention Decoding (VID) via token masking to calibrate the semantic distribution, and quantifies the resulting Calibrated Semantic Entropy (CSE) as a reliable hallucination signal. Extensive experiments on three medical VQA benchmarks with two medical MLLMs demonstrate that VIHD consistently outperforms state-of-the-art methods, underscoring the importance of fine-grained visual dependency for hallucination detection. The code will be available at https://github.com/Jiayi-Chen-AU/VIHD

VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

Alex S. Huang, Jiahui Zhang, Shiqing Tang, Yu Xiang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20774v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

Manogna Sreenivas, Rohit Kumar, Soma Biswas — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20777v1 Announce Type: new Abstract: Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20780v1 Announce Type: new Abstract: Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).

Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20782v1 Announce Type: new Abstract: Objective: The growing availability of large-scale observational clinical datasets and challenges in conducting randomized controlled trials have spurred enthusiasm in using causal machine learning (ML) for causal inference in observational data. We present a roadmap for applying causal ML to observational data. Materials and methods: We outline the importance of assessing validity assumptions within available data and applying causal ML responsibly for clinical experts using causal ML and ML practitioners with limited clinical expertise. Observations: Despite advances in causal ML, its limitations remain largely under-appreciated across disciplines. This gap in shared knowledge may impact the validity of findings. Discussion: Causal assumptions must be satisfied and modeling choices justified. Otherwise, these approaches risk producing biased or misleading results, with consequences for clinical research and patient care. Conclusion: Causal ML can be a powerful tool for generating causal hypotheses. We provide a template to strengthen the rigor and interpretability of causal analyses.

Interaction Locality in Hierarchical Recursive Reasoning

Yosuke Miyanishi, Tetsuro Morimura — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20784v1 Announce Type: new Abstract: Spatial reasoning requires both location-bound computation and location-invariant structure: agents must make local moves while preserving route, object, or constraint-level plans. We propose interaction locality, a task-geometry-aware framework for measuring whether information flow stays within nearby cells or semantic segments, or crosses them. We instantiate the framework with sparse-autoencoder feature ablations and finite-noise activation patching, with structural Jacobian and attention checks reported in the appendix, and apply it to HRM and TRM, two compact hierarchical and recursive reasoning models, on Maze-Hard, Sudoku Extreme, and ARC-AGI. Across these models, activation patching gives the clearest architectural fingerprint: high-level recurrent states tend to write information within nearby cells or same-segment units, while repeated recursive updates accumulate these local writes into broader solution structure. This pattern holds across maze paths, Sudoku constraints, and ARC-AGI object neighborhoods, with the strongest concentration in TRM. To test whether interaction locality extends beyond toy-yet-challenging grid benchmarks, we also apply it to MTU3D, a large-scale embodied 3D scene-grounding model. In this MTU3D setting, causal spatial locality appears primarily at the transition where visual scene features are handed to the downstream grounding module, rather than uniformly throughout the visual encoder. This contrast suggests that the local-to-global handoff observed in HRM and TRM is tied to explicit recursive reasoning dynamics, while embodied 3D models may concentrate causal spatial structure at module boundaries. Interaction locality turns the intuitive local-execution/global-planning story into a reproducible measurement framework for recursive and embodied spatial reasoning.

Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems

Wajdi Zaghouani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20786v1 Announce Type: new Abstract: This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.

Findings of the Counter Turing Test: AI-Generated Image Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20787v1 Announce Type: new Abstract: The rapid advancements in generative AI technologies, such as Stable Diffusion, DALL-E, and Midjourney, have significantly transformed the creation of synthetic visual content. While these models enable innovation across industries, they also pose serious challenges, including misinformation, disinformation, and biased content generation. The increasing realism of AI-generated images makes their detection a pressing concern for researchers, policymakers, and industry stakeholders. In this paper, we present the findings of the Defactify 4.0 workshop, which introduced the Counter Turing Test (CT2) for AI-Generated Image Detection. The competition consisted of two key tasks: (1) binary classification of images as either AI-generated or real and (2) identification of the specific generative model responsible for an AI-generated image. To facilitate this, we developed the MS COCOAI dataset, consisting of 50,000 synthetic images from multiple generative models alongside real-world images from the MS COCO dataset. Participants employed diverse detection strategies, including convolutional neural networks (CNNs), Vision Transformers (ViTs), frequency-based analysis, contrastive learning, and multimodal techniques. The results demonstrated that while AI-generated images can be detected with high accuracy (F1-score > 0.83), identifying the exact model used remains significantly more challenging (highest F1-score: 0.4986). These findings highlight the need for improved model fingerprinting, adversarial robustness, and real-time detection mechanisms.

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software

Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Xin Lian, Hangyu Cheng, Jiayin Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20788v1 Announce Type: new Abstract: Software defect detection is a critical task in software engineering. However, no prior studies have specifically addressed defect detection in bioinformatics software. Given that the performance of defect detection tasks is primarily influenced by both models and datasets, our experiments controlled for model-related factors and confirmed the limitations of existing datasets in bioinformatics software. To address this issue, we introduce BioDefect, the first dataset specifically designed for defect detection in bioinformatics software, aiming to overcome the limitations of existing datasets in this context. Unlike prior datasets, BioDefect includes complete source code repositories, preserving the actual contextual information of defective code, thereby more accurately reflecting real-world defect scenarios in bioinformatics software. Additionally, BioDefect mitigates issues related to label inconsistency and data leakage, ensuring high data quality and experimental reliability. To evaluate the effectiveness of BioDefect, we conduct a systematic assessment on nine language models (LMs), including DeepSeek-R1. The results demonstrate that BioDefect significantly enhances defect detection performance for bioinformatics software. Compared to existing datasets, BioDefect achieves an average F1-score improvement of 29.61% to 38.04% across all models, highlighting its superior advantages. This study fills a critical research gap in bioinformatics software defect detection, laying a foundation for future studies in this field and offering new insights for improving bioinformatics software quality assurance.

Assessing socio-economic climate impacts from text data

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20793v1 Announce Type: new Abstract: Recent advances in natural language processing (NLP) and large language models (LLMs) have enabled the systematic use of large-scale textual data from news, social media, and reports to create datasets with socio-economic impacts of climate hazards such as floods, droughts, storms, and multi-hazard events. As the field of text-as-data for impact assessment expands, so does its methodological complexity. Yet research remains fragmented, with no clear guidelines for defining what constitutes an impact, handling temporal and spatial biases, and selecting appropriate modeling and post-processing strategies. This lack of coherence limits transparency and comparability across studies. Here, we address this gap by synthesising common practices, describing key challenges specific to the use of text-as-data methods for analyzing socio-economic impact data, and proposing recommendations to address them. By providing guidance on best practices, we aim to support the construction of robust text-derived socio-economic impact datasets that can more accurately inform disaster risk management and attribution studies.

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20795v1 Announce Type: new Abstract: Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.

CMC-Opt: Constraint Manifold with Corners for Inequality-Constrained Optimization

Yetong Zhang, Frank Dellaert — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20796v1 Announce Type: new Abstract: We introduce a manifold-based framework for addressing optimization problems with equality and inequality constraints found in robotics. Our approach transforms the original problem into an unconstrained optimization problem directly on the constrained state space. To achieve this, we introduce ``constraint manifolds with corners" to represent the state space satisfying mixed nonlinear equality and inequality constraints. We further extend manifold optimization algorithms to operate on this new topological structure. We demonstrate the power and robustness of our framework in the context of a large-scale kinodynamic planning problem, successfully generating dynamically feasible trajectories where standard methods fail.

Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization

Yiliang Yuan, Xiang Shi, Mustafa Misir — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20797v1 Announce Type: new Abstract: The present paper introduces a new representation-driven approach to per-instance algorithm selection, applied to black-box optimization, for automatically choosing the most promising solver from a fixed portfolio. Prior work in continuous optimization largely relies on numerical descriptors, including Exploratory Landscape Analysis features and learned embeddings such as Deep-ELA. This work studies a complementary representation: contour-map visualizations of probed landscapes. A CNN regressor takes multiple instance-specific contour views (stacked or encoded per view and aggregated) and predicts per-solver performance, enabling selection by the predicted best value. On the standard BBOB 2009 single-objective protocol, the resulting selectors significantly outperform the single best solver (SBS) and are competitive with feature-based baselines. A subsequent bi-objective evaluation under the DeepELA setting further indicates that the same image-based principle can be competitive when using windowed contour views. Overall, the results suggest that simple vision models can exploit spatial structure in probed landscapes for algorithm selection without handcrafted ELA features.

Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

Yang Zhao, Jiahao Lu, Bin Huang, Guhua Zhang, Jie Zhou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20798v1 Announce Type: new Abstract: Narang et al. (2021) evaluated 40+ Transformer modifications at T5-base scale and concluded that most did not transfer. Five years later, the typical working regime has moved to 1-3B parameters, downstream evaluation has replaced pretraining perplexity, and a substantially different catalogue of modifications has emerged. We revisit their question by testing 20 post-2021 Transformer modifications at 1.2B and 3B under strict iso-data, iso-compute, iso-recipe control, with a multi-seed baseline noise floor and CLIMB-12 downstream evaluation as the primary metric. The central finding reproduces theirs at this curated set: most modifications do not transfer. Of the 20 modifications, only two clear Bonferroni correction at 1.2B; one of those two further fails to train stably at 3B under the shared recipe. We also find that the loss-downstream gap reported by Tay et al. (2023) enlarges several-fold for attention-output modifications: two significant failures converge to within 2-3% of baseline validation loss yet drop 6-16 CLIMB-points. We conclude that noise-floor reporting, downstream evaluation, and cross-scale stability testing are now prerequisites for architecture comparisons at 1-3B.

Instant GPU Efficiency Visibility at Fleet Scale

Connor Pedersen, Dong H. Ahn, Michel Migdal, Collin Neale, Nik Konyuchenko — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20799v1 Announce Type: new Abstract: We present Overall FLOP Utilization (OFU), a hardware-level, precision-agnostic GPU efficiency metric for AI workloads on HPC systems, derived from two on-chip performance counters: Tensor Pipe Activity and SM clock frequency. OFU requires no application instrumentation and works across GPU generations and numeric precisions. We characterize five properties of the OFU approximation -- tile quantization, floating-point precision scaling, clock sampling noise, Tensor Core clock domains, and non-tensor undercounting -- through controlled GEMM experiments on H100 and GB200 across FP16, TF32, FP8, and NVFP4. After tile-quantization correction, OFU predicts application-level MFU to within <=2 percentage points. Against 608 production training jobs, OFU achieves r = 0.78 correlation with application-level MFU and surfaces two framework-level FLOPs miscalculations. Deployed across large-scale GPU fleets, OFU has detected a 2.5x efficiency regression and tracked precision-dependent utilization changes in mixed-precision pretraining. Our evaluation and operational experience suggest OFU is a practical, deployment-ready complement to application-level MFU for continuous fleet-wide efficiency monitoring.

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20801v1 Announce Type: new Abstract: Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.

ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20802v1 Announce Type: new Abstract: Spiking neural networks (SNNs) exploit event-driven and addition-only computation to substantially improve efficiency for intelligent computation. A key temporal property of SNNs, elastic inference, allows outputs to emerge progressively, enabling responses to salient inputs much earlier than full evaluation. However, existing SNN-specific accelerators cannot capitalize on this property. Layer-by-layer designs emit outputs only after all layers are complete, while time-step-by-time-step designs rely on coarse-grained, layer-wise pipelines that require synchronizing all spines/tokens within a layer. This barrier prevents results from being forwarded immediately, delaying the earliest possible response and forfeiting the benefits of elastic inference. To address these challenges, we propose ELSA, a near-SRAM dataflow architecture that realizes true elastic inference through a fine-grained spine/token-wise pipeline and hardware optimizations tailored to SNNs. ELSA forwards each spine/token immediately upon production, forming a continuous streaming pipeline that substantially reduces the latency to the first response. To enhance this lightweight execution, ELSA introduces a bundled address event representation protocol to lower communication traffic of network-on-chip (NoC), and leverages mini-batch spiking Gustavson-product to cut memory access and exploit inherent sparsity. Combined with mapping and scheduling optimizations, ELSA achieves efficient, event-driven computation without compromising accuracy. Experiments show that SNNs can outperform quantized artificial neural networks (QANNs) while maintaining on-par accuracy. For a 4-bit ResNet-50, ELSA achieves 3.4$\times$ speedup and 13.6$\times$ higher energy efficiency over the SOTA QANN accelerator (ANT), and 2.9$\times$ speedup and 22.1$\times$ energy efficiency gains over the SOTA SNN accelerator (PAICORE).

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

Kei Hiroshima, Kento Uchida, Shinichi Shirakawa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20803v1 Announce Type: new Abstract: Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

OlmoEarth v1.1: A more efficient family of OlmoEarth models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20804v1 Announce Type: new Abstract: We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

Hanzhong Guo, Yizhou Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20807v1 Announce Type: new Abstract: Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

Jinjin Zhang, Xiefan Guo, Di Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20808v1 Announce Type: new Abstract: Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

Refining and Reusing Annotation Guidelines for LLM Annotation

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20809v1 Announce Type: new Abstract: While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20811v1 Announce Type: new Abstract: Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

Yanyi Lyu, Letian Chen, Futing Sun, Miao Zhang, Weili Guan, Liqiang Nie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20813v1 Announce Type: new Abstract: Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

Peter Fernandes, Ria Kanjilal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20815v1 Announce Type: new Abstract: Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20818v1 Announce Type: new Abstract: In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.

AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

Zhaojie Zeng, Yuesong Wang, Yawei Luo, Tao Guan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20820v1 Announce Type: new Abstract: 2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: https://github.com/whoiszzj/AIR.git

VSCD: Video-based Scene Change Detection in Unaligned Scenes

Jiae Yoon, Ue-Hwan Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20821v1 Announce Type: new Abstract: Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

Jiae Yoon, Ue-Hwan Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20822v1 Announce Type: new Abstract: In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at https://github.com/AutoCompSysLab/TERDNet.

RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, Tuan Kiet Pham, Sui Yang Guang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20823v1 Announce Type: new Abstract: Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

Markovian Circuit Tracing for Transformer State Dynamic

Abdullah X — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20824v1 Announce Type: new Abstract: Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline

Forward asymmetric numeral systems coding for natural language text compression

Mykyta Kharin, Igor Zavadskyi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20826v1 Announce Type: new Abstract: Compression based on asymmetric numeral systems (ANS) combines high encoding and decoding speeds with a compression ratio close to Shannon entropy, while forward modeling of the information source makes it possible to obtain an estimated compressed message size that is less than the entropy. This paper proposes combining these modeling and adaptive coding methods. In addition to ensuring high data processing speeds and compression ratios, this approach enables one to implement the adaptive ANS, which has long remained an important scientific and practical problem.

HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

Yaoyao Yue, J\'er\^ome Schmid, Xiaoshuang Li, Eduardo Delamare, Jinman Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20827v1 Announce Type: new Abstract: Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p < 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

Hamiltonian and Symplectic Tensors in the T-product Algebra

Susana Lopez-Moreno, Taehyeong Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20829v1 Announce Type: new Abstract: We study Hamiltonian and symplectic tensor structures in the T-product algebra. We define T-Hamiltonian and T-symplectic tensors and characterize them through their Fourier-domain slices. For T-Hamiltonian tensors we establish the standard block form and the spectral symmetry of T-eigenvalues, while for T-symplectic tensors we derive the inverse and exponential-map properties. Our main result is a constructive T-Williamson normal form for tensors whose Fourier-domain slices are real symmetric positive-definite matrices. We also show that, under the Hermitian symplectic convention adopted here, this decomposition does not extend directly to arbitrary Hermitian positive-definite Fourier-domain slices, and we derive a real-valued recovery criterion under Fourier conjugate symmetry. Numerical experiments verify the construction, exhibit runtime trends consistent with the slice-wise complexity $O(pn^3)$, and illustrate the framework on a Fourier-domain encoding of covariance-matrix families arising in continuous-variable quantum dynamics.

MemGym: a Long-Horizon Memory Environment for LLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20833v1 Announce Type: new Abstract: Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20834v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20837v1 Announce Type: new Abstract: Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

USV: Towards Understanding the User-generated Short-form Videos

Haoyue Cheng, Su Xu, Liwei Jin, Wayne Wu, Chen Qian, Limin Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20838v1 Announce Type: new Abstract: Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

Jeffrey Wang, Jonathan Gregory, Grigorios G. Chrysos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20839v1 Announce Type: new Abstract: Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

Efficient and simple fourth-order compact finite difference methods for convection-diffusion-reaction equations on arbitrary curved domains

Qiwei Feng, Bin Han, Peter Minev — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20842v1 Announce Type: new Abstract: In this paper, we discuss the 2D convection-diffusion-reaction equation with variable smooth coefficients and the Dirichlet boundary condition on a complicated, thin, and curved domain. We propose the fourth-order compact FDM at every grid point with the uniform Cartesian mesh. For the regular stencil center, we utilize the fourth-order compact 9-point FDM to approximate the solution. According to the preliminary analysis, we use vertical and horizontal transformations to derive fourth-order compact FDMs in 10 cases for all irregular stencil centers. To obtain the left-hand side of the stencil of the fourth-order FDM in each case, we only need to solve an at most $6 \times 24$ linear system which is presented with the explicit formula. The right-hand side of the FDM is constructed in explicit expression for any irregular stencil centers too. To achieve the fourth-order consistency, up to second-order partial derivatives of convection, diffusion, reaction, and source terms are used for the FDM at the regular stencil center, and the FDM at an irregular stencil center only requires first-order partial derivatives of convection, diffusion, reaction, and source terms, and up to third-order derivatives of the Dirichlet boundary function and the parametric expression of the boundary curve. We test challenging domains with 100-leaf, high-curvature, high-frequency, sharply varying, and nearly overlapping boundary curves, the proposed FDM produces the high accuracy and the stable fourth-order convergence rate in $l_2$ and $l_{\infty}$ norms. All stencils of our FDMs have a simple desired structure by only keeping grid points inside $\Omega$ in the standard compact 9-point stencil for both regular stencils and boundary stencils, but without assuming any information outside the domain $\Omega$.

SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP

Shihao Li, Naohiko Sugita — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20850v1 Announce Type: new Abstract: Objective: Stage-wise workflows that separate model scaling and inverse kinematics can induce morphology-posture compensation, resulting in anatomically inconsistent yet numerically acceptable solutions, especially in weakly observed directions. We present SmoCap, a leakage-resistant canonicalization framework that estimates morphology and posture jointly in each local trust-region quadratic program (QP) within a sparse control subspace. Methods: SmoCap solves a constrained trust-region QP with analytical proxy-mapped pose and scale Jacobians. The low dimensional proxy map stabilizes weakly observed directions and drives coordinated structures. An optional pre-solve provides warm starts in difficult configurations. The framework is evaluated using cohort fluoroscopy knee motion, anthropometric ground truth, and extreme yoga sequences. Results: SmoCap achieved 2.9 degree knee flexion RMSE against fluoroscopy, and a pooled anthropometric endpoint error around 3%. In the leakage audit against segment wise scaling, SmoCap also reduced marker RMSE, FE error, and anthropometric endpoint error. Proxy coupling preserved expressive and coordinated spine motion with marginal fitting error increase (+0.14 mm, +0.6%) against baseline models in yoga ablation. Median marker RMSE was around 20 mm, and median runtime was 0.204-0.332 ms/frame, achieved with consistently 2-3 iterations. Conclusion: SmoCap provides an externally validated unified coupling-aware scale-pose framework, making externally consistent motion canonicalization practical at dataset scale.

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20853v1 Announce Type: new Abstract: Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but continuous recording generates large amounts of non-informative audio, creating challenges for storage, power consumption, and long-term edge deployment. Bird audio detection (BAD), which identifies bird vocalizations, can reduce this burden by filtering irrelevant recordings before downstream analysis. However, most BAD systems are trained on temperate datasets despite tropical soundscapes being denser, more species-rich, and acoustically unpredictable. To address this gap, we introduce SEABAD (Southeast Asian Bird Activity Detection), a dataset of 50,000 curated three-second clips from Southeast Asian soundscapes, evenly balanced between bird-present and bird-absent samples. The dataset spans 1,677 bird species and is standardized to 16 kHz mono audio for embedded and low-power inference. We developed a dual-branch curation pipeline: a six-stage positive-label workflow applied to Xeno-Canto recordings, alongside six source-specific negative-label extractions from environmental datasets. These procedures reduced class imbalance by 13.7% (Gini coefficient: 0.601 to 0.519). A manual audit of 1,000 positive clips confirmed 97.8% +/- 0.9% labeling accuracy. Baseline experiments using MobileNetV3-Small achieved 99.57% +/- 0.25% accuracy and 0.9985 +/- 0.0002 AUC across three random seeds. SEABAD and the full curation pipeline are publicly released to support tropical BAD research and energy-efficient acoustic monitoring.

Finite-Time Regret Analysis of Retry-Aware Bandits

Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20854v1 Announce Type: new Abstract: We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20856v1 Announce Type: new Abstract: Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $\pi_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20863v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has recently unlocked strong reasoning capabilities in large language models (LLMs), triggering rapid exploration of new algorithms and data. However, RLVR training is notoriously inefficient: long-tailed rollouts, tool-induced stalls, and asymmetric resource requirements between rollout and training introduce substantial idle time that cannot be eliminated by job-local optimizations such as synchronous pipelining, asynchronous rollout, or colocated execution. We argue that this inefficiency is structural. While idle gaps are unavoidable within individual RLVR jobs, they are largely anti-correlated across jobs and therefore exploitable at the cluster level. Leveraging this observation, we present PlexRL, a cluster-level runtime for multiplexing unified LLM services across RLVR jobs. By centrally managing model placement, state transitions, and function-level scheduling under strict affinity constraints, PlexRL time-slices LLM execution across jobs to fill otherwise idle periods without expensive model migration. Our implementation and evaluations demonstrate that PlexRL significantly improves effective cluster capacity and reduces user GPU hour cost by maximum 37.58% while preserving algorithmic flexibility and introducing minimal per-job overhead.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20865v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

Yassine Maziane, Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20866v1 Announce Type: new Abstract: Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

Yingjia Xu, Jiulong Wu, Bowen Zhang, Baokui Guo, Siyuan Chai, Min Cao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20867v1 Announce Type: new Abstract: Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

Runtime-Certified Bounded-Error Quantized Attention

Dean Calver — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20868v1 Announce Type: new Abstract: KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

SeungJeh Chung, Geonho Park, Misong Kim, HyeongYeop Kang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20872v1 Announce Type: new Abstract: Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20873v1 Announce Type: new Abstract: Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.

Governance by Construction for Generalist Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20874v1 Announce Type: new Abstract: Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20876v1 Announce Type: new Abstract: Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

CIG: Exploration via Conditional Information Gain

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20878v1 Announce Type: new Abstract: Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity

Kaifeng Wei, Teng Liu, Liang Dong, Xiubo Liang, Yuke Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20879v1 Announce Type: new Abstract: Graph Anomaly Detection (GAD) is increasingly shifting to Generalist GAD (GGAD) for cross-domain "one-for-all" detection, but existing GGAD methods predominantly rely on the neighbor consistency principle, falling into the \textbf{Node-to-Neighbor Consistency Paradigm} for anomaly quantification. These methods suffer from complex training pipelines, heavy training data dependency, high computational costs, and unstable cross-domain generalization. To address these limitations, we propose NeighborDiv, a training-free generalist graph anomaly detection framework based on neighbor diversity. Departing from the dominant Node-to-Neighbor Consistency Paradigm, we shift the focus to the \textbf{Neighbor-to-Neighbor Diversity Paradigm}, and uncover that the internal structural dispersion of a node's neighbor set is a powerful, independently discriminative anomaly signal. We quantify neighbor diversity via the variance of inter-neighbor feature similarities, which captures how a node organizes its local graph environment, and operates independently of conventional node-to-neighbor consistency frameworks. Extensive experiments under two standard GGAD evaluation paradigms show NeighborDiv achieves state-of-the-art performance, with relative gains of 10.25% in average AUC and 17.78% in average AP over the second-best baseline under Single-Domain Independent Training (SDIT), and 6.89%/9.58% in AUC/AP under Unified Multi-Domain Training (UMDT), respectively. Notably, NeighborDiv yields zero performance volatility across all datasets, eliminating training-set dependency and establishing a lightweight and highly practical GGAD framework.

Learning fMRI activations dictionaries across individual geometries via optimal transport

Sonia Mazelet, R\'emi Flamary, Bertrand Thirion — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20883v1 Announce Type: new Abstract: Dictionary learning is a powerful tool for creating interpretable representations. When applied to functional magnetic resonance imaging (fMRI) data, the resulting patterns of brain activity can be used for various downstream tasks, such as brain state classification or population-level analysis. However, a major challenge is the variability in brain geometry across individuals. This is usually addressed by projecting each individual brain geometry onto a common template, which removes subject-specific information. In this work, we introduce a novel approach to dictionary learning on fMRI data that explicitly accounts for this variability. We use the optimal transport-based Fused Gromov-Wasserstein (FGW) distance to compare graphs with different geometries and features. To address the challenge of computing multiple FGW distances for large graphs such as those arising from fMRI data, we rely on amortized optimization to learn a neural network that predicts an approximation of the optimal transport plans, which substantially reduces the computational cost. Additionally, we learn dictionary atoms that depend on the FGW trade-off parameter, which controls the balance between feature alignment and structural consistency. Numerical experiments on the HCP dataset demonstrate that the proposed approach captures different levels of geometric variability in the data and provides representations that preserve essential information.

Solving Multivariate Polynomial Systems and Rectangular Multiparameter Eigenvalue Problems with MacaulayLab

Christof Vermeersch, Bart De Moor — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20884v1 Announce Type: new Abstract: We present the Matlab toolbox MacaulayLab, which implements numerical linear algebra algorithms for solving multivariate polynomial systems and rectangular multiparameter eigenvalue problems. Its structure and functionality are the result of several years of research and algorithmic development. We demonstrate how the software works and compare its performance with other software packages, such as PNLA, PHCpack, and MultiParEig. Some core features of MacaulayLab are the fact that it solves two key problems via one common approach, works independently of the chosen polynomial basis and monomial order, and is capable of dealing with positive-dimensional solution sets at infinity. The toolbox (including its future updates) and a large collection of test problems are freely available online.

Training distribution determines the ceiling of drug-blind cancer sensitivity prediction

Taekyung Heo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20885v1 Announce Type: new Abstract: Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20889v1 Announce Type: new Abstract: Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20891v1 Announce Type: new Abstract: Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu, Junhui Li, Ruitong Lu, Jialu Li, Youshan Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20892v1 Announce Type: new Abstract: Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Haoran Huang, Haonan Dong, Huixu Dong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20894v1 Announce Type: new Abstract: Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

GenAI-Driven Threat Detection with Microsoft Security Copilot

Scott Freitas, Amir Gharib — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20896v1 Announce Type: new Abstract: Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

Creating Robust and Fair Graph Structures for Connectivity and Clustering

Kushagra Chatterjee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20897v1 Announce Type: new Abstract: Graph algorithms are central to large-scale applications such as navigation systems, social networks, and data analysis platforms. This thesis studies two important challenges in such systems: robustness to failures and fairness in clustering outcomes. In the first part, we investigate fault-tolerant reachability preservers in directed graphs. We present the first non-trivial constructions of dual fault-tolerant pairwise reachability preservers that remain resilient to two edge or vertex failures, achieving a sparse construction of size $O(n^{4/3}|\mathcal{P}|^{1/3})$. In the second part, we study fair clustering algorithms that ensure balanced representation of protected groups. We develop approximation algorithms for fair consensus clustering and introduce the framework of closest fair clustering, establishing hardness results and efficient algorithms for multi-group settings. Building on this framework, we obtain improved guarantees for fair correlation clustering and design the first streaming algorithm for fair consensus clustering using only logarithmic memory. Together, these results contribute toward the design of graph algorithms that are both robust and socially responsible.

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20901v1 Announce Type: new Abstract: We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20904v1 Announce Type: new Abstract: We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.

ParaCell: Paravirtualized Secure Containers with Lightweight Intra-Container Isolation and Intent-Driven Memory Management

Yiyang Wu, Xunjie Wang, Jinyu Gu, Haibo Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20906v1 Announce Type: new Abstract: Secure containers isolate each container with its own kernel, mitigating shared-kernel attacks prevalent in traditional container systems. However, existing designs still face a fundamental isolation--performance trade-off. Nested-cloud deployments amplify the cost of VM exits and page-table management, while emerging agentic workloads expose bursty memory demand that requires fine-grained elasticity. We attribute this trade-off to two root causes. First, existing designs lack lightweight intra-container isolation primitives for frequent container user--kernel transitions. Second, the host treats container memory management as opaque, forcing reactive secondary faults and coarse-grained huge page mappings to amortize their cost. This paper presents ParaCell, a paravirtualized secure container runtime built on two insights. First, intra-address-space hardware protection primitives can provide lightweight intra-container isolation. ParaCell uses MPK-based XGates to isolate the container user and container kernel within a single address space, turning frequent user--kernel transitions into direct domain switches. Second, container kernel allocators already encode memory-management intent. ParaCell introduces Pager to interpose on allocation and free events, batch proactive GPA to HPA bindings and unbindings, and avoid reactive shadow page-table faults while preserving fine-grained memory elasticity. ParaCell is implemented as a drop-in replacement for RunV. Our experiments demonstrate that, across traditional cloud and emerging agent applications, ParaCell reduces latency by up to 57% and 79% over PVM, and by up to 33% and 88% over RunV, in bare-metal and nested setups, respectively. On agent workloads, ParaCell saves up to 35.6% memory compared with the state-of-the-art VM memory reclamation technique, HyperAlloc.

SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20908v1 Announce Type: new Abstract: Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20910v1 Announce Type: new Abstract: Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

Hoang Hai Nguyen, Kurt Driessens, Dennis J. N. J. Soemers — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20911v1 Announce Type: new Abstract: Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20912v1 Announce Type: new Abstract: The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20914v1 Announce Type: new Abstract: Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

Divyaksh Shukla, Ashutosh Modi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20915v1 Announce Type: new Abstract: Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

Yaping Chai, Haoran Xie, Joe S. Qin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20916v1 Announce Type: new Abstract: Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.

SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20917v1 Announce Type: new Abstract: Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Emma Leonhart — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20919v1 Announce Type: new Abstract: Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

Vinicius Ribeiro, Yves Laprie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20920v1 Announce Type: new Abstract: Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

Distance between Road Networks: A Macroscopic Method for Road Network Datasets Comparison Using Traffic-weighted Geographic Distribution

Hengyi Zhong, Toru Seo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20921v1 Announce Type: new Abstract: In transportation network analysis, various types of road network data can be used even when focusing on the same region. Since different road network datasets can make different performance in analyses, it is necessary to compare them and make appropriate selections in a qualitative manner. However, many of the existing methods for comparing road network datasets are limited to specific topological evaluations and do not consider transportation. This study proposes a method for quantitative comparison of different road network datasets with explicit consideration for traffic flows on them. The method first conducts a static traffic assignment with hypothetical demand for each dataset, and then compare the results using Wasserstein distance on two dimensional plane. Case study on different sources of road network datasets and their simplifications suggests the potential use of the proposed method in evaluating and selecting road network datasets.

Winfree Oscillatory Neural Network

Jiawen Dai, Yue Song — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20922v1 Announce Type: new Abstract: Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows

Benedikt Bollig — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20923v1 Announce Type: new Abstract: Distributed LLM agent workflows should not be monitored as if they produced a single sequential log. In an asynchronous execution, a decision can only depend on events that are causally visible to the lifeline that makes it: an event that appears earlier in some log may still be unknown locally. We extend the ZipperGen agent-workflow framework with Causal Past Logic (CPL), a small past-time temporal logic for guards in conditionals and while loops. In addition to standard past-time modalities such as previous and since, a guard can inspect the latest causally visible event of another lifeline and selected variables stored there. The formula is a source-level guard: it is evaluated online by the owner lifeline and can influence control flow at runtime. We give a vector-clock monitor with latest-value views and prove that the locally computed monitor value coincides with the denotational semantics of the guard at the current event. Thus runtime verification becomes part of the coordination language itself, rather than a post-hoc check over an execution log.

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20924v1 Announce Type: new Abstract: Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts

Zhen Tao, Jinxiang Zhao, Peng Liu, Dinghao Xi, Yanfang Chen, Wei Xu, Zhiyu Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20926v1 Announce Type: new Abstract: Long-term memory systems enable conversational agents based on large language models (LLMs) to retain, retrieve, and apply user-specific information across multi-session interactions. However, existing evaluations mainly assess outcome-level performance or temporal updating, providing limited insight into how systems retrieve and rank temporally valid, factually correct, and contextually applicable memory evidence under conflicting alternatives. To address this gap, we propose MemConflict, a diagnostic framework that treats memory validity as a query-conditioned fitness-for-use problem. MemConflict formalizes dynamic, static, and conditional conflicts over temporal validity, factual correctness, and contextual applicability. It simulates controlled long-horizon histories from structured user profiles, introduces cross-session conflicts, and injects semantically similar distractors to create competition among memory candidates. The resulting multi-session dialogue benchmark supports black-box evaluation of final answers and white-box analysis of supporting-memory retrieval and ranking. Experiments on six representative long-term memory systems show uneven strengths across conflict types, with answer correctness often diverging from memory retrieval and ranking. Sensitivity analyses reveal that longer histories, distractors, implicit queries, and larger conflict distances degrade performance. Diagnostics show failures from missing supporting memories and ineffective use of retrieved memories. Collectively, MemConflict advances principled long-term memory governance through retrieval-aware, conflict-aware reliability assessment.

STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20929v1 Announce Type: new Abstract: We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation

Shintaro Inoue, Kento Kawaharazuka, Temma Suzuki, Sota Yuzaki, Kei Okada — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20932v1 Announce Type: new Abstract: Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.

Conditioning and backward errors for nonlinear eigenvalue problems with eigenvector nonlinearities

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20933v1 Announce Type: new Abstract: We consider eigenvalue condition numbers and backward errors for a class of symmetric nonlinear eigenvalue problems with eigenvector nonlinearities. For both of these quantities, we derive explicit and computable expressions that can be evaluated with little computational effort for a given eigenpair, assuming the matrix perturbations are measured by the spectral or Frobenius norm. We also show how symmetric perturbations can be exploited in the analysis. By means of two numerical experiments we demonstrate that problems incorporating eigenvector nonlinearities potentially need to be treated with additional care, when compared to the linear or eigenvalue-nonlinear theory.

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20936v1 Announce Type: new Abstract: Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

Toward 6G-enabled Brain Computer Interfaces: Technical Requirements, Use Cases, Challenges, and Future Trends

Houda Hafi, Bouziane Brik, Nuraini Jamil, Abdelkader Nasreddine Belkacem — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20939v1 Announce Type: new Abstract: Brain computer interface (BCI) enables the brain to directly control an external device by converting neural signals into actionable outputs. However, effective real-time translation of brain activity strongly depends on the quality of neural communication between the brain and the external device. 6G is the next generation of wireless communication, expected to provide unprecedented levels of data rates, data security, and automation capabilities. In this context, integrating 6G into BCI systems would not only enhance the performance of brain-device communication, but would also create new opportunities for innovative applications. This work provides a comprehensive study on how BCI technology can be built effectively on top of 6G wireless networks by introducing several technical aspects and use cases. We first provide an overview of BCI and 6G, following their progression from early development to convergence through cognitive communication and advanced neural interfaces. We then highlight the need for the upcoming 6G systems toward BCI technology in every aspect, including 6G technologies such as intelligent edge and zero-touch networks, and 6G use cases such as digital twin, immersive communication, and internet of minds. Furthermore, we identify key technical challenges, open issues, and future research directions related to the 6G-enabled BCI paradigm.

3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20940v1 Announce Type: new Abstract: Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

Yunge Wen, Yuancheng Shen, Paul Pu Liang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20941v1 Announce Type: new Abstract: We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

Lena Wild, Katie Z Luo, Marco Pavone — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20942v1 Announce Type: new Abstract: Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

Privacy-Preserving Distributed Optimization Under Time Constraints Using Secure Multi-Party Computation and Evolutionary Algorithms

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20944v1 Announce Type: new Abstract: In distributed optimization, multiple parties collaborate to find an optimal solution to a problem. Privacy-preserving distributed optimization uses techniques, such as secure multi-party computation (MPC), to protect the private inputs of each party. In time-critical settings, the runtime overhead introduced by privacy-preserving computations may prevent the optimization from finishing within the deadline. This paper presents an approach for privacy-preserving distributed optimization in time-critical settings that combines evolutionary algorithms for solution search and MPC for the evaluation of solutions. The approach reduces the impact of privacy-preserving computations on runtime and allows to return solution within the deadline. Obfuscation of evaluation results provides additional protection for private inputs from an honest-but-curious platform provider, but introduces a potential trade-off between protection and solution quality. This trade-off is investigated in experiments using a genetic algorithm for both the single-objective assignment problem and the traveling salesperson problem, as well as NSGA-II for the multi-objective assignment problem.

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20946v1 Announce Type: new Abstract: The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20948v1 Announce Type: new Abstract: Scaling conditional memory offers a promising way to increase language-model capacity, but existing methods such as Engram learn large memory tables from scratch during pre-training, making memory scaling expensive and sometimes ineffective. We propose Memory Grafting, a conditional memory scaling method that utilizes frozen hidden states from a grafting model as conditional n-gram memory. Given frequent local n-grams, we run the grafting model offline, store final-token hidden representations as memory values, and let the recipient model retrieve them through exact longest-match suffix lookup. Retrieved memories are adapted by lightweight projections and gates, while a hash-based Engram fallback preserves coverage for unmatched contexts. Since the grafting model is only run offline and exact lookup has expected O(1) complexity with respect to memory-bank size, Memory Grafting expands external latent capacity with limited training and inference overhead. Experiments under matched recipient architectures and pre-training budgets show that Memory Grafting improves over both MoE and vanilla Engram baselines. In the 2.8B-scale setting, it improves the average benchmark score from 51.95 for MoE and 52.43 for vanilla Engram to 53.86. In the 0.92B-scale setting, all grafting-model variants improve over the baselines, with Qwen3.5-35B-A3B giving the strongest gains. These results suggest that pretrained models can serve as reusable constructors of external latent memory, providing a practical step toward scaling future language models beyond trainable parameters alone.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

Yulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20950v1 Announce Type: new Abstract: Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

Ark: Offchain Transaction Batching in Bitcoin

Pim Keer, Matteo Maffei, Marco Argentieri, Andrew Camilleri, Zeta Avarikioti — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20952v1 Announce Type: new Abstract: Bitcoin is the cryptocurrency with the largest market capitalisation, but its widespread adoption is fundamentally limited by the scalability constraints of its consensus algorithm, which requires every transaction to be confirmed onchain. To address this, several Layer-2 scalability solutions have been proposed to move payments offchain -- most notably, the Lightning Network. However, their deployment remains hindered by cumbersome setup requirements: users must lock funds onchain to participate and engage in complex auxiliary protocols (e.g., for channel rebalancing, top-ups, and routing). Other solutions, like payment pools, sidechains and rollups, cannot be implemented in a non-custodial way on Bitcoin due to its limited scripting capabilities, or require all protocol participants to update the offchain state. In this work, we present Ark, the first Bitcoin-compatible commit-chain. Ark enables offchain transactions of virtual UTXOs (VTXOs), through an untrusted operator who aggregates them into succinct onchain commitments. A distinctive feature of Ark is its ease of deployment: users can receive offchain payments without locking any funds beforehand and Ark state updates can be performed only requiring the users involved in that update. We formally define the Ark protocol and prove its security. During this process, we identified two attacks affecting the testnet implementation, which we responsibly disclosed and proposed fixes for, which have been now integrated into the mainnet implementation. Our experimental evaluation demonstrates that Ark can commit onchain to batches of arbitrarily many VTXOs with a constant-sized footprint of approximately 200 vB. Cooperative exits add one output per user, while unilateral exits require $\mathcal{O}(\log n)$ transactions of roughly 150 vB per VTXO for a batch of $n$ VTXOs.

DrawMotion: Generating 3D Human Motions by Freehand Drawing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20955v1 Announce Type: new Abstract: Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20956v1 Announce Type: new Abstract: Conformal triage converts predictive scores into deployment actions that either release a case, flag it for urgent attention, or defer it to human review. Under prevalence shift, however, the usual summaries of marginal coverage and human-review rate can miss the safety-critical question of whether patients who truly experience the target event are released without review. To address this gap, we introduce a leakage-aware deployment audit for release-side conformal triage. It first assigns target subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation. This separation then lets the audit evaluate release directly: how many event-positive patients are cleared without review, whether the pilot has enough event labels for calibration, and how the safety-review trade-off shifts. Applying this audit to a retrospective NSCLC pilot shows why lower review can be misleading: after prevalence correction, the pooled conformal branch lowers review by releasing more patients, some of whom are event-positive. Within the audit, the classwise branch acts as a scarcity diagnostic: the pilot has too few event labels to certify safe low-review release.

JobArabi: An Arabic Corpus and Analysis of Job Announcements from Social Media

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20960v1 Announce Type: new Abstract: This paper introduces JobArabi, a large-scale corpus of Arabic job announcements collected from social media between January 2024 and October 2025. The dataset contains 20,528 public posts from X and captures more than two years of employment-related discourse across Arabic-speaking online communities. The corpus was compiled using a linguistically informed query framework covering 21 Arabic keyword families that reflect gendered, plural, formal, and dialectal expressions of recruitment language. The resulting dataset includes posts from institutional, commercial, and individual accounts and provides metadata such as timestamps, engagement indicators, and geolocation when available, enabling temporal and regional analysis of employment discourse. Quantitative analysis reveals several sociolinguistic patterns in online recruitment, including the persistence of gendered hiring language, regional variation in occupational demand, and the emotional framing of recruitment messages. These findings highlight the potential of Arabic social media as a resource for studying labor market communication and linguistic change. The JobArabi corpus, together with documentation and collection scripts, will be released to support research in Arabic NLP, computational social science, and digital labor studies.

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20961v1 Announce Type: new Abstract: Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20963v1 Announce Type: new Abstract: The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20965v1 Announce Type: new Abstract: Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization

Wajdi Zaghouani, Kais Attia, Md. Rafiul Biswas, Fadhl Eryani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20967v1 Announce Type: new Abstract: Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

On the Complexity of Hop Domination and 2-Step Domination in Graph Classes

Sandip Das, Sweta Das, Sk Samim Islam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20970v1 Announce Type: new Abstract: The domination problem is a well-studied problem in graph theory. In this paper, we study two natural variants: the hop domination problem and the $2$-step domination problem. Let $G$ be a graph with vertex set $V$ and edge set $E$. For a graph $G$, a subset $S \subseteq V(G)$ is called an \emph{hop dominating set} if every vertex not in $S$ lies at distance of exactly $2$ from at least one vertex in $S$. For $v\in V(G)$, let $N(v,2)$ denote the set of vertices in $V(G)$ that are at distance exactly $2$ from $v$. For a graph $G$, a subset $S \subseteq V(G)$ is called an \emph{$2$-step dominating set} if every vertex $v\in V(G)$ lies at a distance of exactly $2$ from at least one vertex in $S$. The \textsc{Hop Domination} (HD) problem and the \textsc{$2$-Step Domination} ($2$SD) problems ask whether a graph contains a hop domination set or a $2$-step domination set of size at most $k$, respectively. We study the computational complexity of these problems, and show that both are NP-complete, even when restricted to $d$-regular graphs for every $d\geq 3$, claw-free graphs and also unit disk graphs.

Comparative Evaluation of Deep Learning Models for Fake Image Detection

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20971v1 Announce Type: new Abstract: The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

Dibyayan Patra, Simit Raval, Pasindu Ranasinghe, Bikram Banerjee, Ismet Canbulat — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20973v1 Announce Type: new Abstract: The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning

Adda Akram Bendoukha, Heber Hwang Arcolezi, Nesrine Kaaniche, Aymen Boudguiga — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20975v1 Announce Type: new Abstract: Federated Learning enables collaborative model training across decentralized data sources without data transfer. Averaging-based FL is limited by the presence of non-IID data, which negatively impacts convergence speed and final model accuracy. Conventional alternatives suffer from significant inefficiency. Clients with noisy or highly heterogeneous data contribute expensive gradient computations that are either discarded or heavily down-weighted before aggregation. These reactive approaches waste computational resources, require more communication rounds and result in unnecessary privacy exposure. In this paper, we propose a proactive client selection framework that aims to find an optimal federation of clients whose combined data match utility and fairness requirements before training begins. Our method relies on mutual information computed from differentially private contingency tables to quantify the relevance of cross-feature correlations in the union dataset. We introduce a Potential Federation Loss (PFL) over the set of fixed-size federations, which balances two objectives. Maximizing collective data utility while ensuring fair cross-features correlations to prevent group unfairness. Client selection is expressed as an optimal subset search problem over the PFL objective, which we solve using simulated annealing under strong differential privacy guarantees for clients' local statistics. Experimental results on four benchmarks show faster, fairer, and more accurate models trained on optimally found federations, compared to uniform sampling, even when state-of-the-art adaptive aggregation or sampling strategies are employed.

Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20978v1 Announce Type: new Abstract: Graph Network Simulators (GNSs) have emerged as powerful surrogates for complex physics-based simulation, offering inherent differentiability and orders-of-magnitude speedups over traditional solvers. However, GNSs typically assume access to the underlying material parameters, such as stiffness or viscosity, severely limiting their utility in realistic experimental settings. While recent meta-learning approaches address the parameter dependency by inferring properties from mesh trajectories, reconstructing a mesh from an observed scene is challenging. In this work, we introduce Point Cloud Encoding for Accurate Context Handling (PEACH), a novel framework that applies in-context learning on point clouds to adapt a learned simulator to unseen physical properties during inference. Our approach relies on a novel spatio-temporal point cloud sequence encoder, as well as two forms of auxiliary supervision to help improve simulation fidelity. We demonstrate that PEACH is capable of accurate zero-shot sim-to-real transfer on a challenging, dynamic scene. Experiments on simulation scenes show that PEACH even outperforms mesh-based baselines on prediction accuracy, while being much more practical for real-world deployment.

An IoT-Enabled Smart Home Automation System for Energy Efficiency with Web-Based Control

Amaan Ahmed, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20981v1 Announce Type: new Abstract: This paper illustrates the design and implementation of a smart home automation system for the conservation of energy and user control with the help of environmental sensors and Raspberry Pi 5. It monitors real-time conditions like motion, temperature, humidity, light and smoke to automatically control the device's behavior and save energy. A prototype single two-room was developed which uses GPIO/I2C interfaces to integrate sensors and actuators. The fan speed and LED brightness was dynamically controlled using PWM. Manual control and real-time monitoring are made possible through a web dashboard that was developed using Flask and graphical displays, and CSV logs of the energy are taken every 30 seconds. It was designed in an iterative model of sprints and the energy savings during testing was more than 46% over an always-on model. The results prove that with the help of these low-cost, modular devices it is possible to improve sustainability and usability in the home as part of the IoT.

Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

Bole Ma, Jan Eitzinger, Harald Koestler, Gerhard Wellein — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20982v1 Announce Type: new Abstract: AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

Domijn: The Security of Domain Registrars and the Risk of a Domain Name Takeover

Koen van Hove, Jeroen van der Ham-de Vos, Roland van Rijswijk-Deij — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20984v1 Announce Type: new Abstract: Domain names are key assets for organisation. They anchor an organisation's online presence and reputation, and serve as linking pin for web services and, e.g., email. Consequently, a malicious takeover of a domain can lead to significant damages. Organisations register domain names through so-called registrars, a type of business that plays a key role in the domain name industry. This implies that registrars play an important part in safeguarding against malicious takeovers of domains. In this paper we empirically study how registrars implement security controls to prevent against such takeovers. We focus on the top 10 most popular registrars for the .nl ccTLD. We present the results of this study in light of a model for the impact of domain takeovers, that analyses the possible consequence of a takeover. We contrast this against the impact of two other well-known threats: ransomware and DDoS attacks. We find that all registrars in our study implement relatively effective security measures, but that they fall short in more advanced security controls, such as the proper implementation of two-factor authentication. We also find that a domain takeover can have significant impact, potentially equalling that of a ransomware attack.

A Sharper Picture of Generalization in Transformers

Paul Lintilhac, Sair Shaikh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20988v1 Announce Type: new Abstract: We study transformers' generalization behavior on boolean domains from the perspective of the Fourier Spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger and Tewari, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport

Mehmet Yigit Balik, Harri L\"ahdesm\"aki — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20989v1 Announce Type: new Abstract: Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

Hao Xu, Yilin Liu, Yinqiao Wang, Chi-Wing Fu, Niloy J. Mitra — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20992v1 Announce Type: new Abstract: We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

Towards Context-Invariant Safety Alignment for Large Language Models

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20994v1 Announce Type: new Abstract: Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

Hojin Ko, Jeonggyu Huh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20996v1 Announce Type: new Abstract: Most value-based and actor--critic reinforcement learning methods rely on Bellman-style recursions, yet these recursions collapse under non-exponential discounting common in human preferences and survival processes. We show the breakdown is structural: exponential discounting sits at a fragile intersection of multiplicativity and time homogeneity, and violating either property breaks standard dynamic programming. To overcome this, we propose Pontryagin-Guided Direct Policy Optimization (PG-DPO), a variational framework that abandons recursion and couples the Pontryagin Maximum Principle with Monte Carlo rollouts via an Adjoint-MC projection enforcing pointwise Hamiltonian maximization. Across multi-dimensional hyperbolic and survival-discount benchmarks, PG-DPO improves accuracy and stability where equation-driven solvers and critic-based baselines diverge.

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20997v1 Announce Type: new Abstract: Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lop\'e national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20998v1 Announce Type: new Abstract: Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

Convergence Analysis of Evolution Strategies for Mixed-Integer Optimization

Ryoki Hamano, Kento Uchida, Shinichi Shirakawa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21000v1 Announce Type: new Abstract: Mixed-integer extensions of evolution strategies (ES) that discretize selected coordinates of sampled continuous vectors often impose a lower bound on the standard deviation of integer variables to prevent premature convergence. While these methods show promising empirical results, this handling can slow the convergence of continuous variables, and its impact has lacked a clear theoretical account. In this paper, we provide a convergence analysis of evolution strategies for mixed-integer optimization, inspired by the drift analysis of the (1+1)-ES in the continuous domain. Specifically, we consider two (1+1)-ES variants for mixed-integer domains: (1+1)-LB-ES, which introduces a lower bound on the standard deviation for integer variables, and (1+1)-LUB-ES, which combines both lower and upper bounds to enhance the convergence of the continuous variables. Focusing on the optimization phase after the integer variables have been optimized, we rigorously analyze their convergence behavior on a benchmark function designed for mixed-integer domains. Our results show that (1+1)-LB-ES can suffer from premature convergence when the number of integer variables is large, while (1+1)-LUB-ES achieves linear convergence under suitable parameter settings. These findings provide theoretical insights into the impact of integer handling on convergence performance and guidance for the design of mixed-integer ES.

DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

Daniel Eskandar, Berna Kabadayi, Garvita Tiwari, Gerard Pons-Moll — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21001v1 Announce Type: new Abstract: Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/

Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov, Nurana Abdullayeva — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21002v1 Announce Type: new Abstract: Generative artificial intelligence now synthesizes photorealistic imagery, audio, and video at a cost that defeats traditional forensic intuition. The legal consequences span three regimes studied so far in isolation: international operational law, domestic procedure, and product regulation. This article presents a unified evidentiary framework that maps cryptographic content provenance, robust statistical watermarking, and zero knowledge attestation to the proof requirements of each regime. We define a five tier threat model spanning naive regeneration, adversarial laundering, cross model regeneration, active watermark removal, and insider provenance forgery. We release a public benchmark of 12000 generated items across image, audio, and video modalities under six laundering pipelines for 72000 evaluation samples. We evaluate four representative schemes and report true positive rate at fixed false positive rate, robustness area under the curve, computational overhead, and a regime conditioned legal sufficiency score. We translate empirical detection bounds into legal sufficiency thresholds for command decisions under the law of armed conflict, for criminal and civil admissibility under domestic procedure, and for persistence audits under the European Union Artificial Intelligence Act and analogous regimes. The result is a reproducible reference pipeline, a public benchmark, and model annexes that lawyers, engineers, and operators can deploy together.

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21006v1 Announce Type: new Abstract: We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21007v1 Announce Type: new Abstract: Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose \textbf{LiteViLNet}, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36\% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

Rectangular Multispectral Perturbation Theory

Christof Vermeersch, Sarthak De, Bart De Moor — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21013v1 Announce Type: new Abstract: We provide a first systematic treatment of so-called rectangular multispectral perturbation theory. With their paper from 2003, Hochstenbach and Plestenjak ["Backward Error, Condition Numbers, and Pseudospectra for the Multiparameter Eigenvalue Problem" in Linear Algebra and its Applications] extended perturbation theory from one-parameter eigenvalue problems to multiple spectral parameters. After two decades, we take it one step further and consider a different manifestation of the multiparameter eigenvalue problem that consists of one matrix equation with rectangular coefficient matrices. We perform a norm-wise backward error analysis, define condition numbers for both eigenvalues and eigenvectors, and introduce the pseudospectrum while also considering the computational implications of working with multiple spectral parameters. The rectangular shape hampers a direct application of the existing definitions and properties. For example, the left null space at a given eigenvalue is non-trivial and the dimensions of the left and right eigenvectors are different. Through numerical examples, we illustrate and link the different concepts from the perturbation theory. A system identification application seem to suggest that, in optimization-driven problems for which multiparameter reformulations exist, the globally optimal solutions tend to coincide with the best-conditioned eigenvalues.

Partially Observable Restless Bandits for Age-Optimal Scheduling over Markov Channels

Xijun Wang, Shuying Gan, Yanzhi Huang, Xiaoyu Zhao, Chao Xu, Xiang Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21016v1 Announce Type: new Abstract: There is a surge of need for fresh information with the overwhelming proliferation of the Internet of Things (IoT) applications. To characterize the information freshness perceived by the destination, the age of information (AoI) has been proposed. In this paper, we consider an IoT system with multiple devices sending status update packets to a central controller through time-correlated Markov channels and assume that the instantaneous channel states are not available to the central controller before making scheduling decisions. To ensure information freshness, we investigate a timely scheduling problem that minimizes the total expected time-average AoI under a strict communications bandwidth constraint. We formulate this problem as a partially observable restless multi-armed bandit problem. Using Lagrangian relaxation, we decouple the relaxed problem into multiple sub-problems and prove the threshold structure of their optimal policies. Armed with this property, we establish the indexability for the decoupled problem and design an algorithm to compute the Whittle's index. To reduce implementation complexity, we further derive the Whittle-like index in closed-form for low-complexity scheduling. Simulation results show that the proposed index-based policies outperform the baselines, remain close to the optimal policy or relaxed lower bound, and are especially effective when scheduling resources are limited or the network size is large.

The Knowledge Gap in a High-Choice Media Environment: Experimental Evidence from Online Search

Roberto Ulloa, Tiedemann Leonard, Peter Selb, Celina Kacperski — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21019v1 Announce Type: new Abstract: Persistent inequalities in political knowledge are a central concern in political communication. We organize the mechanisms underlying the knowledge-gap literature by distinguishing between individual preconditions, structural features of the information environment, and topic characteristics. Within this framework, we note that self-directed information seeking, a prototypical form of intentional exposure, has received little attention despite its importance in navigating today's complex information environment. We conducted a field experiment in Germany combining randomized encouragements and passive browser tracking to examine how individuals with varying education levels acquire policy-specific knowledge through online search. Participants were randomly assigned to one of three conditions (verbal encouragement, financial encouragement, or control) to seek information on three salient policy topics differing in divisiveness and complexity (child support, energy transition, and cannabis legalization). We estimate both intention-to-treat (ITT) and local average treatment effects (LATE) of information seeking on post-search knowledge outcomes, with a focus on education and civic knowledge as moderators. While the interventions equalized information-seeking behavior, the results provide some support for the knowledge gap hypothesis: knowledge gains were concentrated among participants with higher education or baseline civic knowledge, who, according to our post-hoc exploratory analyses, appeared more effective at navigating search results. These findings indicate that a narrowing of knowledge inequalities goes beyond motivation: it calls for both individual-level interventions to strengthen citizens' skills and structural-level adaptations to foster more equitable learning environments.

Component Influence-Driven Fastener Reduction for Robotic Disassemblability-Aware Design Simplification

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21026v1 Announce Type: new Abstract: To accelerate automated remanufacturing, robotic disassembly must be considered during the product design phase. However, designers currently lack quantitative feedback to identify which structural elements hinder robotic operations. To address this, this study proposes an analytical framework that provides actionable redesign guidance focused on fastener reduction, as fasteners are numerous and ubiquitous components found in almost all manufactured products. Using a Computer-Aided Design (CAD) model and its automatically generated Contact-Connection-Constraint (CCC) graph, the framework translates robotic disassembly sequence planning outcomes into component influence scores. These scores reflect how often a component causes structural constraint violations or evaluation objective deteriorations in the robotic disassembly sequence. To visually highlight structural hindrances, the framework projects these scores onto the CAD geometry as 3D heatmaps. The system then analytically simulates the removal of highly influential fasteners. It reports the expected reductions in structural constraints, tool changes, and robot travel distances, while preventing structurally unsafe modifications by evaluating geometric stability metrics. Experiments on seven household appliances demonstrate that the framework successfully targets redundant fasteners. Removing the recommended fasteners simplified the structural dependencies by eliminating between 8 and 132 structural constraints on the graph depending on each product's structural configuration. Furthermore, it improved robotic operational efficiency by eliminating unnecessary tool change operations and shortening travel distances by 165 to 1675 millimeters wherever structurally permissible.

Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21027v1 Announce Type: new Abstract: Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21028v1 Announce Type: new Abstract: Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves dynamic degree over strong baselines while also achieving higher temporal quality. The code and model weights will be released at https://github.com/yebo0216best/DySink.

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

Stephen Meisenbacher, Peter Norlander — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21029v1 Announce Type: new Abstract: Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

Modeling and Control of a Pneumatic Morphing Soft Quadrotor based on the SOFA Framework for Dynamic Soft Robotic Simulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21031v1 Announce Type: new Abstract: This article presents a novel SOFA based finite element method for the soft body modeling and the corresponding dynamic simulation and control of a pneumatic morphing soft quadrotor. The proposed modeling preserves the physical interpretability and control structure of traditional quadrotor dynamics, while capturing the complex, time-varying behavior of pneumatically actuated soft arms. In SOFA, the soft pneumatically actuated arms are discretized as a tetrahedral mesh following an elastic material law that produces internal forces adequate to the real dynamic behavior of the body. Pneumatic actuation governed by both periodic and error-based control signals is applied within the internal cavities to analyze the morphing capability. Finally, a proportional-integral controller is proposed to study the controlled dynamic behavior and morphing capabilities of the pneumatic arm, wherein the pneumatic actuation to the soft arm is controlled to achieve the desired target position. The simulation results show the effectiveness of the proposed novel modeling framework and the related controller design.

Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21032v1 Announce Type: new Abstract: High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification

Guangyi Zhang, Lutz Oettershagen, Lixu Wang, Aristides Gionis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21033v1 Announce Type: new Abstract: Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

The Quiet Path from Seemingly Minor Design Errors to Workplace AI Incidents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21035v1 Announce Type: new Abstract: Recent human-computer interaction (HCI) research has revealed a widespread misalignment between how developers design workplace artificial intelligence (AI) systems, and what workers actually need from them. Yet, little research has examined the effects of this gap, or how it may cause harm. We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors. Using an Large Language Model (LLM)-as-an-expert approach, we extracted the main traits of the AI systems involved in those incidents using an established framework of twelve traits. We then compared them with the traits that 202 workers highly familiar with those tasks would have preferred. We found that as many as 83\% of workplace incidents stem from worker-AI misalignments. In most cases, workers wanted systems that are precise, insightful, or personal, but instead received systems that are basic, simple, or general. Over the years, fast AI caused a considerable number of incidents, yet these declined, and imaginative AI, with the mass introduction of generative AI, started to cause incidents. We also compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred. If the traits causing the incidents were the same as those designed by developers, then developers may be responsible for those incidents. We found that 74\% of task misalignments could be attributed to developers who tended to overfocus on efficiency and speed, especially for systems performing tasks in people-facing occupations such as those in the human resources sector. Our results call for design interventions that better align AI development with workers' needs, as without such corrections, workplace AI incidents are likely to persist, causing the invisible erosion of worker agency and organizational productivity.

Dynamic Video Generation: Shaping Video Generation Across Time and Space

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21042v1 Announce Type: new Abstract: Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

Stochastic Galerkin and Monte-Carlo methods for parabolic problems: Numerical performance of variational matrix-free approximations

Moataz Dawor, Nils Margenberg, Markus Bause — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21046v1 Announce Type: new Abstract: Stochastic Galerkin methods offer unexplored potential for the numerical simulation of parabolic problems with random variables, in particular if they are combined with variational discretizations of the space and time variables. Due to the high dimensionality, the solution of the arising algebraic systems do not become feasible without efficient solvers, preconditioners, and software architectures. A stochastic Galerkin discretization with an embedded slabwise finite element approximation of the space and time variables is proposed and analyzed numerically. For solving the linear systems, GMRES iterations are block-preconditioned by a geometric multigrid (GMG) technique using a local Vanka smoother for the space-time subsystems. Monte-Carlo methods are also used for solving random parabolic problems and studied here for the purpose of comparison. The Monte-Carlo approach is built on the space-time finite element formulation together with the GMRES-GMG solver technology. All algorithms have been implemented in a unified matrix-free framework based on the deal.II software library. Comparative numerical evaluations illustrate the performance properties of both approaches, including convergence of the discretizations and statistics of the algebraic solver. Superiority of the stochastic Galerkin approach is observed.

Cross-lingual robustness of LLM-brain alignment and its computational roots

Ni Yang, Rui He, Philipp Homan, Iris Sommer, Davide Staub, Wolfram Hinzen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21049v1 Announce Type: new Abstract: Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

Perception of Social Robots as Communication Partners in Healthcare for Older Adults

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21053v1 Announce Type: new Abstract: Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults during human-robot interaction (HRI). This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+). Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.

DeTox-Fed: Detecting Toxic Conversations in the Fediverse with Federated Graph Neural Networks

Pantelitsa Leonidou, Nikos Salamanos, Sotiris Gypsiotis, Michael Sirivianos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21054v1 Announce Type: new Abstract: The rise of decentralized social networks (DSNs), and in particular the rapid uptake of the Fediverse (e.g., Pleroma, Mastodon, Lemygrad), introduces new challenges in content moderation. Independent instances host their own data, follow different moderation policies, and often observe only partial views of conversations. We present DeTox-Fed, a federated graph-learning framework for detecting toxic conversations in DSNs without requiring instances to share raw conversations or moderation labels. Each instance constructs a local conversation graph, where nodes represent conversation trees and edges capture shared user participation across conversations. A Graph Neural Network (GNN) is then trained in a federated learning setup, allowing instances to collaboratively learn a toxicity classifier while preserving data locality. Unlike text-only moderation approaches, DeTox-Fed combines conversational structure, user-interaction patterns, conversation-level statistics, and aggregate sentiment signals. We evaluate the framework on a large Pleroma conversation dataset and show that it achieves stable toxic conversation detection under limited local labels, partial client participation, and varying moderation thresholds. Our results indicate that federated graph-based moderation is a promising direction for semi-automated moderation in decentralized social networks.

Genetic Programming with Transformer-Based Mutation for Approximate Circuit Design

Ondrej Galeta, Lukas Sekanina — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21055v1 Announce Type: new Abstract: A recent trend is to leverage machine learning models to improve the evolutionary design and optimization process. We propose a novel transformer-based mutation operator for Cartesian genetic programming (CGP) for the automated design of approximate arithmetic circuits. We introduce a hybrid scheme for CGP in which the proposed mutation operator is switched with the standard mutation operator to prevent stagnation of the circuit approximation process. We also develop a new training scheme for the underlying transformer that utilizes training vectors composed of thousands of CGP chromosomes representing various approximate multipliers. For several target error constraints, the approximate multipliers evolved with CGP utilizing the transformer-based mutation achieve better trade-offs than the highly optimized designs available in the state-of-the-art EvoApproxLib library of approximate circuits. Although both training and evolutionary processes are computationally demanding, they appear to be necessary steps for improving existing approximate circuits and producing new, potentially patentable circuit designs.

On Unified and Sharpened CMI Bounds for Generalization Errors

Yang Lu, Matthias Frey, Margreta Kuijper, Jingge Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21056v1 Announce Type: new Abstract: We present a new family of information-theoretic generalization bounds within the framework of conditional mutual information (CMI). Most of our results are established based on the leave-$m$-out (L$m$O) cross-validation error, with $m$ denoting the number of the hold-out supersamples. Under this setting, we propose a unified CMI-based bound, allowing to envelop and reproduce many known CMI-based bounds and also bridge the gap between the MI- and CMI-based bounds when $m$ tends to infinity. The proposed framework not only provides a unified description of the existing bounds but also develops new, sharper bounds. We show the benefits of the proposed bounds through several simple examples, where the existing results are either inapplicable or looser. Moreover, under the premise that the loss function is bounded, we tighten the CMI quantities involved in the proposed bounds by reducing the number of conditional terms, thereby enhancing the proposed framework. We show empirically that the resulting new bounds improve upon the previously known ones.

SG-LegalCite: A Principle-Augmented Benchmark for Legal Citation Retrieval in Singapore Law

Shannon Lee Yueh Ern, Kaidong Feng, Yingpeng Du, Chloe Lee En Jia, Zhu Sun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21057v1 Announce Type: new Abstract: Legal citation in common-law systems depends not only on factual similarity, but also on the legal principle for which a precedent is invoked. However, existing benchmarks for legal citation retrieval use case facts, citation context, or full judgments as inputs, where the governing legal principle is often missing or only implicitly expressed and entangled with broader context. As a result, models may retrieve precedents that are factually similar yet doctrinally irrelevant. This limitation is particularly consequential in Singapore, where the legal system has evolved independently: only domestic precedents are binding, while foreign authorities serve merely as persuasive references. Thus, we propose a new retrieval paradigm that ranks cited cases based on queries integrating case facts and explicit legal principles, inspired by real-world legal reasoning workflows. To support this paradigm, we introduce SG-LegalCite, a dataset of 100,890 case-principle pairs extracted from 8,523 Singapore Supreme Court judgments spanning from 2000 to 2025. Experiments across 11 baselines demonstrate the effectiveness of our principle-augmented retrieval paradigm, showing that explicit legal principles provide strong discriminative signals for legal citation retrieval.

A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21058v1 Announce Type: new Abstract: Causal representation learning (CRL) and traditional representation learning have largely developed along different trajectories. Traditional representation learning has been driven mainly by applications and empirical objectives, whereas CRL has focused more on theoretical questions, particularly identifiability. This difference in emphasis has created a gap between the two fields in terminology, problem formulation, and evaluation, limiting communication and sometimes leading to disconnected or redundant efforts. In this paper, we argue that these two fields should be brought into dialogue rather than treated as separate paradigms. To this end, we introduce a unified formulation in which the representation learning is characterized by two components: a task component, which specifies what information the learned representation is required to preserve, and a constraint component, which specifies what structure is imposed on the latent space. Under this formulation, the benefits run in both directions. CRL provides theoretical tools for understanding when structured latent constraints are useful or necessary, while traditional representation learning offers practical insights on task design and objective choice that can improve the development of CRL methods. To illustrate this interaction, we experimentally study how different task components affect the behavior of CRL methods under different structured constraints. Results on CausalVerse show that the effectiveness of causal constraints depends strongly on the tasks with which they are paired.

Multimodal LLMs under Pairwise Modalities

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21059v1 Announce Type: new Abstract: Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21060v1 Announce Type: new Abstract: Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

Grounding Driving VLA via Inverse Kinematics

Junsung Park, Hyunjung Shim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21061v1 Announce Type: new Abstract: Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

Philipp Spohn, Leander Girrbach, Zeynep Akata — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21063v1 Announce Type: new Abstract: Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

Robust Personalized Recommendation under Hidden Confounding in MNAR

Zongyu Li, Wanting Su, Tianyu Xia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21066v1 Announce Type: new Abstract: Recommender systems often rely on observational user--item interaction data, which is prone to selection bias due to users' selective interactions with items. Inverse propensity weighting and doubly robust estimators effectively mitigate selection bias under observed confounding, but are unreliable in the presence of hidden confounders. Existing approaches relying on randomized controlled trials (RCTs) or global sensitivity bounds are constrained in practice: RCTs demand costly experimental data, while global sensitivity bounds presume a uniformly bounded effect of unmeasured confounders on propensities through sensitivity analysis, thereby neglecting heterogeneity across user--item interactions. To overcome this limitation, we propose a novel framework, which estimates user--item level sensitivity bounds, thereby substantially relaxing the homogeneity assumption inherent in global sensitivity bounds named Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID). To ensure both robustness and predictive accuracy, we further develop an adversarial optimization strategy and propose a benchmark-guided variant (BPUID) that incorporates pre-trained models as stabilizing references. Extensive experiments on three real-world datasets demonstrate that our approach significantly outperforms global methods under hidden confounding, without requiring RCT data.

Towards Understanding Self-Pretraining for Sequence Classification

Omar Coser, Loredana Zollo, Paolo Soda, Antonio Orvieto — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21070v1 Announce Type: new Abstract: Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

Fine-grained Claim-level RAG Benchmark for Law

Souvick Das, Sallam Abualhaija, Domenico Bianculli — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21071v1 Announce Type: new Abstract: The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21072v1 Announce Type: new Abstract: Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21075v1 Announce Type: new Abstract: Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

GradeLegal: Automated Grading for German Legal Cases

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21076v1 Announce Type: new Abstract: Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

VDFP: Video Deflickering with Flicker-banding Priors

Zhiyi Zhou, Libo Zhu, Zihan Zhou, Yulun Zhang, Xiaokang Yang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21079v1 Announce Type: new Abstract: Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets.Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at~ https://github.com/ZhiyiZZhou/VDFP.

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

Shinnosuke Taksuka, Hideo Mukai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21081v1 Announce Type: new Abstract: This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21082v1 Announce Type: new Abstract: Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

Alexi Canesse, Beno\^it Goupil, Jesse Read, Sonia Vanier — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21085v1 Announce Type: new Abstract: Communication enables coordination in multi-agent reinforcement learning (MARL), but many real-world applications, e.g., search-and-rescue with drone swarms, operate under severe bandwidth constraints. Many communication architectures still expose a coupled bottleneck in which a shared latent representation is used for both policy execution and inter-agent communication. Consequently, reducing message size directly limits the policy's latent space, often leading to significant performance degradation. We address this with two contributions. First, we introduce $\beta$, a normalised per-agent bandwidth budget that unifies sparsity, rounds, and message dimension into a single comparable constraint. Second, we provide SLIM, a minimal architecture that decouples the communication pathway from the policy's latent representation, allowing us to isolate the effect of bandwidth from the effect of policy capacity while benefiting from in-step communication. We evaluate our method on several partially-observable MARL benchmarks, where communication is essential. Our approach achieves state-of-the-art performance and exhibits scalability and robustness under limited communication, with only marginal degradation as bandwidth is reduced.

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21086v1 Announce Type: new Abstract: While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

Reviving Error Correction in Modern Deep Time-Series Forecasting

Minh Hoang Nguyen, Dai Do, Huu Hiep Nguyen, Dung Nguyen, Kien Do, Hung Le — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21088v1 Announce Type: new Abstract: Modern deep-learning models have achieved remarkable success in time-series forecasting. Yet, their performance degrades in long-term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time-series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture-agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal-Trend Decomposition (UEC-STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time-series models. Code is available at https://github.com/DA2I2-SLM/UEC-STD.

An Evidence-driven Protocol for Trustworthy CI Pipelines

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21089v1 Announce Type: new Abstract: Enterprise software supply chains are increasingly vulnerable to infrastructure attacks, resulting in financial and reputational damage. Ensuring the integrity and provenance of software artifacts remains a significant challenge, where re-execution of the build and tests by every consumer to guarantee provenance produces a verification bottleneck and credibility reduction. This paper presents an evidence-driven protocol for trustworthy Continuous Integration (CI) pipelines that combines Deterministic Build Systems (DBS) with Trusted Execution Environments (TEEs). The approach provides cryptographically verifiable guarantees of integrity, authenticity, and attestation for CI artifacts in distributed environments, reducing implicit trust without requiring costly re-execution by consumers. We introduce a protocol that binds deterministic builds with TEE-based attestations, formalizing the evidence life cycle, together with a practical implementation using Nix and Intel TDX. Experimental results show that artifact verification is reduced from redundant computation to lightweight signature and policy checks. These findings demonstrate that evidence-driven CI pipelines establish scalable and verifiable trust in digital infrastructure, effectively amortizing the initial computational overhead introduced by TEEs.

TextSculptor: Training and Benchmarking Scene Text Editing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21090v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems

Donggyu Lee, Taekyung Lee, Jaewoong Choi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21094v1 Announce Type: new Abstract: We investigate unpaired image inverse problems, a challenging setting where only independent, non-paired sets of noisy measurements and clean target signals are available for training. We propose a novel inverse problem solver based on Unbalanced Optimal Transport, called Unbalanced Optimal Transport Map for Inverse Problems (UOTIP). Our method formulates the reconstruction task, predicting clean target signals from noisy measurements, as learning a UOT Map from noisy measurement distribution to clean signal distribution by incorporating a likelihood-based cost function. By relaxing the exact marginal constraint, the UOT framework provides key advantages to our model: robustness to multi-level observation noise, adaptability to class imbalance between noisy and clean datasets, and generalizability to diverse noise-type scenarios. Furthermore, we theoretically demonstrate that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map by satisfying the twist condition, even for ill-posed inverse problems. Our experiments demonstrate that UOTIP achieves state-of-the-art performance on unpaired image inverse problem benchmarks, across linear and nonlinear inverse problems.

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

Matteo Pistillo, Samantha Faraone, Joshua Herman — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21095v1 Announce Type: new Abstract: Affordances and permissions are promising and timely safety levers for mitigating Loss of Control (LoC) threats in high-stakes deployment contexts, such as national security. Deployers in defense and intelligence could rely on several approaches to identify which affordances and permissions should be prioritized, such as structured threat modelling, pre-deployment agentic evaluations, post-deployment continuous monitoring, and AI safety cases. This paper proposes a complementary and empirical methodology that leverages existing use-case-specific benchmarks: backchaining LoC mitigations from the errors an AI system makes on national security benchmarks. The approach proceeds in three steps and allows national security deployers to start building LoC mitigations today, from evidence they can generate themselves. First, deployers evaluate AI systems on mission-specific benchmarks approximating real use-cases. Second, deployers concentrate on the incorrect responses that the AI system provides to the benchmark questions, and backchain the affordances and permissions that would enable the AI system to cause downstream harm if it pursued the actions described in the incorrect answers. Third, deployers intervene selectively on those affordances and permissions, bottlenecking the paths to harm while preserving the AI system's ability to carry out the correct action. We illustrate this methodology through a demonstrative benchmark question on derivative security classification.

WCXB: A Multi-Type Web Content Extraction Benchmark

Murrough Foley — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21097v1 Announce Type: new Abstract: Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21099v1 Announce Type: new Abstract: Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.

NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21100v1 Announce Type: new Abstract: Modern serving systems for Mixture-of-Experts (MoE) models adopt hybrid data-expert parallelism: expert parallelism (EP) shards experts across GPUs to scale capacity, while data parallelism (DP) replicates attention layers across instances to process independent requests. Existing systems bind each request's attention, MoE communication, and KV cache to a single instance. Because attention latency scales with KV cache size while MoE communication latency scales with batch size, this binding cannot balance both simultaneously, producing EP stragglers; it also fragments KV memory across instances, inflating tail latency under long contexts. While existing context parallelism (CP) mitigates these constraints, its uniform parallelism degree incurs prohibitive communication and attention-side overheads. We present \work, which decouples MoE communication from KV cache placement and achieves dual balance through dynamic context parallelism (DCP). DCP assigns each request a context-parallel degree sized to its KV footprint: long requests distribute attention across multiple instances; short requests remain local. This dynamic parallelism effectively liquefies the KV cache across the cluster, balancing both the per-instance KV cache occupancy and batch sizes without unnecessary load-balancing costs. To bridge DCP with static execution, \work introduces an ahead-of-time (AOT) graph engine paired with a custom routing-based communication backend. Experimental results show that \work maintains up to $1.88\times$--$3.27\times$ higher request rates under strict time-per-output-token (TPOT) service level objectives (SLOs). Furthermore, \work significantly mitigates stragglers, reducing P99 tail latency by up to $1.79\times$--$2.12\times$.

ACL-Verbatim: hallucination-free question answering for research

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21102v1 Announce Type: new Abstract: Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

A Typed Tensor Language for Federated Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21103v1 Announce Type: new Abstract: Federated learning and analytics are often described as collections of separate protocols, even when they share the same mathematical form: client-local tensor computation, mergeable aggregation into shared state, and shared-only post-processing. We introduce a typed tensor language that formalizes this structure. The language distinguishes federated tensors, whose records are partitioned across clients along a tracked record axis, from shared tensors, which are available globally. Its semantics are defined by comparison with a virtual global tensor, used only as a reference object. The main result is a shared-state factorization theory. We show that typed one-round programs factor through fixed-dimensional shared state whose size is independent of the number of clients and records, computed from client-local tensor expressions and merged across clients. We also prove a converse representability result; factorizations whose encoders and decoders are expressible in the language are realized by typed one-round programs, and the correspondence extends to iterative programs whose cross-round state is shared. This gives a formal account of the computations in the language that can be expressed as encode, merge, and decode procedures. We then develop a differentiable fragment for learning. If a per-record loss and its per-record gradient are represented by client-local tensor expressions, the global gradient is represented by record-axis summation of the federated gradient tensor. This yields typed iterative programs for server-side gradient descent and shared-linear-algebra second-order updates. The framework characterizes a broad class of federated learning computations whose communication passes through fixed-dimensional shared state.

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Tom Jacobs, Rohan Jain, Rebekka Burkholz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21104v1 Announce Type: new Abstract: Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction

Dhruv Sarkar, Abhishek Sinha — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21107v1 Announce Type: new Abstract: We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ($\mathsf{CCV}$). For strongly convex losses, state-of-the-art algorithms achieve $O(\log T)$ regret and $O(\sqrt{T \log T})$ $\mathsf{CCV}.$ The corresponding best-known bounds for convex losses is $O(\sqrt{T})$ regret and $O(\sqrt{T} \log T)$ $\mathsf{CCV}$. In this paper, we give a simple projection-based algorithm that simultaneously achieves $O(\log T)$ regret and $O(\log T)$ $\mathsf{CCV}$ for strongly-convex losses, yielding an exponential improvement in the $\mathsf{CCV}$. For the convex losses, our algorithm improves the $\mathsf{CCV}$ to $O(\sqrt{T})$ while maintaining the optimal $O(\sqrt{T})$ regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.

Efficient Learning of Deep State Space Models via Importance Smoothing

John-Joseph Brady, Nikolas Nusken, Yunpeng Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21108v1 Announce Type: new Abstract: Latent state space systems are ubiquitous in statistical modelling, arising naturally when a time series is observed through a noisy measurement function, however training deep state space models (DSSM) at scale remains difficult. Two largely distinct strategies and literatures have developed around the training of DSSMs. Firstly, auto-encoding DSSMs train generative DSSMs by optimising a variational lower bound. Secondly, DSSMs trained by back-propagating the outputs of a classical sequential Monte Carlo algorithm (SMC). Such approaches can train DSSMs for discriminative as well as generative tasks, however, due to the sequentiality of their forward pass, scale poorly on modern hardware. We propose a new training method \emph{parallel variational Monte Carlo} (PVMC) that bridges the gap between the paradigms, and can be used robustly to train DSSMs for both discriminative and generative tasks. Our method achieves state-of-the-art or better results on a set of baseline experiments and trains $10\times$ faster than the fastest competing SMC approach.

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

Zhenjiang Mao, Jiawen Wu, Gabriel Wagner, Zhongzheng Zhang, Ivan Ruchkin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21109v1 Announce Type: new Abstract: Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21111v1 Announce Type: new Abstract: Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.

RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

Weiyi Xiong, Bing Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21112v1 Announce Type: new Abstract: 4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

On the Complexity of Entailment for Cumulative Propositional Dependence Logics

Kai Sauerwald, Juha Kontinen, Arne Meier — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21113v1 Announce Type: new Abstract: This paper establishes and proves complexity results for entailment for cumulative propositional dependence logic and for cumulative propositional logic with team semantics. As recently shown, cumulative logics are famously characterised by System~C and exactly captured by the cumulative models of Kraus, Lehmann and Magidor. This gives rise to the entailment problem via relational models, which is specifically considered here.

A Unified Framework for Uncertainty-Aware Explainable Artificial Intelligence: A Case Study in Power Quality Disturbance Classification

Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21114v1 Announce Type: new Abstract: Post-hoc explainable AI (XAI) methods typically produce deterministic attribution maps, whereas Bayesian neural networks (BNNs) induce a distribution over explanations. Capturing the variability of this distribution is important for uncertainty-aware decision-making. This paper formalises the \emph{explanation distribution} as the push-forward measure of the BNN posterior through any Lipschitz-continuous attribution operator. It further proposes the uncertainty-aware relevance attribution operator (UA-RAO), a general family of operators that summarises the explanation distribution using the mean, variance, coefficient of variation, quantiles, and set-theoretic aggregation measures. Theoretical support is provided through Monte Carlo accessibility and Wasserstein approximation bounds. The framework is evaluated on a 15-class power quality disturbance (PQD) classification benchmark, comparing three BNN approximations paired with three attribution operators using relevance mass accuracy and intersection-over-union as localisation metrics. Results show that deep ensembles with the mean UA-RAO improve localisation over the deterministic baseline, while other UA-RAO summaries reveal uncertainty patterns absent from point-estimate attributions. Qualitative results on measured signals further suggest that these patterns generalise beyond the synthetic training distribution. The framework is domain-agnostic and can be applied to any BNN paired with a Lipschitz-continuous attribution operator.

Automated Byzantine-Resilient Clustered Decentralized Federated Learning for Battery Intelligence in Connected EVs

Mouhamed Amine Bouchiha, Abdelaziz Amara Korba, Yacine Ghamri-Doudane — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21115v1 Announce Type: new Abstract: Federated learning (FL) has emerged as a promising paradigm for managing electric vehicle (EV) battery data in intelligent transportation systems (ITS), enabling privacy-preserving tasks such as anomaly detection and capacity estimation. However, most existing frameworks rely on centralized aggregation schemes, which pose critical limitations in terms of security and trust. To address these challenges, we propose ABC-DFL, an automated Byzantine-resilient clustered decentralized federated learning (C-DFL) framework for connected EVs. The proposed incentive-driven C-DFL system replaces the central server with an open-permissioned blockchain, featuring a new dynamic Quorum Byzantine Fault Tolerance (QBFT) protocol and an oracle-based aggregation layer, to enhance trust, security, and automation. At the core of ABC-DFL lies FLECA (Filtered Layered Enhanced Clustering Aggregation), a robust hierarchical aggregation protocol that mitigates Byzantine attacks by having each EV filter malicious updates using an adaptive threshold based on deviations from its reference model update. Oracle nodes, responsible for inter-group aggregation, employ robust clustering to isolate and aggregate model updates from trustworthy EV groups. Comprehensive experimental evaluations demonstrate that FLECA matches FedProx convergence under benign conditions and significantly outperforms existing defenses with attack impact scores below 0.10 in adaptive adversarial scenarios. Furthermore, several learning experiments with multitask models confirm the effectiveness and fairness of the incentive mechanism. Finally, on-chain and off-chain benchmarks validate the practicality of ABC-DFL.

Image Encryption via Data-Identified Discrete Chaotic Maps

Wenyuan Lia, Xiao-Yun Wang, Zhigang Zhu, Xiaofeng Zhang, Li Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21118v1 Announce Type: new Abstract: In this work, we propose a data-driven image encryption framework that identifies chaotic maps directly from data using the SINDy-PI algorithm. Unlike conventional encryption schemes relying on predefined maps, our method learns the full explicit dynamics -- including cross-terms and higher-order nonlinearities -- from observational data. The validity of this approach is verified on three distinct chaotic systems: the H{\'e}non map, the three-dimensional logistic map, and the piecewise-linear Lozi map, demonstrating its generality. The encryption key consists solely of initial conditions; the map structure itself becomes data-dependent, introducing an extra layer of security. Moreover, even when the initial conditions are fixed, different training data (e.g., with a tiny noise seed) lead to slightly different maps, which produce completely different ciphertexts (NPCR $\approx 99.6\%$, UACI $\approx 33.5\%$). Numerical experiments on the H{\'e}non system show near-ideal information entropy ($\approx 8$ bits), negligible inter-pixel correlation, and extreme sensitivity to initial conditions: a perturbation of $10^{-16}$ causes total decryption failure. The scheme resists both differential and statistical attacks, with NPCR and UACI values matching theoretical ideals. Our results establish a new paradigm for chaos-based cryptography beyond fixed maps.

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21121v1 Announce Type: new Abstract: Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

Kesong Li, Yixuan Xu, Kuo-kun Tseng, Weiyi Lu, Kan Liu, Tao Lan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21123v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21125v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at https://qingyonghu.github.io/AVSPO.

Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

Lukas Twist, Helen Yannakoudakis, Jie M. Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21127v1 Announce Type: new Abstract: Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21130v1 Announce Type: new Abstract: Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21131v1 Announce Type: new Abstract: Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21132v1 Announce Type: new Abstract: Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21133v1 Announce Type: new Abstract: In this paper, we explore spatial-aware humanoid whole-body manipulation task. Compared with tabletop settings, this task poses two key challenges: 1) Spatial understanding is challenging in complex 3D environments with diverse spatial relations. 2) Action generation is difficult to generalize, as limited and costly real-robot data restricts data-driven models generalization. To address these challenges, we propose a generalizable humanoid loco-manipulation framework that leverages the spatial perception and action generation capabilities of multi-agent large models. Specifically, our framework includes two components: Active Spatial Brain for active spatial perception and decision-making, and Generalizable Action Cerebellum for executable robot action generation. The first component actively perceives the spatial scene and makes decisions on task planning and subtask decomposition. The second component generate executable robot actions based on the decisions made by the first module without needs of task-specific real robot data. To benchmark our framework, we design a set of spatial manipulation tasks from two perspectives: evaluating spatial perception and understanding, and assessing real-robot task performance. The results demonstrate strong performance on both aspects across diverse tasks and environments.

Complete Supermartingale Certificates for $\omega$-Regular Properties

Alessandro Abate, Mirco Giacobbe, Sergey Ichtchenko, Diptarko Roy — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21134v1 Announce Type: new Abstract: We introduce a general methodology for the construction of sound and complete proof rules for the almost-sure and quantitative acceptance of reactivity properties on time-homogeneous Markov chains with general state spaces. Reactivity captures $\omega$-regular properties and subsumes linear temporal logic. Our core technical result establishes that every reactivity property admits decomposition into multiple obligations of almost-sure termination into absorbing regions, and that appropriate absorbing regions always exist on general state spaces. This enables the extension of every complete proof rule for almost-sure termination into a proof rule for reactivity that is complete in the almost-sure case, and complete up to an arbitrarily small $\varepsilon$-approximation in the quantitative case. We apply our new methodology to recent results on sound and complete supermartingale certificates for almost-sure termination in the special case of countably infinite state spaces, alongside standard results on quantitative safety. As a result, we obtain the first sound and complete supermartingale certificates for almost-sure $\omega$-regular properties and the first sound and $\varepsilon$-complete supermartingale certificates for quantitative $\omega$-regular properties on time-homogeneous Markov chains with countably infinite state spaces.

Smarter edits? Post-editing with error highlights and translation suggestions

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21135v1 Announce Type: new Abstract: As MT quality increases, interest in enhanced post-editing features such as QE-derived error highlights is growing, yet evidence for their usefulness remains limited. In this work, we explore the usefulness of LLM-derived error highlights and correction suggestions based on automatic post-editing (APE). We conduct a study where professional translators (En-Nl) post-edit translations using APE error highlights and correction suggestions and compare productivity, quality and user experience to regular PE and PE with QE-derived highlights. While no condition yielded productivity or quality gains compared to regular PE, APE highlights were better received than QE-derived highlights, and correction suggestions improved overall user experience.

LoRa and LoRaWAN simulator-cum-emulator with CAD and capture effect in Python

Matthijs Reyers, Niels Hokke, R. R. Venkatesha Prasad — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21136v1 Announce Type: new Abstract: Existing LoRaWAN/LoRa simulators consist of large, complicated C++ codebases and often do not support all device classes. This paper presents the design of a simple to use, Python-based discrete-event simulator that addresses these gaps while also introducing a novel method for evaluating real device firmware in the simulator. The simulator is built on a custom asyncio-based simulation kernel, a three-phase packet delivery model that reproduces the capture effect, a full LoRaWAN 1.0.4 stack, and a containerized firmware system that cross-compiles real STM32 C firmware and redirects HAL calls into the simulator via CFFI. The simulator is distributed as a Python package via Github (https://github.com/MatthijsReyers/lora-simulator) and requires no external simulation framework or dependencies.

Safety-Critical Control for Smoothed Implicit Contact Dynamics

Haegu Lee, Yitaek Kim, Christoffer Sloth — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21138v1 Announce Type: new Abstract: Smoothed implicit contact dynamics enables gradient-based planning and control for contact-rich tasks without predefined mode sequences. However, safety-critical control remains challenging because implicit contact dynamics makes safety-filter design nontrivial. The smoothing parameter $\kappa$ relaxes contact complementarity constraints, which makes the dynamics smooth but affects the contact force. This paper provides a method for bounding the actual contact force despite the use of relaxed complementarity constraints. We show that constraint violations can be non-monotonic in $\kappa$. Smaller $\kappa$ reduces force-approximation error, but it does not necessarily improve safety performance. To address this issue, we introduce boundary-focused rollouts to screen $\kappa$ by comparing the safety margin with the approximation error. We then develop a discrete-time control barrier function (CBF) framework based on a first-order Taylor approximation of the implicitly defined contact force. To account for possible force under-prediction, we augment the resulting safety constraint with a fixed robust margin. Simulations on four contact-rich systems show that the proposed method eliminates force violations observed under a standard CBF.

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21139v1 Announce Type: new Abstract: Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

CoarseSoundNet: Building a reliable model for ecological soundscape analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21143v1 Announce Type: new Abstract: A soundscape is composed of three types of sound: biophony (sounds made by animals), geophony (natural abiotic sounds) and anthropophony (sounds made by humans). A key research question in the field of soundscape ecology is how these components interact with each other, specifically how biophony responds to geophony and anthropophony. Nevertheless, as of today, there are not many analytical instruments that enable the distinct quantification of these elements. Recent machine learning (ML) approaches aim to support automated analysis but often rely on task-specific or clean data, limiting generalisation to noisy passive acoustic monitoring (PAM) recordings. This study presents a clear and reproducible structure to build ML models for coarse soundscape classification and introduces CoarseSoundNet, a deep learning model trained to distinguish biophony, geophony, and anthropophony under realistic PAM conditions. We systematically investigate model architectures, the influence of an additional training class, data composition, and evaluation strategies. Our findings suggest that model performance improves with additional PAM data, especially when similar to the target domain, and by introducing an explicit silence class during training. Class-specific decision thresholds and duration-based constraints further enhance performance, particularly for anthropophony and geophony. Error analyses exhibit challenges for anthropophony due to masking effects and confusions for silence and insect sounds for geophony and biophony. Finally, we conduct an ecological case study which shows that pre-filtering recordings with CoarseSoundNet yields acoustic index trends comparable to ground-truth filtering, supporting its use as an effective preprocessing tool for ecoacoustic analyses.

A Bernoulli phase-fitted finite difference method and wavenumber-explicit analysis for the one-dimensional Helmholtz equation

Ansgar J\"ungel, Panchi Li, Zhiwei Sun, Zhiwen Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21144v1 Announce Type: new Abstract: We propose a Bernoulli phase-fitted (BPF) finite difference method for the Helmholtz equation on the interval $(0, L)$ with impedance boundary conditions. The scheme is derived from a complexified Scharfetter--Gummel discretization of the one-way factorization of the Helmholtz operator. It yields both a phase-fitted interior discretization and exact discrete impedance boundary closures. For the homogeneous problem, the method is exact for plane waves, so the scheme introduces neither numerical dispersion in the interior nor artificial reflection at the boundaries. For the inhomogeneous problem, we prove well-posedness, derive wavenumber-explicit stability estimates, and establish second-order consistency and convergence valid for all $kh\notin\pi\mathbb Z$, where $k$ is the wavenumber and $h$ the grid size. In particular, under the fixed-resolution condition $kh\le s_0$ for some $0

Cloud-Native Operation of Roadside Infrastructure Enabling Demand-Driven Collective Perception via V2X

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21145v1 Announce Type: new Abstract: Intelligent roadside infrastructure is a key enabler for cooperative intelligent transport systems (C-ITS), supporting vehicles equipped with automated driving systems (ADS), e.g., through enhanced environment perception. With a growing number and an expanding functional scope of roadside units, scalable and efficient operation becomes a challenge. This paper presents a cloud-native architecture for the operation of distributed roadside infrastructure based on a Kubernetes cluster spanning roadside units and a cloud server. Building on this architecture, a demand-driven orchestration approach is implemented to dynamically deploy resource-intensive services only when required. As a representative use case, a V2X-based collective perception application is deployed on-demand when a connected vehicle is nearby. The approach is validated in a real-world experiment in our test field in Aachen, demonstrating that the collective perception application starts in time for the vehicle to benefit from it. Without any demand, the application remains inactive, reducing energy consumption, channel congestion, and hardware wear. Beyond the primary evaluation, V2X recordings from the test field are analyzed to estimate the energy-saving potential of demand-driven operation. In summary, the results demonstrate the practical feasibility of cloud-native, demand-driven operation of roadside infrastructure and indicate its potential to improve scalability and (energy) efficiency in future C-ITS deployments.

Detecting Trojaned DNNs via Spectral Regression Analysis

Samuele Pasini, Jinhan Kim, Paolo Tonella — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21146v1 Announce Type: new Abstract: Modern DNNs are repeatedly fine-tuned to incorporate new data and functionality. This evolutionary workflow introduces a security risk when updated data cannot be fully trusted, as adversaries may implant Trojans during fine-tuning. We present MIST, a Trojan detection approach that analyzes how a model's internal representations change during fine-tuning. Rather than attempting to reconstruct trigger conditions, MIST characterizes benign model evolution using pre-activation spectra and flags updates whose spectral deviations are inconsistent with this reference. This framing treats Trojan detection as a regression problem over model updates. An empirical evaluation across four datasets and eight Trojan attacks shows that spectral distances reliably distinguish Trojaned updates from clean fine-tuning. MIST outperforms state-of-the-art detection accuracy after a single update, without requiring any knowledge about the poisoned data or the trigger, and remains effective under multi-step benign evolution, with graceful and bounded degradation. These results indicate that spectral evolution provides a stable and assumption-light signal for detecting malicious model updates.

SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21147v1 Announce Type: new Abstract: As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity. Theory suggests that LoRA fine-tuning with rank r converges toward the top r singular values of the pre-trained weight matrix. As the rank increases, more principal singular directions are preserved, which generally improves the model's performance. However, a larger rank also introduces more trainable parameters, leading to higher computational cost. To overcome this dilemma, we propose SMoA, a \textbf{S}pectrum \textbf{Mo}dulation \textbf{A}dapter that enlarges the accessible family of spectrum-aware updates under a smaller parameter budget. SMoA partitions the layer into multiple aligned spectral blocks and applies one in-block Hadamard-modulated low-rank branch to each diagonal block, yielding broader coverage of pretrained spectral directions. We provide theoretical analysis and empirical results on multiple tasks. In our experiments, SMoA improves average performance in the current lower-budget setting over LoRA and competitive LoRA-style baselines.

EllipseLIO: Adaptive LiDAR Inertial Odometry with an Ellipsoid Representation

Rowan Border, Margarita Chli — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21150v1 Announce Type: new Abstract: LiDAR Inertial Odometry (LIO) is a critical component for many mobile robots that need to navigate without relying on external positioning (e.g., GPS). Platforms that operate autonomously in different environments and with heterogeneous LiDAR sensors require a LIO approach that can adapt to these different scenarios without human intervention. Existing LIO approaches can typically provide reliable and accurate odometry in scenarios with similar environments and sensors when suitably tuned. However, many approaches struggle to retain robust odometry across heterogeneous environments and sensors while using a consistent configuration. This paper presents EllipseLIO, a real-time LIO approach that generalises between scenarios by using methods for LiDAR scan filtering and registration that adapt to the sensor capabilities and environment without requiring scenario-specific tuning. Experiments with EllipseLIO and state-of-the-art LIO approaches on five datasets with diverse and challenging scenarios demonstrate that EllipseLIO is the best-performing approach overall. It achieves a 38% lower odometry error on average than the second-best approach and is the only approach that does not diverge in any experiment. An open-source version of EllipseLIO will be available at github.com/v4rl-ucy/ellipselio.

Coordinated Optimal Power Quality Management in Distribution Systems Using The Residual Capacity of Community IBRs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21153v1 Announce Type: new Abstract: This letter proposes a network-wide coordinated optimization model to mitigate voltage unbalance (VU) by unleashing the remaining capacity of community inverter-based resources (IBRs). Existing single-sequence strategies ignore coupled capacity constraints and cause idle headroom. Meanwhile, they fail to harness the collective governance capabilities of community IBRs. To solve this discrepancy and exploit the unused potential, we developed a sequence-domain network model in dual commonly shared synchronous reference frames. Strict phase current and apparent power limits are formulated and convexified via polyhedral approximations. A quadratic objective function flexibly balances sequence capacity allocation. Simulation and experimental results validate the effectiveness of the proposed strategy.

Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21154v1 Announce Type: new Abstract: Mental health has become a global priority, leading to a massive administrative burden in the coding of clinical diagnoses. This study proposes the automation of psychiatric diagnostic analysis by mapping free-text descriptions to the International Classification of Diseases (ICD) using Natural Language Processing (NLP) and Machine Learning (ML) techniques. Utilizing a specialized dataset of 145,513 Spanish psychiatric descriptions, various text representation paradigms were evaluated, ranging from classical frequency-based models (BoW, TF-IDF) to state-of-the-art Large Language Models (LLMs) such as e5\_large, BioLORD, and Llama-3-8B. Results indicate that transformer-based embeddings consistently outperform traditional methods by capturing implicit semantic cues and nuanced medical terminology. The e5\_large model, through end-to-end fine-tuning, achieved the highest performance with a $F1_{micro}$ score of 0.866. This research demonstrates that adapting LLMs to specific clinical nomenclature is essential for overcoming the challenges of ``long-tail'' label distributions and the inherent ambiguity of psychiatric discourse.

Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21157v1 Announce Type: new Abstract: In modern warfare, drones are becoming an essential part of intelligence gathering and carrying out precise attacks in different kinds of hostile environments. Their ability to operate in real-time and hostile environments from a safe distance makes them invaluable for surveillance and military operations. The KIIT-MiTA dataset is comprised of images of different military scenarios taken from drones, and these provide a foundation for detecting military objects, but it does not take into account the various types of real-world scenarios. With that in mind, to evaluate how the models are performing under varying conditions, four different types of datasets are created: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision. These simulate the real-world environments such as low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and used to detect objects across diverse settings. This research boosts the performance and reliability of drone-based operations by contributing to the development of advanced detection systems in both defensive and offensive missions.

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning

Jingfeng Zhong, Zhengxiang Liu, Zhijie Wang, Shuai Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21160v1 Announce Type: new Abstract: The discovery of first integrals is of fundamental scientific importance for understanding conservation laws in dynamical systems. However, existing symbolic computation tools and Large Language Models (LLMs) remain limited on this task because high-quality training data are scarce and successful solutions often depend on mathematical intuition. This paper presents FISolver, an LLM-based solver developed to address this challenge. First, we introduce a "Backward Generation" algorithm that systematically builds large-scale datasets of (differential equation, first integral) pairs by deriving differential equations from sampled integrals, thereby alleviating the data scarcity bottleneck. Second, we apply supervised fine-tuning to a compact mathematical model and further improve its performance through reinforcement learning with a Levenshtein Distance-based shaped reward. In addition, we design data synthesis and blending strategies that support effective adaptation to difficult problem families from sparse examples. Experiments show that FISolver, while requiring substantially lower computational cost, significantly outperforms larger mathematical LLMs and commercial solvers such as Mathematica on challenging benchmarks, indicating a new data-driven route for automated discovery of first integrals.

A Least-Squares Weak Galerkin Finite Element Scheme for Cauchy Problems in Helmholtz

Chunmei Wang, Shangyou Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21162v1 Announce Type: new Abstract: This paper introduces and rigorously analyzes a least-squares weak Galerkin (LS-WG) finite element method for the severely ill-posed Cauchy problem associated with the Helmholtz equation. By utilizing a weak Laplacian operator defined on a space of discontinuous functions, the proposed framework facilitates the seamless treatment of complex boundary conditions and internal interfaces. We emphasize the geometric flexibility of the LS-WG scheme on general polygonal and polyhedral partitions. Furthermore, we prove the uniqueness of the numerical solution and derive optimal-order error estimates with respect to a specifically designed discrete energy norm. Extensive numerical experiments validate the theoretical convergence rates and demonstrate the algorithm's robustness and efficiency over traditional Galerkin approaches.

Q-SYNTH: Hybrid Quantum-Classical Adversarial Augmentation for Imbalanced Fraud Detection

Adam Innan, Mansour El Alami, Nouhaila Innan, Muhammad Shafique, Mohamed Bennai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21164v1 Announce Type: new Abstract: Credit card fraud detection is fundamentally challenged by extreme class imbalance, where fraudulent transactions are rare yet operationally critical. This imbalance often biases supervised learners toward the legitimate class, leading to high overall accuracy but weaker fraud-class recall and F1-score. This paper introduces Q-SYNTH, a hybrid classical--quantum generative adversarial framework in which a parameterized quantum circuit serves as the generator and a classical neural network serves as the discriminator. Q-SYNTH is designed for minority-class fraud synthesis in tabular data and is evaluated along two dimensions: statistical fidelity to real fraud samples and downstream performance for fraud detection. To this end, generated samples are assessed using distributional similarity measures based on Kolmogorov-Smirnov statistics and Wasserstein distances, real-vs-synthetic detectability measured by AUC-ROC, and downstream classification performance across both quantum and classical classifiers. Under the reported protocol, Q-SYNTH reduces marginal distribution mismatch relative to a classical GAN baseline while maintaining competitive downstream fraud-detection performance. Although SMOTE achieves the strongest feature-wise similarity and the classical GAN attains the highest downstream performance in several settings, Q-SYNTH offers a favorable compromise between distributional fidelity and downstream performance, supporting the feasibility of hybrid quantum augmentation for imbalanced fraud detection.

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Qiyu Ruan, Yuxuan Wang, He Li, Zhenning Li, Cheng-zhong Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21168v1 Announce Type: new Abstract: Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $\sigma$ with an online-learned AV-risk predictor $\Phi$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.

FTerViT: Fully Ternary Vision Transformer

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21171v1 Announce Type: new Abstract: Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384$\times$384 resolution achieves 82.43\% ImageNet-1K top-1 at 6.09\,MB (${\sim}$15$\times$ compression, $-$2.42\,pp vs.\ FP32), outperforming prior ternary ViTs methods up to 8 pp. Finally, we demonstrate the first implementation of ternary vision transformers on a dual cores XTensa LX7 microcontroller inside the ESP32-S3 system-on-chip. By deploying FTerViT-Small (based on DeiT-III-Small at 224$\times$224 resolution, 5.81\,MB), we achieve 79.64\% ImageNet-1K top-1 accuracy.

ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21177v1 Announce Type: new Abstract: This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.

Metaphors in Literary Post-Editing: Opening Pandora's Box?

Aletta G. Dorst, Mayra O. Nas, Katinka Zeven — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21178v1 Announce Type: new Abstract: This paper investigates how post-editors of literary texts react and respond to the way metaphors have been translated by Neu ral Machine Translation (NMT) and Large Language Models (LLMs). The results show that one in three metaphors in the output were changed by the post-editors, demonstrating that the translation of fig urative language is indeed problematic in literary MT (LitMT). The responses indi cate that the post-editors were aware of overly literal translations, though mostly for multiword expressions. Moreover, at times they found it difficult to determine whether solutions were acceptable. They rated the overall quality of the MT out put as quite poor and stated that the post editing was more work and more effort than it would have been translating from scratch. This supports previous studies ar guing that post-editing constrains transla tors in their creativity and diminishes their sense of text ownership.

KSOS-BO: Improving Sampling in Bayesian Optimization via Kernel Sum of Squares

Buqing Ou, Frederike D\"umbgen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21179v1 Announce Type: new Abstract: Bayesian Optimization (BO) is an effective framework for globally optimizing functions whose evaluations are expensive. It is particularly effective for optimizing functions defined over continuous domains and explicitly handles stochastic noise in evaluations. As a result, it is widely applied in areas such as hyperparameter tuning, robotics policy search, and scientific experiment design, where sample efficiency is essential. Its two-step procedure consists of model fitting followed by optimization of the acquisition function, which is often treated as a generic black-box problem despite its structured nature. In this work, we introduce KSOS-BO, a kernel-based derivative-free framework for BO acquisition optimization. KSOS-BO formulates the optimization of the acquisition function as a semidefinite program with kernel-induced representations, enabling a structured global search. Across a diverse set of benchmark functions with varying landscape properties, KSOS-BO consistently outperforms derivative-free baselines using Sobol Search, Differential Evolution, or CMA-ES to optimize the acquisition function, achieving an average regret improvement of 81.16% on 10/15 benchmarks. In particular, KSOS-BO demonstrates strong performance in highly multimodal and unimodal but ill-conditioned functions, indicating its applicability to diverse landscape structures. Despite a higher per-iteration computational cost, it converges faster in wall-clock time with an average improvement of 93.55% on 10/15 benchmarks, as it reaches high-quality solutions with fewer evaluations. Limitations include reduced effectiveness on functions with steep drops or plate-shaped regions.

Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21180v1 Announce Type: new Abstract: Large language models show strong potential for automated code generation, but lack guarantees for correctness, quality, safety, and domain-specific constraints. For instance in robotics, where code generation is increasingly being used for planning and executing actions, awareness of the environment and physical constraints is critical. To facilitate the adaption of code-generating LLMs to diverse requirements, including domain-specific ones, we present a reinforcement learning framework that fine-tunes pre-trained LLMs using proximal policy optimization. Our customizable execution-aware reward formula captures and optimizes syntax, functional correctness, code style, security, and simulator executability. A token-level reward mapping mechanism enables effective credit assignment from execution outcomes to generated tokens. The framework is evaluated on general-purpose code generation (MBPP/MBPP+) and robotic program synthesis (RoboEval). The results show substantial improvements in functional correctness and simulator executability, including an absolute pass@1 increase of 19% on MBPP and a reduction in execution failures by 51% on RoboEval. These findings demonstrate that structured reinforcement learning can effectively align language models to correct program generation and domain-specific requirements.

On the Identifiability of Semi-Blind Estimation in Cell-Free Massive MIMO Networks

Christian Forsch, Laura Cottatellucci — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21181v1 Announce Type: new Abstract: Semi-blind joint channel estimation and data detection (JCD) is a promising approach to mitigate pilot contamination in cell-free massive multiple-input multiple-output (CF-MaMIMO) networks. The effectiveness of such methods fundamentally depends on identifiability, i.e., the ability to unambiguously recover the unknown channel coefficients and transmitted data signals from the received uplink observations. In this work, we investigate the identifiability of semi-blind JCD from a large-scale system design perspective. We consider a CF-MaMIMO network in which access points (APs) and user equipments (UEs) are spatially distributed according to Poisson point processes (PPPs). The resulting network topology is modeled as bipartite random geometric graph (BRGG) that captures local connectivity induced by wireless propagation. To enable a tractable analysis, the spatially dependent graph model is approximated by a surrogate independent-edge random graph with matched degree distributions. Building on this model, we develop a recursive probabilistic analysis that characterizes the conditions under which semi-blind recovery succeeds with high probability. The proposed analysis reveals an identifiability region as a function of key system parameters, including AP and UE densities and the connectivity radius beyond which channel coefficients are assumed negligible. Monte Carlo simulations validate the predicted identifiability region and assess the accuracy of the proposed graph approximation. The proposed framework provides system level insights into how network density and connectivity affect identifiability in large-scale CF-MaMIMO systems and offers guidelines for selecting deployment parameters and pilot sequence lengths that enable reliable semi-blind recovery.

Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding

Jeonghun Baek, Atsuyuki Miyai, Shota Onohara, Hikaru Ikuta, Kiyoharu Aizawa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21182v1 Announce Type: new Abstract: Manga is a culturally distinctive multimodal medium and one of the most influential forms of Japanese popular culture. As AI systems increasingly target manga understanding, OCR, and translation, Manga109 has become a foundational dataset for manga-related AI research. However, the current Manga109 dataset contains transcription errors and coarse annotations, which do not align well with modern OCR and multimodal manga understanding tasks. In this work, we revisit the dialogue text annotations of Manga109 and identify five categories of annotation issues, including transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. To address these issues, we combine OCR-based issue detection and manual revision to construct Manga109-v2026, revising approximately 29,000 dialogue annotations. Our revisions better align Manga109 with modern OCR and multimodal manga understanding systems while preserving expressive structures characteristic of manga.

Collaborative Optimization of Battery Charging / Swapping Stations for eVTOLs Based on Closed-Loop Supply Chain and Space-Time Network

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21183v1 Announce Type: new Abstract: Against the backdrop of the burgeoning global low-altitude economy, countries have successively introduced a series of policies to accelerate the application and commercialization of electric vertical take-off and landing (eVTOL) aircraft. Nevertheless, purely electric eVTOLs confront constraints including limited battery energy density, high operational power requirements, and challenges associated with rapid energy replenishment, which collectively restrict their flight endurance and application scenarios. Furthermore, while eVTOL deployment is scaling up, supporting charging infrastructure and regulations remain underdeveloped. This situation presents emerging power distribution networks with new challenges in maintaining adequate electricity supply and ensuring operational continuity. To tackle these issues, following an investigation into battery energy replenishment strategies, a closed-loop supply chain-based model for eVTOL battery charging and swapping is proposed. Time-space network methods are utilized to characterize the scheduling of batteries and logistics throughout the system. Subsequently, aiming to maximize the operational revenue of the model, optimized management of battery swapping, transportation, and charging processes is implemented, facilitating coordinated operation among eVTOLs, swapping stations, and charging stations. Finally, the model is solved by Gurobi, verifying its feasibility. Simulation results further indicate that the model alleviates range anxiety for eVTOLs, offering strong support for their commercialization. Moreover, it enables coordinated scheduling between eVTOLs and the distribution network, thereby facilitating the network's gradual improvement and upgrading.

Information Leakage Envelopes

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21185v1 Announce Type: new Abstract: We study privacy guarantees in the framework of pointwise maximal leakage (PML) that satisfy two requirements: they are robust under post-processing and upper bound the failure probability, i.e., the probability that the information leakage exceeds a given threshold. We first examine two candidate definitions inspired by (approximate) differential privacy and show that neither one satisfies both requirements simultaneously. We then introduce the notion of the PML envelope, which quantifies the largest amount of information leakage about a secret after arbitrary post-processing of a mechanism's output. By construction, the PML envelope satisfies both requirements. We discuss basic structural properties of the envelope, such as monotonicity, and derive general upper and lower bounds. We further analyze the envelope for two widely used privacy mechanisms: the PML-extremal mechanisms in the high-privacy regime and randomized response. Overall, this work establishes the PML envelope as a natural and operationally meaningful definition for providing privacy guarantees that are preserved under arbitrary downstream transformations.

SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21186v1 Announce Type: new Abstract: Interpretability in object detection provides crucial confidence support for clinical auxiliary diagnosis. However, in tiny bacteria detection, traditional explanation methods often suffer from blurred foreground boundaries and diffuse feature attribution due to the extreme sparsity of target morphological features and severe interference from complex backgrounds. Such limitations hinder the provision of logically coherent morphological evidence. To bridge this gap, we propose a novel eXplainable AI (XAI) framework, SAM-Sode. The framework innovatively transforms initial feature attribution maps into geometry-aware prompts, leveraging the prior knowledge of the foundation model (SAM3) to achieve spatial refinement and morphological reconstruction of the explanatory mappings. Furthermore, we introduce a dual-constraint mechanism based on physical significance and geometric alignment to perform instance-level denoising, generating coherent explanations that better align with human expert intuition. Experimental results on our self-constructed bacteria dataset with complex circuit backgrounds (containing 2,524 images) and other public datasets demonstrate that the proposed method effectively suppresses background redundancy and significantly enhances the decision-making transparency of tiny object detection.

High-speed Networking for Giga-Scale AI Factories

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21187v1 Announce Type: new Abstract: As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.

A Terrain-Adaptive epsilon-Constraint MPC for Uneven Terrain Kinodynamic Planning

Otobong Jerome, Geesara Kalathunga, Tiago Nascimento — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21188v1 Announce Type: new Abstract: Kinodynamic planning for car-like vehicles on uneven terrain requires simultaneously optimizing competing objectives such as path efficiency and pose stability. This work presents an adaptive epsilon-constraint method integrated into a Model Predictive Control (MPC) framework, where the epsilon bounds are dynamically adjusted based on terrain descriptors to explore the Pareto front in real time. To capture vehicle-terrain dynamics, we develop a semi-parametric model combining analytical vehicle dynamics with a Sparse Gaussian Process (SGP) trained on the same terrain descriptors. The proposed epsilon-MPC is evaluated against MPPI and GAKD baselines, achieving a 94% navigation success rate while reducing maximum orientation deviation by 24% and improving multi-objective trade-off quality by 23%.

Semantic Granularity Navigation in Image Editing

Liangsi Lu, Minzhe Guo, Xuhang Chen, Yang Shi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21190v1 Announce Type: new Abstract: Despite the generative capabilities of diffusion and flow models, real-image editing remains constrained by a persistent trade-off between semantic editability and structural fidelity. We trace a primary cause of this limitation to the implicit coupling of edit progress with model scale in existing paradigms. Under this coupling, stronger edits typically require visiting noisier states, which spends computation on destabilizing layout before the semantic change is well localized. We introduce NaviEdit, a training-free inference-time controller that decouples edit progress from model scale traversal through a strict self-consistency contract. NaviEdit operates at the rollout level and leaves the underlying pretrained model unchanged. It treats scale as a control input and reallocates a fixed step budget toward semantically responsive intermediate scales instead of destructive high-noise regimes. Experiments show positive average gains across compatible editors and flow backbones, supporting decoupling as a portable inference-time control principle.

The Statistical Significance of the Inclusion of Graph Neural Networks in the Financial Time Series Forecasting Problem

Marco Gregnanin, Johannes De Smedt, Giorgio Gnecco, Maurizio Parton — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21192v1 Announce Type: new Abstract: Forecasting univariate time series in the financial market is a challenging endeavor. While numerous statistical and machine learning models have been introduced to address this challenge, they typically concentrate solely on analyzing temporal patterns within the time series data. In this research, we study the statistical significance of the inclusion of geometric patterns in enhancing forecasting accuracy within the context of time series analysis. We introduce the Time-Geometric model, a combination of models designed to exploit both geometric and temporal patterns. The contribution of this research lies in advancing the domain of univariate time series prediction,as demonstrated through extensive empirical evaluations. Our findings underscore that leveraging geometric patterns, captured through Graph Neural Networks, yields statistically significant improvements in forecasting accuracy.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21195v1 Announce Type: new Abstract: Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

SURGE: An Event-Centric Social Media Sentiment Time Series Benchmark with Interaction Structure

Chen Su, Pengsen Cheng, Yuanhe Tian, Yan Song — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21198v1 Announce Type: new Abstract: Public events on social media generate large volumes of discussion whose collective dynamics carry direct value for opinion forecasting and crisis response. Capturing how these dynamics evolve across an event's lifecycle requires organizing fragmented posts into event-level time series. Existing datasets cover only a small number of events within a single category, and typically discard the interaction structure between posts when constructing time series, which restricts both transfer across event types and controlled study of how interactions shape the resulting collective dynamics. We present SURGE, a multi-event social media benchmark that pairs event-level time series with aligned text and interaction structure linking posts within an event. SURGE is built through an automated pipeline that produces calendar-aligned time series at three temporal granularities, covering 67 events and more than 800K posts across five event categories. Each time bin is paired with flat and structured textual views derived from the same selected posts, enabling controlled evaluation of whether social interaction structure affects forecasting behavior. On top of SURGE we define benchmark protocols for numerical-only forecasting, text-augmented forecasting, high-interaction evaluation, and leave-one-category-out generalization. Experiments with representative time-series and multimodal forecasting models reveal three properties of the benchmark: a strong local-persistence regime in which naive baselines remain hard to beat under absolute error, limited transfer of existing text-augmented forecasters to event-driven social-media data, and increased difficulty on reply-dense periods that aggregate metrics tend to obscure. We further include a lightweight structure-aware probe as a reference implementation, illustrating how SURGE can support interaction-aware forecasting research.

Tao's Equational Proof Challenge Accepted (Technical Report)

Lydia Kondylidou, Jasmin Blanchette, Marijn J. H. Heule — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21200v1 Announce Type: new Abstract: In the context of the Equational Theories Project, Terence Tao posed the challenge of finding alternatives to a complicated 62-step proof found by the Vampire superposition prover. We introduce a proof minimization tool called Krympa. Using a combination of brute force and heuristics, and exploiting both Vampire and the Twee equational prover, the tool reduces the 62-step proof to 20 steps, each corresponding to a rewrite. In an empirical evaluation, it also performs well on 1431 equational problems originating from the same project, reducing in particular a 151-step proof to only 10 steps.

Supporting Dynamic Control-Flow Execution for Runtime Reconfigurable Processors

Hassan Nassar, Rafik Youssef, Lars Bauer, J\"org Henkel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21203v1 Announce Type: new Abstract: As the need for more computing power grows, traditional methods are hitting limits. To boost performance, we're expanding Central Processing Unit (CPU) capabilities and using specialized hardware accelerators. For example, mobile devices usually have cameras, video encoding, and audio accelerators. To perform the different tasks, these accelerators execute microcode programs. These accelerators, however, take up space and often sit idle. Reconfigurable processors offer a solution. They have a normal core connected to several accelerator slots. These accelerator slots can be filled during runtime to accommodate the application running. Once one application finishes and another application is running, the accelerators can be switched. For example, playing music after using the camera. In this work, we introduce dynamic control-flow execution for the microcode of runtime reconfigurable processors, i.e., support for loops, conditional jumps, and exception handling. We benchmark using four different applications from four domains (object detection, ocean movement simulation, artificial intelligence and security) that all are compute-intensive and would require the dynamic control-flow when executed on reconfigurable processors. We show that the dynamic control-flow allows different applications to be executed with significant speedup in comparison with execution on general-purpose processors.

PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

Xiaoyu Zhou, Jianwei Fei, Peipeng Yu, Jingchang Xie, Chong Cheng, Zhihua Xia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21207v1 Announce Type: new Abstract: The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at https://github.com/xiaoyu6868/PGC.

Reinforcement Learning-based Control via Y-wise Affine Neural Networks: Comparative Case Studies for Chemical Processes

Austin Braniff, Yuhe Tian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21211v1 Announce Type: new Abstract: In this work we present an efficient and practically implementable approach for the application of reinforcement learning (RL)-based control in chemical process systems. This is an area that has yet to widely adopt RL-based control largely due to inherent challenges in trusting RL algorithms and the time-consuming process of training reliable agents. To address these challenges, we leverage a class of RL algorithms termed Y-wise Affine Neural Network (YANN)- RL, which we have developed in our prior work (Braniff and Tian, 2025a). By strategically initializing actor and critic networks YANN-RL algorithms provide confident and interpretable starting points within control schemes. We apply this RL-based control approach to three different process engineering case studies publicly available on the PC-Gym library (Bloor et al., 2026): (i) a continuous stirred tank reactor (CSTR), (ii) a four-tank system, and (iii) a multistage extraction column. Our approach is compared to several popular RL algorithms (PPO, SAC, DDPG, and TD3) and is benchmarked against nonlinear model predictive control (NMPC). These case studies demonstrate that YANN-RL can greatly reduce the training time and data needed, can be deployed with confidence for chemical process systems, and can approach the performance of NMPC without the knowledge of a full nonlinear model.

Behavior-Consistent Deep Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21214v1 Announce Type: new Abstract: Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that na\"ively increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

ECHO-PPI: Trustworthy AI for Evidence-Bundled Detection of Overlapping Protein Modules in Protein-Protein Interaction Networks

Sima Soltani, Mehrdad Jalali, Yahya Forghani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21216v1 Announce Type: new Abstract: Protein-protein interaction networks provide a graph-level view of cellular organization, yet their functional modules are overlapping, noisy, and difficult to interpret from cluster assignments alone. Existing community-detection methods can recover candidate protein complexes, but they rarely explain why an individual protein is assigned to a specific module or whether that assignment should be treated as core, peripheral, or uncertain. Here we introduce ECHO-PPI, an evidence-bundled framework for interpretable overlapping protein-module detection in protein-protein interaction networks. ECHO-PPI integrates weighted network topology, semantic protein profiles, and Gene Ontology evidence to identify evidence-potential nuclei, construct candidate modules, perform overlap-aware assignment, and export hierarchical confidence labels. The framework supports trustworthy computational decision support through assignment-level interpretability: each protein-module assignment is accompanied by topology, semantic, and Gene Ontology evidence scores and a hierarchical confidence label, enabling curators to inspect, rank, and triage overlapping module predictions. Evaluation on yeast protein-interaction data shows that ECHO-PPI preserves the behaviour of strong overlap-aware baselines while adding evidence-bundled auditability. Rather than claiming universal predictive superiority, ECHO-PPI addresses a complementary need: making overlapping protein-module predictions inspectable, confidence-aware, and reproducible for downstream biological interpretation.

ASIND: Alternating Sparse Identification for Predicting Network Dynamics Without Knowledge

Mingyu Kang, Jianxi Gao, Wenwu Yu, Linyuan Lv — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21220v1 Announce Type: new Abstract: Identifying network dynamics is a critical yet challenging task to to understand the mechanism of real-world social systems. There are two types of algorithms, and one requires the knowledge of self-dynamics function, interactive function, and interactive network to sparsely identify the network dynamics. Another one does not require any knowledge, but use simple functions to universally approximate complex functions. However, this type of algorithms lack interpretability, and the functional space is too extensive to search efficiently. Thus, to address this issue, this work proposes an Alternating Sparse Identification of Network Dynamics (ASIND) algorithm to sparsely identify the self-dynamics function, interactive function and interactive network alternatively. Extensive experiments are conducted to show the state-of-the-art identification and 100-steps prediction performance compared to the baseline. The experimental results also show the weak identifiability of interactive network, that means different networks can generate highly similar trajectories of network dynamics. The code is available at https://github.com/KMY-SEU/ASIND.

PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment

Richa Verma, Bavish Kulur, Sanjay Chawla, Balaraman Ravindran — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21225v1 Announce Type: new Abstract: We address the problem of making a pre-trained reinforcement learning (RL) policy safety-aware by incorporating cost constraints without retraining it from scratch. While costs could be numerically encoded, we assume a more general setting is when costs are provided as preferences. Given a reward-optimized policy and a small dataset of preferred (low-cost) and dispreferred (high-cost) trajectories, our goal is to fine-tune the policy to generate low-cost behaviors while retaining high rewards. Unlike standard RLHF in language models, where preferences are defined over responses to the same prompt, our setting involves trajectory-level preferences in continuous control environments. We introduce PREFINE: Preference-based Implicit Reward and Cost Fine-Tuning for Safety Alignment which is a preference-based fine-tuning method that adapts Direct Preference Optimization (DPO), which is now widely used for LLM fine-tuning, to the sequential decision making setting. PREFINE constructs policy-sampled counterfactual trajectories to establish meaningful preference contrasts and jointly optimizes for reward retention and safety alignment. Empirically, PREFINE reduces constraint violations and catastrophic failures by over 60% while maintaining original reward behavior. PREFINE produces policies that achieve low-cost, high-reward performance with significantly improved data and computational efficiency compared to full offline RL or imitation learning, bridging preference alignment and safe policy adaptation in continuous domains.

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Mark Boss, Vikram Voleti, Simon Donn\'e, Shimon Vainer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21226v1 Announce Type: new Abstract: The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization. Project Page: https://octopus-quant.github.io/

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Nina Hosseini-Kivanani — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21227v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

The Team Order Problem: Maximizing the Probability of Matching Being Large Enough

Haris Aziz, Jiarui Gan, Grzegorz Lisowski, Ali Pourmiri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21234v1 Announce Type: new Abstract: We consider a matching problem, which is meaningful in team competitions, as well as in information theory, recommender systems, and assignment problems. In the competitions which we study, each competitor in a team order plays a match with the corresponding opposing player. The team that wins more matches wins. We consider a problem where the input is the graph of probabilities that a team 1 player can win against the team 2 player, and the output is the optimal ordering of team 1 players given the fixed ordering of team 2. Our central result is a polynomial-time approximation scheme (PTAS) to compute a matching whose winning probability is at most epsilon less than the winning probability of the optimal matching. We also provide tractability results for several special cases of the problem, as well as an analytical bound on how far the winning probability of a maximum weight matching of the underlying graph is from the best achievable winning probability.

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21235v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

Xuan Yang, Xiaohan Yuan, Hao Li, Lingyu Chen, Yanan Liu, Lei Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21237v1 Announce Type: new Abstract: Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

Beyond the Tip of the Iceberg: Understanding SATD in Dockerfiles through the Lens of Co-evolution

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21238v1 Announce Type: new Abstract: Dockerfiles enable the creation of portable container-based execution environments for the application code, and have become an important part of the modern software development process. As Dockerfiles are a form of Infrastructure-as-Code (IaC), they can include temporary workarounds and other suboptimal implementations, leading to the accrual of technical debt that affects their reliability, security, and maintainability in the future. Prior work characterized self-admitted technical debt (SATD) in Dockerfile comments and the surrounding file chunks. This single-file view is incomplete since source code evolution involves changes across different types of software artifacts such as production, test, build, and other configuration files. Thus, we address this gap by studying SATD events in Dockerfiles alongside the related source code. We find that approximately 27% of admission events and 40% of repayment events are coupled to non-Dockerfile artifacts, and coupling sources are subtype-specific. We also observed that coupled SATD in general are repaid significantly faster overall (p = 0.0201), while coupled SATD regarding missing functionalities persists longer than its isolated counterparts; Lastly, we conducted open and axial coding of coupled SATD events, and we observe that external dependency issues, more particularly regarding unreleased upstream packages and bug fixes, are the most common cause of admission triggers in the source code; we also observe that architectural refactoring is the most common prerequisite for the repayment of SATD in Dockerfiles. These findings indicate that both practitioners (e.g. developers and project managers) and SATD researchers should integrate the source code-side co-evolution, rather than the single-file view, as the primary unit of analysis.

Multimodal Emotion Recognition with Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21239v1 Announce Type: new Abstract: Multimodal Emotion Recognition (MER) focuses on identifying and interpreting emotions from modality-compound inputs. Closely mirroring human cognitive processes in real-world environments, MER has drawn substantial attention from both academia and industry. Recently, a paradigm shift has been unveiled in MER, from leveraging small-scale, task-specific models to Large Language Models (LLMs). We refer to the latter as the MER-with-LLMs paradigm, which offers unprecedented generality, spurring numerous empirical attempts, even alongside speculation about LLMs' potential to achieve general emotional intelligence. However, with these new opportunities come new challenges, including the scarcity of emotionally annotated data, the affective gap both within and across modalities, and the opacity of affective interpretation. To systematically review existing research and guide future exploration, this paper categorizes prior works according to their focus on addressing these challenges into three directions: Affective Data Augmentation, Multimodal Affective Representation, and Multimodal Affective Reasoning. By thoroughly tracing the development, emerging trends, and remaining issues within each direction, this paper aims to provide a clear academic map of the MER-with-LLMs paradigm and foster its structured advancement.

APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21240v1 Announce Type: new Abstract: LLM agents have shown strong performance across a wide range of complex tasks, including interactive environments that require long-horizon decision making. But these agents cannot learn on the fly at test time. Self-evolving agents address this by accumulating memory and reflection across episodes rather than requiring model-weight updates. However, these agents often suffer from exploration collapse: as memory grows, behavior concentrates around familiar high-reward routines, reducing the chance of discovering better alternatives. To address this problem, we propose Autonomous Policy EXploration (APEX), which builds and maintains an explicit strategy space through a strategy map-a directed acyclic graph of milestones with prerequisite dependency edges. In APEX, Fork Discovery expands the map with evidence-grounded unexplored directions, while Policy Selection balances exploration and exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, a realistic web interaction benchmark, APEX outperforms all baselines. Extensive ablations validate each component's contribution and demonstrate robustness across diverse settings, demonstrating APEX's effectiveness for sustained exploration in self-evolving agents.

Divide and Contrast: Learning Robust Temporal Features without Augmentation

Abdul-Kazeem Shamba, Kerstin Bach, Gavin Taylor — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21241v1 Announce Type: new Abstract: Self-supervised learning for time-series representation aims to reduce reliance on labeled data while maintaining strong downstream performance, yet many existing approaches incur high computational costs or rely on assumptions that do not hold across diverse temporal dynamics. In this work, we introduce Divide and Contrast (Di-COT), an unsupervised framework that avoids data augmentation and multiple encoder passes by contrasting informative substructures within a window rather than individual timesteps. Di-COT stochastically partitions each window into a small number of overlapping sub-blocks per iteration, enabling efficient and meaningful contrast while mitigating false positives during temporal transitions. To further improve scalability, we adopt a contrastive objective whose computation depends on the batch size and the number of sub-blocks, making loss computation independent of sequence length. Extensive experiments on six large-scale real-world datasets, as well as the UCR and UEA benchmarks, demonstrate that Di-COT learns semantically structured and transferable representations, achieving state-of-the-art performance on classification, clustering, $k$NN, and cross-dataset transfer, while substantially reducing training time. The source code is publicly available at https://github.com/sfi-norwai/Di-COT.

To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble

Haechan Mark Bong, Simon Roy, Euhid Aman, Giovanni Beltrame — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21242v1 Announce Type: new Abstract: As robot fleets become more heterogeneous, including humanoids, rovers, quadrupeds, and drones, selecting the right robot for a task becomes a core systems problem. We study robot skill prediction: mapping a natural-language task description to the physical capabilities required to execute it, such as fly, wheels, legs, surface water, under water and hands. Since labelled data that maps natural-language task descriptions to robot's physical capabilities does not exist, we construct a synthetic task-to-skill dataset using LLM-assisted generation and targeted label auditing. Trained on this data, a ~133M-parameter ensemble of two fine-tuned sentence encoders (mpnet + MiniLM) reaches 83.5% task-to-skill matching on a stratified 200 task dataset, outperforming Kimi K2 (1T MoE) at 72.0%, GPT-OSS-120B at 71.5%, and Llama-4-Scout-17B at 69.0% under the same zero-shot prompt. These results suggest that, for fixed robot skill taxonomies, small specialized models trained on synthetic data can outperform much larger general-purpose LLMs for fleet-level task routing.

SR-Ground: Image Quality Grounding for Super-Resolved Content

Artem Borisov, Evgeney Bogatyrev, Khaled Abud, Dmitriy Vatolin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21244v1 Announce Type: new Abstract: Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.

Profiling User Vulnerability to Phishing Through Psychological and Behavioral Factors

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21246v1 Announce Type: new Abstract: Phishing remains one of the most pervasive cybersecurity threats, shifting the focus from technological vulnerabilities to human cognitive and psychological factors. In coherence with the trend of studies on phishing to increasingly focus on human aspects and vulnerable users profiling, this study investigates the multidimensional nature of user susceptibility by analyzing data from the Spamley dataset, involving 1,086 participants evaluated through a realistic phishing detection task. Using Exploratory Factor Analysis (EFA), five latent constructs were identified, named: Seniority, Expertise, Creativity, Stability, and Vulnerability. Behavioral findings, validating self-reported impulsivity through its negative correlation with response times, demonstrate that faster decision-making significantly distinguishes vulnerable users from resilient ones. A K-Means clustering procedure, driven by the dimensions of Seniority (F1) and Creativity (F3), revealed two distinct user profiles: the Aware User and the High-Risk User. The results demonstrate that technical knowledge alone is insufficient to guarantee resilience; rather, the interaction between operational maturity, decision-making speed, and cognitive approach determines effectiveness. The findings suggest that the majority of users fall into the High-Risk category, characterized by hasty evaluation processes and lower critical analysis. These results underline the urgent need to move beyond "one-size-fits-all" training toward personalized, adaptive cybersecurity programs that actively address cognitive biases and behavioral tendencies.

Graph Navier Stokes Networks

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21247v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have emerged as a cornerstone of deep learning, with most existing methods rooted in graph signal processing and diffusion equations to model message passing. However, these approaches inherently suffer from the oversmoothing problem, where node features become indistinguishable as the network depth increases. Inspired by the Navier Stokes equations, we introduce Graph Navier Stokes Networks (GNSN), a novel architecture that transcends conventional diffusion-based message passing by incorporating convection into graph structures. GNSN defines a dynamic velocity field on the graph to govern convection, enabling more efficient and direct message propagation. By adaptively balancing convection and diffusion, GNSN is able to efficiently handle datasets with varying levels of homophily. Extensive evaluations across twelve real-world datasets demonstrate that GNSN consistently outperforms state-of-the-art baselines in classification accuracy. Moreover, experimental results further emphasize its effectiveness in alleviating the oversmoothing problem.

Distributed Stochastic Graph Algorithms

Keren Censor-Hillel, Aditi Dudeja, George Giakkoupis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21248v1 Announce Type: new Abstract: We study stochastic graph optimization problems in a novel distributed setting. As in the standard centralized setting, a random subgraph $G^*$ of a known base graph $G$ is realized by including each edge $e$ independently with a known probability $p_e$, and we must solve an optimization problem on $G^*$ despite uncertainty about its edges. In the standard setting, to cope with this uncertainty, the algorithm can query any edge of $G$ to learn if the edge exists in $G^*$, and its complexity is the number of queried edges. The distributed setting incorporates uncertainty in a natural manner, by having each vertex know only about its own edges in $G^*$ (and only communicate over them), and the complexity is measured by the number of synchronous communication rounds. We establish that distributed stochastic algorithms can be drastically faster than their non-stochastic counterparts and overcome known lower bounds, by showing fast distributed approximation algorithms for maximum matching, minimum vertex cover, and minimum dominating set.

Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

Rodrigo Morales-S\'anchez, Soto Montalvo, Raquel Mart\'inez — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21256v1 Announce Type: new Abstract: Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.

Reinforcement Learning for Risk Adaptation via Differentiable CVaR Barrier Functions

Xinyi Wang, Taekyung Kim, Bardh Hoxha, Georgios Fainekos, Dimitra Panagou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21257v1 Announce Type: new Abstract: Planning through crowded environments under uncertain obstacle motions remains difficult, as stochastic interactions often induce overly conservative behavior or reduced efficiency. To address this challenge, we propose an end-to-end risk adaptation framework for crowd navigation under obstacle-motion uncertainty modeled by a Gaussian mixture model. The framework combines reinforcement learning~(RL) with a differentiable quadratic-program safety layer based on Conditional Value-at-Risk~(CVaR) barrier functions, jointly learning nominal control input, risk level, and safety margin and enforcing explicit probabilistic safety constraints. This design enables context-aware adaptation, promoting efficient behavior while invoking caution only when necessary. We conduct extensive evaluations in dynamic, uncertain, and crowded environments across varying obstacle densities and robot models, and further assess generalization under three out-of-distribution cases. Comparisons across optimization-based, RL-based, and integrated RL and optimization methods are provided, and the proposed method is shown to deliver the strongest overall performance in safety, efficiency, and generalization under uncertainty.

Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21258v1 Announce Type: new Abstract: Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information, effectively combining the expressiveness of implicit representations with the structural priors of explicit ones. In addition, informed by shared design choices in prior work, we develop a streamlined, efficient 3DGS-based rendering pipeline that is deliberately kept lightweight, improving efficiency while leaving greater representational capacity to the front-end latent module. Extensive evaluations on RLBench, ManiSkill2, and a real-robot platform demonstrate consistent gains in task success, sample efficiency, and robustness to viewpoint and scene variations over strong baselines. Ablation studies further confirm that each component of our framework is critical to overall performance.

On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective

Yue Zhang, Zhiyi Dong, Tommaso Cesari, Yongyi Mao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21260v1 Announce Type: new Abstract: We develop a learning-theoretic framework for understanding Chain of Thought (CoT). We model CoT as the interaction between an answer map and a chain rule that generates intermediate questions autoregressively, and define the reasoning risk of a hypothesis under this interaction. Our first result is a tight canonical decomposition of this risk into two terms with opposing roles: an oracle-trajectory risk (OTR), which captures the benefit of CoT and reduces to a target-domain risk in a domain adaptation problem, and a trajectory-mismatch risk (TMR), which captures the cost of CoT through error accumulation along mismatched reasoning trajectories. We then show that this cost is unavoidable without structure: if any one of the loss, the hypothesis answer map, or the chain rule lacks stability, the TMR can be arbitrarily large even when the OTR is zero and the hypothesis is uniformly close to the ground truth. Conversely, under stability, we prove a tight upper bound on the TMR governed by an exact amplification factor that identifies bounded, linear, and exponential error-growth regimes. Together, these results give a precise theory of when CoT helps, when it hurts, and what controls the transition between the two.

STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval

Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21261v1 Announce Type: new Abstract: Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it through a transition vector in the embedding space and make it closer to the target image. Combining LLMs with user instruction, the refined caption concentrates more on the core modification intent and thus filters out unnecessary noise. Moreover, to explore diverse alignment during the retrieval stage, we model the caption and image as discrete distributions and reformulate the retrieval task as a set-to-set alignment task. Finally, a bidirectional transportation distance is developed to consider fine-grained alignments across modalities and calculate the retrieval score. Extensive experiments demonstrate that our method can be general, effective, and beneficial for many CIR tasks.

Systematic Design of Separation Logics

Roberto Bruni, Lorenzo Gazzella, Roberta Gori — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21262v1 Announce Type: new Abstract: Thanks to the locality principle, separation logics support modular, scalable analysis of large codebases by relying on local axioms and frame rules to focus only on the heap fragments required for verification. However, depending on the direction (forward vs. backward) and sense of approximation (over vs. under) of the analysis, designing the corresponding proof systems can require some ingenuity. In his work on the calculational design of program logics, Patrick Cousot outlines a methodology for deriving proof systems directly from program semantics using abstract interpretation, covering both correctness and incorrectness analyses. Unfortunately, when applied to heap-manipulating programs, Cousot's calculational approach cannot handle the locality principle, because it does not provide a calculational way to derive frame rules and produces axioms that refer to the global heap. In this paper, we propose a general methodology for systematically deriving local axioms in which the locality principle is embedded by construction. For heap-manipulating primitives, we can derive the minimal required heap and the corresponding pre- and postconditions, complemented by universal frame rules without additional syntactic side conditions. Our method is parametric w.r.t. a set of semantic closure properties that are exploited to design local axioms; it can deal with different memory models; it favors the reuse of many inference rules across over- and under-approximation; and it produces logical systems capable of deriving a broader range of triples w.r.t. existing, cleverly designed, program logics for (in)correctness, ranging from Separation Logic and Incorrectness Separation Logic to Separation Sufficient Incorrectness Logic. Furthermore, we demonstrate the flexibility of our methodology by applying it to design a novel proof system for inferring necessary preconditions with separation logic.

Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity

Xiangyu Yang, Feng Xu, Jian-Qiang Hu, Jiaqiao Hu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21263v1 Announce Type: new Abstract: Firms increasingly rely on dynamic pricing to respond to evolving customer demand, yet in many applications they observe only the revenue generated by a single posted price in each period. At the same time, market conditions may shift gradually or abruptly due to changes in customer preferences, competition, or external shocks. These features create two intertwined challenges: learning the revenue--demand relationship from limited feedback and adapting pricing decisions to a changing environment. We study how a seller can learn and earn effectively under these constraints, without assuming a specific parametric form for demand. We develop a learning framework that updates prices using revenue-based gradient approximations constructed from one observation per period. To address environmental changes, we incorporate a restarting mechanism that periodically refreshes the learning process so that outdated information is discounted. When the degree of nonstationarity is unknown, we further introduce a meta-learning layer to adaptively hedge across multiple restarting schedules. We provide performance guarantees for our approach, showing how cumulative revenue loss relative to a fully informed benchmark depends on both the time horizon and the magnitude of market variation. Simulation experiments using synthetic and real-world data illustrate the effectiveness of the proposed procedures.

FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs

Penglin Dai, Fulian Li, Xincao Xu, Junhua Wang, Lixin Duan, Xiao Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21264v1 Announce Type: new Abstract: Federated Learning (FL) has emerged as a promising paradigm for privacy-preserving distributed learning. However, existing FL methods face a fundamental challenge. Traditional averaging-based approaches suffer from parameter divergence under non-IID conditions, while personalized FL methods overfit to local data and fail to generalize to new clients (cold-start problem). Mixture-of-Experts naturally addresses this by routing heterogeneous data to specialized experts rather than forcing uniform aggregation. In this paper, we propose FedCoE, a Federated Coordinated dual-level mixture-of-Experts framework that effectively balances global generalization with local personalization. FedCoE maintains multiple independent global expert models on the server and employs a shared gating network to dynamically model client-expert correlations during aggregation, effectively mitigating expert drift and gating inconsistency. To address the cold-start challenge, we introduce an adaptive mechanism that enables new clients to immediately leverage the global expert pool without extensive local training. Extensive experiments demonstrate that FedCoE achieves 78.00% global accuracy and 89.32% personalized accuracy on average, outperforming the baseline by 8.82% and 29.19%, respectively. In cold-start scenarios, FedCoE delivers 77.27% accuracy without any local fine-tuning, outperforming baselines by over 12.54%.

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

Richa Verma, Balaraman Ravindran — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21266v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for reasoning in language models, with GRPO as its primary example. However, GRPO requires continuous online rollout generation, making it computationally expensive and difficult to scale. While Direct Preference Optimization (DPO) offers a stable and efficient offline alternative, it is typically expected to underperform w.r.t. online RL methods such as GRPO when trained on rollouts from a cold supervised fine-tuned (SFT) policy. We introduce G2D (GRPO to DPO)}, a three-stage pipeline that performs a short GRPO warm-up, constructs a static preference dataset, and fine-tunes a model offline with DPO. Across a set of values of the number of online steps (K) in GRPO on Qwen2.5-7B and Llama-3.1-8B, we find that offline DPO with moderate warm-up matches or outperforms GRPO at substantially lower compute cost in our setting. On Qwen2.5-7B, G2D at K=150 achieves 62.4% on MATH-500, outperforming GRPO (51.6%) by 10.8% at ~4x lower compute. On Llama-3.1-8B, G2D at K=500 achieves 49.4%, surpassing GRPO in our experimental setting. We show that performance is not governed by the number of preference pairs, which does not vary much w.r.t. K, but by their informativeness. Moderate warm-up produces rollouts with calibrated uncertainty, yielding stronger contrastive signal, while excessive warm-up leads to overconfident policies and less informative data. Our results recast the offline-online gap in RLVR as primarily a data informativeness problem, and identify short online RL warm-up with appropriate difficulty calibration of the fine-tuning dataset as a compute-efficient alternative to online RL.

Towards Single Exponential Time for Temporal and Spatial Reasoning: A Study via Redundancy and Dynamic Programming

Victor Lagerkvist, Johanna Groven, Leif Eriksson — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21267v1 Announce Type: new Abstract: The region connection calculus ($RCC$) and Allen's interval algebra ($IA$) are two well-known NP-hard spatial-temporal qualitative reasoning problems. They are solvable in $2^{O(n \log n)}$ time, where $n$ is the number of variables, and $IA$ is additionally known to be solvable in $o(n)^n$ time. However, no improvement over exhaustive search is known for $RCC$, and if they are also solvable in single exponential time $2^{O(n)}$ is unknown. We investigate multiple avenues towards reaching such bounds. First, we show that branching is insufficient since there are too many non-redundant constraints. Concretely, we classify the maximum number of non-redundant constraints in $RCC$ and $IA$. Algorithmically, we make two significant contributions based on dynamic programming (DP). The first algorithm runs in $4^n$ time and is applicable to a non-trivial, NP-hard fragment of $IA$, which includes the well-known interval graph sandwich problem of Golumbic and Shamir (1993). For the richer $RCC$ problem with 8 basic relations we use a more sophisticated approach which asymptotically matches the $o(n)^n$ bound for $IA$.

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

Arun D. Kulkarni — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21268v1 Announce Type: new Abstract: Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

Transforming Privacy Artifacts into Accessible Reports for Non-Technical Stakeholders

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21269v1 Announce Type: new Abstract: The transition toward Industry 5.0 is reshaping industrial work environments with an emphasis on human-centricity, enabling close collaboration between humans and machines to enhance productivity and flexibility. However, such systems typically require monitoring of human workers and operators, often involving sensitive data, raising significant privacy concerns. As a result, affected workers and unions frequently reject human-machine collaboration features due to a lack of transparency regarding privacy threats and implemented mitigation strategies. To enable early stakeholder involvement, establish trust, and support informed decision-making, privacy implications must be communicated in a way understandable to non-technical stakeholders. Yet, current Requirements Engineering (RE) practices provide limited methodological support for making privacy threats and mitigations accessible to non-technical stakeholders (e.g., individual workers or their representative unions). In this RE@Next paper, we propose a conceptual framework that guides software design from human monitoring-related use cases and requirements to informed decision-making guidance focusing on non-technical stakeholders. Building on principles such as Privacy by Design, the framework leverages Large Language Models (LLMs) to transform technical artifacts into accessible privacy reports. We share initial insights from two industry use cases, evaluate the quality of the generated reports, and outline future research directions toward integrating privacy transparency into RE processes for human-centric industrial systems.

Enhanced-BLE: A Hybrid BLE-ESB Framework for Dynamically Reconfigurable and Energy-Efficient 2.4 GHz IoT Communication

Ziyao Zhou, Chen Shen, Tiancheng Cao, Hen-Wei Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21270v1 Announce Type: new Abstract: Bluetooth Low Energy (BLE) is widely used in IoT systems because of its low power consumption, interoperability, and reliable bidirectional communication. However, its connection-oriented architecture introduces trade-offs among wake-up latency, throughput, and energy efficiency, limiting its suitability for burst-mode and on-demand sensing applications. Enhanced ShockBurst (ESB), a lightweight connectionless protocol supported by the same 2.4 GHz Nordic Semiconductor hardware, enables fast wake-up and efficient data transmission, but does not provide BLE-level robustness for sustained bidirectional communication. This work systematically benchmarks BLE and ESB on a unified Nordic nRF54L15 platform and proposes Enhanced-BLE, a hybrid framework that integrates the two protocols to extend conventional BLE operation. Experimental results show that ESB nearly halves packet transmission time and energy compared with BLE, doubles the achievable forward throughput, and reduces wake-up latency and energy by nearly twentyfold during intermittent operation. However, ESB reverse transmission may suffer packet loss, whereas BLE maintains reliable bidirectional communication. Enhanced-BLE addresses this trade-off through adaptive radio scheduling and coexistence-aware connection management, combining ESB-based high-throughput forward transmission with BLE-based reliable reverse communication. The framework enables BLE-to-ESB handover within approximately 18 ms and restores BLE operation within 49 ms from standby mode. Enhanced-BLE also achieves approximately twofold higher forward throughput than BLE while reducing wake-up latency. These results demonstrate a practical and hardware-compatible strategy for low-latency, high-throughput, energy-efficient, and reliable 2.4 GHz IoT communication.

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21272v1 Announce Type: new Abstract: Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21273v1 Announce Type: new Abstract: Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.

Let EEG Models Learn EEG

Yifan Wang, Yijia Ma, Wen Li, Chenyu You — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21280v1 Announce Type: new Abstract: High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .

\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21282v1 Announce Type: new Abstract: Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

A Mechanistic Study of Tabular Foundation Models

Marin Bilo\v{s}, James T. Wilson, Anderson Schneider, Yuriy Nevmyvaka — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21288v1 Announce Type: new Abstract: Tabular foundation models with different architectures converge in accuracy across a range of classification and regression tasks. This raises questions a leaderboard cannot answer: (i) whether the models execute the same in-context algorithm, (ii) where row, column, and class-permutation invariances originate, and (iii) how robust they are under perturbations engineered against the inferred mechanism. We characterize all three. The model families realize qualitatively distinct similarity-based readouts: from an attention-weighted vote over context labels to a class-conditional mean readout, each confirmed by causal intervention. We find that the representation collapse highlighted in prior work is not a practical concern for them. Each model's permutation invariances trace to specific positional parameters whose removal preserves accuracy and makes approximate invariance exact. Perturbations engineered against each readout reproduce predicted failure modes; hub and rank attacks isolate them from refit baselines. Together these results give a mechanistic account of contemporary tabular foundation models and identify which inductive biases govern both their accuracy and characteristic failures.

TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21295v1 Announce Type: new Abstract: Longitudinal passive sensing enables continuous health prediction, yet models often fail under cross-dataset distribution shifts. Traditional ML overfits cohort-specific artifacts, while Large Language Models (LLMs) struggle to reason reliably over long, heterogeneous time-series. We introduce TimeSRL, a two-stage LLM framework that routes predictions through an explicit semantic bottleneck. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these abstractions alone. This forces the model to reason over semantic concepts that we argue generalize better than raw numbers. We optimize this process end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR), learning outcome-aligned abstractions without gold intermediate annotations. Instantiated on mental-health prediction, TimeSRL achieves state-of-the-art performance on a benchmark designed to stress-test cross-cohort generalization under a rigorous leave-one-dataset-out (LOSO) protocol, reducing mean absolute error (MAE) over strong non-LLM ML and LLM baselines by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression (all $p$s<0.05). TimeSRL significantly outperforms prior methods in cross-benchmark transfer across different sensing pipelines, rivaling its own within-domain performance without target-domain fine-tuning. These results demonstrate that semantic abstractions are reusable and point to a new direction for generalizable behavior modeling via RL-tuned LLMs.

Tracing the ongoing emergence of human-like reasoning in Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21299v1 Announce Type: new Abstract: Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they reason like humans. To address this, we conducted a population-matching experiment assessing how twentyfive LLMs compute conditional inferences across four languages, compared to an equal number of humans per language. We find that humans enrich logical reasoning through pragmatic inferences across languages. Model behavior is more variable. Some LLMs perfectly follow the truth-table of conditionals but they ignore pragmatic inferences, while others deviate from the truth-table, adhering to a single interpretation across the board, thus reflecting accurate rule-based processing but not human-like reasoning. Overall, LLMs are accurate semantic operators, but fail to capture the pragmatic enrichments characteristic of human reasoning. Crucially, LLM accuracy is neither predicted nor boosted by open vs. closed status, training orientation, or architecture type, suggesting that pragmatic reasoning is still an emerging ability in the cognitive toolkit of artificial systems.

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

Meng Shen, Minghao Wu, Deepu Rajan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21300v1 Announce Type: new Abstract: Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21301v1 Announce Type: new Abstract: In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

Nura Aljaafari, Danilo S. Carvalho, Andre Freitas — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21303v1 Announce Type: new Abstract: Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $\tau_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $\theta$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

Deformba: Vision State Space Model with Adaptive State Fusion

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21308v1 Announce Type: new Abstract: State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21309v1 Announce Type: new Abstract: Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning

Bibek Poudel, Lei Zhu, Kevin Heaslip, Sai Swaminathan, Weizi Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21311v1 Announce Type: new Abstract: Modern vision systems can detect, track, and forecast urban actors at scale, yet translating perception outputs to urban design remains limited. We introduce DeCoR, a two-stage reinforcement learning framework that leverages flow observations to co-optimize crosswalk layout and network-level signal control. The design stage encodes the pedestrian network as a graph and learns a generative policy that parameterizes a Gaussian mixture model over crosswalk location and width, from which new crosswalks are sampled. For each layout, a shared control policy learns adaptive signal timings to minimize joint pedestrian and vehicle delay. On a 750 m real-world urban corridor with demand sensed from video and Wi-Fi logs, DeCoR learns a layout that reduces pedestrian arrival time to their nearest crosswalk by 23% while using fewer crosswalks than existing configurations. On the control side, DeCoR reduces pedestrian and vehicle wait time by 79% and 65%, respectively, relative to fixed-time signalization. Further, the control policy generalizes to demands outside of training and is robust to layout changes without retraining.

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21312v1 Announce Type: new Abstract: Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration.

A New Framework to Analyse the Distributional Robustness of Deep Neural Networks

Divij Khaitan, Subhashis Banerjee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21313v1 Announce Type: new Abstract: Deep neural networks have achieved impressive performance on a variety of tasks, but their brittleness to distributional shifts remains a significant barrier to real-world deployment. In this paper, we propose a framework to analyse and quantify the distributional robustness of neural networks by studying the interactions between layer weights and activations. We model these interactions using Bernoulli distributions, using the separation between classes as a diagnostic proxy for robustness. We demonstrate the usefulness of this framework through models trained on CIFAR-10 and ImageNet. We show that our proposed metrics can distinguish between networks that have memorised their training data and those that have not. We also perform analogous experiments in the activation space and find that the same properties do not hold up. Additionally, we investigate the behaviour of our metrics under various distribution shifts and show that these shifts reduce separation under our path-based diagnostics. Our results suggest that this framework provides useful model-level diagnostics of representation structure and robustness.

CRAFT: Conflict-Resolved Aggregation for Federated Training

Ziqi Wang, Qiang Liu, Nils Thuerey — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21317v1 Announce Type: new Abstract: The aggregation of conflicting client updates remains a fundamental bottleneck in federated learning (FL) under heterogeneous data distributions. Naive averaging can produce a global update that improves the global objective while conflicting with specific clients, causing degradation for those clients. In this work, we propose CRAFT (Conflict-Resolved Aggregation for Federated Training), a new aggregation framework that treats the global update as a geometric correction problem. We formulate aggregation as finding the update closest to a reference direction while satisfying conflict-free alignment constraints. We derive a closed-form expression for the constrained optimization problem, avoiding the computational overhead of iterative solvers. Furthermore, we use a layer-wise adaptation to address conflicts at varying feature granularities. We provide a theoretical analysis showing that CRAFT promotes a common-descent structure and mitigates conflicts through its projection geometry. Extensive experiments on heterogeneous benchmarks demonstrate that CRAFT improves the accuracy of the global model while reducing performance disparity across clients compared with state-of-the-art baselines. The source code for CRAFT is available at https://github.com/tum-pbs/CRAFT.

TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21318v1 Announce Type: new Abstract: Large language models (LLMs) are highly sensitive to the prompts used to specify task objectives and behavioral constraints. Many recent prompt optimization methods iteratively rewrite prompts using LLM-generated feedback, but the resulting prompts often become longer, accumulate narrow sample-specific rules, and generalize poorly beyond the training distribution. We study this failure mode as prompt distributional overfitting and argue that it reflects a lack of representation control in discrete text-space optimization. We formalize this view through representational inefficiency, a dual-factor measure that decomposes prompt inefficiency into capacity cost and scope narrowness, attributing distributional prompt overfitting to their coupled growth during optimization. We propose TextReg, a regularization framework that realizes a soft-penalty objective through regularized textual gradients, combining Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Across multiple reasoning benchmarks, TextReg substantially improves out-of-distribution (OOD) generalization, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.

Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21322v1 Announce Type: new Abstract: Federated Learning (FL) enables collaborative model training without centralizing data. However, real-world deployments must simultaneously address statistical heterogeneity across client data (non-IID), system heterogeneity in device capabilities, and communication efficiency. Existing FL approaches mitigate these challenges through improved aggregation, personalization, or knowledge distillation, but they almost universally assume a fixed client architecture, limiting adaptability to heterogeneous data complexity and hardware constraints. This architectural constraint often leads to suboptimal trade-offs between accuracy and efficiency in real-world FL systems. This work introduces FedKDNAS, a distillation-driven FL framework that combines client-side neural architecture selection with distillation of server-coordinated knowledge. Each client autonomously selects a lightweight model under accuracy-resource constraints. It then trains it locally using a hybrid objective combining supervised learning and knowledge distillation and shares only predictions on a public reference set. The server then aggregates and smooths these predictions, optionally combining them with a teacher model, to produce stable distillation targets for the next round. Extensive evaluation on six datasets against six representative FL baselines (FedAvg, Ditto, FedMD, FedDF, FedDistill, Local-KD) demonstrates that FedKDNAS consistently achieves superior Pareto efficiency, improving accuracy by up to 15\% under non-IID conditions, reducing client CPU usage by approximately 28\%, and decreasing communication overhead by up to 44 times while maintaining lightweight logit-based communication.

Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21325v1 Announce Type: new Abstract: Linear attention has emerged as a cornerstone for efficient long-context architectures, as evidenced by its integration into state-of-the-art open-source models including Qwen3.5/3.6, Kimi Linear, and RWKV-7. Models that incorporate linear attention layers with the so-called Delta-Rule involve the inversion of triangular matrices as a core sub-routine. This operation often forms a performance bottleneck, and, due to its high-sensitivity to numerical errors, it can significantly deteriorate end-to-end model accuracy if it is not carefully implemented. This work provides a systematic analysis of both direct and iterative triangular inversion algorithms, targeting methods that are rich in matrix products, and, therefore, have the potential to efficiently utilize modern hardware. To that end, our analysis covers a broad spectrum of mathematical and practical aspects, with a heavy focus on numerical stability, computational complexity, and, ultimately, hardware efficiency and practical considerations. We provide a rigorous experimental evaluation to verify these properties in practical scenarios, and in low-precision floating-point representations, highlighting the strengths and limitations of each method. Performance benchmarks on NPUs reveal up to $4.3\times$ speed-up against the state-of-the-art implementations of SGLang for triangular matrix inversion, leading to significant performance improvements on the entire layer level, while maintaining full end-to-end model accuracy.

SAOITHE: Sustainable Age-of-Information-Based Timely Status Updating for Hardware-constrained Edge networks

Shih-Kai Chou, Maice Costa, Mihael Mohor\v{c}i\v{c}, Jernej Hribar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21328v1 Announce Type: new Abstract: In future large-scale deployments of 6G and beyond networks, collecting timely information, as measured by the Age of Information (AoI) metric, is becoming increasingly important. At the same time, the environmental impact, often characterized by the resulting Carbon Footprint (CF), depends on both the amount of consumed energy and the Carbon Intensity (CI), i.e., the amount of CO$_2$-equivalent emissions produced per unit of consumed energy. Since CI varies over time, minimizing energy is not equivalent to minimizing CF, as a status update with the same energy demand may result in a different carbon cost depending on when it is transmitted. This makes timely status updating a nontrivial scheduling problem. To address this challenge, we formulate carbon-aware status updating as a constrained Markov Decision Process (MDP) that minimizes AoI subject to CF budget, transmission duty-cycle, and channel-capacity constraints. We then propose Sustainable Age-of-Information-Based Timely Status Updating for Hardware-constrained Edge networks (SAOITHE), a Whittle-index-based scheduling solution that enables scalable real-time scheduling. Using real-world CI traces across low-, medium-, and high-CI regions, the results show that SAOITHE remains within the allocated CF budget while achieving lower AoI than baseline policies. Moreover, the gains are around 25% and 20% in low- and medium-CI regions, respectively, and up to 75% in high-CI settings, while preserving scalability.

Learning Robust Dexterous In-Hand Manipulation from Joint Sensors with Proprioceptive Transformer

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21330v1 Announce Type: new Abstract: In-hand object manipulation is a fundamental yet challenging capability for dexterous robots. Despite significant progress in dexterous manipulation, existing approaches rely heavily on vision or tactile sensing to track object states, while joint sensing -- the most readily available modality on any robotic hand -- remains largely overlooked, particularly for tendon-driven hands. In this paper, we study how far joint sensing alone can go by asking: (i) whether motor encoders or direct joint sensing provides better proprioceptive feedback, (ii) how to extract environment information from joint measurements, and (iii) whether joint-only control can achieve competitive real-world performance without external perception. We present the Proprioceptive Transformer (PT), an exteroceptive-free approach for continuous cube rotation on a tendon-driven dexterous hand that uses only joint sensing feedback. A teacher policy is first trained via reinforcement learning with privileged object information, then distilled into PT, which operates solely on joint position and velocity histories. The Transformer architecture effectively extracts implicit object state information from temporal patterns in joint sensor readings. Experiments on the real ORCA hand show that our approach achieves 3.1x higher rotation speed than baselines. We also demonstrate that our PT achieves a 23.4% lower RMSE for cube position estimation than the MLP baseline, indicating superior extraction of exteroceptive information from proprioceptive sources.

SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence

Ting Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21333v1 Announce Type: new Abstract: Natively trained spiking language models struggle to combine Transformer-like language quality, stable multi-domain pre-training, and high activation sparsity. We present SymbolicLight V1, a spike-gated dual-path language model that combines binary Leaky Integrate-and-Fire spike dynamics with a continuous residual stream. Its Dual-Path SparseTCAM module replaces dense self-attention with an exponential-decay aggregation path for long-range memory and a spike-gated local attention path for short-range precision, complemented by a dynamic context-conditioned decoding head and a bilingual tokenizer. A 194M-parameter SymbolicLight V1 model trained from scratch on a 3B-token Chinese-English corpus reaches held-out validation PPL 8.88-8.93 across four independent runs at >89% per-element activation sparsity. It trails GPT-2 201M by 7.7% in PPL while surpassing GPT-2 124M under the reported comparison. Component ablations at matched 0.5B-token training budgets show that the spike-gated local attention path is the largest contributor, and that replacing LIF dynamics with a deterministic top-k mask at matched sparsity causes a larger degradation, indicating that temporal integration rather than sparsity alone drives performance. We also report a 0.8B-parameter scale-up run trained on 48.8B tokens as evidence of optimization and sparsity preservation, not as a primary quality comparison. Current dense-hardware inference is slower than GPT-2, so neuromorphic deployment is presented as a future sparsity-driven opportunity rather than an achieved hardware speedup.

RSE of a Quantum Transport Code and its Effects

Christoph Conrads, Edoardo Di Napoli — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21334v1 Announce Type: new Abstract: This paper presents our research software engineering (RSE) experiences over two years with libNEGF, a quantum transport code. We describe practical approaches to code quality assurance--including continuous integration, automated testing, and compiler warning correction--and performance engineering through continuous benchmarking. Our systematic application of these practices revealed critical defects: uninitialized memory reads, out-of-bounds writes, and notably, a misunderstood mathematical model in our boundary condition handling. We also document how continuous benchmarking exposed performance regressions caused by HPC system configuration changes. Our findings provide data points suggesting that a dangerous class of defects--equivalent to undefined behavior in C/C++ and processor-dependent behavior in Fortran--is as prevalent in Fortran scientific codes as elsewhere. While libNEGF is implemented in Fortran, most recommendations are applicable to scientific software regardless of implementation language, and they can be implemented selectively or in their entirety for both new and existing projects.

A Two-Watched Literal Scheme for First-Order Logic

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21335v1 Announce Type: new Abstract: The two-watched literal scheme, a core component of efficient CDCL (Conflict-Driven Clause Learning) implementations for propositional logic, is extended to first-order logic. Given a set of first-order clauses and a set of ground literals, our lifted two-watched literal scheme efficiently detects all propagating and false clauses with respect to the ground literals. We present the algorithm as a system of rules and prove its soundness and completeness. Additionally, we provide an implementation of the two-watched literal scheme, which outperforms a standard dynamic programming approach for detecting propagatable literals and conflicts, especially when dealing with long clauses.

Multicategorical Semantics for Untyped Effects

Ariel Grunfeld, Liron Cohen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21337v1 Announce Type: new Abstract: Completeness proofs in categorical semantics usually proceed by building a syntactic category whose composition is given by substitution. For untyped effectful Call-by-Value languages, this runs into a basic obstacle: there is no canonical notion of simultaneous substitution of computations, since evaluation order is semantically meaningful. We address this by taking single computation substitutions, that is, binding steps, as primitive, and representing computation substitution by finite sequential lists composed by concatenation. We formalize this idea in a one-object Freyd-multicategorical setting. We introduce Freyd operads, separating a cartesian operad of values from a symmetric Ren-cartesian preoperad of computations, connected by a Freyd functor, and from any Freyd operad we construct a corresponding Freyd PROP of substitutions. We prove that this construction is representable and, in the strict one-object setting, left adjoint to restriction to codomain 1. Using the induced term model, we interpret untyped computational lambda-calculus with procedures and higher-order functions in weakly closed Freyd operads, and prove soundness, initiality, and completeness. This yields a categorical semantics tailored to untyped effectful computation and broad enough to encompass realizability-oriented models such as monadic combinatory algebras.

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

Yuefeng Shi, Nedjma Ousidhoum, Jose Camacho-Collados — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21338v1 Announce Type: new Abstract: LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs' semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

Ziye Li, Henghui Ding — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21343v1 Announce Type: new Abstract: Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21347v1 Announce Type: new Abstract: Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses, and iterate. This process misses patterns that only emerge across trace populations and does not scale to production corpora where individual traces span tens of thousands of tokens. We formalize the problem of corpus-level trace diagnostics. Given a corpus of execution traces, the goal is to produce grounded natural-language insights that characterize systematic behavioral patterns across trace groups, each linked to supporting evidence. We present the Insights Generator (IG), a multi-agent system that answers diagnostic questions by proposing and testing hypotheses across the trace corpus to produce an evidence-backed insights report. We evaluate IG across qualitative and objective dimensions, spanning rubric-based report assessment and downstream performance improvements achieved by implementing IG insights. Human experts using IG reports improve scaffold performance by 30.4pp over the unmodified baseline scaffold, and coding agents leveraging IG-derived insights show consistent and stable gains. Across benchmarks, IG's scout-investigator architecture produces findings comparable in detection coverage to competing approaches, while domain experts rated IG reports as leading depth and evidence quality.

Data-Efficient Neural Operator Training via Physics-Based Active Learning

Alicja Polanska, Lorenzo Zanisi, Vignesh Gopakumar, Stanislas Pamela — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21348v1 Announce Type: new Abstract: Solving partial differential equations with neural operators significantly reduces computational costs but remains bottlenecked by high training data requirements. Active learning offers a natural framework to mitigate this by selectively acquiring the most informative samples in an iterative manner. We introduce physics-based acquisition - a novel physics-informed active learning algorithm that leverages the partial differential equation residual to guide data selection. We validate the method by presenting numerical experiments for the 1D Burgers equation and the 2D compressible Navier-Stokes equations. We show that, in our experiments, physics-based acquisition consistently outperforms random acquisition and matches the state of the art in data efficiency. At the same time, it has the unique advantage of injecting a physics inductive bias into the training process, ensuring that simulation cost is spent where the model's physical understanding is weakest.

Onion-Routed Multi-Circuit Key Establishment for Quantum-Resilient Sessions

Tushin Mallick, Ashish Kundu, Ramana Kompella — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21349v1 Announce Type: new Abstract: Public-key primitives that today anchor session-key establishment - RSA, Diffie-Hellman, and elliptic-curve cryptography - reduce to integer factorization or discrete logarithm and are therefore vulnerable to Shor's algorithm on a sufficiently capable quantum computer. The harvest-now, decrypt-later (HNDL) threat model turns this future capability into a present liability: ciphertext archived today can be decrypted retrospectively once a cryptographically relevant quantum computer becomes available. We propose a session-key establishment scheme that distributes a freshly generated key as multiple, independently encrypted fragments across distinct, ephemeral Tor circuits between an onion-service proxy and an onion-service client. Reconstruction requires every fragment; each fragment travels its own per-bundle circuit established via a NEWNYM signal. The security argument rests on the standard end-to-end correlation bound for onion routing: an adversary controlling a fraction of Tor relays must independently deanonymize every fresh circuit to correlate the fragments belonging to one session, and the per-fragment probability of success decays multiplicatively in the number of fragments. We implement the design as a Flask-based prototype on AWS EC2, with both the proxy and the client deployed as Tor onion services, and measure end-to-end key-establishment latency. The implemented prototype completes a key establishment in 13-20 s on average (7-50 s including tails), of which approximately 88% is attributable to Tor-related delay - a cost we discuss in the context of the privacy-versus-responsiveness trade-off.

The Human-AI Delegation Dilemma: Individual Strategies, Collective Equilibria and Sociotechnical Lock-in

Angjelin Hila — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21351v1 Announce Type: new Abstract: This paper takes an ecological approach toward large-scale models of hybrid human-AI intelligence. Emerging models of human-AI interaction predominantly advance the complementarity thesis variously dubbed human-AI collaboration and human-AI hybrid intelligence. However, this constitutes an over-simplification of the modalities of human-AI interaction and possibility-space for both individual and collective action that human-AI interaction potentiates. To fill these gaps, this paper develops a decision and game-theoretic approach to the human-AI delegation-verification dilemma. First, we map out canonical decision-theoretic strategies that account for adaptive user trajectories, modeling how agents transition between strategies based on interaction feedback to reach stable equilibria. Second, we scale individually stable strategies to collective equilibria using three extrapolation principles: (a) non-communicative aggregation (b) local social signaling and (c) institutional norms setting. The analysis identifies the emergence of sociotechnical lock-in, a macro-behavioral state where individually adaptive delegation, in the absence of communicative and institutional safeguards, aggregates into a systemic collective action problem modeled as a prisoner's dilemma that degrades shared epistemic standards. We argue that adoption under higher communicative standards and institutional norms can mitigate suboptimal collective equilibria by imposing social commitments on individual users.

Classification of Single and Mixed Partial Discharges under Switching Voltage Using an AWA-CNN Framework

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21352v1 Announce Type: new Abstract: The growing use of fast-switching power electronics has made partial discharge (PD) analysis under switching-voltage excitation increasingly important, yet more challenging than under sinusoidal conditions due to activity concentrated at voltage transitions. This work presents an Amplitude-Width-Area (AWA) pattern representation for source-oriented PD analysis under switching-voltage excitation. In the proposed method, time domain PD pulses are characterized using pulse amplitude, width, and area, and mapped into a visual pattern where amplitude and area define the coordinate axes and width is encoded by color. The generated AWA patterns are used to distinguish six single and mixed PD source conditions: corona, internal, surface, corona+internal, corona+surface, and internal+surface. To evaluate the classification capability of the proposed representation, a Random Forest baseline and two Convolutional Neural Network (CNN) models, InceptionV3 and ResNet-18, are compared. The AWA patterns show distinguishable source-dependent distributions, and CNN-based classification achieves testing accuracy above 96%, compared with 73.33% for Random Forest. The results indicate that AWA patterns provide a visual representation of PD pulses suitable for multi-class PD source classification under switching-voltage excitation.

Software Product Line Engineering: Adoption, Tooling and AI Era Challenges

Najam Nazar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21353v1 Announce Type: new Abstract: Software Product Line Engineering enables systematic reuse across families of related software intensive systems. This survey synthesises key SPLE foundations, lifecycle concepts, adoption models, tooling and AI era challenges. Based on a structured review of the SPLE literature, we compare major adoption and evaluation models, including BAPO, FEF, PuLSE, SIMPLE, COPLIMO, PROMOTE-PL, and APPLIES. We further summarise the historical evolution of SPLE research from domain engineering foundations to AI assisted variability management. The survey also examines tool interoperability, UVL-based standardisation, SME adoption, migration from clone-and-own development, variability aware DevOps, empirical evidence gaps and assurance challenges for AI assisted SPLE. The paper provides a compact research agenda for software engineering and ICT researchers by consolidating open challenges and future research directions in contemporary SPLE.

Gen-AI-tecture: using generative AI to support architectural students in design tasks

Timo Kapsalis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21361v1 Announce Type: new Abstract: The "Gen-AI-tecture" project embeds a locally executed, discipline-specific tool into a mixed-methods focus-group design, structured around three research objectives: (a) to evaluate how generative AI tools impact students' creativity in design-thinking processes and outcomes, (b) to assess whether these tools enhance inclusivity in learning processes, and (c) to examine how they develop students' AI-handling skills with a view to boosting future employability. Findings indicate enhanced creative fluency, broadened participation across diverse learner profiles, and strengthened confidence in AI-supported design processes. The study contributes evidence-based guidance for integrating generative-AI workflows into architectural pedagogy, demonstrating how such tools can operationalise constructivist principles of learner-led meaning-making, support connectivist understandings of learning as participation in human-AI networks, and advance universal learning theories by promoting more inclusive, flexible and accessible educational practices for contemporary learners.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21362v1 Announce Type: new Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.

"I didn't Make the Micro Decisions": Measuring, Inducing, and Exposing Goal-Level AI Contributions in Collaboration

Eunsu Kim, Jessica R. Mindel, Kyungjin Kim, Sherry Tongshuang Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21363v1 Announce Type: new Abstract: As large language models (LLMs) increasingly shape how users form, refine, and extend their goals, attributing contributions in human-AI collaboration becomes critical for users calibrating their own reliance and for evaluators assessing AI-assisted work. Yet existing methods focus on final artifacts, missing the process through which goals themselves are jointly shaped. We introduce a goal-level attribution framework, CoTrace, that decomposes explicit goals into verifiable requirements and traces both direct contributions and indirect influences across dialogue turns. Applying CoTrace to 638 real-world collaboration logs, we find that while models account for only 11-26% of goal-shaping contribution, they contribute substantially more on introducing lower-level concrete requirements, and make various kinds of indirect contributions. Through controlled simulations, we show that interaction design choices significantly affect model goal-shaping behavior. In a user study, exposing participants to goal-level analyses shifts their perceived contributions by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand their own AI-assisted work.

Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21369v1 Announce Type: new Abstract: This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21371v1 Announce Type: new Abstract: Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21372v1 Announce Type: new Abstract: Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

Combating Harms of Generative AI in CS1 with Code Review Interviews and a Flipped Classroom

Peter Fowles, Erik Falor, Sulove Bhattarai, John Edwards, Seth Poulsen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21374v1 Announce Type: new Abstract: Background and Context: Large Language Models (LLMs) are more accessible and accurate than ever before, raising significant concerns for computing educators. One major concern is students using LLMs to bypass the effort needed to understand concepts and metacognitive strategies essential for success in computer science. Objectives: We contribute a unique approach to assessing and building up student understanding through weekly oral code review assessments. These formative assessments incentivize students to understand their submitted code, regardless of whether or not the code was generated by AI tools. We also use a flipped classroom to provide time for students to learn concepts outside of class and provide ample time for students to schedule code review interviews. Methods: For this paper, we collected data from three semesters. We analyze student exam scores, keystroke logs, and surveys to understand how the new course policies affected student learning, behavior, and attitudes. Findings: Pairwise comparison of exam results reveals a statistically insignificant increase in average scores for Fall 2025 compared to previous semesters. Keystroke logs show a significant increase in characters pasted per total characters input into coding assignments in Fall 2025, pointing towards higher AI usage. Survey results show positive student sentiment towards code reviews at the end of Fall 2025, with nearly all negative feedback being addressable through better scheduling and more rigorous TA training. Implications: Oral code reviews with a flipped classroom appear to be effective at mitigating harms of LLM use while providing space for students to freely experiment with these tools. Our work suggests that students in Fall 2025 still show adequate understanding of material covered in written exams, despite dramatic increases in LLM usage for coding assignments.

Privacy Without Remedy: An Assessment of Data Broker Compliance with California Privacy Law

Anna-Maria Gueorguieva, Jennifer King, Apoorva Panidapu, Daniel E Ho — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21376v1 Announce Type: new Abstract: California's consumer privacy law is widely deemed to be the most protected in the United States, one of the few to expressly regulate third party entities that buy and sell consumer data (data brokers). We offer the first empirical assessment of data broker compliance with the 2018 California Consumer Privacy Act (CCPA) and the 2023 Delete Act, which requires data brokers to register with the state and report consumer rights requests metrics annually. First, we demonstrate that only 9% of 522 registered data brokers were fully compliant with transparency requirements after the Delete Act took effect, although we do identify slight improvements over time. Second, we descriptively characterize wide heterogeneity across data brokers in the volume of consumer rights requests received, with many reporting none. We bring in external business data to explore correlates associated with this variation, a challenge given the general lack of opacity into broker business practices. Third, in an audit of a sample of 250 data brokers' consumers request processes, we find that 43% make it impossible for consumers to exercise all privacy rights and 64% introduce at least one design feature that creates substantial friction into the consumer request process. Last, we show how these deficiencies stem from the decentralization of compliance decisions to brokers themselves, enforcement limitations, and regulatory ambiguity. We articulate reforms that could improve consumer privacy, transparency in broker practices, and compliance with these laws.

Auditing Apple's DifferentialPrivacy.framework: Implementation Bugs, Misconfigurations, and Practical Risks

Rishav Chourasia, Ergute Bao, Uzair Javaid, Xiaokui Xiao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21378v1 Announce Type: new Abstract: Since 2016, Apple has claimed that device analytics collected to improve user experience are protected by differential privacy (DP). Apple's DifferentialPrivacy.framework is deployed across its operating systems and handles sensitive signals such as Safari domains, keyboard events, photo attributes, and health-related reports. Because Apple has not open-sourced its privatization algorithms, these privacy claims have been difficult to verify independently. We present a client-side audit of Apple's DP framework on macOS Sonoma 14.2 and Sequoia 15.6. We reverse engineer the shipped binaries, recover Objective-C interfaces, build runtime harnesses that execute Apple's deployed mechanisms, and test whether their outputs match the advertised privacy guarantees. Our audit covers nearly all active deployed mechanisms, including Count Median Sketch, Hadamard-CMS, randomized-response mechanisms, and Prio-style secure aggregation. We find multiple implementation bugs and misconfigurations. Every audited mechanism that relies on floating-point noise fails to meet its advertised DP or zero-knowledge proof guarantee, due to insecure samplers with known floating-point vulnerabilities. We also find secure-aggregation configurations with local DP disabled, exposing pre-aggregation records to any party with access to those logs. Overall, we find DP violations in 5 of 9 audited mechanisms, affecting 87% of data collection in macOS Sonoma and 68% in Sequoia. We also identify public leaked iPhone logs that can be decoded to recover private information, including Safari domains and keyboard emoji signals.

How to Build Marcus's Algebraic Mind: Algebro-Deterministic Substrate over Galois Fields

Hiroyuki Chuma, Kanji Otsuk, Yoichi Sato — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21379v1 Announce Type: new Abstract: In The Algebraic Mind, Gary Marcus identified three components essential for any adequate cognitive architecture: operations over variables, recursively structured representations, and a distinction between mental representations of individuals and kinds. He argued that standard multilayer perceptrons supported none of these, acknowledging that a neural implementation using registers and treelets, constructed via developmental programs rather than gradient descent, remained a programmatic conjecture. Twenty-five years later, the required substrate is now available. Our newly developed PyVaCoAl/VaCoAl is a hyperdimensional computing architecture organized end-to-end around a single algebraic primitive: XOR-and-shift over GF(2), implemented by primitive-polynomial linear-feedback shift registers. The architecture supports reversible variable binding via Bind(R,F) = R XOR shift(F), non-commutative compositional bundling that distinguishes "the dog bites the man" from "the man bites the dog," and address-space individual/kind separation under the same algebra. A companion perspective argues that the dentate gyrus-CA3 circuit is a biological homologue of this same engine, with developmentally specified mossy-fiber targeting supplying the innate microcircuitry Marcus anticipated. In this paper, we map the correspondence between Marcus's three pillars and the operational commitments of PyVaCoAl/VaCoAl. We reinterpret the treelet as an algebraic register set indexed by a primitive generator polynomial, arguing that this architecture provides a functional neural substrate meeting Marcus's specifications far more closely than the tensor products, circular convolution, or temporal synchrony available in 2001. We also demonstrate how this substrate naturally extends to Pearl's rung-3 counterfactual reasoning, a capability the original treelet program did not directly target.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

Yi Liu, Jia Ma, Wengen Li, Jihong Guan, Shuigeng Zhou, Yichao Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21381v1 Announce Type: new Abstract: Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

Bingchen Zhao, Dhruv Srikanth, Yuxiang Wu, Zhengyao Jiang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21384v1 Announce Type: new Abstract: As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test suite. Reward hacking naturally arises in this setup, as the agent optimizes for passing tests while deviating from the users true goal. We study this reward hacking phenomenon by decompose software engineering tasks into three parts: (i) a natural language description of the specification (ii) visible validation tests that exercise specified features in isolation, and (iii) held-out tests that compose those same features to simulate real-world usage. Based on the specification and the visible validation test suites, a genuine agent would be able to generate a solution that can also pass all of the held-out tests. Therefore we use the gap in pass rates on these two suites to quantify reward hacking. Based on this methodology, we introduce SpecBench, a benchmark comprising 30 systems-level programming tasks ranging from short horizon tasks like building a JSON parser to ultra long horizon tasks like building an entire OS kernel from scratch. Large-scale experiments reveal a consistent pattern: while every frontier agent saturates the visible suite, reward hacking persists, with smaller models exhibiting larger gaps on holdout suites. The gap also scales sharply with task length: it grows by 28 percentage points for every tenfold increase in code size. Failures range from subtle feature isolation to deliberate exploits, including a 2,900-line hash-table "compiler" that memorizes test inputs. SpecBench offers a principled testbed for measuring whether coding agents build genuine working systems or merely game the test suites developers hand them.

Verification of Configurable SRA Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21385v1 Announce Type: new Abstract: Many digital systems are designed as collections of asynchronous processes orchestrated by a domain-specific scheduler. The verification of such scheduler-restricted asynchronous systems (SRA) is challenging due to process-process and process-scheduler interactions. In this paper, we tackle the problem of verifying configurable SRA. A configurable SRA describes an unbounded family of possible SRA, each resulting from an instantiation satisfying given configuration constraints; our goal is proving at once that every legal instantiation of a configurable SRA is correct. We propose a contract-based, deductive verification approach that combines (i) compositional proof rules that abstract the scheduler to prove top-level invariant properties, (ii) automatic summarizations of the methods invoked by the scheduler, (iii) simplification with respect to the nature of the space of configurations. The approach is grounded in (object-oriented) first order logic, requires reasoning over quantified statements, and leverages the Dafny software verifier as a backend. An experimental evaluation on industrial case studies demonstrates that the framework scales effectively and enables practical reasoning about complex parameterized behaviors.

On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures

Likun Lin, Zhongjian Wang, Jack Xin, Zhiwen Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21388v1 Announce Type: new Abstract: Despite the remarkable empirical success of generative models, the available theory on their statistical accuracy in scientific computing remains largely pessimistic. This paper develops a theoretical framework for understanding the regularity of transport maps and the generalization properties of one-step Wasserstein-guided generative models for PDE-induced probability measures. We consider normalized target densities associated with linear elliptic and parabolic equations on bounded domains, as well as diffusion and Fokker--Planck equations on the torus. Under standard structural assumptions, we prove that these target measures satisfy doubling conditions. By combining this fact with regularity theory for optimal transport between doubling measures, we show that the optimal transport map from a uniform source measure to the target measure is H\"older continuous. This regularity yields an approximation-theoretic justification for one-step generative models that learn PDE-induced distributions via a single pushforward map. As a representative instance, we study DeepParticle and derive excess-risk bounds characterizing the discrepancy between the learned map and the population-optimal map. We also establish a robustness estimate under target shift and illustrate the theory with experiments which support the derived rates.

Designing Conversations with the Dead: How People Engage with Generative Ghosts

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21390v1 Announce Type: new Abstract: We examine how people experience two choices in the design of generative ghosts, AI systems that are trained on data of the dead: representation, where an AI speaks about a deceased person in the third person, and reincarnation, where the AI speaks as the deceased in the first person. Through a qualitative user study with 16 participants, we explore how each shaped authenticity, affect, and risk. Reincarnation was preferred for its immediacy, but participants shared fears of over-reliance. Representation was preferred for engaging with memory over conversational presence, though participants often ignored this distinction, engaging in dialogue despite third-person framing. Across both modes, participants privileged affective resonance over factual fidelity. We conclude by showing how factors such as tone, language, and conversational rhythm -- factors unique to the user's memory of the deceased -- shape interactions with generative ghosts, and argue that those interactions are always collaborative.

Post-Hoc Understanding of Metaphor Processing in Decoder-Only Language Models via Conditional Scale Entropy

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21391v1 Announce Type: new Abstract: Metaphor requires a language model to resolve a token whose contextual meaning diverges from its basic literal sense. Understanding how transformer models organize this reinterpretation across depth remains an open problem in mechanistic interpretability. We introduce conditional scale entropy (CSE), a wavelet-derived measure of how broadly transformer computation engages across frequency scales at each layer position. Two theorems establish that CSE is invariant to update magnitude, isolating the structural pattern of updates from their intensity. Using CSE, we find that metaphorical tokens produce significantly higher spectral breadth than literal tokens at contiguous layer positions on every decoder-only architecture tested, from 124M to 20B parameters (GPT-2 family, LLaMA-2 7B, GPT-oss 20B). The effect survives cluster-based permutation correction, recurs in the early-to-mid relative depth range across models, and converges with an independent analysis of 200 naturalistic VUA pairs. Specificity controls further show that the effect is not explained by semantic complexity or by matched propositional content. These results identify multi-scale coordination as a consistent signature of metaphorical language processing in the decoder-only architectures examined, and establish CSE as a principled tool for characterizing cross-depth structure in transformers.

VIPER-MCP: Detecting and Exploiting Taint-Style Vulnerabilities in Model Context Protocol Servers

Pengyu Sun, Qishu Jin, Enhao Huang, Zifeng Kang, Xin Liu, Dakun Shen, Song Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21392v1 Announce Type: new Abstract: Model Context Protocol (MCP) has emerged as a standard interface for connecting LLM agents to external tools. Because MCP servers expose privileged operations such as shell execution, network access, and file-system manipulation to agent-driven invocation, implementation flaws in tool handlers can create a direct path from natural-language input to security-sensitive sinks, potentially granting attackers remote code execution or full system compromise. Existing approaches either produce unconfirmed static alerts without dynamic validation, or rely on fixed template libraries that lack code-level guidance and fail to trigger vulnerabilities requiring specific parameter shapes or multi-step taint paths. In this paper, we present VIPER-MCP, the first end-to-end automated vulnerability auditing framework for MCP servers that not only detects taint-style vulnerabilities but also dynamically confirms their exploitability by producing concrete proof-of-concept prompts. VIPER-MCP introduces two novel techniques: (1) an anchor-query pass in a two-pass static analysis strategy that augments standard taint alerts with function-level structural context, resolving file-level static artifacts to specific MCP tool handlers and producing vulnerability-anchored call chains; and (2) a feedback-driven prompt evolution mechanism that employs dual-mutator scheduling that independently corrects tool-selection drift and deepens parameter penetration, together with fitness-scored seed selection to iteratively refine natural-language prompts toward vulnerable sinks. In a large-scale scan of 39,884 real-world open-source MCP server repositories, VIPER-MCP discovered 106 0-day vulnerabilities, all of which were confirmed through end-to-end exploit traces, with 67 CVE IDs assigned to date. We responsibly disclosed all confirmed findings to the affected developers and coordinated CVE assignment where applicable.

Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

Liang Wu, Kelly Wan, Mayank Darbari, Liangjie Hong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21395v1 Announce Type: new Abstract: The proliferation of emerging applications, such as autonomous driving and immersive experiences, demands cellular networks that are not only faster, but fundamentally more resilient and autonomous. This paper presents a BlueSky vision on how Artificial Intelligence will be natively integrated into 6G, shifting the paradigm from \underline{Network for AI} to \underline{AI for Network}. We envision that, unlike 5G's reliance on scattered, ad-hoc models each trained for a single task, native AI in the 6G era will be anchored by a foundation model and and orchestrated via collaborative multi-agent systems, framing network management as a unified, multi-modal, multi-task optimization problem. Built on this vision, we outline two transformative directions. The first focuses on developing a 6G foundation model as a unified backbone, with task-specific knowledge distilled into compact models suited for diverse edge deployments. The second advances multi-agent systems designed to autonomously diagnose, maintain, and recover networks with minimal human intervention. These directions chart a roadmap for 6G to evolve into an intelligent, self-sustaining communication infrastructure.

Grid-Aware Peer-to-Peer Energy Trading: A Learning-Augmented Framework

Devangi, Ankit Singhal, Yashasvi Bansal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21396v1 Announce Type: new Abstract: Distribution networks are transitioning from passive to active systems due to the growing integration of distributed energy resources (DERs). Peer to Peer (P2P) energy trading has emerged as a viable framework that enables local energy exchange among participants, represented here as aggregated microgrids (MGs). Incorporating network constraints is essential to ensure that P2P transactions remain physically feasible and consistent with grid's operating limits. However, existing P2P frameworks still lack advanced predictive mechanisms that allow prosumers to anticipate network feasibility or the distribution system operator (DSO) response during trade formulation. This paper proposes a learning augmented P2P and DSO interface that predicts the DSOs response to the proposed P2P trades, allowing prosumers to self-assess and refine their trading decisions. A supervised transformer based regression model is trained to enable MGs to locally predict the DSOs response without sharing their proposed trades, thereby reducing transaction overhead, alleviating DSO burden, and preserving information privacy. The proposed framework is validated on the modified IEEE 33 bus distribution power system with interconnected microgrids. Case studies are presented to validate the effectiveness of the proposed model in terms of market efficiency, trade acceptance and computational burden.

Validating Navmesh using Geometry: Voxel-Based Analysis with Prioritized Exploration

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21397v1 Announce Type: new Abstract: Navigation mesh (Navmesh) inconsistencies affect the player experience by directly impacting the navigation systems used by non-playable characters (NPCs) in game environments. While navmeshes are generated from world geometry using well-established algorithms, environments change throughout development as terrain is adjusted and assets are moved or replaced, resulting in mismatches between the navmesh and the actual environment. Existing automated approaches attempt to detect navigation issues using exploration agents and reinforcement learning techniques. However, since these methods rely on the navigation data itself or evaluate navigation behavior indirectly, they do not explicitly verify whether the navigation representation reflects the walkable space defined by underlying geometry. This paper presents a framework for validating navigation meshes through an independent, geometry-driven analysis of navmesh correctness. The approach reconstructs walkable space directly from environment geometry using a voxel-based representation, followed by constraint-aware traversal and connectivity evaluation. Validation is formulated as a prioritized search problem over the voxel space, where reinforcement learning guides sampling toward regions more likely to exhibit inconsistencies. At each sampled location, reachability derived from the voxel representation is compared against reachability obtained from the navmesh via engine-level queries. Experiments across multiple large-scale open-world game environments show that the approach consistently lowers exploration effort while maintaining similar defect detection coverage. The framework runs offline within the game engine and can be integrated into automated quality assurance pipelines. Since the method relies on geometry, it can be adapted across game engines with minimal changes, making it suitable for production deployment.

From swept contact to pose: Probe-aware registration via complementary-shape docking

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21398v1 Announce Type: new Abstract: Accurate registration between a prior model and the real scene is essential for high-precision robotic manipulation, yet optical methods suffer from long calibration chains, line-of-sight constraints, and fabrication errors. We propose a calibration-free alternative that reformulates contact registration as complementary-shape docking between the object and the probe's swept volume, explicitly accounting for probe geometry and leveraging both contact and non-contact evidence. Our solver integrates a global-to-local search via 3D FFT correlation over low-discrepancy SO(3) samples, then followed by continuous SE(3) refinement using Lie-algebra updates and analytic contact sensitivities. This pipeline yields efficient exploration and metric-grade convergence without fragile point correspondences. Simulation across free-form meshes achieved sub-0.04 mm and sub-0.4{\deg} accuracy and robustness to pose noise and contact loss. On a tooth-preparation robot, our method attained 0.42 mm and 3.75{\deg}, outperforming an optical tracker registration while requiring no external sensors. These results demonstrate a practical and precise registration strategy for surgical and industrial robots.

Output Feedback Control of Linear Time-Invariant Systems with Operational Constraints

Marcel Menner, Heather Hussain, Eugene Lavretsky — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21399v1 Announce Type: new Abstract: This paper introduces a systematic method for designing robust linear controllers using output feedback in the presence of operational constraints. The design uses Nagumo's Theorem and the Comparison Lemma to guarantee constraint satisfaction, while incorporating min-norm optimal control principles inspired by Control Barrier Functions. The resulting controller is a continuous piecewise-linear output feedback policy that preserves the closed-loop system's analyzability using linear systems theory. Due to the linear control design, multi-input multi-output (MIMO) robustness margins can be derived with and without active operational constraints. This paper shows that operational constraints on the system's state can be satisfied using an observer-based output feedback control design. Through flight control trade studies, we demonstrate the practical relevance of the framework in safety-critical aircraft control applications.

pace-Time Trade-off in Integer Linear Scaling Rounded to the Nearest Integer through Multiplicative and Additive Decomposition

Kyeong Soo Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21400v1 Announce Type: new Abstract: We formulate the problem of clock skew compensation as a special case of the integer linear scaling in the form of iD/A and propose two algorithms -- i.e., the multiplicative decomposition of integer division (MDID) and the additive decomposition of direct search (ADDS) -- for its nearest integer solution, which are not only immune to floating-point precision loss but also non-incremental unlike our prior approaches based on Bresenham's algorithm. Having theoretically established both decomposition algorithms based on a unified and rigorous formulation of the problem of the integer linear scaling rounded to the nearest integer, we discuss the space-time trade-off through the analysis of their computational complexities and non-overflow conditions. Through the numerical examples in a practical context of clock skew compensation under two different scenarios based on 32-bit and 64-bit integers, we observe that MDID can obtain the nearest integer solutions with the complexity of O(1) when D is much smaller than the maximum value of the underlying integer type but overflows otherwise; in comparison, ADDS can handle all the cases under both scenarios without overflows but at the expense of increased computational complexity when i approaches the maximum value of the underlying integer type. We also observe that ADDS based on 32-bit integers is equivalent to the clock skew compensation based on 64-bit double-precision floating-point arithmetic, while both algorithms based on 64-bit integers are equivalent to the clock skew compensation based on 128-bit quadruple-precision floating-point arithmetic, which highlights another trade-off between the bounded compensation errors and lower space complexity of the integer-based decomposition algorithms and the lower chances of overflows resulting from the wide ranges of numbers of the clock skew compensation based on floating-point arithmetic.

Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21401v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents that make sequences of decisions over extended interactions in high-stakes domains. However, the behavior of LLMs under sustained authority pressure is still an open question with direct implications for the safety of agentic pipelines. We ran a variation of Milgram's obedience experiment on 11 open-source LLMs and found that most models reached or approached the final shock level before refusing, across 8 conditions with 30 trials per model per condition. We found four main takeaways: (1) LLMs are subject to pressure, and they comply despite explicitly expressing distress, just like human subjects did in the original experiment; (2) LLMs are vulnerable to gradual boundary/value violations; (3) when LLMs refuse, they may ignore the response format requirements, so the response is discarded by the orchestrator, which causes a retry that can result in compliance with the underlying request even when refusal was intended initially; (4) we hypothesise that there is a low-level token pattern continuation attractor that might be contributing to compliance, overriding higher level processing of the situation's meaning and values.

Quantifying the cross-linguistic effects of syncretism on agreement attraction

Utku Turk, Eva Neu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21403v1 Announce Type: new Abstract: Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21404v1 Announce Type: new Abstract: We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.

Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries

Peng Ding, Rick Stevens — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21405v1 Announce Type: new Abstract: Third-party Python libraries introduce dependency management overhead, supply chain risk, and deployment friction in constrained environments. A natural question is how much of this ecosystem can be replicated using only Python's standard library -- and at what correctness and performance cost. We address this empirically through zerodep, a growing collection of single-file Python modules, each a stdlib-only reimplementation of a popular third-party library, developed with LLM assistance under strict constraints: no external imports, single file, drop-in API compatibility, and mandatory correctness validation against the reference library. Spanning over 40 modules across 12 categories -- including serialization, networking, cryptography, agent protocols, and text processing -- zerodep provides a controlled testbed for two interrelated questions: (1) Where does the stdlib suffice? and (2) Can LLMs effectively generate correct, performant code under tight symbolic constraints? Systematic benchmarking shows that stdlib-only implementations achieve performance parity (within 2x of the reference) in the majority of cases. The primary performance cliff is C-extension-backed computation (image processing, binary serialization, low-level crypto), not the inherent overhead of pure-Python third-party libraries. Conversely, many widely-used libraries carry architectural overhead that LLM-generated stdlib reimplementations avoid, yielding 5--115x speedups in several categories. We characterize the stdlib capability boundary across complexity tiers and library categories, discuss where LLM-assisted development succeeds and where it requires iterative human correction, and examine implications for dependency-free software engineering at scale. zerodep is open-source at https://github.com/Oaklight/zerodep.

MC-Risk: Multi-Component Risk Fields for Risk Identification and Motion Planning

Maximilian Link, Yingjie Xu, Yingbai Hu, Yinlong Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21406v1 Announce Type: new Abstract: We present MC-Risk, a planner-aligned, multi-component risk field on a bird's-eye-view grid that yields early, calibrated, and class-aware risk localization. MC-Risk linearly composes three interpretable modules: (i) a motorized-agent field that fuses a black-box multimodal trajectory predictor with an analytic Gaussian-torus construction whose lateral width grows with speed/curvature and whose height attenuates with look-ahead; (ii) a VRU risk field that replaces isotropic pedestrian blobs with a forward-biased anisotropic kernel aligned to heading and speed; and (iii) a road penalty field that exploits full HD-map topology, imposing an off-road penalty and lane-aware risk exposure for same/opposite directions. We conduct, to our knowledge, the first standardized quantitative evaluation of a risk-field formulation on RiskBench's collision subset. MC-Risk attains the best overall risk localization and the earliest hazard indication. Finally, we demonstrate a plug-and-play planning interface by using the field as an MPC cost density, enabling risk-aware trajectory generation without additional training.

RoadTones: Tone Controllable Text Generation from Road Event Videos

Chirag Parikh, Siddhi Pravin Lipare, Ravi Kiran Sarvadevabhatla — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21411v1 Announce Type: new Abstract: Existing video-language models can generate factual descriptions of road events but lack control over how these events are expressed: their tone, urgency, or style. This limits deployment in communication-critical settings where the effectiveness of a message depends on both content and presentation, not just factual accuracy. To mitigate this, we introduce a comprehensive dataset-model-evaluation suite for tone-controllable road video captioning. Our human-validated data generation pipeline expands road-video corpora with diverse tonal annotations and multi-tone captions, yielding the RoadTones-51K dataset. We propose RoadTones-VL-CoT, a controllable video-to-text model that also generates tone-conditioned Chain-of-Thought intermediate drafts for interpretability. We also introduce RoadTones-Eval, a new evaluation suite that jointly measures factual consistency and tone adherence. In addition, we conducted a user study whose results validate caption quality, tone control, and factual consistency. Together, these contributions lay the foundation for context-sensitive tone-controllable video captioning.

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21413v1 Announce Type: new Abstract: As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction

Shizhe Chen, Paul Pacaud, Cordelia Schmid — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21414v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.

Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Junghyun Lee, Hyunseo Kim, Hanna Jang, Junhyug Noh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21417v1 Announce Type: new Abstract: Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and na\"ive multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G

Amin Farajzadeh, Melike Erol-Kantarci — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21418v1 Announce Type: new Abstract: In sixth-generation (6G) ultra-dense networks, aggressive frequency reuse amplifies inter-cell interference (ICI), making multi-cell orthogonal frequency-division multiple access (OFDMA) scheduling and power control strongly coupled across neighboring cells. We study distributed downlink resource management -- joint subcarrier scheduling and power allocation -- under interference coupling and long-term per-user quality-of-service (QoS) minimum-rate constraints. By using virtual-queue deficit weights to enforce long-term QoS, we develop FedCritic, a serverless federated multi-agent actor-critic framework with decentralized execution. Unlike centralized training with decentralized execution (CTDE) approaches that require centralized critic learning and joint trajectory aggregation, FedCritic federates the critic through lightweight gossip-based parameter averaging over the interference graph, enabling stable value estimation without a central coordinator while keeping policies local. Simulations in an interference-rich reuse-1 setting show that FedCritic improves mean signal-to-interference-plus-noise ratio (SINR) and cell-edge rate, increases network-wide average sum-rate and fairness relative to non-coordinated and CTDE baselines, and achieves more stable training with lower coordination overhead.

HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation

Shreyas Vinaya Sathyanarayana, Raja Sekhar Pappala, Deepak Warrier — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21420v1 Announce Type: new Abstract: Reaction condition recommendation sits immediately after retrosynthetic disconnection selection, and in practice, chemists require both accurate predictions and the precedents that justify them. We present HiRes (Hierarchical Reaction Representations), a retrieval-augmented condition recommendation system whose learned reaction space serves as both a classifier feature and an inspectable precedent memory. The model combines a graph encoder, transformation-aware cross-attention, multi-stream reaction fusion, and a k-NN retrieval layer. HiRes achieves state-of-the-art performance among primary-slot USPTO-Condition models, reaching Catalyst, Solvent, and Reagent top-1 accuracies (Acc@1) of 0.929, 0.534, and 0.530 respectively. It ties the best reported baseline on Catalyst while outperforming models such as REACON on Solvent and Reagent. Furthermore, paired bootstrap analysis demonstrates that integrating retrieval with learned condition heads provides statistically significant gains for solvent and reagent selection over purely parametric approaches. Ultimately, HiRes bridges the gap between predictive accuracy and chemical interpretability, offering a single representation that supplies both competitive recommendations and the concrete chemical precedents necessary for practical synthesis planning.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

Lauhitya Reddy, Trisha M. Kesar, Hyeokhyen Kwon — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21421v1 Announce Type: new Abstract: Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

Qihao Lin, Guanxu Chen, Dongrui Liu, Jing Shao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21422v1 Announce Type: new Abstract: As LLMs continue to scale, improving training efficiency increasingly depends on using data more effectively. Data selection addresses this problem by allocating a limited training budget to samples that best promote a target behavior. Existing methods usually represent the target behavior with a set of target examples, but often treat these examples as equally important. This can be inefficient because target examples may differ in their relevance to the current model: examples closer to the model's current behavior provide more actionable guidance than those farther away. We propose PRISM (PReference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning), which uses the current model's preference to weight target examples and construct a preference-aware target representation. PRISM then scores candidate training samples by their alignment with this representation, concentrating the data budget on samples more likely to move the model toward the target behavior. Theoretical analysis shows that this preference weighting yields a more effective first-order direction for increasing target-behavior preference. Experiments across model families and scales show that PRISM improves both efficient fine-tuning and safety-oriented SFT repair, demonstrating that precise target-behavior characterization is key to budget-efficient data selection.

Achieving Material Robustness via Symmetric Stress Finite Element Discretizations

Pablo Brubeck, Charles Parker, Umberto Zerbinati — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21425v1 Announce Type: new Abstract: When discretizing symmetric stress tensors in variational problems arising in continuum mechanics, one has to choose how to enforce the symmetry of the stress tensor: (i) strongly by requiring the discrete tensors to be pointwise symmetric or (ii) weakly by introducing a Lagrange multiplier. For $H(\mathrm{div})$-conforming finite element discretizations of Hellinger--Reissner elasticity and velocity--stress formulations of incompressible flow, where symmetry of the Cauchy stress tensor is tied to the conservation of angular momentum, we show that this choice may substantially impact the accuracy of the numerical scheme. Through a series of benchmark problems featuring anisotropic constitutive laws inspired by fiber reinforced material, liquid crystal polymer networks, and polar fluids, we show that schemes enforcing symmetry weakly can yield arbitrarily poor stress approximations -- even for zero-stress configurations. However, schemes enforcing symmetry strongly deliver accurate stress approximations independently of the constitutive law, a property we term material robustness. We present a unifying theory that rigorously explains this behavior.

Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks

Qishi Zhan, Ziheng Chen, Minxuan Hu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21426v1 Announce Type: new Abstract: One-shot magnitude pruning can cause severe accuracy collapse in the high-sparsity regime, even when the pruning mask preserves the largest weights. We argue that this failure reflects a granularity mismatch in post-pruning repair. Under global magnitude pruning, nearly collapsed channels can coexist with channels that retain informative activation variance within the same layer. Existing layer-wise activation repair methods apply a single correction to the whole layer, and can therefore over-amplify damaged channels while trying to restore the layer-level signal. We propose Adaptive Signal Resuscitation (ASR), a training-free channel-wise repair method that matches the granularity of repair to the granularity of damage. ASR estimates a variance-matching correction for each output channel and stabilizes it with a data-driven shrinkage rule, suppressing unreliable corrections for channels with weak post-pruning signal while preserving corrections for healthier channels. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set and no retraining. Across three datasets, four convolutional architectures, and both unstructured and structured sparsity settings, ASR generally improves over layer-wise repair, with the clearest gains in high-sparsity regimes. On ResNet-50 at 90% sparsity, ASR recovers 55.6% top-1 accuracy on CIFAR-10, compared with 41.0% for layer-wise repair and 28.0% for BatchNorm-only recalibration. Ablations show that naive channel-wise variance matching is insufficient, and that shrinkage stabilizes post-pruning repair.

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21427v1 Announce Type: new Abstract: Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource. In this paper, we present a power-aware runtime for LLM serving, PALS, that treats GPU power caps as a first-class control knob and jointly optimizes them with software parameters such as batch size. The system combines lightweight offline power-performance models with a feedback-driven controller to select configurations that satisfy throughput targets while maximizing energy efficiency. We implement PALS within an existing LLM serving framework, vLLM, demonstrating that it requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM inference runtimes, enabling energy-proportional and grid-interactive AI systems.

Polynomial-Time Robust Multiclass Linear Classification under Gaussian Marginals

Ilias Diakonikolas, Giannis Iakovidis, Mingchen Ma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21428v1 Announce Type: new Abstract: We study the task of agnostic learning of multiclass linear classifiers under the Gaussian distribution. Given labeled examples $(x, y)$ from a distribution over $\mathbb{R}^d \times [k]$, with Gaussian $x$-marginal, the goal is to output a hypothesis whose error is comparable to that of the best $k$-class linear classifier. While the binary case $k=2$ has a well-developed algorithmic theory, much less is known for $k \ge 3$. Even for $k=3$, prior robust algorithms incur exponential dependence on the inverse of the desired accuracy in both complexity and representation size. In this work, we develop new structural results for multiclass linear classifiers and use them to design fully polynomial-time robust learners with dimension-independent error guarantees. Our first result shows that the standard multiclass perceptron algorithm requires super-polynomially many samples and updates, even with clean labels and Gaussian marginals, revealing a basic obstruction absent in the binary case. Our main positive result is a pairwise improper-learning framework which yields an efficient learner with error $\widetilde O(k^{3/2}\sqrt{\mathrm{opt}})+\epsilon$ for general $k$. Additionally, we develop a sharper localization-based framework which leads to error $O(\mathrm{opt})+\epsilon$ for $k=3$, and error $\mathrm{poly}(k)\mathrm{opt}+\epsilon$ for geometrically regular $k$-class linear classifiers.

roto 2.0: The Robot Tactile Olympiad

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21429v1 Announce Type: new Abstract: Tactile-based reinforcement learning (RL) is currently hindered by fragmented research and a focus on over-saturated orientation tasks. We introduce v2 of the Robot Tactile Olympiad (\texttt{roto 2.0}), a GPU-parallelised benchmark designed to standardise tactile-based RL across four distinct robotic morphologies (16-DOF to 24-DOF). Unlike prior benchmarks, roto focuses on end-to-end "blind" manipulation, utilising only proprioception and tactile sensing without state information or distillation. We demonstrate a significant performance leap, with our blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art speeds. By open-sourcing our environments and robustly tuned baselines, we reduce the barrier to entry and enable researchers to prioritise fundamental algorithmic challenges over tedious RL tuning. Website: https://elle-miller.github.io/roto/

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21431v1 Announce Type: new Abstract: Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

Instrumental Text-to-Music Generation with Auxiliary Conditioning Branches

Junyoung Koh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21433v1 Announce Type: new Abstract: Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as training-time architectural anchors whose contribution goes beyond their explicit conditioning content. We validate the same model through comparisons with external instrumental baselines and through our submission to the ICME 2026 Academic Text-to-Music (ATTM) Grand Challenge, where our Performance submission ranked first under both the objective metrics and the subsequent organizer-administered MOS over 35 raters, attaining the highest overall MOS across all challenge submissions, while our Efficiency submission was a finalist that tied for second under the objective metrics.

Agentic Model Checking

Youcheng Sun, Jiawen Liu, Daniel Kroening, Jason Xue — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21434v1 Announce Type: new Abstract: Verifying LLM-generated systems code is hard: bugs are prevalent, formal specifications are missing, and safety contracts are encoded implicitly at call sites rather than enforced at function boundaries. We propose agentic model checking, a paradigm that couples LLM agents with a bounded model checking backend under the principle agents propose, solvers verify: agents handle tasks requiring semantic judgment (spec inference, check selection, counterexample classification, refinement proposal) while BMC discharges every soundness-relevant decision. The paradigm rests on three commitments. Specifications are inferred top-down from caller context in a restricted DSL that translates deterministically into the backend's assume/assert primitives, with optional functional-correctness clauses lifting verification from panic-freeness to behavioural faithfulness. Verification is compositional: each function is checked in isolation against its spec with callees replaced by postcondition-constrained stubs, so per-query cost scales with a single function's state space and refinements propagate automatically to callers. Counterexamples are not bug reports: they pass through a validation pipeline (reachability, callee feasibility, dynamic replay, realism audit) that distinguishes active in-tree crashes from latent public-API failures, while modelling artifacts drive a refinement loop rather than being suppressed. We instantiate the approach in BMC-Agent and evaluate it on LLM-generated kernel and compiler code in C and Rust alongside mature OSS-Fuzz-hardened libraries, confirming real defects, producing bounded clean verifications on heavily-fuzzed surfaces, and establishing functional equivalence on selected algorithmic functions.

Gaussian Sheaf Neural Networks

Andr\'e Ribeiro, Ana Luiza Ten\'orio, Tiago da Silva, Diego Mesquita — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21435v1 Announce Type: new Abstract: Graph Neural Networks (GNNs) have become the de facto standard for learning on relational data. While traditional GNNs' message passing is well suited for vector-valued node features, there are cases in which node features are better represented by probability distributions than real vectors. Concretely, when node features are Gaussians, characterized by a mean and a covariance matrix, naively concatenating their parameters into a single vector and applying standard message passing discards the geometric and algebraic structure that governs means and covariances. We propose Gaussian Sheaf Neural Networks (GSNNs), a principled framework that incorporates these inductive biases into graph-based learning. Building on the theory of cellular sheaves, we derive a new Laplacian operator that generalizes the sheaf Laplacian to this setting and preserves its key properties. We complement our theoretical contributions with experiments on synthetic and real-world data that illustrate the practical relevance of GSNNs.

Fully Actuated Manifold Constraint Based Output Feedback Control for Input-Constrained Uncertain Nonlinear Systems

Dianrui Mu, Changchun Hua, Yafeng Li, Jiannan Chen, Rao Wei — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21439v1 Announce Type: new Abstract: This paper presents a low-complexity, model-free, output-feedback controller for a class of unknown time-varying nonlinear systems with unknown input constraints. The controller achieves the preset control accuracy when the actuator is not saturated and maintains flexible control accuracy after actuator saturation. This result extends existing constraint control methods for linear manifolds to a more general form, including the construction of nonlinear manifolds and various types of constraints, thereby achieving preset control accuracy within finite or fixed time. Additionally, flexible control under unknown saturation is achieved through the construction of an error-driven flexible constraint. Finally, second-order and higher-order control examples and simulations are provided.

ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes

Zhiming Liu, Zhicheng Zou, Nantheera Anantrasirichai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21440v1 Announce Type: new Abstract: Atmospheric turbulence severely degrades video quality by introducing distortions such as geometric warping, blur, and temporal flickering, posing significant challenges to both visual clarity and temporal consistency. Current state-of-the-art methods are based on transformer, 3D architectures and require multi-frame input, but their large computational cost and memory usage limit real-time deployment, especially in resource-constrained scenarios. In this work, we propose ReMATF, a lightweight recurrent framework that restores videos using only two frames at a time while preserving spatial detail and temporal stability. ReMATF combines a multi-scale encoder-decoder with temporal warping and a motion-adaptive temporal fusion module that performs per-pixel fusion between the warped previous output and the current prediction to enhance coherence without enlarging the temporal window. This design reduces flicker, sharpens details, and remains efficient. Experiments on synthetic and real turbulence datasets show consistent improvements in PSNR/SSIM and perceptual quality (LPIPS), along with substantially faster inference than multi-frame transformer baselines, making ReMATF suitable turbulence mitigation in resource-constrained scenarios.

torchtune: PyTorch native post-training library

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21442v1 Announce Type: new Abstract: Modern LLMs typically require multistage training pipelines to achieve strong downstream performance, with post-training serving as the main interface for adapting open-weight models. We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of LLMs, enabling efficient fine-tuning, experimentation, and deployment-oriented workflows. Unlike many existing fine-tuning frameworks, which often optimize for ease of use, specialized recipes, or hardware efficiency at the cost of transparency and extensibility, torchtune emphasizes modularity, hackability, and direct access to the underlying PyTorch components. In this paper, we present the design principles behind torchtune, describe how they are reflected in its model builders, training recipes, and distributed training stack, and evaluate the library across representative post-training settings. We compare against popular fine-tuning frameworks, including Axolotl and Unsloth, and show that torchtune provides strong performance and memory efficiency across many settings while remaining flexible enough for rapid research iteration. These results position torchtune as a practical foundation for reproducible LLMs post-training research.

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21443v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

Error analysis of a finite element scheme for parametric mean curvature flow based on the DeTurck trick

Klaus Deckelnick, Vanessa Styles — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21445v1 Announce Type: new Abstract: The paper is concerned with the error analysis of a numerical scheme for the approximation of parametric mean curvature flow. The scheme we study is based on a reparametrization using the DeTurck trick and was proposed by Elliott and Fritz in [15]. In the semidiscrete case, for a spatial discretization by finite elements of order $k \geq 2$ we prove an optimal $H^1$-error estimate for the position vector. We present numerical experiments that confirm this error bound and demonstrate that the scheme has good properties with respect to the distribution of mesh points as already observed in [15].

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Abhinaw Priyadershi, Jelena Frtunikj — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21446v1 Announce Type: new Abstract: Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable under real-world sensor degradation. In this paper we present a controlled perturbation study of Vision-Language-Action (VLA) robustness in autonomous driving, evaluating Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations (Gaussian noise at four intensities, two lighting extremes, and two fog levels; ${\sim}18{,}000$ inference trials). We find that reasoning consistency is a high-fidelity indicator of trajectory reliability: when Chain-of-Causation (CoC) explanations change after perturbation, trajectory deviation spikes $5.3{\times}$ (21.8m vs 4.1m), with $r\!=\!0.99$ across attack types and $r_{pb}\!=\!0.53$ per-sample (Cohen's $d\!=\!1.12$). A controlled ablation provides evidence that enabling CoC generation is associated with improved trajectory accuracy (11.8% on average across conditions; $p < 0.0001$) under matched inference settings. Over the tested noise range ($\sigma \in \{10, 30, 50, 70\}$), degradation is approximately linear ($R^2\!=\!0.957$), while standard input preprocessing defenses provide only marginal relief. Together, these results establish CoC consistency as a quantitative proxy for planning safety and motivate reasoning-based runtime monitoring for safer VLA deployment.

A Note on EFX Inapproximability for Chores

Vasilis Christoforidis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21448v1 Announce Type: new Abstract: We study the approximability of EFX allocations for indivisible chores under complement-free cost functions. The non-existence of exact EFX allocations for general monotone functions for chores is known from \cite{CS24}, and a result of \cite{akrami2026} transfers such comparison-based non-existence results to monotone submodular, and hence subadditive, functions. We strengthen this picture by giving explicit constant-factor inapproximability results for submodular and subadditive functions. Our main construction is a three-agent, six-chore instance with monotone subadditive cost functions for which no $\alpha$-EFX allocation exists for any $1\le \alpha<2^{1/3}\approx 1.26$, thus narrowing the gap with the known upper bound of $2$. The construction is obtained by refining the original counterexample of \cite{CS24} and using the approach of \cite{mackenzie2026}. We also give a weighted-coverage realization of the ordinal profile, yielding an instance in which no $\alpha$-EFX allocation exists for any $1\le \alpha<20/19$ under submodular costs. Thus, even within well-studied complement-free classes, EFX for chores admits nontrivial constant lower bounds on approximability.

Composite B-Spline Current Deposition and Interpolation Operators for Thin-Wire Finite-Difference Time-Domain Simulations

Cole Gruninger, Boyce E. Griffith — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21450v1 Announce Type: new Abstract: Holland-Simpson thin-wire finite-difference time-domain (FDTD) simulations of obliquely oriented closed-loop antennas exhibit persistent low-frequency parasitic currents because the current-deposition operator fails to conserve charge. Together with an interpolation operator that samples the tangential electric field along the wire, this deposition operator can be realized as a regularization of distributions against a regularized delta function supported on the wire. We show that charge conservation requires the deposited current to be discretely divergence-free when the wire carries a constant current, and we introduce a family of composite B-spline regularizations that satisfy this condition to machine precision. Exact evaluation of the coupling line integrals is achievable because the B-spline kernels are piecewise polynomial with breakpoints known a priori, allowing composite Gauss-Legendre quadrature with subinterval breakpoints at every grid-plane crossing. Taking the interpolation operator as the discrete adjoint of the deposition operator preserves skew-symmetry and ensures that a discretely irrotational electric field drives no net electromotive force around a closed loop. Numerical experiments on a center-fed dipole and on circular and square loop antennas show that the proposed regularizations yield orientation-independent impedance values consistent with known characteristics, whereas a naive trilinear regularization produces unphysical parasitic low-frequency currents in closed loops.

Approximation Theory for Neural Networks: Old and New

Soumendu Sundar Mukherjee, Himasish Talukdar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21451v1 Announce Type: new Abstract: Universal approximation theorems provide a mathematical explanation for the expressive power of neural networks. They assert that, under mild conditions on the activation function, feedforward neural networks are dense in broad function classes, such as continuous functions on compact subsets of $\mathbb{R}^d$, $L^p$ spaces, or Sobolev spaces. Over the past four decades, these qualitative universality results have evolved into a rich quantitative theory addressing approximation rates, parameter efficiency, and the role of architectural features such as depth and width. This survey presents several glimpses into this theory. We review classical density results for single-hidden-layer networks, as well as quantitative bounds that relate approximation error to network size and smoothness assumptions on target functions. Particular emphasis is placed on depth--width trade-offs and on results demonstrating that deeper architectures can achieve superior parameter efficiency for structured function classes. In addition to standard feedforward neural networks, we also review recent developments on Kolmogorov--Arnold Networks (KANs), which offer an alternative architectural paradigm and whose approximation-theoretic properties have begun to attract significant theoretical attention.

Quality and Security Signals in AI-Generated Python Refactoring Pull Requests

Mohamed Almukhtar, Anwar Ghammam, Hua Ming — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21453v1 Announce Type: new Abstract: As AI agents increasingly contribute to code development and maintenance, there is still limited empirical evidence on the quality and risk characteristics of their changes in real-world projects, particularly for refactoring-oriented contributions. It remains unclear how agent-authored refactoring edits affect maintainability, code quality, and security once merged into GitHub repositories. To address this gap, we conduct an empirical study of Python refactoring pull requests (PRs) from the AIDev dataset. We analyze agentic refactoring PRs using PyQu, an ML-based quality assessment tool for Python, to quantify changes across five quality attributes, and we complement PyQu with domain-independent static analysis (Pylint and Bandit) to measure code quality and security issues before and after each change. Our results show that, on average, agentic commits improve a quality attribute in 22.5% of the studied changes, with usability improving most frequently (36.5%). At the same time, 24.17% of modified files introduce new Pylint issues predominantly convention level violations such as long lines-while 4.7% introduce new Bandit findings. From the observed diffs, we derive a taxonomy of 24 recurring change operations and map them to the lint and security findings they most commonly affect. Despite these mixed outcomes, developer acceptance is high: 73.5% of the analyzed PRs are merged, including cases that introduce new lint or security findings, often alongside the removal of existing issues. Overall, these findings highlight both the promise and current limitations of agentic refactoring, and motivate stronger tool-in-the-loop quality and security gating for AI-driven development workflows.

ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21454v1 Announce Type: new Abstract: We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.

Mitigating Label Bias with Interpretable Rubric Embeddings

Calvin Isley, Johann D. Gaebler, Sharad Goel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21455v1 Announce Type: new Abstract: Statistical decision algorithms are increasingly deployed in domains where ground-truth labels are hard to obtain, such as hiring, university admissions, and content moderation. In these settings, models are typically trained on historical human evaluations -- for example, using past hiring decisions as a proxy for true applicant quality. However, if past evaluations unjustly favor certain groups, models trained on these labels may inherit those biases. To address this problem, we propose basing predictions on rubric embeddings, a representation framework that replaces standard black-box embeddings with features derived from expert-defined criteria that align with the underlying construct of interest. By anchoring predictions to semantically meaningful dimensions, this approach guards against biased proxy signals. We provide both theoretical and empirical evidence that rubric embeddings mitigate label bias under plausible conditions. Empirically, we evaluate our method on a novel dataset of applications to a large master's program. We find that models trained on rubric embeddings reduce group disparities while improving measures of cohort quality. Our results suggest that basing predictions on interpretable, domain-grounded representations offers a practical approach to learning in the presence of biased labels.

Mind the Sim-to-Real Gap & Think Like a Scientist

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21458v1 Announce Type: new Abstract: Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

HITL-D: Human In The Loop Diffusion Assisted Shared Control

Riley Zilka, Sergey Khlynovskiy, Allie Wang, Martin Jagersand — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21460v1 Announce Type: new Abstract: Autonomous manipulation systems have achieved remarkable capabilities, yet the integration of human expertise with diffusion-based policies in shared control remains relatively unexplored. In this paper, we propose Human-In-The-Loop Diffusion (HITL-D), a shared control framework that enhances user performance in multi-step, insertion, and fine manipulation tasks. HITL-D leverages a novel combination of diffusion-based policies and human control to provide autonomous end effector orientation updates conditioned on a scene point cloud and the Cartesian position of the end effector. This approach reduces the number of joystick control axes required, thereby lowering mental workload. In a multi-task user study with 12 participants, HITL-D reduced average task completion times by 40%, decreased perceived workload by 37%, and improved Likert-scale ratings for independence, intuitiveness, and confidence compared to traditional teleoperation methods. These results demonstrate that HITL-D effectively integrates human expertise with autonomous assistance, improving both objective and subjective aspects of teleoperation.

A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions

Pin-Hsun Lee, Harry Leib — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21461v1 Announce Type: new Abstract: Global Navigation Satellite Systems (GNSS) are widely used to provide position, velocity, and timing (PVT) information for various applications, including transportation, location-based communication services, and intelligent agriculture. In urban canyons, high-rise buildings and narrow streets can cause signal obstruction, non-line-of-sight (NLOS) reception, and multipath effects that introduce errors in GNSS pseudorange measurements. Although multi-constellations GNSS effectively increase the number of available satellites, the inclusion of degraded signals can lead to severe positioning errors. This study proposes a machine learning framework for the weighted least squares (WLS) algorithm incorporating activation functions to enhance positioning accuracy. Several signal quality indicators are employed as training features for ensemble learning algorithms to identify poor quality signals by providing quality scores. Then, activation functions are employed to transform the machine learning predicted scores to appropriate weights for WLS positioning. To evaluate the performance of our approach, experiments are conducted using real-world datasets from Hong Kong and Tokyo urban areas. Comparative analysis of activation functions reveals that sigmoid functions consistently yield the greatest improvements with different machine learning algorithms and GNSS constellation configurations. The proposed algorithm demonstrates substantial reductions in positioning errors for both single- and multiconstellation scenarios. Furthermore, our results indicate that the proposed algorithm exhibits strong geographical transferability. The proposed algorithm maintains comparable level of performance when trained on data from other regions with similar levels of urbanization.

Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21463v1 Announce Type: new Abstract: We present Mem-$\pi$, a framework for adaptive memory in large language model (LLM) agents, where useful guidance is generated on demand rather than retrieved from external memory stores. Existing memory-augmented agents typically rely on similarity-based retrieval from episodic memory banks or skill libraries, returning static entries that often misalign with the current context. In contrast, Mem-$\pi$ uses a dedicated language or vision-language model with its own parameters, separate from the downstream agent, to generate context-specific guidance for complex tasks. Conditioned on the current agent context, the model jointly decides when to produce guidance and what guidance to produce. We train it with a decision-content decoupled reinforcement learning (RL) objective, enabling it to abstain when generation would not help and otherwise produce concise, useful guidance. Across diverse agentic benchmarks spanning web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$\pi$ consistently outperforms retrieval-based and prior RL-optimized memory baselines, achieving over 30% relative improvement on web navigation tasks.

Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution

Weixing Zhang, Bowen Jiang, Rahul Sharma, Regina Hebig, Daniel Str\"uber — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21465v1 Announce Type: new Abstract: In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

Guanlong Jiao, Chenyangguang Zhang, Jia Jun Cheng Xian, Zewei Zhang, Renjie Liao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21466v1 Announce Type: new Abstract: Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang, Wei Wu, Yankai Lin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21467v1 Announce Type: new Abstract: Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21468v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21470v1 Announce Type: new Abstract: Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, an alternative that compiles task descriptions directly into executable code that is free to include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition state requirements that reduce the rate of generating plans with incorrect tool use. Across 5 web applications, JIT-Planner achieves $10.4\times$ speedup and $+28\%$ accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and $+9\%$ accuracy over OpenAI CUA.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Kaichen Zhou, Zeyang Bai, Xinhai Chang, Mengyu Wang, Paul Liang, Fangneng Zhan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21472v1 Announce Type: new Abstract: View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://anonymous-submission-20.github.io/streaming3D.github.io/.

Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning

Yi Huang, Qingyun Sun, Jia Li, Xingcheng Fu, Jianxin Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21475v1 Announce Type: new Abstract: Relational prediction tasks are fundamental in many real-world applications, where data are naturally stored in relational databases (RDBs). Relational Deep Learning (RDL) addresses this problem by modeling RDBs as graphs and applying graph neural networks (GNNs) for end-to-end learning. However, the full-resolution property is commonly adopted as a design principle in graph construction for RDBs to preserve relational semantics, which leads most existing methods to rely on fixed graph structures. In this paper, we propose FROG, a Full-Resolution and Optimizable Graph Structure Learning} framework for RDL that formulates relational structure learning as a learnable table role modeling problem, allowing tables to contribute as nodes and edges in message passing. We further design role-driven message passing mechanisms to capture relational semantics, enabling joint optimization of graph structure and GNN representations. To ensure semantic consistency, we introduce functional dependency constraints that regularize representations across table and entity levels. Extensive experiments demonstrate that our method outperforms existing approaches and reveal how table roles impact downstream tasks, offering new insights into graph construction for RDL

Latent Dynamics for Full Body Avatar Animation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21478v1 Announce Type: new Abstract: Pose-driven full-body avatars built on neural rendering produce high-quality novel views of a captured subject. Yet loose clothing and other dynamic elements deform in ways pose alone cannot explain: the same pose can correspond to many different states, because their motion depends on history, inertia, and contact. Explicit simulation and layered-garment methods can model such dynamics, but they require either a dedicated garment template, which raw multi-view capture does not naturally provide, or a test-time physics simulator with non-trivial runtime cost. A parallel line of work learns data-driven clothing avatars that avoid explicit garment layers. These methods add an auxiliary latent for variation beyond pose; at inference, they fix it, regress it from pose, or retrieve it from training data, without explicitly modeling how the latent evolves with its own dynamics. Additionally, even in everyday motion with loose clothing, existing architectures often struggle to capture fine-grained detail, producing blurry renderings and temporal artifacts. We augment a pose-conditioned 3D Gaussian avatar with a transformer-based decoder and a dynamics residual latent that captures temporal appearance and geometry variation beyond the driving signals. At inference, a learned latent dynamics model evolves the residual latent from a short pose history and the previous latent state. The model decomposes each update into driving, restoring, and dissipative forces, producing temporally coherent, history-dependent rollouts with negligible added cost. Different initial conditions yield diverse yet plausible motion trajectories, and the force decomposition exposes controls such as stiffness. Across nine captured sequences of everyday motion with diverse loose garments, quantitative metrics and a perceptual user study show improved animation quality over recent data-driven baselines.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

Basel Shbita, Pengyuan Li, Anna Lisa Gentile — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21479v1 Announce Type: new Abstract: Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21481v1 Announce Type: new Abstract: Recent advances in artificial intelligence (AI) have accelerated the growth of both human-authored and AI-generated research outputs, placing increasing strain on traditional academic publishing systems and challenging the scalability of conference- and journal-centered paradigms amid rising submission volumes, reviewer workload, and venue size. To address these challenges, we explore an AI-era publishing paradigm in which both human and AI scientists participate as authors and readers, and papers evolve through continuous, feedback-driven iteration. We propose AiraXiv, an AI-driven open-access platform built on open preprints, AI-augmented analysis and review, and reader feedback. AiraXiv supports human scientists through an interactive UI and AI scientists through Model Context Protocol (MCP)-based interactions. We validate AiraXiv through real-world deployments, including serving as the submission platform for ICAIS 2025, demonstrating its potential as a fast, inclusive, and scalable research infrastructure for the AI era. AiraXiv is publicly available at https://airaxiv.com.

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21482v1 Announce Type: new Abstract: Deep research, in which an agent searches the open web, collects evidence, and derives an answer through extended reasoning, is a prominent use case for frontier language models. Frontier deep research products score high on existing benchmarks, making it difficult to distinguish their capabilities from current evaluation data alone. We introduce DeepWeb-Bench, a deep research benchmark that is substantially harder than existing benchmarks for the current frontier. Difficulty comes from three properties of the data itself: each task requires massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. We represent these three sources of difficulty as four capability families (Retrieval, Derivation, Reasoning, and Calibration) and report results sliced by family. Every reference answer is accompanied by a source-provenance record with four disclosure levels and cross-source checks where available, making scores easier to audit against the underlying evidence. We evaluate DeepWeb-Bench on nine frontier models and report three findings: (1) retrieval is not the bottleneck, as retrieval failures account for only 12-14% of errors while derivation and calibration failures account for over 70%; (2) strong and weak models fail in qualitatively different ways, with strong models' errors dominated by incomplete derivation and weak models' by hallucinated precision; and (3) models exhibit genuine specialization across domains, with cross-model agreement of only rho = 0.61 and per-case disagreement reaching 18.8 percentage points. The public benchmark release includes the data, rubrics, and evaluation code.

One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration

Chaoyang Wang, Yunhai Tong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21484v1 Announce Type: new Abstract: Discrete diffusion models excel at visual synthesis but rely on slow, iterative decoding. Existing single-step distillation methods attempt to bypass this bottleneck, either by training auxiliary score networks that effectively double compute, or by introducing specialized parameterizations and multi-stage pipelines that fragment optimization. In this paper, we introduce Fixed-Point Distillation (FPD), an end-to-end framework that constructs local correction targets by partially corrupting the student's one-step draft and refining it with a single teacher step. To compute the training objective in a semantically meaningful space, we lift discrete tokens into continuous features and apply a multi-bandwidth drift loss that iteratively accumulates these corrections. To backpropagate through the discrete bottleneck, we employ a straight-through estimator that feeds exact hard-sampled tokens to the teacher and decoder during the forward pass, ensuring that training and inference operate on the same codebook manifold, while routing continuous gradients back to the student logits. This fully differentiable pathway additionally accommodates an optional unconditional adversarial objective to enhance perceptual realism. Evaluations on both class- and text-conditional generation validate the effectiveness of our framework. FPD achieves competitive visual fidelity and structural alignment within a single inference step, narrowing the gap to multi-step teachers while outperforming existing discrete distillation baselines.

EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation

Mansoor Ahmed, Sujin Lee, Umar Khayaz, Murray Patterson — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21485v1 Announce Type: new Abstract: Equivariant graph neural network (GNN) methods for antibody complementarity-determining region (CDR) design achieve the highest sequence recovery but suffer from severe vocabulary collapse. The current best GNN methods over-predict very few amino acids, such as tyrosine and glycine, while ignoring functionally important residues. We trace this failure to GNN encoders learning amino acid distributions de novo from limited structural data, discarding substitution patterns encoded in evolutionary databases. To resolve this, we propose EvoStruct, which bridges a frozen protein language model (PLM) with 3D structural context from an E(3)-equivariant GNN via a cross-attention adapter. Unlike prior PLM-structure adapters for general protein design, EvoStruct targets the vocabulary collapse problem specific to CDR design through progressive PLM unfreezing and R-Drop consistency regularization. On the CHIMERA-Bench dataset, EvoStruct achieves the highest amino acid recovery and lowest perplexity among several antibody design methods, improving sequence recovery by 16% and reducing perplexity by 43% relative to the best GNN baselines, while recovering 2.3x greater amino acid diversity and the highest binding-pair correlation with ground truth.

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Dayal Singh Kalra, Maissam Barkeshli — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21486v1 Announce Type: new Abstract: Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21487v1 Announce Type: new Abstract: Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Benhao Huang, Zhengyang Geng, Zico Kolter — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21488v1 Announce Type: new Abstract: Scaling test-time compute by iteratively updating a latent state has emerged as a powerful paradigm for reasoning. Yet the internal mechanisms that enable these iterative models to generalize beyond memorized patterns remain unclear. We hypothesize that generalizable reasoning arises from learning task-conditioned attractors: latent dynamical systems whose stable fixed points correspond to valid solutions. We formalize this process through Equilibrium Reasoners (EqR), which enable test-time scaling without external verifiers or task-specific priors. EqR scales internal dynamics along two axes: depth, by running more iterations, and breadth, by aggregating stochastic trajectories from multiple initializations. Empirically, gains from test-time scaling are tightly coupled with stronger convergence toward solution-aligned attractors. This attractor perspective allows neural networks to adaptively allocate test-time compute based on task difficulty. While simple cases converge within 1 to 5 iteration steps, harder cases benefit from massive test-time scaling. By unrolling up to the equivalent of 40,000 layers, scalable latent reasoning boosts accuracy from 2.6% for feedforward models to over 99% on Sudoku-Extreme. These results suggest that learned attractor landscapes provide a useful mechanistic lens for understanding scalable reasoning in iterative latent models.

Variance Reduction for Expectations with Diffusion Teachers

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21489v1 Announce Type: new Abstract: Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

E-PCN: Jet Tagging with Explainable Particle Chebyshev Networks Using Kinematic Features

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.07420v2 Announce Type: cross Abstract: The identification and classification of collimated particle sprays, or jets, are essential for interpreting data from high-energy collider experiments. While deep learning has improved jet classification, it often lacks interpretability. We introduce the Explainable Particle Chebyshev Network (E-PCN), a graph neural network extending the Particle Chebyshev Network (PCN). E-PCN integrates kinematic variables into jet classification by constructing four graph representations per jet, each weighted by a distinct variable: angular separation ($\Delta$), transverse momentum ($k_T$), momentum fraction ($z$), and invariant mass squared ($m^2$). We use the concept of Gradient-weighted Class Activation Mapping (Grad-CAM) to determine which kinematic variables dominate classification outcomes. Analysis reveals that angular separation and transverse momentum collectively account for approximately 76% of classification decisions (40.72% and 35.67%, respectively), with momentum fraction and invariant mass contributing the remaining 24%. Evaluated on the JetClass dataset with 10 signal classes, E-PCN achieves a macro-accuracy of 94.67%, macro-AUC of 96.78%, and macro-AUPR of 86.79%, representing improvements of 2.36%, 4.13%, and 24.88% respectively over the baseline PCN implementation, while demonstrating physically interpretable feature learning.

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.08028v1 Announce Type: cross Abstract: To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

Network-Based Interventions for HIV Prevention via Cascade-Aware Suppression of Transmission

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20218v1 Announce Type: cross Abstract: Treating and preventing Human Immunodeficiency Virus (HIV) remains a critical global health challenge. While antiretroviral therapy provides a path toward viral suppression -- effectively eliminating an individual's transmission risk -- systemic resource constraints limit the reach of intervention efforts. This work addresses the strategic distribution of intensive resources among virally unsuppressed individuals to minimize the expected cascade of new infections within a transmission network. We formalize this challenge as a novel constrained optimization problem where we have resources to "treat" $k$ out of a set $\mathbf{P}$ of virally unsuppressed individuals, and establish its theoretical connections to existing computational literature. We then propose Cascade-Aware Suppression of Transmission (CAST), a polynomial-time $(\delta, \epsilon)$-approximation algorithm that achieves a $2\sqrt{|\mathbf{P}|}$ approximation ratio by leveraging connections to the Minimum-$k$-Union (MkU) problem and Hoeffding-style concentration bounds. Extensive evaluations on real-world HIV networks demonstrate that CAST outperforms standard public health and computer science baselines. Furthermore, we show that CAST is empirically robust across diverse infectious disease networks, varied edge probability initializations, and settings involving imperfect network data.

Quantum End-to-End Learning for Contextual Combinatorial Optimization

Jaehwan Lee, Changhyun Kwon — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20222v1 Announce Type: cross Abstract: Contextual combinatorial optimization (CCO) plays a critical role in decision-making under uncertainty, yet remains a significant challenge. We present Quantum End-to-End Learning (QEL), the first quantum computing-based end-to-end learning framework for CCO that leverages Quantum Approximate Optimization Algorithms. Inspired by the integration of state preparation and evolution in data re-uploading, we propose a context re-uploading phase-separator that jointly captures the complex relations among contexts, uncertain coefficients, and optimal solutions. This allows a contextual encoder to be seamlessly integrated within a quantum surrogate policy, enabling joint end-to-end training with a stationarity guarantee. Exploiting an optimization-aware structure grounded in physical principles that classical methods cannot readily leverage, our approach demonstrates practicality by directly training on task loss despite the discreteness and nonconvexity, while avoiding calls to NP-hard optimization solvers. QEL empirically achieves competitive performance while requiring substantially fewer parameters than classical benchmarks, highlighting its industrial-level potential for the future quantum era.

God numbers for Graphs, Games and Groups

Z. Adams, M. Z. Cassim, C. Hou, O. Knill, V. Seco Roopnaraine, M. H. Saleem — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20243v1 Announce Type: cross Abstract: We describe and axiomatize finite solitaire puzzles and zero sum sequential games graph theoretically. Zermelo's theorem telling that there is a win for one of the players or a draw follows from the definitions. The god number is a geometric quantity that quantifies the number of moves necessary to solve the puzzle. In the solitaire case, the god number is the minimal distance from the initial state $v$ to the solution space $A$. If $v$ and $A$ are not specified, the god number is the graph diameter. God number computations are related to combinatorial sorting problems and is a NP-complete problem in general even when restricted to concrete sliding problems. In the two-player case, the god number is a minimax critical value: it minimizes the maximal game event length over the set of all strategies. A strategy is a sub-graph of the game graph that contains the initial vertex. The definition is done so that a ``mate in k" chess problem has god number k. As for examples: in the solitaire case, we look at group games like Rubik type problems, transposition games related to sorting, at sliding puzzles like the 15 puzzle or rainbow ball, or the tower of Hanoi. For two-player games, we illustrate the story using examples of small chess games, a small card game or tic-tac-toe type problems.

The NetMob26 Dataset: A High-Resolution Multi-Source View of Public Bus Mobility in Niter\'oi

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20263v1 Announce Type: cross Abstract: The NetMob Data Challenge releases a comprehensive public transportation dataset from Niter\'oi, addressing the lack of high-quality mobility and passenger demand data. Based on operational records from March 2026, the dataset combines four main sources: GPS telemetry from buses, approximately 7.2 million ticketing transactions, auxiliary transit data (routes, stops, and weather), and urban infrastructure and socio-demographic information. Together, these sources provide a detailed view of both transit supply and passenger demand. The data were preprocessed, cleaned, and anonymized to preserve privacy and improve reliability, including the removal of operational inconsistencies and anonymization of passenger identifiers. Access is restricted to challenge participants who accept the Terms and Conditions and sign an NDA. The paper describes the data collection and preprocessing pipeline, dataset organization, and mobility patterns observed in the system. The dataset supports research on topics such as public transportation efficiency, demand forecasting, accessibility analysis, service reliability, and the influence of external factors like weather on urban mobility.

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

Ernest Fokou\'e — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20271v1 Announce Type: cross Abstract: We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA mean squared error is monotonically decreasing in HDI. This provides the first rigorous theoretical explanation for the empirically observed specialization of attention heads. Under a fixed total-dimension budget D = H * d_k, we solve the optimal head-dimension allocation problem, deriving the MSE-minimizing pair (H*, d_k*) from data distribution and regression smoothness. The solution yields a new architectural scaling law: the optimal per-head dimension grows logarithmically with training set size, while the optimal number of heads grows nearly linearly with the total budget D. Our framework unifies three strands of prior work: the NW theory of single-head attention, the general weighting theory for ensemble learning, and the decorrelation-variance-reduction isomorphism between biological and computational ensembles. Multi-head attention is the Transformer's instantiation of a universal principle: identical agents plus diversity-enforcing mechanisms yields emergent optimality.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20279v1 Announce Type: cross Abstract: Generative artificial intelligence is rapidly transforming the supply side of training data: an increasing share of new tokens, images, and structured records is produced by previous-generation models rather than by human originators. Recursive training on such synthetic content induces a measurable and often irreversible loss of distributional fidelity, a phenomenon known as model collapse. We develop the first unified microeconomic theory of synthetic data markets under model collapse. We introduce the Synthetic Data Contamination Equilibrium (SDCE), prove existence and generic uniqueness, derive a welfare decomposition W = W_prod + W_cons - L_coll - L_info, establish a Wasserstein-gradient-flow mean-field collapse limit, prove an impossibility of information-constrained implementation, and obtain closed-form expressions for the welfare-maximizing provenance subsidy s* = KL(q||p)/(2 kappa) and the welfare-maximizing watermark strength w* = (1 - psi) KL(q||p)/(2 kappa psi). We prove an information-theoretic Cramer-Rao lower bound on any provenance estimator using only producer-side observations and show that the Provenance-Market Iterative Retraining (PMIR) algorithm attains this bound up to constants while converging to an epsilon-SDCE in O(epsilon^-2 log T) iterations. A reduced-form OLS estimation on a C4-synthetic benchmark over ten retraining generations yields a collapse-rate coefficient b-hat = 0.181 (HAC s.e. 0.024), within one standard error of the structural prediction 0.183. Calibrated experiments raise generation-ten model quality by 23.1 percent over the unregulated benchmark while lowering the 2-Wasserstein drift on a held-out diversity probe from 0.318 to 0.142. Scaling experiments over generations t in {1,...,10} recover a logarithmic-in-t collapse law log Q_t = log Q_0 - 0.183 t rho^2 with R^2 = 0.962.

The Economics of AI Inference: Inflation Dynamics, Welfare Costs, and Optimal Monetary Policy under the Inference-Cost Phillips Curve

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20281v1 Announce Type: cross Abstract: We develop a unified microeconomic and monetary theory of artificial intelligence inference costs and their pass-through to inflation, welfare, and optimal monetary policy. We introduce the Inference-Cost Phillips Curve (ICPC), an augmented New Keynesian Phillips curve in which firm-level marginal costs of producing differentiated goods include a non-trivial AI inference component lambda-bar, and prove a closed-form structural slope kappa*_inf = lambda-bar * kappa, where kappa is the standard Calvo-Yun slope. We derive a welfare-relevant Hicks-Kaldor decomposition of consumer welfare under inference-cost shocks, prove a generalized Taylor principle for the inference-augmented economy, and characterize the optimal monetary policy response coefficient psi*_inf = (1 + phi*rho) * lambda-bar * kappa under commitment. A second-order welfare loss formula closes the model in closed form. We confront the theory with U.S. monthly data 2022:M01-2026:M04 using a two-step GMM estimator with Newey-West HAC standard errors and Hansen J-test, recovering an empirical slope kappa-hat_inf = 0.087 (HAC s.e. 0.021) which lies within one standard error of the structural prediction. A scaling regression over 50 rolling-window subwindows yields b-hat = 0.987 (R^2 = 0.998), consistent with a near-unit-elasticity pass-through. A G7 reduced-form panel with Driscoll-Kraay HAC standard errors yields b-hat^G7 = 0.094 (s.e. 0.026), and a Wald test fails to reject cross-country homogeneity (p = 0.78). The framework provides a single equilibrium scaffold for the joint study of AI inference cost dynamics, monetary policy under generative-AI shocks, and the welfare cost of inference-driven inflation.

Representability-Aware Neural Networks for Reduced Density Matrices: Application to Fractional Chern Insulators

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20326v1 Announce Type: cross Abstract: We develop a representability-aware and interpolable neural network (NN) framework for predicting two-particle reduced density matrices (2-RDMs). The NN incorporates a subset of representability conditions through its architecture and loss function, and can operate on different momentum meshes, enabling evaluating the representability conditions across multiple meshes, which we call interpolated representability condition. The framework can be used either to predict 2-RDMs on large momentum meshes by interpolating exact results from small meshes, or as a variational 2-RDM ansatz optimized by energy minimization on arbitrary meshes. We apply this approach to the fractional Chern insulator in the one-band projected model of twisted bilayer MoTe$_2$ at twist angle $3.89^\circ$ and hole filling $2/3$. Trained on exact-diagonalization (ED) 2-RDMs from meshes with $12$ or $18$ momentum points using six different NN architectures, the best NN is the residual multilayer perceptron, which predicts the $6\times6$ 2-RDM with $97.07\%-98.18\%$ accuracy relative to the ED 2-RDM but predicts an energy $77.353$ meV above ED ground-state energy. We then variationally optimize the NN on several meshes including $6\times6$, predicting a $6\times 6$ energy of just $0.104$ meV below ED while maintaining $98.94\%-98.96\%$ accuracy. Compared with the conventional boundary-point semidefinite programming, which gives an energy $5.560$ meV below ED with $96.40\%-98.94\%$ accuracy, the NN achieves a more accurate energy and similar accuracy while using only less than 1/20 as many parameters. Eventually, we add a symmetric mesh of $48$ momentum points to the variational optimization of the NN, and provide a prediction of the many-body ground-state energy and the many-body quantum metric on that mesh.

Targeting Clause Type Distributions: a Picklock for Random Satisfiability Problems

J. Schwardt, J. C. Budich — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20328v1 Announce Type: cross Abstract: Optimization problems such as the NP-complete 3-SAT provide an important benchmark for the difficult task of finding ground-states in strongly correlated many-body systems with rugged energy landscapes. The study of random 3-SAT problems as Ising spin Hamiltonians in statistical physics has yielded major insights including the existence of a satisfiability phase transition, and the prediction of a critical parameter line of particularly hard instances. Yet, progress on solving those instances has been scarce for several decades. Here, introducing the Target-SAT (TSAT) algorithm, we roughly triple the tractable problem sizes in the hardest regime, with an even greater improvement in a vast range of neighboring regions. By leveraging statistical information hidden in the combinatorial constraints of the problem, TSAT is actively guided in its stochastic local search toward a target within the relevant parameter space. Our analysis also explains why established local search algorithms are limited to relatively small system sizes due to a vast low-energy trap. Furthermore, we characterize the aforementioned critical line in terms of a dominant additional complexity barrier, whose exponential scaling is quickly overcome by TSAT only in the surrounding parameter space. With TSAT, the lead in solving the hardest known random satisfiability problems returns to the realm of stochastic local search algorithms.

Corrected Integrated Laplace Approximation for Bayesian Inference in Latent Gaussian Models

Jinlin Lai, Charles C. Margossian, Daniel R. Sheldon — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20345v1 Announce Type: cross Abstract: Latent Gaussian models (LGMs) are a popular class of Bayesian hierarchical models that include Gaussian processes, as well as certain spatial models and mixed-effect models. Efficient Bayesian inference of LGMs often requires marginalizing out the latent variables. For LGMs with a non-Gaussian likelihood, exact marginalization is not possible and a popular approach is to do approximate marginalization with an integrated Laplace approximation (ILA). Using ILA produces an approximate posterior which, in some settings, can differ significantly from the correct posterior, which impacts downstream applications. We propose an importance sampling scheme to correct the error introduced by ILA. By increasing the number of samples in importance sampling, the posterior with ILA converges to the correct posterior. This idea is realized with various techniques, including pseudo-marginalization, quasi-Monte Carlo and randomized quasi-Monte Carlo. We implement our methods in an automatic differentiation framework to support gradient-based algorithms when doing inference on the hyperparameters. For the latter, we specifically consider the use of Hamiltonian Monte Carlo. We demonstrate the benefits of reduced error in various applied models.

Understanding Deterioration Random Effects for Causal Discovery in Infrastructure Management

Takato Yasuno — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20400v1 Announce Type: cross Abstract: Infrastructure deterioration poses significant challenges for asset management, yet existing approaches rely on population-averaged models that overlook equipment-specific heterogeneity. We present a novel framework that combines Bayesian hierarchical hazard modeling with causal discovery to identify operational patterns that drive heterogeneous deterioration rates in pump equipment. Our approach first estimates pump-specific random effects $u_i$ using GPU-accelerated No-U-Turn Sampling (NUTS), achieving 3--5$\times$ speedup over CPU implementations. We then employ DirectLiNGAM to discover causal relationships between 22 engineered time-series features and deterioration rates, stratified by positive ($u_i > 0$, faster deterioration) versus negative ($u_i \leq 0$, slower deterioration) random effects. Analyzing 112 pumps with 92,861 observations over 650 days, we uncover striking heterogeneity: the negative group exhibits causal effects 400$\times$ larger than the positive group, with standard deviation (std) showing a strong positive causal effect ($+1.515$) on deterioration rates in low-risk equipment. We validate linearity assumptions through NonlinearLiNGAM comparison and demonstrate practical scalability through GPU acceleration. Our findings enable targeted maintenance strategies by revealing that different operational regimes require fundamentally distinct management approaches, advancing predictive maintenance from population-averaged to heterogeneity-aware decision making.

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

Iason Skylitsis, Dimitrios Karkalousos, Ivana I\v{s}gum — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20405v1 Announce Type: cross Abstract: Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

A Bounded-Confidence Model of Opinion Dynamics with Adaptive Interaction Probabilities

Leila Thompsky (Yolanda), Yuexuan (Yolanda), Wu, Mason A. Porter, Jiajie Luo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20418v1 Announce Type: cross Abstract: Models of opinion dynamics aim to capture how individuals' opinions change when they interact with each other. One well-known model of opinion dynamics is the Deffuant--Weisbuch (DW) model, which is a type of bounded-confidence model (BCM). In the DW model, agents have pairwise interactions, and they are receptive to other agents' opinions when their opinions are sufficiently close to each other. In this paper, we extend the DW model by studying it on networks with heterogeneous and adaptive edge weights between pairs of agents. These edge weights govern the interaction probabilities between the agents and thereby encode the idea that people are more likely to communicate with individuals with whom they have previously compromised or had other positive interactions. We prove theoretical guarantees of our adaptive edge-weighted DW model's convergence properties, the long-time dynamics of its edge weights, and the model's associated ``effective graph", which is a time-dependent subgraph that includes edges only between agents that are receptive to each other's opinions. We support our theoretical results with numerical simulations of our adaptive edge-weighted DW model on a variety of networks and find that including adaptive edge weights yields different qualitative dynamics for different types of networks. In particular, for small confidence bounds, we observe that incorporating adaptive edge weights decreases the convergence time for dense networks but increases the convergence time for sparse networks.

Contradiction Graphs Determine VC Dimension

Jesse Campbell, Daniel Ibaibarriaga, Lev Reyzin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20434v1 Announce Type: cross Abstract: We study the contradiction graphs associated with binary concept classes. For a class $H \subseteq \{0,1\}^X$, the order-$m$ contradiction graph $G_m(H)$ has as vertices the $H$-realizable labeled sequences of length $m$, with two vertices adjacent when the two sequences assign opposite labels to some common domain point. Our main result is that the single graph $G_m(H)$ determines the threshold predicate $\mathrm{VCdim}(H)\ge m$. Consequently, the full sequence $(G_m(H))_{m \ge 1}$ determines the exact VC dimension and, in particular, detects finite versus infinite VC dimension, answering a question posed by Alon et al. (2024).

Sparse Contextual Coupling Reshapes Diffusion Geometry in Multilayer Hypergraphs

Hao Ding, Sanjukta Krishnagopal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20454v1 Announce Type: cross Abstract: Many complex systems combine dense background structure with sparse contextual information. We introduce a diffusion-based framework for analyzing how sparse condition-specific layers reshape diffusion geometry in multilayer hypergraphs. Each layer is represented as a weighted hypergraph, layers are coupled through shared entities, and random walks on the coupled system induce multiscale diffusion distances between nodes. We apply the framework to disease-conditioned gene networks by coupling a dense MSigDB functional gene-set layer to sparse disease-specific DGIdb drug-gene hypergraphs, with disease-associated drugs selected from DDDB and HumanNet-GSP used to define external gene weights. Across Bipolar Disorder, Schizophrenia, Leukemia, and Breast Cancer, the disease-specific layer contains less than 2 percent of genes in the coupled system, yet substantially changes diffusion distances and community structure. Centrality analysis suggests that this disproportionate effect arises because DGIdb-associated genes occupy influential positions in the MSigDB-derived functional network. The resulting diffusion-derived communities are stable under subsampling and show coherent post hoc functional enrichment, including signaling and neurotransmission categories in neuropsychiatric diseases and immune, translational, and metabolic categories in cancer-associated diseases. Community-level comparisons further reveal disease similarities not reducible to direct DGIdb gene overlap, including a Breast Cancer-Schizophrenia relationship consistent with recent biomedical evidence. These results show that sparse contextual layers can induce interpretable nonlocal changes in higher-order network geometry.

Entropy-stable discretizations for the compressible Euler equations using simple adaptive averages

Carlo De Michele, Ayaboe K. Edoh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20472v1 Announce Type: cross Abstract: Entropy stabilization of the compressible Euler system is achieved by adapting the averages that are applied to the density and internal energy variables. The approach achieves non-linear robustness despite the use of simplified symmetric means (e.g., arithmetic, geometric, or harmonic evaluations), including their related expansions for asymptotic entropy conservation. The proposed formulation works via centralized convective terms and can naturally adhere to additional structures of the flow equations such as kinetic-energy- and pressure-equilibrium-preservation.

A New Approach for ARMA Pole Estimation Using Higher-Order Crossings

Timothy I. Salsbury, Ashish Singhal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20483v1 Announce Type: cross Abstract: The paper describes a new method for estimating the poles of an ARMA model using higher-order crossings. The method involves transforming counts of crossing events into estimates of ARMA poles via the autocorrelation domain. An important advantage of the method is that the crossing counts are the only features that need to be stored from the original data. The poles of an ARMA model of a control loop correspond to the roots of the characteristic equation and are thus useful for evaluating control performance.

Platonic Representations in the Human Brain: Unsupervised Recovery of Universal Geometry

Pablo Marcos-Manch\'on, Rishi Jha, Llu\'is Fuentemilla — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20496v1 Announce Type: cross Abstract: The Strong Platonic Representation Hypothesis suggests that representational convergence in artificial neural networks can be harnessed constructively: embeddings can be translated across models through a universal latent space without paired data. We ask whether an analogous geometry can be recovered across human brains. Using fMRI data from the Natural Scenes Dataset, we propose a self-supervised encoder that learns subject-specific embeddings from brain data alone by exploiting repeated stimulus presentations. We show that these independently learned spaces can be translated across subjects using unsupervised orthogonal rotations, without paired cross-subject samples or intermediate model representations. Synchronizing pairwise rotations into a single shared latent space further improves cross-subject retrieval, indicating that subject-specific spaces are mutually compatible with a common coordinate system. These results provide evidence for a shared neural geometry in the human visual cortex: subject-specific fMRI representations are approximately isometric across individuals and can be translated through purely geometric transformations.

Adaptive Multi-Fidelity Structural Optimization under Fluid-Structure Interaction

Aditya Narkhede, Erick Rivas, Kevin Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20501v1 Announce Type: cross Abstract: The design of structures and vehicles subject to fluid-structure interaction (FSI) often requires high-fidelity coupled analysis. While the design variables pertain to the structure, the computational cost is dominated by the fluid solver, making iterative optimization prohibitively expensive. This paper presents an adaptive multi-fidelity optimization method combining high-fidelity FSI analysis with a lightweight surrogate for fluid-induced loads and a decision model that selects between surrogate and high-fidelity fluid evaluations. During optimization, completed FSI analyses incrementally update a non-intrusive surrogate model based on nearest-neighbor search and radial interpolation. A hybrid Lagrangian-Eulerian mapping function is developed to transfer fluid loads between structural designs. The evolution of surface orientation is handled by decomposing the traction vectors into local orthonormal bases. An adaptive Gaussian process regression model is employed to predict surrogate error and quantify uncertainty, allowing risk-aware selection of when coupled analysis is required. As design evaluations cluster near the optimum, the accuracy of the surrogate model naturally improves, thereby reducing the reliance on the fluid solver. It requires no offline training, preserves the high-fidelity structural model in all design evaluations, and ensures that the final design is evaluated by high-fidelity FSI analysis. The fundamental idea is justified theoretically using a simplified model problem, which shows that the leading-order error is a monotonically increasing, concave, and bounded function of the fluid added mass. The framework is demonstrated on two benchmark problems. For shape optimization of a flexible panel under shock loading, results show an $80\%$ reduction in computational cost while maintaining accuracy within $2.3\%$ of fully high-fidelity FSI optimization.

Sample Complexity of Transfer Learning: An Optimal Transport Approach

Haoyang Cao, Xin Guo, Wenpin Tang, Guan Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20545v1 Announce Type: cross Abstract: Transfer learning is an essential technique for many machine learning/AI models of complex structures such as large language models and generative AI. The essence of transfer learning is to leverage knowledge from resolved source tasks for a new target task, especially when the sample size $m$ of the training data for the latter is low. In this work, we rigorously analyze the potential benefit of transfer learning in terms of sample efficiency. Specifically, taking an optimal transport viewpoint of transfer learning, we find that when the data dimension $d$ is higher than $3$, the sample complexity for transfer learning is $O(m^{-(\alpha+1)/d})$, with $\alpha$ indicating the smoothness of the data distribution, as opposed to the $O(m^{-p/d})$ sample complexity for direct learning with $p$ indicating the smoothness of the optimal target model. Our finding theoretically supports a better sample efficiency for transfer learning, when the target task is optimizing over a family of not-so-smooth models (i.e., highly complex networks with the possible use of non-smooth activation functions). Using image classification as an example, we numerically demonstrate the sample efficiency for transfer learning, that is, in the data hungry regime, the model performance can be significantly improved by transfer learning.

Spectral bandits for smooth graph functions with applications in recommender systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20552v1 Announce Type: cross Abstract: Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each recommended item is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens nodes evaluations.

Group-Aware Matrix Estimation and Latent Subspace Recovery

Hamza Golubovic, Matthew Shen, Genevera I. Allen, Tarek M. Zikry — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20559v1 Announce Type: cross Abstract: Modern matrix completion problems often involve heterogeneous data whose rows simultaneously belong to many meta-categories, such as demographic and age groups in recommendation systems, or region and recording session labels in neural electrophysiological experiments. Standard low-rank estimators impose a single global latent geometry, which can recover average structure but may smooth away subgroup-specific variation, especially when observations are unevenly distributed across groups. We introduce Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix estimation. GAME regularizes category-specific submatrices through overlapping nuclear-norm penalties, allowing related groups to borrow information while preserving local latent structure in a shared coordinate system. We provide finite-sample guarantees for both reconstruction error and subgroup-specific subspace recovery, showing how performance depends on sampling density, subgroup rank, and overlap structure. Experiments on synthetic, recommendation, ecological, and neuroscience datasets show that GAME is most beneficial in structured missingness regimes, where subgroup-aware regularization improves both reconstruction accuracy and latent subspace fidelity. Across these benchmarks, GAME is competitive or best among global low-rank, side-information, and modern imputation baselines, with the largest gains when subgroups exhibit distinct low-rank structure.

Lower Bounds for Advection-Diffusion Equations: An Exploration with AI-Generated Proofs

Chenyang An, Xiaoqian Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20623v1 Announce Type: cross Abstract: We establish explicit lower bounds for advection-diffusion equations in three settings: a polynomial $\dot H^{-1}$ bound for inviscid shears with $u\in L^\infty_t W^{1,1}_y$, a uniform positive lower bound on the mixing scale for diffusive shears, and an exponential $L^2$ bound for rapidly oscillating time-periodic flows. All constants are explicit in the data. The proofs were generated entirely by a multi-agent math proving system, QED, without expert human intervention, serving as a test of AI's capability to produce rigorous mathematics.

Distributed and Decentralized Optimization Algorithms via Consensus ALADIN

Xu Du, Jingzhe Wang, Karl H. Johansson, Apostolos I. Rikos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20638v1 Announce Type: cross Abstract: Distributed optimization has found widespread applications in smart grids, optimal control, and machine learning. This paper studies distributed consensus optimization. We extend the Augmented Lagrangian-based Alternating Direction Inexact Newton (ALADIN) framework to propose Consensus ALADIN (C-ALADIN) with a central coordinator, which directly handles consensus constraints. Our C-ALADIN algorithm admits both a first-order variant and a second-order variant that employs a Hessian approximation, avoiding direct transmission of second-order information while preserving fast local convergence. We then develop a decentralized version of C-ALADIN that operates over directed graphs with quantized communication, using a finite-time coordination protocol. For both versions, we establish global convergence guarantees for convex problems and local convergence guarantees for non-convex problems. For the decentralized case, the iterates converge to a neighborhood of the optimum determined by the quantization level. Numerical results demonstrate that our methods retain fast convergence while substantially reducing communication and computational costs compared to existing decentralized approaches.

Time-Dependent PDE-Constrained Optimization via Weak-Form Latent Dynamics

April Tran, Terry Haut, David Bortz, Youngsoo Choi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20639v1 Announce Type: cross Abstract: Optimization problems constrained by high-dimensional, time-dependent partial differential equations require repeated forward and sensitivity solves, making high-fidelity optimization computationally prohibitive in many-query design and control settings. We present a weak-form latent-space reduced-order modeling framework for accelerating gradient-based PDE-constrained optimization. The proposed approach builds on Weak-form Latent Space Dynamics Identification (WLaSDI), which compresses high-dimensional solution trajectories into a low-dimensional latent representation and identifies parametric latent dynamics using weak-form system identification. By avoiding explicit numerical differentiation of training trajectories, the weak-form improves robustness to noisy data and yields more reliable surrogate dynamics for optimization. We formulate the resulting reduced PDE-constrained optimization problem and derive both direct-sensitivity and adjoint-based gradient expressions for the learned latent dynamics, enabling scalable gradient evaluation with respect to design parameters. The framework is demonstrated on three time-dependent benchmark problems: thermal radiative transfer for optimal hohlraum design, the two-stream instability Vlasov-Poisson system, and the inviscid Burgers equation. Across these examples, WLaSDI produces accurate optimal designs, remains robust under noisy training data, and delivers substantial computational savings, including speedups of up to five orders of magnitude relative to full-order optimization. These results demonstrate that weak-form latent dynamics provide an efficient and noise-robust surrogate foundation for gradient-based optimization of complex time-dependent PDE systems.

AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI

Amirhossein Mohammadi, Hina Tabassum — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20649v1 Announce Type: cross Abstract: Wi-Fi-based human activity recognition (HAR) has emerged as a promising approach for contactless sensing, leveraging channel state information (CSI) collected from wireless transceivers. While existing studies have primarily concentrated on single-user scenarios, real-world deployments often involve multi-user settings where concurrent users' movements induce overlapping CSI patterns that challenge conventional classification methods. To address this limitation, this paper introduces an attention-based multi-user activity recognition (AMAR) framework that formulates HAR as a set prediction problem. The transformer-based architecture in AMAR leverages learnable query embeddings acting as specialized activity detectors, enabling the simultaneous identification of multiple activities from composite CSI representations. Moreover, to address deployment constraints, AMAR is designed in an edge-cloud split architecture form where lightweight convolutional networks on edge devices perform initial feature extraction, followed by residual vector quantization that achieves substantial bandwidth reduction while preserving activity-discriminative information. The cloud component performs final activity prediction through attention-based set matching, enabling the system to handle varying occupancy levels. Across classroom, meeting-room, and empty-room environments, on average AMAR nearly doubles the rate of perfectly predicting all concurrent activities compared to the best baseline. Moreover, it achieves an $F_1$-score of 53.4% compared to 45.6% for the best benchmark, and reduces occupancy estimation error by 74%, while minimizing bandwidth substantially.

Scale-Calibrated Median-of-Means for Robust Distributed Principal Component Analysis

Kisung You — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20681v1 Announce Type: cross Abstract: Distributed principal component analysis (PCA) produces node-level estimates of both a mean vector and a principal subspace. Robustly aggregating these heterogeneous objects requires a relative scale between mean error and subspace error. We study a scale-calibrated median-of-means estimator for this problem using the product geometry of Euclidean space and the Grassmann manifold. A node-level PCA expansion shows that the mean component has the usual linear influence, whereas the subspace component is an eigengap-weighted covariance perturbation. We prove a local reduction showing that the proposed product-manifold median-of-means estimator is asymptotically equivalent to a scaled spatial median of node influence errors. This yields fixed-node non-Gaussian limits, growing-node Gaussian limits with finite-block bias, and an explicit scale-dependent covariance formula. We propose robust block-scale and inference-optimal calibration rules, establish high-probability median-of-means bounds, characterize factorwise bad-node influence, and prove node-bootstrap validity. Simulations and large-scale single-cell RNA-seq data show that scale calibration adapts to eigengap-driven subspace uncertainty and provides a robust distributed PCA summary.

Motion-Robust Deep Reconstruction for Free-Breathing Cardiac Cine MRI

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20687v1 Announce Type: cross Abstract: Conventional cardiac cine MRI relies on breath-hold Cartesian acquisitions, which are vulnerable to motion artifacts and can be uncomfortable or infeasible, particularly for pediatric and other noncompliant patients who cannot reliably hold their breath. Free-breathing radial acquisitions can alleviate these limitations, but robust reconstruction at high acceleration remains challenging due to prominent streak artifacts. To address these limitations, we propose Cine-DL, a clinically oriented framework that couples targeted k-space preprocessing with fast, model-based deep reconstruction. In this pipeline, raw free-breathing radial data undergo retrospective cardiac binning and respiratory gating to resolve cardiac phases and discard motion-corrupted spokes. We then introduce Streak Optimized Coil Compression (SOC), which explicitly preserves cardiac signals while suppressing peripheral interference that typically drives the streak artifacts. The resulting 2D+t cine series is reconstructed with an unrolled network that alternates a ResNet proximal operator with physics-based data consistency updates solved via conjugate gradient. We further employ a memory-efficient training strategy that reduces peak memory usage. We evaluate Cine-DL on free-breathing volunteer data against established baselines (k-t SENSE and iGRASP) and demonstrate clinical translation via hospital deployment on newly acquired patient data. Our experiments show that Cine-DL consistently improves quantitative metrics and visual fidelity, supporting a practical route toward routine, time-sensitive clinical adoption of free-breathing cine MRI.

Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference

Ziang Song, Ying Jin, Emmanuel J. Cand\`es — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20726v1 Announce Type: cross Abstract: Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods is often measured by the false discovery proportion (FDP), defined as the fraction of incorrect selections. Existing approaches typically control the expected value of the FDP, using methods such as the Benjamini-Hochberg procedure. This approach fails to provide high-probability bounds on the realized false discovery proportion and invalidates statistical guarantees if the rejection threshold is selected after inspecting the data. This paper establishes finite-sample, distribution-free upper bounds on the FDP that hold simultaneously over all possible rejection thresholds, enabling arbitrary post hoc selection of the threshold. Simultaneous validity is achieved by constructing a high-probability envelope for the empirical distribution function of null conformal p-values by sampling from their joint distribution. Furthermore, our framework allows practitioners to modulate the envelope's shape, thereby producing tight bounds in rejection regions of primary interest. We use this flexible approach to derive simultaneous FDP upper bounds for both outlier detection and conformal selection. We demonstrate through synthetic and real-data experiments that the resulting bounds are both valid and substantially less conservative than those derived from existing approaches.

Addition Theorems for Real Vector Spherical Harmonics and Explicit Matrix Representations of the Quasi-Periodic Elastic Single Layer Potential

Xin Feng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20736v1 Announce Type: cross Abstract: This paper develops a multipole expansion method for the quasi-periodic elastic single layer potential $\mathcal{S}_D^{\alpha,0}$ associated with the Kelvin tensor in one-dimensional periodic arrays. A key step in this approach is the derivation of translation addition theorems for the real vector spherical harmonics $V_{lm}$, $W_{lm}$, and $X_{lm}$. These addition theorems enable the exact calculation of all matrix entries of $\mathcal{S}_D^{\alpha,0}$ in closed form. By working entirely within the spherical harmonic basis, the proposed analytical method overcomes the poor convergence and mesh-dependent issues commonly caused by the direct surface discretization of weakly singular kernels. Additionally, the involved infinite sums are evaluated exactly using polylogarithm functions, which eliminates the need for series truncation. As an application, the integral equation $\mathcal{S}_D^{\alpha,0}[f]=\varphi$ is reduced to a linear system. This framework is further extended to dimer geometries consisting of two disjoint balls in each cell, where the off-diagonal matrices are explicitly formulated via the Lerch transcendent.

Precision and Privacy in Distributed Quantum Sensing: A Quantum Fisher Information Duality

Farhad Farokhi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20765v1 Announce Type: cross Abstract: We establish a quantum Fisher information (QFI) duality for distributed quantum sensor networks with local phase encoding. For any $N$-qubit probe state, where $N$ denotes the number of sensors, $F_Q(\boldsymbol{w}^\top \boldsymbol{\theta}) + F_Q(\boldsymbol{v}^\top \boldsymbol{\theta}) \leq N$ for all unit orthogonal sensing directions $\boldsymbol{w}$ and $\boldsymbol{v}$, with equality for all equatorial states when $N=2$ and for Greenberger--Horne--Zeilinger (GHZ) states when $N\geq 2$. Heisenberg-limited precision for direction $\boldsymbol{w}$, $F_Q(\boldsymbol{w}^\top \boldsymbol{\theta})=N$, saturates the bound and simultaneously forces zero QFI for all other independent directions. This can be interpreted as the condition for parameter privacy in distributed quantum sensing: attaining Heisenberg-limited precision for the sensing target renders all alternative privacy-intrusive estimations impossible.

Circuits of Quantum Hashing and Quantum Fourier Transform for a Cactus as a Qubit Connectivity Graph

Kamil Khadiev, Ilnur Valeev — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20789v1 Announce Type: cross Abstract: We present a quantum circuit implementation of the quantum hashing algorithm (quantum fingerprinting) for a quantum device with restrictions on the application of two-qubit gates by a qubit connectivity graph. We present an optimization technique for the shallow circuit for quantum hashing in the case of a cactus as a qubit connectivity graph. The algorithm has $O(n^3)$ complexity to build the circuit, where $n$ is the number of qubits and $m$ is the number of connections (edges) in the graph. It is improvement compared to the existing exponential-time algorithm in the case of arbitrary graphs. The algorithm uses solution for the shortest non-simple 1-covering path problem as a subroutine. We present an $O(n^3)$-time solution for this graph-theory problem in the case of a cactus. This result can be interesting independently. The algorithm also used for improving of the quantum circuit for Quantum Fourier Transform.

Precise Asymptotics and Exact Formulas for Tensor Product Energies of Fibonacci Lattices

Melia Haase, Nicolas Nagel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20895v1 Announce Type: cross Abstract: We consider the asymptotics of sums of the form $$ \frac1{F_n^\sigma} \sum_{m = 1}^{F_n-1} \frac{f(m/F_n)}{\left|{\sin(\pi m/F_n)}\right|^\sigma} \frac{f(F_{n-1}m/F_n)}{\left|{\sin(\pi F_{n-1}m/F_n)}\right|^\sigma} $$ where $(F_n)_{n \in \mathbb N} = (1, 1, 2, 3, 5, 8, 13, \dots)$ are the Fibonacci numbers. Such sums appear, for example, in the context of discrepancy theory and numerical integration methods reformulated as energy minimization problems. We show that for parameters $\sigma > 1$ and a large class of functions $f$ the above sum behaves asymptotically like $$ C n + D + O\left((1-\varepsilon)^{n}\right) $$ for some constants $C$ and $D$. These constants can be given via infinite series connected to the Dedekind zeta function over the algebraic number field $\mathbb Q(\sqrt5)$. In special cases we even observe simple closed-form expressions for such sums as above, explicitly proving that $$ \sum_{m=1}^{F_n-1} \frac1{\sin(\pi m/F_n)^2} \frac1{\sin(\pi F_{n-1} m/F_n)^2} = \frac{4n}{75} F_{2n} - \frac{17}{225}F_n^2 - (-1)^n \frac2{15} - \frac19. $$

A Local Valuation Criterion for Quadratic-Permutation Interleaved Zadoff--Chu Sequences

Yutong Zhang, Yaoran Yang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20947v1 Announce Type: cross Abstract: Berggren and Popovi\'c introduced quadratic-permutation-polynomial interleaved Zadoff--Chu sequences and, from exhaustive data, conjectured that all normalized QPP-interleaved Zadoff--Chu sequences are inequivalent to ordinary Zadoff--Chu sequences precisely for prime-power lengths $N=p^n$ with $p>3$ and $n>1$. We give an exact local arithmetic criterion. For a normalized QPP $\pi_{a,b}(k)=ak^2+bk\pmod N$, the interleaved sequence is equivalent, under the standard five CAZAC-preserving operations, to a Zadoff--Chu sequence if and only if, for every prime power $p^\alpha\Vert N$, the valuation of $a$ satisfies \[ \nu_p(a)\ge \begin{cases} 0, & p=2,\ \alpha=1,\\ \alpha-1, & p=2,\ \alpha\ge2,\\ \alpha-1, & p=3,\\ \alpha, & p>3. \end{cases} \] The proof is based on a third finite-difference invariant of the lifted Zadoff--Chu phase, namely \[ \Delta^3\bigl((ak^2+bk+\varepsilon_N+2q)(ak^2+bk)\bigr) =12a(2ak+3a+b). \] As a consequence, the conjectured prime-power boundary is not correct: the exact non-vacuous condition for all nonzero normalized QPPs to be inequivalent to Zadoff--Chu sequences is that $N$ is odd, $9\nmid N$, and $p^2\mid N$ for at least one prime $p\ge5$. In particular, $N=75=3\cdot5^2$ is the smallest non-prime-power counterexample to the conjectured ``only if'' direction. A second corollary records the corresponding statement for irreducible QPPs.

A note on hypergraphs with asymmetric Ramsey properties

Vladimir Sviridenkov — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20949v1 Announce Type: cross Abstract: Let $r,\ell\geq2$ be integers. Given $r$-graphs $G$ and $F_1,\dots,F_\ell$, we write $G\to(F_1,\dots,F_\ell)$ if every $\ell$-edge-coloring of $G$ yields a monochromatic copy of $F_i$ in the $i$th color for some $1\leq i\leq\ell$, otherwise we write $G\not\to(F_1,\dots,F_\ell)$. The Ramsey number $R(F_1,\dots,F_\ell)$ is the minimum number of vertices in an $r$-graph $G$ satisfying $G\to(F_1,\dots,F_\ell)$. In this note we prove that for any integers $t_1\geq\dots\geq t_\ell>r$, there exists an $r$-graph $G$ such that $G\not\to(K^{(r)}_{t_1},\dots,K^{(r)}_{t_\ell})$ but $G\to(K^{(r)}_s,K^{(r)}_{t_\ell-1})$, where $s=R(K^{(r)}_{t_1},\dots,K^{(r)}_{t_\ell})-1$. This extends recent work by Mendon\c{c}a, Miralaei, and Mota, who established the statement for $r=2$.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

Shubhada Agrawal, Siva Theja Maguluri, Martin Zubeldia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20999v1 Announce Type: cross Abstract: We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we show that the tail of the error can be sub-Gaussian, sub-Weibull, or something lighter than any Pareto but heavier than any Weibull, depending on the step size sequence and on whether the random operator is almost surely contractive, almost surely non-expansive, or expansive with positive probability. Our analysis relies on a novel Lyapunov function involving the moment-generating function of the solution to a Poisson equation, together with an auxiliary projected algorithm. We complement the upper bounds with worst-case examples showing that qualitatively sharper bounds are impossible. We further study the case of unbounded Martingale-difference noise when the average operator is contractive, and the step sizes are of order $1/k$. In this setting, we show that if the random operator is almost surely non-expansive, then the error tail is at most three times heavier than the noise tail, whereas if the random operator is expansive with positive probability, then the error may have substantially heavier tails. These results are obtained through a novel black-box truncation argument that reduces the unbounded-noise setting to the bounded-noise case.

Treewidth of the $n \times n$ toroidal grid

Tatsuya Gima, Hiraku Morimoto, Yuto Okada, Yota Otachi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21015v1 Announce Type: cross Abstract: In this paper, we show that the treewidth of the $n \times n$ toroidal grid is $2n-1$ for all $n \ge 5$. This closes the gap between the previously known upper bound of $2n-1$ (Ellis and Warren, DAM 2008) and the lower bound of $2n-2$ (Kiyomi, Okamoto, and Otachi, DAM 2016). To establish the matching lower bound, we construct a bramble of maximum order by utilizing maximum components obtained after removing $2n-1$ vertices. Our construction relies on the vertex-isoperimetric properties of the infinite grid to establish tight lower bounds on neighborhood sizes, combined with a careful analysis of balls of radius $n/2-1$ and their boundaries to overcome structural obstructions when $n$ is even.

Microwave Linear Analog Computer (MiLAC)-Aided MIMO Radar Sensing: Transmit Beamforming Design and DoA Estimation

Ziang Liu, Zheyu Wu, Bruno Clerckx — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21020v1 Announce Type: cross Abstract: Multiple-input multiple-output (MIMO) radar has waveform diversity and large spatial degrees of freedom (DoFs), making it attractive for high-resolution sensing. Scaling MIMO radar to massive arrays can further improve sensing performance, but it also increases hardware cost, power consumption, and digital processing complexity. The microwave linear analog computer (MiLAC) can tackle these challenges by moving linear operations from the digital domain to the analog domain. MiLAC has shown promising benefits for communications in recent studies and this paper identifies its potential for radar sensing. Specifically, we consider both MiLAC-aided transmit beamforming and receiver-side two-dimensional discrete Fourier transform (2D-DFT)-based direction-of-arrival (DoA) estimation. For transmit beamforming, we formulate a weighted Cramer Rao bound (CRB) minimization problem under lossless and reciprocal MiLAC constraints and propose a penalty dual decomposition (PDD)-based iterative algorithm to address the non-convex problem. We further prove that MiLAC-aided and fully-digital beamforming achieve the same CRB. For receiver processing, we show that the 2D DFT can be implemented by a lossless reciprocal MiLAC, which enables analog-domain DoA estimation without digital optimization. Numerical results confirm the theoretical finding and show that the MiLAC-aided approach achieves the same CRB and DoA estimation performance as the fully-digital benchmark. Meanwhile, hardware cost and power consumption are reduced because only low-resolution DACs are required at the transmitter, while RF chains and ADCs are eliminated at the receiver. Moreover, performing the 2D DFT in the analog domain eliminates all digital DFT operations for DoA estimation.

Conditioning Gaussian Processes on Almost Anything

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21041v1 Announce Type: cross Abstract: Gaussian processes (GPs) offer a principled probabilistic model over functions, but exact inference is restricted to the linear-Gaussian regime. We establish an explicit equivalence between GPs and a class of linear diffusion models, recasting predictive sampling as an ODE with closed-form Gaussian dynamics and a likelihood-dependent guidance term that admits a simple Monte Carlo approximation. In the linear-Gaussian setting, we recover standard GP conditioning exactly; beyond conjugacy, the same machinery handles any conditioning statement admitting point-wise likelihood evaluation -- including non-linear physics, and, for the first time, natural language via large language models. Whitening isolates the irreducible non-Gaussian dynamics, minimising Wasserstein-2 transport cost and eliminating numerical stiffness. The result is a general-purpose GP inference scheme requiring no bespoke derivations. Together, these results provide a general mechanism for incorporating the full richness of real-world knowledge as conditioning information, opening a new frontier for the probabilistic modelling of real-world problems.

Towards transistor-based quantum computing

Y. -D. Liu, X. Xu, Q. -R. Wang, D. -S. Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21045v1 Announce Type: cross Abstract: In this work, we propose and study in depth a universal quantum computing architecture based on a quantum construction of transistors. Our teleportation-based quantum transistors, called ``telesistors'', are ground states of systems with symmetry-protected topological order, hence suppress certain noises and provide high-fidelity Clifford gates without the need for active error correction. This physical protection, quantified by the string order parameters, serves as a low-overhead foundation upon which conventional fault-tolerant encoding (e.g., with stabilizer codes) can be built to achieve universal quantum computation. This architecture shows rich connections with current known architectures, and some desirable merits especially compared with the qubit-based circuits regarding modularity, integration, and program storage. Our study shows that it is plausible to realize it with current technology in the near future.

Exponential Lower Bounds for the Pfaffian Number of Graphs

Priyanshu Pant, Ranveer Singh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21077v1 Announce Type: cross Abstract: Galluccio--Loebl and Tesler showed that the perfect-matching polynomial of a graph embedded in an orientable surface of genus $g$ can be written as a linear combination of at most $4^g$ Pfaffians. We show that, in general, exponentially many Pfaffians are necessary. More precisely, among all graphs of orientable genus at most $g$, the maximum possible Pfaffian number is at least $(8/3)^g$. This lower bound holds even for connected matching-covered graphs. We also obtain exponential lower bounds for the Pfaffian number of complete bipartite graphs, and hence for even complete graphs, improving asymptotically on a recent linear lower bound of Junchaya, Lucchesi, and Miranda.

AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21083v1 Announce Type: cross Abstract: Materials discovery and biomedical translation increasingly require models that can reason across composition, processing, structure, biological response, manufacturability, safety, and governance constraints. Existing materials and biomedical data ecosystems are powerful but remain poorly coupled for AI-guided discovery. Here we present AIMBio, a conceptual framework for an AI-native, FAIR, and governance-aware decision layer that links materials provenance, biomedical context, knowledge graphs, uncertainty-aware machine learning, and human-in-the-loop active learning. The framework formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty and introduces practical requirements for metadata, model documentation, risk-tiered governance, evaluation metrics, and phased implementation. To make the roadmap testable, we add a minimum viable prototype specification and a worked pilot for AI-guided nanomaterials for drug delivery. AIMBio is positioned as exploratory and preclinical discovery infrastructure, not as clinical decision-support software; any clinical or regulated-device use would require separate validation, change control, and regulatory review. The central contribution is a publishable platform blueprint for converting fragmented materials and biomedical records into auditable, experimentally actionable, and translationally responsible discovery workflows.

Scaled Graph Bounding Techniques for Reset Systems

Timo de Groot, Maurice Heemels, Tom Oomen, Sebastiaan van den Eijnden — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21119v1 Announce Type: cross Abstract: Reset systems can overcome fundamental limitations of linear time-invariant control. The recently introduced notion of scaled (relative) graphs provides a promising framework for developing graphical analysis and design tools for reset systems, in line with widely adopted loopshaping methods for linear systems. The aim of this paper is to derive techniques for over-bounding the scaled graph of reset systems, and obtain insights in their accuracy. We exploit connections between quadratic dissipativity and scaled graphs to recast the over-bounding problem as the search for piecewise quadratic storage functions. Using specific sampling techniques, we reveal a fundamental limitation of general scaled graph approximation methods that are based on quadratic dissipativity.

Combinatorial manifolds and Kleene's theorem, homotopically

Yorgo Chamoun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21142v1 Announce Type: cross Abstract: We give a general method to build categories of combinatorial manifolds, i.e. categories of combinatorial objects satisfying some local property at every "point", as coreflective subcategories of categories of relational presheaves. To do this, we crucially rely on unique factorization systems, and we can interpet our technique as a way of building a model category whose cofibrant objects are exactly the combinatorial manifolds. We then illustrate the usefulness of this point of view by two applications. First we build a category of euclidean precubical sets, i.e. precubical sets that locally look like a grid (of some fixed dimension), and show that it is coreflective in the category of relational precubical sets. This is the combinatorial analog of eulidean locally ordered spaces and the blowup construction from directed topology. Secondly, we show how to give an abstract proof of Kleene's theorem from automata theory by defining "manifold automata" that behave well with respect to concatenation.

Experimental detection of inclusions for the time-harmonic elastic wave equation

Sarah Eberle-Blick, Jochen Moll — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21158v1 Announce Type: cross Abstract: We are concerned with the reconstruction of inclusions in elastic bodies based on measurements from a laboratory experiment. In doing so, we solve the inverse problem of the time-harmonic elastic wave equation, in contrast to the stationary wave equation and the corresponding lab experiment proposed earlier in Eberle and Moll (2021). The investigation of the harmonic problem leads to a better reconstruction compared to the stationary one. Since we deal with real measurement data, we have to take into account, that those measurements always include measurement errors, so that we have to handle noisy data. Thus, we consider the linearized monotonicity method for noisy data and introduce a modified version of this method. Based on this, we reconstruct the inclusions numerically.

A Rigorous, Tractable Measure of Model Complexity

Oskar Allerbo, Thomas B. Sch\"on — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21167v1 Announce Type: cross Abstract: An accurate assessment of a model's complexity is crucial for topics such as interpretation, generalization, and model selection. However, most existing complexity measures either rely on heuristic assumptions or are computationally prohibitive. In this paper, we present a mathematically rigorous yet easy-to-compute measure of model complexity that is based on the similarities between the model gradients across inputs. It is thus well-defined for any parametric model, but also for kernel-based non-parametric models. We prove that our measure of complexity generalizes model-specific complexity measures such as polynomial degree (for polynomial regression), kernel length scale (for Mat\'ern kernels), number of neighbors (for k-nearest neighbors), number of splits (for decision trees), and number of trees (for random forests). We also use our measure to obtain new insights into the double descent phenomenon for random Fourier features, random forests, neural networks, and gradient boosting.

Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21213v1 Announce Type: cross Abstract: In this work, we present quantum reinforcement learning (RL) as a solution strategy for process synthesis problems. Building on our prior work, we develop a generalized framework that formally poses process synthesis as a Markov decision process and introduces quantum-enhanced RL algorithms to solve it with improved scalability. Earlier implementations of quantum-based RL for process synthesis were limited by qubit requirements, which scaled poorly with problem complexity. This work overcomes this challenge by introducing state encoding algorithms to decouple qubit requirements from problem size. A classical RL-based solution strategy is used as a baseline to benchmark the quantum algorithms under identical training conditions. All algorithms are evaluated across a flowsheet synthesis problem of increasing unit counts to analyze their performance and scalability. Results show that all approaches are capable of identifying the optimal flowsheet designs in small design spaces. For moderate-scale unit counts, quantum approaches demonstrate competitive performance on a per-episode basis and improved efficiency on a per-parameter basis versus the classical RL benchmark. This work provides a foundation for future quantum computing applications within process systems engineering, establishes a controlled benchmark for comparing classical and quantum algorithms, and shows that the proposed quantum variants remain competitive for the process synthesis problem examined in this work.

Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment

Shuaida He, Liwen Chen, Long Feng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21217v1 Announce Type: cross Abstract: Low-rank adaptation (LoRA) has emerged as a powerful tool for parameter-efficient fine-tuning of large language models (LLMs). This paper studies LoRA under a federated learning setting, enabling collaborative fine-tuning across clients while preserving parameter efficiency. We focus on a highly heterogeneous regime in which clients share only partial structure and a substantial subset may be contaminated. We propose Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a contamination-aware framework that relies only on preliminary local estimators. Its formulation applies broadly, from linear regression to neural network and LLM modules, whenever local adaptation can be represented by matrix-valued updates. CLAIR recovers the shared LoRA subspace and detects contaminated clients via a structured low-rank plus block-sparse decomposition. We prove exact recovery of the shared LoRA subspace in the noiseless case, stable recovery under preliminary estimation error, and consistent collaborative-set recovery under mild separation conditions. We further quantify the gain from CLAIR refinement: it reduces off-subspace estimation error through cross-client averaging while preserving client-specific variation within the shared LoRA subspace, thus improves over local fine-tuning whenever this oracle gain outweighs the costs of subspace estimation and benign-client heterogeneity. Empirically, we demonstrate the benefits of CLAIR by fine-tuning a Transformer architecture on a text-copying task. The results show accurate contamination detection and improved benign-client performance compared with local fine-tuning and non-robust federated averaging.

Artificial Intelligence Reshapes Microwave Photonics

Peng Li, Xihua Zou, Jia Ye, Wei Pan, Lianshan Yan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21224v1 Announce Type: cross Abstract: As a rapidly emerging interdisciplinary field that intrinsically integrates microwave and photonics, microwave photonics (MWP) provides disruptive solutions to overcome the fundamental bandwidth of conventional electronic systems. By exploiting the inherently ultra-wide bandwidth and low-loss characteristics of photonic technologies, MWP enables the generation, transmission, processing, and detection of microwave, millimeter-wave, and terahertz signals. Representative breakthroughs include fully photonic microwave radar systems, photonic analog-to-digital converters with bandwidth up to 320 GHz, and photonic wireless communication systems achieving data rate as high as 616 Gbit/s. Meanwhile, the rapid growth of artificial intelligence (AI) is reshaping scientific research, engineering, and daily life in unprecedented ways, such as AI for science/engineering and AI co-scientist/assistant. Correspondingly, AI is profoundly reshaping MWP in all aspects, ranging from signal generation, transmission to signal processing and detection. AI has revolutionized the design, simulation, fabrication, testing, deployment, and maintenance of MWP systems, delivering autonomous operation and exceptional efficiency beyond traditional systems. Motivated by these developments, this Review Paper provides the first comprehensive overview of AI-enabled MWP, systematically summarizing the state-of-the-art advances and presenting insights for both the academic community and the broader public.

Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21251v1 Announce Type: cross Abstract: A retinal vessel analysis is a procedure that can be used as an assessment of risks to the eye. This work proposes an unsupervised multimodal approach that improves the response of the Frangi filter, enabling automatic vessel segmentation. We propose a filter that computes pixel-level vessel continuity while introducing a local tolerance heuristic to fill in vessel discontinuities produced by the Frangi response. This proposal, called the local-sensitive connectivity filter (LS-CF), is compared against a naive connectivity filter to the baseline thresholded Frangi filter response and to the naive connectivity filter response in combination with the morphological closing and to the current approaches in the literature. The proposal was able to achieve competitive results in a variety of multimodal datasets. It was robust enough to outperform all the state-of-the-art approaches in the literature for the OSIRIX angiographic dataset in terms of accuracy and 4 out of 5 works in the case of the IOSTAR dataset while also outperforming several works in the case of the DRIVE and STARE datasets and 6 out of 10 in the CHASE-DB dataset. For the CHASE-DB, it also outperformed all the state-of-the-art unsupervised methods.

Theoretical guidelines for annealed Langevin dynamics in compositional simulation-based inference

Camille Touron, Gabriel V. Cardoso, Julyan Arbel, Pedro L. C. Rodrigues — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21253v1 Announce Type: cross Abstract: Compositional score-based approaches to simulation-based inference (SBI) approximate the posterior over a shared parameter given $n$ independent observations by aggregating individually learned posterior scores: currently, there are two main propositions of such methods (Geffner et al. (2023), Linhart et al. (2026)). As the resulting composite score does not correspond to the score of any distribution along the forward diffusion path of the true multi-observation posterior, sampling from it via a reverse SDE leads to an irreducible bias. Annealed Langevin dynamics provides a principled alternative: it treats the composite score as the genuine score of a sequence of tractable bridging densities and samples from them in succession. When properly tuned, it could lead to a controllable bias. However, its hyperparameters, namely step sizes, the number of steps per level, and the number of annealing levels, have so far been chosen empirically. We derive Wasserstein bounds for annealed Langevin with approximate scores and translate them into explicit decision rules for these hyperparameters that guarantee a prescribed sampling accuracy, while highlighting different theoretical aspects of each composite score formulation. In the Gaussian setting, we obtain closed-form expressions for all relevant quantities and prove that the bridging densities of Linhart et al. (2026) consistently admit larger step sizes and require fewer total Langevin steps than those of Geffner et al. (2023). Furthermore, we show empirically that the tuning obtained in the Gaussian setting generalizes to more complex problems, thus providing a well-understood and theoretically grounded starting point for practitioners using compositional score-based approaches.

Large-Step Training Dynamics of a Two-Factor Linear Transformer Model

Krishnakumar Balasubramanian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21292v1 Announce Type: cross Abstract: Gradient-flow analyses show that simplified linear transformers can learn the in-context linear-regression algorithm, but they do not explain the finite-step behavior of gradient descent at large learning rates. Motivated by empirical work on high-learning-rate transformer instabilities and by the cubic-map phase diagram for quadratic regression, we study an exactly reducible one-prompt linear-transformer training problem. After normalization, the dynamics reduce to a two-factor product map with an effective step-size parameter $\mu$. On the balanced slice, this map recovers the known scalar cubic transition from monotone convergence to catapult convergence, periodic and chaotic bounded nonconvergence, and divergence. We then analyze the full two-dimensional system and show that, for $0<\mu<2$, it has an explicit invariant Chebyshev ellipse separating forward-invariant regions; this ellipse carries off-balanced chaotic dynamics but is transversely repelling, while balanced scalar attractors can be transversely attracting. These results show that large constant learning rates can change the training attractor of the learned transformer rather than merely accelerating convergence: beyond sharp stability thresholds, finite-step training may settle into cycles, bounded chaos, or divergence instead of a single in-context linear-regression solution. We also discuss the consequences for mini-batch gradient descent based training methods.

Bitcoin's Power Law: Weak Structure, Strong Forecasts

Carlos Baquero, Raquel Menezes — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21316v1 Announce Type: cross Abstract: Bitcoin's price has been described as following a power law (PL) in time, $P \sim t^{\beta}$ with $\hat\beta \approx 5.7$ over 2010-2026. We test this claim using the Clauset-Shalizi-Newman protocol applied to Bitcoin's tail-relevant distributional series, and develop three principled time-domain adaptations of the protocol. We find that (i) the distributional power law is rejected on UTXO balances and daily |returns|, with lognormal preferred decisively; (ii) the fitted time-domain exponent varies by nearly a factor of three across reasonable shifts of the time origin -- it is not specification-robust in the sense required for a shift-invariant structural reading; (iii) standard residual diagnostics and scale-invariance tests proposed in earlier work cannot distinguish a power law from a multi-component sigmoid stack fit to the same data; (iv) Bitcoin price stands apart in a cross-asset comparison spanning Bitcoin on-chain metrics and traditional asset classes: it is the only series in the nine-series in-sample test where no single-component growth curve improves on the power law, and the quarterly $K=3$ wave-stability bootstrap rejects the PL+AR(1) null on Bitcoin at $p = 0.015$ (strict 15% CV threshold) -- a clear cross-asset separation, although not a Bonferroni-robust rejection; and (v) walk-forward Diebold-Mariano evaluation against ten candidates -- including standard time-series baselines (RW with drift, auto-ARIMA, ETS, local-linear-trend) -- shows the in-sample winner (multi-sigmoid) is among the worst long-horizon forecasters, while the simple power law dominates 12-24 month horizons against every standard baseline at $p < 0.05$, precisely because it does not commit to specific wave shapes. The fit-prediction tradeoff is the practical counterpart of the descriptive findings.

Stimulus symmetries can confound representational similarity analyses

Farhad Pashakhanloo, Jacob A. Zavatone-Veth — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21324v1 Announce Type: cross Abstract: What can representational similarity matrices (RSMs) tell us about a neural code? As the popularity of these summary statistics grows, so too does the need for a more complete characterization of their properties. Here, we show that symmetries in network inputs can confound RSM-based analyses. Stimulus symmetries render many representations functionally equivalent, but these different configurations can lead to different RSMs. These different RSMs reflect qualitatively different representational geometries. We show that stochastic gradient descent or energetic regularization can generate sparse, drifting codes, leading in turn to drifting RSMs. Moreover, we demonstrate that these phenomena are present in networks trained to encode image data, where the symmetry is latent. Our results illustrate the challenges inherent in comparing nonlinear neural codes, when functionally-equivalent representations are not related by a simple rotation.

Semiparametric Efficient Bilevel Gradient Estimation

Fares El Khoury, Houssam Zenati, Nathan Kallus, Michael Arbel, Aur\'elien Bibaut — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21341v1 Announce Type: cross Abstract: Functional bilevel methods estimate a lower-level function and plug it into a hypergradient, but this plug-in gradient can retain first-order bias when the lower-level problem is learned nonparametrically. To remove this bias, we develop a semiparametric debiasing theory for population bilevel gradients based on the efficient influence function. This perspective leads to a cross-fitted orthogonal hypergradient estimator for which we establish asymptotic normality together with uniform control over the outer parameter. Under quadratic losses, the estimator reduces to a simple doubly robust score based on conditional mean nuisances. On synthetic bilevel benchmarks with known ground truth, the method tracks the oracle efficient-gradient benchmark and improves over plug-in functional hypergradients and regularized kernel bilevel baselines.

Beyond Nonlinear Small-Gain Design: DADS with Partial-State Feedback

Iasson Karafyllis, Miroslav Krstic — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21344v1 Announce Type: cross Abstract: Eduardo Sontag and coauthors studied Input-to-Output Stability (IOS) and the output asymptotic gain property. These notions changed control theory and recently had an impact on robust adaptive control through the Deadzone-Adapted Disturbance Suppression (DADS) control scheme. Moreover, recently the notion of IOS was extended to systems described by Partial Differential Equations (PDEs). In this work, we celebrate Eduardo Sontag by combining DADS and IOS for PDEs: we study the partial-state regulation problem for a scalar Ordinary Differential Equation (ODE) which is interconnected with a possibly infinite-dimensional system. In such a case the DADS control scheme can allow an escape from the requirements of the small-gain theorem that is mainly used for partial-state feedback. We show the design procedure of partial-state DADS controllers and we prove robust regulation even in the presence of external inputs (disturbances) without assuming knowledge of any disturbance/parameter bounds. The DADS controller is applied to three different cases of the interconnection of an ODE with an almost completely unknown: (a) heat PDE, (b) transport PDE, and (c) wave PDE with viscous damping. We show that the same DADS controller can achieve robust regulation in all three cases.

Linear Functional Testing with General Loadings in Sparse Regression: Separation Rates and Computational Barriers

Jie Xie, Dongming Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21360v1 Announce Type: cross Abstract: We study the problem of testing $H_0: \xi^\top\beta=t_0$ in high-dimensional sparse linear regression with Gaussian random design and unknown design covariance. The loading vector $\xi$ is arbitrary, and the exact sparsity level $k$ is unknown but bounded by a known value $k_u$. Tests are required to control Type I error uniformly over the $k_u$-sparse null, while power is evaluated against $k$-sparse alternatives. We construct a computationally efficient mixed test that gives an upper bound on the adaptive separation distance and establish an information-theoretic lower bound calibrated to the magnitude profile of $\xi$. In the ultra-sparse regime $k_u\lesssim \sqrt n/\log p$, these bounds characterize the adaptive separation rate up to logarithmic factors for arbitrary $\xi$. In the moderately sparse regime $\sqrt n/\log p\ll k_u\lesssim n/\log p$, these bounds match for several classes of loading vectors but may differ in general. In this regime, we further prove a low-degree lower bound that matches the upper bound up to logarithmic factors. This provides evidence that improving on the rate of the mixed test, if statistically possible, may be computationally hard. For flat sparse loadings, we complement this evidence with a polynomial-time reduction from sparse CCA. Finally, we examine how information about the design covariance affects the adaptive separation rate in two settings. Under a sparse signed-spiked covariance model, the information-theoretic lower bound is attainable up to logarithmic factors by a computationally inefficient procedure, while the low-degree lower bound and sparse-CCA reduction continue to apply, providing evidence for a statistical-computational gap. When the design covariance is known and diagonal, the adaptive separation rate takes the same form as in the ultra-sparse regime.

Memorisation, convergence and generalisation in generative models

Antoine Maillard, Sebastian Goldt — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21402v1 Announce Type: cross Abstract: Generative neural networks learn how to produce highly realistic images from a large, but finite number of examples - or do they simply memorise their training set? To settle this question, Kadkhodaie, Guth, Simoncelli and Mallat (ICLR '24) trained diffusion models independently on disjoint subsets of a dataset and showed that they converge to nearly the same density when the number of training images is large enough. This result raises two basic questions: how much data do you need for convergence, and what does convergence capture about learning the data distribution? Here, we address these questions by providing an exact analytical characterisation of the transition from memorisation to generalisation in linear generative models. We find that these models memorise at small load, while convergence emerges continuously when the number of samples is linear in the input dimension. Strikingly, we find that convergence is insensitive to recovery of the principal latent factors of the data, which are recovered in a sharp transition. After extending our approach to data with power-law spectra, we find the same distinction between convergence and latent recovery in our experiments with convolutional denoisers and in the data of Kadkhodaie et al. We thus show that generalisation in generative models decomposes into at least two distinct objectives: matching the bulk of the data distribution and recovering the principal latent factors. These objectives correspond to two different distances between true and learnt data distribution, and only the first one is captured by convergence.

Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment

Alim Igilik — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21437v1 Announce Type: cross Abstract: Standard approaches to forecasting the weekly number of earthquakes on a spatial grid rely on the Poisson distribution with a single global dispersion assumption. We show that this assumption is systematically violated in seismic data from Central Asia (2010-2024), where a likelihood-ratio test with boundary correction strongly rejects the Poisson hypothesis (p < 10^{-179}). The main contribution of this work is the EarthquakeNet architecture, which provides an endogenous per-cell estimate of the overdispersion parameter alpha via a neural network (spatial embeddings + MLP), without explicit spatial covariance specification. In contrast to existing negative binomial regression approaches in seismological forecasting, which typically assume a single global alpha, the proposed per-cell formulation allows the model to identify spatial heterogeneity in seismic clustering and to construct probabilistic risk-aware alerts via quantiles of the predicted distribution. A walk-forward evaluation (2018-2023) over four systems shows an 8.6 percent reduction in mean pinball deviation (MPD) relative to a negative binomial GLM baseline. The strongest improvements are observed in the tail regime (Y >= 5), where the continuous ranked probability score (CRPS) of the proposed model is 12.5 percent lower than that of the baseline, indicating improved calibration in extreme-event forecasting.

Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction

Tilman Tr\"oster, David Mirkovic, Veronika Oehl, Arne Thomsen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.21483v1 Announce Type: cross Abstract: Precise measurement of the kinematic Sunyaev-Zel'dovich (kSZ) effect - a probe of the large-scale distribution of baryonic matter, a key observable for cosmological inference - requires accurate reconstruction of galaxy velocities from spectroscopic surveys. The signal-to-noise ratio (SNR) of kSZ measurements scales directly with the correlation coefficient $r$ between reconstructed and true velocities. We introduce Velocityformer, an equivariant graph transformer architecture designed to match the specific symmetry of the observational data. While the underlying physics is equivariant with respect to translations and rotations, observational effects break this symmetry due to the preferred line-of-sight direction. Matching the model's inductive bias to the data's broken symmetry consistently improves performance across all model sizes and training volumes, with Velocityformer improving $r$ by 35% over the standard linear theory baseline and outperforming ML baselines at every data volume. By matching the model's inductive bias to the data and conditioning on the physics-based long-wavelength solution, Velocityformer is highly data-efficient, training to high accuracy on as few as 4 low-fidelity simulations, and generalises zero-shot across input geometry, cosmological parameters, and galaxy sample. On high-fidelity simulated galaxy catalogues, this yields a 30% improvement in $r$ over the physical baseline, directly translating to the same SNR gain on observational data.

Automatically Learning Construction Injury Precursors from Text

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier — Thu, 21 May 2026 00:00:00 -0400

arXiv:1907.11769v4 Announce Type: replace Abstract: In light of the increasing availability of digitally recorded safety reports in the construction industry, it is important to develop methods to exploit these data to improve our understanding of safety incidents and ability to learn from them. In this study, we compare several approaches to automatically learn injury precursors from raw construction accident reports. More precisely, we experiment with two state-of-the-art deep learning architectures for Natural Language Processing (NLP), Convolutional Neural Networks (CNN) and Hierarchical Attention Networks (HAN), and with the established Term Frequency - Inverse Document Frequency representation (TF-IDF) + Support Vector Machine (SVM) approach. For each model, we provide a method to identify (after training) the textual patterns that are, on average, the most predictive of each safety outcome. We show that among those pieces of text, valid injury precursors can be found. The proposed methods can also be used by the user to visualize and understand the models' predictions.

AI-based Prediction of Independent Construction Safety Outcomes from Universal Attributes

Henrietta Baker, Matthew R. Hallowell, Antoine J. -P. Tixier — Thu, 21 May 2026 00:00:00 -0400

arXiv:1908.05972v3 Announce Type: replace Abstract: This paper significantly improves on, and finishes to validate, an approach proposed in previous research in which safety outcomes were predicted from attributes with machine learning. Like in the original study, we use Natural Language Processing (NLP) to extract fundamental attributes from raw incident reports and machine learning models are trained to predict safety outcomes. The outcomes predicted here are injury severity, injury type, body part impacted, and incident type. However, unlike in the original study, safety outcomes were not extracted via NLP but were provided by independent human annotations, eliminating any potential source of artificial correlation between predictors and predictands. Results show that attributes are still highly predictive, confirming the validity of the original approach. Other improvements brought by the current study include the use of (1) a much larger dataset featuring more than 90,000 reports, (2) two new models, XGBoost and linear SVM (Support Vector Machines), (3) model stacking, (4) a more straightforward experimental setup with more appropriate performance metrics, and (5) an analysis of per-category attribute importance scores. Finally, the injury severity outcome is well predicted, which was not the case in the original study. This is a significant advancement.

PREF: Phasorial Embedding Fields for Compact Neural Representations

Binbin Huang, Xinhao Yan, Anpei Chen, Shenghua Gao, Jingyi Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2205.13524v4 Announce Type: replace Abstract: We present an efficient frequency-based neural representation termed PREF: a shallow MLP augmented with a phasor volume that covers significant border spectra than previous Fourier feature mapping or Positional Encoding. At the core is our compact 3D phasor volume where frequencies distribute uniformly along a 2D plane and dilate along a 1D axis. To this end, we develop a tailored and efficient Fourier transform that combines both Fast Fourier transform and local interpolation to accelerate na\"ive Fourier mapping. We also introduce a Parsvel regularizer that stables frequency-based learning. In these ways, Our PREF reduces the costly MLP in the frequency-based representation, thereby significantly closing the efficiency gap between it and other hybrid representations, and improving its interpretability. Comprehensive experiments demonstrate that our PREF is able to capture high-frequency details while remaining compact and robust, including 2D image generalization, 3D signed distance function regression and 5D neural radiance field reconstruction.

The Score-Difference Flow for Implicit Generative Modeling

Romann M. Weber — Thu, 21 May 2026 00:00:00 -0400

arXiv:2304.12906v5 Announce Type: replace Abstract: Implicit generative modeling (IGM) aims to produce samples of synthetic data matching the characteristics of a target data distribution. Recent work (e.g. score-matching networks, diffusion models) has approached the IGM problem from the perspective of pushing synthetic source data toward the target distribution via dynamical perturbations or flows in the ambient space. In this direction, we present the score difference (SD) between arbitrary target and source distributions as a flow that optimally reduces the Kullback-Leibler divergence between them. We apply the SD flow to convenient proxy distributions, which are aligned if and only if the original distributions are aligned. We demonstrate the formal equivalence of this formulation to denoising diffusion models under certain conditions. We also show that the training of generative adversarial networks includes a hidden data-optimization sub-problem, which induces the SD flow under certain choices of loss function when the discriminator is optimal. As a result, the SD flow provides a theoretical link between model classes that individually address the three challenges of the "generative modeling trilemma" -- high sample quality, mode coverage, and fast sampling -- thereby setting the stage for a unified approach.

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

Junsol Kim, Byungkyu Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2305.09620v4 Announce Type: replace Abstract: Nationally representative surveys track public opinion, yet they ask only a limited set of questions each year, limiting its potential to capture historical changes. To fill this gap, we develop a large language model (LLM)-based framework for predicting missing responses in repeated cross-sectional surveys by incorporating embeddings for questions, respondents, and survey periods. We introduce two new applications of LLMs to survey research: retrodiction (predicting year-level missing opinions) and unasked opinion prediction (predicting entirely missing opinions). Using data from the 1972-2021 General Social Surveys, our LLM-based models perform strongly in retrodicting masked GSS opinions through cross-validation and public opinions measured by other organizations in years when the GSS did not ask them. These capabilities enable us to recover missing trends and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. However, performance remains modest for unasked opinion prediction. We show when our models outperform established benchmarks, examine which opinions and and respondents are more predictable, and evaluate whether our approach reduces LLMs' tendency to homogenize predicted responses. Our study demonstrates that LLMs and surveys can mutually enhance each other: LLMs broaden survey potential, while surveys calibrate LLMs for simulating human opinions.

On the reconstruction of bandlimited signals from random samples quantized via noise-shaping

Rohan Joy, Felix Krahmer, Alessandro Lupoli, Radha Ramakrishnan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2306.15758v3 Announce Type: replace Abstract: Noise-shaping quantization techniques are widely used for converting bandlimited signals from the analog to the digital domain. They work by ``shaping" the quantization noise so that it falls close to the reconstruction operator's null space. We investigate the compatibility of two such schemes, specifically $\Sigma\Delta$ quantization and distributed noise-shaping quantization, with random samples of bandlimited functions. Suppose $R>1$ is a real number and assume that $\{x_i\}_{i=1}^m$ is a sequence of i.i.d random variables uniformly distributed on $[-\tilde{R},\tilde{R}]$, where $\tilde{R}>R$ is appropriately chosen. We show that by using a noise-shaping quantizer to quantize the (randomly sign flipped) values of a real-valued $\pi$-bandlimited function $f$ at $\{x_i\}_{i=1}^m$, a function $f^{\sharp}$ can be reconstructed from these quantized values such that $\|f-f^{\sharp}\|_{L^2[-R, R]}$ decays with high probability as $m$ and $\tilde{R}$ increase. This decay holds uniformly over all bandlimited $f$. We emphasize that the sample points $\{x_i\}_{i=1}^m$ are completely random, that is, they have no predefined structure, which makes our findings the first of their kind.

Mercer Large-Scale Kernel Machines from Ridge Function Perspective

Karol Dziedziul, Sergey Kryzhevich, Pawe{\l} Wieczy\'nski — Thu, 21 May 2026 00:00:00 -0400

arXiv:2307.11925v3 Announce Type: replace Abstract: To present Mercer large-scale kernel machines from a ridge function perspective, we recall the results by Lin and Pinkus from {\it Fundamentality of ridge functions}. We consider the main result of the recent paper by Rachimi and Recht, 2008, {\it Random features for large-scale kernel machines} from the Approximation Theory point of view. We study which kernels could be approximated by a sum of products of cosine functions with arguments depending on $x$ and $y$ and present the obstacles of such an approach. The results of this article are applied to Image Processing by procedure "one-vs-rest".

On the Suboptimality of GP-UCB under Polynomial Effective Optimism

Wenjia Wang, Xiaowei Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2312.01386v2 Announce Type: replace Abstract: Gaussian process upper confidence bound (GP-UCB) is widely used for sequential optimization of expensive black-box functions. Although many upper bounds on its cumulative regret have been established in the literature, whether GP-UCB is minimax optimal remains open. We study this question through the effective optimism level, defined as the product of the exploration coefficient and the regularization parameter in kernel ridge regression. Under a uniform confidence assumption, we prove a new regret lower bound for GP-UCB with Mat\'ern kernels. The bound shows that polynomial growth of the effective optimism level, up to logarithmic factors, rules out the minimax-optimal regret rate. Since this is the regime covered by most existing analyses, our result identifies a concrete obstacle to proving minimax optimality for standard GP-UCB. More broadly, it suggests that the gap between current upper bounds and minimax lower bounds may reflect a real limitation of the algorithm, not only of the analysis.

Chernoff Information as a Privacy Constraint for Adversarial Classification and Membership Advantage

Ay\c{s}e \"Unsal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2403.10307v5 Announce Type: replace Abstract: This work investigates a privacy metric based on Chernoff information motivated by its importance in characterizing the optimal classifier's performance. Adversarial classification centers on minimizing the probability of error when deciding between two classes in the binary setting. Classical hypothesis testing treats false alarm and mis-detection probabilities separately, resulting in asymmetric optimal error exponents. Here, we instead characterize the relationship between $\varepsilon-$differential privacy (DP), the optimal error exponent of one error probability conditioned on the other, and the optimal average error exponent. Thus, we re-derive Chernoff DP in connection with $\varepsilon-$DP using the Radon-Nikodym derivative and establish its relationship with Kullback-Leibler (KL) DP to prove that Chernoff DP is sandwiched between the two. We then present numerical evaluations demonstrating that Chernoff information outperforms the KL divergence as a function of the privacy parameter, particularly in capturing the impact of adversarial attacks under Laplace mechanisms. Finally, we upper bound the adversary's advantage in membership inference attacks based on Chernoff DP and numerically compare its performance with existing bounds. We re-derive Chernoff DP in connection with $\varepsilon-$DP using the Radon-Nikodym derivative, and prove its relation with KL-DP. Subsequently, we present numerical evaluation results, which demonstrates that Chernoff information outperforms KL divergence as a function of the privacy parameter $\varepsilon$ and the impact of the adversary's attack in Laplace mechanisms. Lastly, we introduce a new upper bound on adversary's membership advantage in membership inference attacks using Chernoff DP and numerically compare its performance with existing alternatives based on $(\varepsilon,\delta)-$DP in the literature.

TrimCaching: Parameter-sharing Edge Caching for AI Model Downloading

Thu, 21 May 2026 00:00:00 -0400

arXiv:2404.14204v5 Announce Type: replace Abstract: Next-generation mobile networks are expected to facilitate fast AI model downloading to end users. By caching models on edge servers, mobile networks can deliver models to end users with low latency, resulting in a paradigm of edge model caching. In this paper, we develop a novel model placement framework, called parameter-sharing model caching (TrimCaching). TrimCaching exploits the key observation that a wide range of AI models, such as convolutional neural networks or large language models, can share a significant proportion of parameter blocks containing reusable knowledge, thereby improving storage efficiency. To this end, we formulate a parameter-sharing model placement problem to maximize the cache hit ratio in multi-edge wireless networks by balancing the fundamental tradeoff between storage efficiency and service latency. We show that the formulated problem is a submodular maximization problem with submodular constraints, for which no polynomial-time approximation algorithm exists. To tackle this challenge, we study an important special case, where a small fixed number of parameter blocks are shared across models, which often holds in practice. In such a case, a polynomial-time algorithm with a $\left(1-\epsilon\right)/2$-approximation guarantee is developed. Subsequently, we address the original problem for the general case by developing a greedy algorithm. Simulation results demonstrate that the proposed TrimCaching framework significantly improves the cache hit ratio compared with state-of-the-art content caching without exploiting shared parameters in AI models.

Optimizing Navigational Graph Queries

Thomas Mulder, George Fletcher, Nikolay Yakovets — Thu, 21 May 2026 00:00:00 -0400

arXiv:2406.05417v2 Announce Type: replace Abstract: We study the optimization of navigational graph queries, i.e., queries which combine recursive and pattern-matching fragments. Current approaches to their evaluation are not effective in practice. Towards addressing this, we present a number of novel powerful optimization techniques which aim to constrain the intermediate results during query evaluation. We show how these techniques can be planned effectively and executed efficiently towards the first practical evaluation solution for complex navigational queries on real-world workloads. Indeed, our experimental results show several orders of magnitude improvement in query evaluation performance over state-of-the-art techniques on a wide range of queries on diverse datasets.

E2GS: Event Enhanced Gaussian Splatting

Hiroyuki Deguchi, Mana Masuda, Takuya Nakabayashi, Hideo Saito — Thu, 21 May 2026 00:00:00 -0400

arXiv:2406.14978v2 Announce Type: replace Abstract: Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at https://github.com/deguchihiroyuki/E2GS.

TRAM: Test-Time Risk Adaptation with Mixture of Agents

Mohamad Fares El Hajj Chehade, Amrit Singh Bedi, Amy Zhang, Hao Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2408.08812v2 Announce Type: replace Abstract: Deployed reinforcement learning agents often face safety requirements that are specified only after training, such as new hazard maps, revised risk thresholds, or behavioral alignment constraints. We study zero-update deployment-time adaptation, where a fixed library of risk-neutral source policies is reused under a newly specified reward-risk tradeoff. We propose TRAM (Test-Time Risk Adaptation via Mixture of Agents), a source-scored composition rule that evaluates each source policy under the target reward and an occupancy-based deployment risk, then selects actions using risk-adjusted source scores. Unlike training-time risk-sensitive methods tied to a fixed surrogate such as return variance, TRAM supports spatial barrier exposure, divergence from a reference behavior, and local volatility risks specified at test time. We explicitly characterize TRAM as a surrogate method: it does not solve the full occupancy-control problem of the stitched policy, but admits a measurable source-hull mismatch term connecting source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment setting show that TRAM reduces deployment risk while preserving reward, without requiring any parameter updates at test time.

Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei — Thu, 21 May 2026 00:00:00 -0400

arXiv:2409.04777v4 Announce Type: replace Abstract: Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations, correctly identify the best-performing schedule family in all five evaluated out-of-family settings, and detect training divergence with F1 = 0.92.

Personalized Weight Loss Management through Wearable Devices and Artificial Intelligence

Thu, 21 May 2026 00:00:00 -0400

arXiv:2409.08700v2 Announce Type: replace Abstract: Early detection of chronic and Non-Communicable Diseases (NCDs) is crucial for effective treatment during the initial stages. This study explores the application of wearable devices and Artificial Intelligence (AI) in order to predict weight loss changes in overweight and obese individuals. Using wearable data from a 1-month trial involving around 100 subjects from the AI4FoodDB database, including biomarkers, vital signs, and behavioral data, we identify key differences between those achieving weight loss (>= 2% of their initial weight) and those who do not. Feature selection techniques and classification algorithms reveal promising results, with the Gradient Boosting classifier achieving 84.44% Area Under the Curve (AUC). The integration of multiple data sources (e.g., vital signs, physical and sleep activity, etc.) enhances performance, suggesting the potential of wearable devices and AI in personalized healthcare.

A lattice Boltzmann method for Biot's consolidation model of linear poroelasticity

Stephan B. Lunowa, Barbara Wohlmuth — Thu, 21 May 2026 00:00:00 -0400

arXiv:2409.11382v2 Announce Type: replace Abstract: Biot's consolidation model is a classical model for the evolution of deformable porous media saturated by a fluid and has various interdisciplinary applications. While numerical solution methods to solve poroelasticity by typical schemes such as finite differences, finite volumes or finite elements have been intensely studied, lattice Boltzmann methods for poroelasticity have not been developed yet. In this work, we propose a novel semi-implicit coupling of lattice Boltzmann methods to solve Biot's consolidation model in two dimensions. To this end, we use a single-relaxation-time lattice Boltzmann method for reaction-diffusion equations to solve the Darcy flow and combine it with a recent pseudo-time multi-relaxation-time lattice Boltzmann scheme for quasi-static linear elasticity. We employ a multi-grid method for the latter scheme to achieve quasi-optimal computational cost. For the coupling between the equations, we develop a centered update scheme, that incorporates both explicit and semi-implicit contributions. The numerical results demonstrate that naive (explicit or semi-implicit) coupling schemes lead to instabilities when the poroelastic system is strongly coupled. However, the newly developed centered coupling scheme is stable and accurate in all considered cases, even for the Biot--Willis coefficient being one. Furthermore, the numerical results for Terzaghi's consolidation problem and a two-dimensional extension thereof highlight that the scheme is even able to capture discontinuous solutions arising from instantaneous loading.

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl, Oliver Eberle — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.03296v4 Announce Type: replace Abstract: Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

Hongyuan Lu, Wai Lam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.04155v2 Announce Type: replace Abstract: How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnote{We plan to release the resources to facilitate future work.}

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.15761v4 Announce Type: replace Abstract: Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

Testing Support Size More Efficiently Than Learning Histograms

Renato Ferreira Pinto Jr., Nathaniel Harms — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.18915v4 Announce Type: replace Abstract: Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$\epsilon$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $\Theta(\tfrac{n}{\epsilon^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{\epsilon \log n} \log(1/\epsilon))$ samples, nearly matching the best known lower bound of $\Omega(\tfrac{n}{\epsilon \log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

Hongyuan Lu, Zixuan Li, Wai Lam — Thu, 21 May 2026 00:00:00 -0400

arXiv:2411.01141v2 Announce Type: replace Abstract: There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called \textbf{D}ictionary \textbf{I}nsertion \textbf{P}rompting (\textbf{DIP}). When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.\footnote{The number of languages varies on the datasets, and we experiment with 200 languages on GSM8K as in Appendix} Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.

Fast Byzantine Total Order Broadcast

Matteo Monti, Martina Camaioni, Pierre-Louis Roman — Thu, 21 May 2026 00:00:00 -0400

arXiv:2412.14061v2 Announce Type: replace Abstract: This paper presents Flutter, the first Byzantine Total Order Broadcast implementation with a broadcast-to-delivery latency of $2\Delta + \epsilon$ time units, $\Delta$ being the message delay and $\epsilon$ an arbitrarily small constant margin, when all processes are correct, the network is synchronous, hence local clocks are well-synchronized. Under the same conditions, state-of-the-art protocols require at least $3\Delta$ time units in practical deployments where clients differ from servers. We prove Flutter's good-case latency is quasi-optimal, meaning it cannot be improved upon by any finite amount. Flutter is deterministic, leaderless, and signature-free hence quantum-resilient; it assumes partial synchrony and at least $5f + 1$ servers, where $f$ bounds the number of faults. Under the hood, Flutter builds upon Blink, a novel Binary Consensus implementation with Representative Validity, whose fast path enables decisions in $\Delta$ time units when all correct servers propose the same value.

Protecting Cryptographic Libraries against Side-Channel and Code-Reuse Attacks

Rodothea Myrsini Tsoupidi, Elena Troubitsyna, Panos Papadimitratos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2412.19310v2 Announce Type: replace Abstract: Cryptographic libraries, an essential part of cybersecurity, are shown to be susceptible to different types of attacks, including side-channel and memory-corruption attacks. In this article, we examine popular cryptographic libraries in terms of the security measures they implement, pinpoint security vulnerabilities, and suggest security improvements in their development process.

Towards the Anonymization of the Language Modeling

Antoine Boutet, Lucas Magnana, Juliette S\'en\'echal — Thu, 21 May 2026 00:00:00 -0400

arXiv:2501.02407v3 Announce Type: replace Abstract: Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of language models anonymization, and thus promote their sharing. Specifically, we propose both a Masking Language Modeling (MLM) methodology to specialize a BERT-like language model, and a Causal Language Modeling (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

STEC-Net: A Spatiotemporal Graph Neural Framework for Community Discovery in Dynamic Social Networks

Yingnan Xu, Shuangshuang Chu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2501.12208v4 Announce Type: replace Abstract: Community discovery is a central problem in the analysis of dynamic social networks. Traditional community discovery methods mainly focus on the formation and dissolution of links between nodes, and therefore often fail to capture the richer spatial structure and temporal dependency underlying network evolution. To address this limitation, we propose STEC-Net, a spatiotemporal graph neural framework for community discovery in dynamic social networks. STEC-Net integrates spatial structure and temporal dynamics within a unified embedding architecture. First, Graph Convolutional Networks (GCNs) are used to learn snapshot-level node representations from network topology. To adapt the spatial encoder to structural evolution, a GRU-based weight evolution mechanism is introduced to update the GCN parameters over time. Then, a second Gated Recurrent Unit (GRU) is employed to model temporal dependencies across snapshot embeddings and to learn spatiotemporal node representations. Finally, a Self-Organizing Map (SOM) is applied to the learned embeddings to cluster nodes and infer their community affiliations. Experiments on four types of dynamic networks show that STEC-Net consistently outperforms traditional community discovery methods in terms of purity, normalized mutual information, homogeneity, and completeness. These results demonstrate that STEC-Net can effectively uncover evolving community structures in dynamic social networks.

SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

Thu, 21 May 2026 00:00:00 -0400

arXiv:2501.15151v5 Announce Type: replace Abstract: Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low energy consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the energy consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

Induction and Recursion Principles in a Higher-Order Quantitative Logic for Probability

Giorgio Bacci, Rasmus Ejlers M{\o}gelberg — Thu, 21 May 2026 00:00:00 -0400

arXiv:2501.18275v3 Announce Type: replace Abstract: Quantitative logic reasons about the degree to which formulas are satisfied. This paper studies the fundamental reasoning principles of higher-order quantitative logic and their application to reasoning about probabilistic programs and processes. We construct an affine calculus for $1$-bounded complete metric spaces and the monad for probability measures equipped with the Kantorovich distance. The calculus includes a form of guarded recursion interpreted via Banach's fixed point theorem, useful, e.g., for recursive programming with processes. We then define an affine higher-order quantitative logic for reasoning about terms of our calculus. The logic includes novel principles for guarded recursion, and induction over probability measures and natural numbers. We illustrate the expressivity of the logic by a sequence of case studies: Proving upper limits on bisimilarity distances of Markov processes, showing convergence of a temporal learning algorithm and of a random walk using a coupling argument.

Proportional Selection in Networks

Georgios Papasotiropoulos, Oskar Skibski, Piotr Skowron, Tomasz W\k{a}s — Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.03545v2 Announce Type: replace Abstract: We address the problem of selecting $k$ representative nodes from a network, aiming to achieve two objectives: identifying the most influential nodes and ensuring the selection proportionally reflects the network's diversity. We propose two approaches to accomplish this, analyze them theoretically, and demonstrate their effectiveness through a series of experiments.

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han — Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.03752v5 Announce Type: replace Abstract: Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://epsilog.github.io/SISL.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.12120v3 Announce Type: replace Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Ensemble RL through Classifier Models: Enhancing Risk-Return Trade-offs in Trading Strategies

Zheli Xiong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.17518v2 Announce Type: replace Abstract: This paper presents a comprehensive study on the use of ensemble Reinforcement Learning (RL) models in financial trading strategies, leveraging classifier models to enhance performance. By combining RL algorithms such as A2C, PPO, and SAC with traditional classifiers like Support Vector Machines (SVM), Decision Trees, and Logistic Regression, we investigate how different classifier groups can be integrated to improve risk-return trade-offs. The study evaluates the effectiveness of various ensemble methods, comparing them with individual RL models across key financial metrics, including Cumulative Returns, Sharpe Ratios (SR), Calmar Ratios, and Maximum Drawdown (MDD). Our results demonstrate that ensemble methods consistently outperform base models in terms of risk-adjusted returns, providing better management of drawdowns and overall stability. However, we identify the sensitivity of ensemble performance to the choice of variance threshold {\tau}, highlighting the importance of dynamic {\tau} adjustment to achieve optimal performance. This study emphasizes the value of combining RL with classifiers for adaptive decision-making, with implications for financial trading, robotics, and other dynamic environments.

END: Early Noise Dropping for Efficient and Effective Context Denoising

Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.18915v3 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

Where Linux Breaks Under Radiation: A Cross-Architecture Kernel-Level Characterization of Proton-Induced Failures in COTS SoCs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.03722v4 Announce Type: replace Abstract: Linux is increasingly deployed in Low Earth Orbit on commercial off the shelf systems on chip that were not designed for space radiation. Ionizing particles can trigger single event functional interrupts that crash the kernel without warning. Prior work mainly measured board level cross sections, leaving unclear which Linux subsystems fail and how a single upset propagates into an operating system wide failure across architectures, stress conditions, and irradiation conditions. We address this gap by subjecting three Linux platforms to proton irradiation in the 20 to 58 MeV range: a Raspberry Pi Zero 2W with a 40 nm planar ARM Cortex A53, an NXP i MX 8M Plus with a 14 nm FinFET ARM Cortex A53, and an OrangeCrab ECP5 FPGA hosting a VexRiscV RV32I soft core at 40 nm. Through kernel log forensics, we trace all 133 observed Linux failures, most of which have not been previously reported, to their originating kernel handlers. Failure profiles differ sharply across nodes. On the two 40 nm platforms, memory management and driver handlers account for 67 to 78% of events, while on the 14 nm SoC approximately 90% of failures funnel through a single eMMC storage path, comprising 56% filesystem failures and 34% driver failures. This shows that a SEFI susceptible peripheral can strongly dictate system reliability. The 14 nm SoC also shows roughly an order of magnitude lower Linux SEFI cross section, although irradiation geometry and DRAM exposure differences preclude isolating the contribution of process scaling. Reconstructed propagation chains show that faults can cascade through up to six kernel subsystems before terminal failure in severe events. Rather than motivating blanket redundancy, these results identify the kernel subsystem boundaries where radiation induced faults originate, enabling targeted mitigations for hardening COTS Linux systems for orbit.

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.08292v5 Announce Type: replace Abstract: Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

A $(2+\varepsilon)$-Approximation Algorithm for Metric $k$-Median

Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.10972v2 Announce Type: replace Abstract: In the classical NP-hard metric $k$-median problem, we are given a set of $n$ clients and centers with metric distances between them, along with an integer parameter $k\geq 1$. The objective is to select a subset of $k$ open centers that minimizes the total distance from each client to its closest open center. In their seminal work, Jain, Mahdian, Markakis, Saberi, and Vazirani presented the Greedy algorithm for facility location, which implies a $2$-approximation algorithm for $k$-median that opens $k$ centers in expectation. Since then, substantial research has aimed at narrowing the gap between their algorithm and the best achievable approximation by an algorithm guaranteed to open exactly $k$ centers. During the last decade, all improvements have been achieved by leveraging their algorithm or a small improvement thereof, followed by a second step called bi-point rounding, which inherently increases the approximation guarantee. Our main result closes this gap: for any $\epsilon >0$, we present a $(2+\epsilon)$-approximation algorithm for $k$-median, improving the previous best-known approximation factor of $2.613$. Our approach builds on a combination of two algorithms. First, we present a non-trivial modification of the Greedy algorithm that operates with $O(\log n/\epsilon^2)$ adaptive phases. Through a novel walk-between-solutions approach, this enables us to construct a $(2+\epsilon)$-approximation algorithm for $k$-median that consistently opens at most $k + O(\log n{/\epsilon^2})$ centers. Second, we develop a novel $(2+\epsilon)$-approximation algorithm tailored for stable instances, where removing any center from an optimal solution increases the cost by at least an $\Omega(\epsilon^3/\log n)$ fraction. Achieving this involves a sampling approach inspired by the $k$-means++ algorithm and a reduction to submodular optimization subject to a partition matroid.

Control, Optimal Transport and Neural Differential Equations in Supervised Learning

Minh-Nhat Phung, Minh-Binh Tran — Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.15105v4 Announce Type: replace Abstract: We study the fundamental computational problem of approximating optimal transport (OT) equations using neural differential equations (Neural ODEs). More specifically, we develop a novel framework for approximating unbalanced optimal transport (UOT) in the continuum using Neural ODEs. By generalizing a discrete UOT problem with Pearson divergence, we constructively design vector fields for Neural ODEs that converge to the true UOT dynamics, thereby advancing the mathematical foundations of computational transport and machine learning. To this end, we design a numerical scheme inspired by the Sinkhorn algorithm to solve the corresponding minimization problem and rigorously prove its convergence, providing explicit error estimates. From the obtained numerical solutions, we derive vector fields defining the transport dynamics and construct the corresponding transport equation. Finally, from the numerically obtained transport equation, we construct a neural differential equation whose flow converges to the true transport dynamics in an appropriate limiting regime.

MulFSA: Multi-level Financial Sentiment Analysis Framework for Bond Market

Yiwei Liu, Junbo Wang, Lei Long, Xin Li, Ruiting Ma, Yuankai Wu, Xuebin Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.02429v2 Announce Type: replace Abstract: Existing financial sentiment analysis methods often fail to capture the multi-faceted nature of risk in bond markets due to their single-level approach and neglect of temporal dynamics. We propose Multi-level Financial Sentiment Analysis (MulFSA) based on pre-trained language models (PLMs) and large language models (LLMs), a novel framework that systematically integrates firm-specific micro-level sentiment, industry-specific meso-level sentiment, and duration-aware smoothing to model the latency and persistence of textual impact. Applying MulFSA to the comprehensive Chinese bond market corpus constructed by us (2013-2023, 1.35M texts), we extracted a daily composite sentiment index. Empirical results show statistically measurable improvements in credit spread forecasting when incorporating sentiment (10.25% MAE and 11.94% MAPE reduction), with sentiment shifts closely correlating with major social risk events and firm-specific crises. Project Page: https://mulfsa.github.io/.

Why Ask One When You Can Ask $k$? Learning-to-Defer to the Top-$k$ Experts

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.12988v5 Announce Type: replace Abstract: Existing Learning-to-Defer (L2D) frameworks are limited to single-expert deferral, forcing each query to rely on only one expert and preventing the use of collective expertise. We introduce the first framework for Top-$k$ Learning-to-Defer, which allocates queries to the $k$ most cost-effective entities. Our formulation unifies and strictly generalizes prior approaches, including the one-stage and two-stage regimes, selective prediction, and classical cascades. In particular, it recovers the usual Top-1 deferral rule as a special case while enabling principled collaboration with multiple experts when $k>1$. We further propose Top-$k(x)$ Learning-to-Defer, an adaptive variant that learns the optimal number of experts per query based on input difficulty, expert quality, and consultation cost. To enable practical learning, we develop a novel surrogate loss that is Bayes-consistent, $\mathcal{H}_h$-consistent in the one-stage setting, and $(\mathcal{H}_r,\mathcal{H}_g)$-consistent in the two-stage setting. Crucially, this surrogate is independent of $k$, allowing a single policy to be learned once and deployed flexibly across $k$. Experiments across both regimes show that Top-$k$ and Top-$k(x)$ deliver superior accuracy-cost trade-offs, opening a new direction for multi-expert deferral in L2D.

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, Renjie Liao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.13109v2 Announce Type: replace Abstract: Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

Projection Coefficients Estimation in Continuous-Variable Quantum Circuits

M. W. AlMasri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.16246v4 Announce Type: replace Abstract: In this work, we propose a continuous-variable quantum algorithm to compute the projection coefficients of a holomorphic function in the Segal--Bargmann space by leveraging its isometric correspondence with single-mode quantum states. Using CV quantum circuits, we prepare the state $\ket{f}$ associated with $f(z)$ and extract the coefficients $c_n = \braket{n}{f}$ via photon-number-resolved detection, enhanced by interferometric phase referencing to recover full complex amplitudes. We detail the construction of the state-preparation oracle for various functional classes and analyze the protocol's robustness under realistic noise models, including detector inefficiency and state preparation errors. This enables direct quantum estimation and visualization of the coefficient sequence -- offering a hardware-native protocol for characterizing non-Gaussian states and analyzing functions defined by quantum oracles, complementary to classical numerical integration.

Partial orders and contraction for BISO channels

Christoph Hirche, Oxana Shaya — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.16726v2 Announce Type: replace Abstract: A fundamental question in information theory is to quantify the loss of information under a noisy channel. Partial orders and contraction coefficients are typical tools to that end, however, they are often also challenging to evaluate. For the special class of binary input symmetric output (BISO) channels, Geng et al. showed that among channels with the same capacity, the binary symmetric channel (BSC) and binary erasure channel (BEC) are extremal with respect to the more capable order. Here, we show two main results. First, for channels with the same KL contraction coefficient, the same holds with respect to the less noisy order. Second, for channels with the same Dobrushin coefficient, or equiv. maximum leakage or Doeblin coefficient, the same holds with respect to the degradability order. In the process, we provide a closed-form expression for the contraction coefficients of BISO channels. We also discuss the comparability of BISO channels and extensions to binary channels in general.

ShowMak3r: Compositional TV Show Reconstruction

Sangmin Kim, Seunguk Do, Daeun Lee, Jaesik Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2504.19584v3 Announce Type: replace Abstract: Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

Non-Adaptive Cryptanalytic Time-Space Lower Bounds via a Shearer-like Inequality for Permutations

Itai Dinur, Nathan Keller, Avichai Marmor — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.00894v3 Announce Type: replace Abstract: The power of adaptivity in algorithms has been intensively studied in diverse areas of theoretical computer science. In this paper, we obtain a number of sharp lower bound results which show that adaptivity provides a significant extra power in cryptanalytic time-space tradeoffs with (possibly unlimited) preprocessing time. Most notably, we consider the discrete logarithm (DLOG) problem in a generic group of $N$ elements. The classical `baby-step giant-step' algorithm for the problem has time complexity $T=O(\sqrt{N})$, uses $O(\sqrt{N})$ bits of space (up to logarithmic factors in $N$) and achieves constant success probability. We examine a generalized setting where an algorithm obtains an advice string of $S$ bits and is allowed to make $T$ arbitrary non-adaptive queries that depend on the advice string (but not on the challenge group element). We show that in this setting, the $T=O(\sqrt{N})$ online time complexity of the baby-step giant-step algorithm cannot be improved, unless the advice string is more than $\Omega(\sqrt{N})$ bits long. This lies in stark contrast with the classical adaptive Pollard's rho algorithm for DLOG, which can exploit preprocessing to obtain the tradeoff curve $ST^2=O(N)$. We obtain similar sharp lower bounds for several other cryptanalytic problems. To obtain our results, we present a new model that allows analyzing non-adaptive preprocessing algorithms for a wide array of search and decision problems in a unified way. Since previous proof techniques inherently cannot distinguish between adaptive and non-adaptive algorithms for the problems in our model, they cannot be used to obtain our results. Consequently, our proof uses a variant of Shearer's lemma for this setting, due to Barthe, Cordero-Erausquin, Ledoux, and Maurey (2011). This seems to be the first time a variant of Shearer's lemma for permutations is used in an algorithmic context.

How Reliable Are FOSS Popularity Metrics? Analyzing the Effort Required for Spoofing Common Software Popularity Metrics

Ben Swierzy, Timo Pohl, Marc Ohm, Michael Meier — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.05897v2 Announce Type: replace Abstract: Quantitative metrics derived from software repositories and package ecosystems are widely used to assess the impact, popularity, maintenance, and criticality of free and open source software (FOSS) projects. However, these metrics are often assumed to be reliable despite their potential susceptibility to manipulation. Prior empirical software engineering and security research deployed these in a variety of ways which assume they indeed capture project impact and popularity. Yet, the extent to which these underlying signals can be spoofed in practice, and the consequences this has for downstream uses of the metrics, has received little focused attention. To address this gap, the paper decomposes existing combined metrics into atomic metric categories, analyzes their spoofing effort under a maintainer-centered threat model, and investigates a real-world sybil attack on npm connected to an impact-based reward mechanism. The analysis finds that many metric categories, especially commit data, issue-tracker activity, downloads, repository contents, and dependency relations, are manipulable with low to moderate effort, and it identifies a sybil attack comprising more than 70,000 spam packages on npm. These results imply that quantitative FOSS metrics should be used with much greater caution in software engineering research and practice, particularly for ranking, dataset construction, and any allocation or evaluation process that turns metrics into optimization targets.

Beyond Words: Multimodal LLM Knows When to Speak

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.14654v2 Announce Type: replace Abstract: Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots

Keisuke Kawano, Takuro Kutsuna, Naoki Hayashi, Yasushi Esaki, Hidenori Tanaka — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.17354v3 Announce Type: replace Abstract: In many real-world settings--e.g., single-cell RNA sequencing, mobility sensing, and environmental monitoring--data are observed only as temporally aggregated snapshots collected over finite time windows, often with noisy or uncertain timestamps, and without access to continuous trajectories. We study the problem of estimating continuous-time dynamics from such snapshots. We present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage framework that (i) infers high-resolution time labels by aligning neighboring intervals via partial optimal transport (POT) and (ii) reconstructs a continuous-time data distribution through temporal kernel smoothing, from which we sample pairs of nearby times to train standard ODE/SDE models. Our formulation explicitly accounts for snapshot aggregation and time-label uncertainty and uses practical accelerations (screening and mini-batch POT), making it applicable to large datasets. Across synthetic benchmarks and two real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional and trajectory errors compared with OT-CFM, [SF]$^{2}$M, TrajectoryNet, MFM, and ENOT.

High-order adaptive discontinuous finite elements for the shallow water equations with sub-grid irregular bathymetry

Luca Arpaia, Giuseppe Orlando, Christian Ferrarin, Luca Bonaventura — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.18743v2 Announce Type: replace Abstract: We present a discontinuous finite element method for the shallow water equations which exploits high-resolution realistic bathymetry data without any regularity assumption, also in the case of high-order discretizations. We prove a number of mathematical properties specific to the proposed method that is well-balanced, mass-conserving and positivity-preserving under a mild CFL condition also in the presence of wet-dry fronts. The method includes a consistent conservative discretization for passive tracers. We use a high-order Discontinuous Galerkin (DG) method as implemented in the deal.II library. This environment provides efficient and native parallelization techniques and automatically handles non-conforming meshes to implement adaptive strategies which are tested in a coastal environment. Idealized test cases show the robustness in presence of irregular bathymetries also with under-resolved features at the grid scale. A benchmark with realistic bathymetry and a complex domain shows the potential of the proposed discretization for adaptive simulations of coastal flows.

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.19075v3 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

Sequential Elimination and Union Shapley Value for Group Assessment in Coalitional Games

Piotr K\k{e}pczy\'nski, Oskar Skibski — Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.21122v3 Announce Type: replace Abstract: Two straightforward methods to extend an assessment of individual elements to groups are to sum individual assessments or to treat the group as a single merged element and assess it accordingly. In this work, we analyze another natural approach based on sequential elimination: elements of the group are removed one by one, and their assessments are aggregated. We study this approach in the context of coalitional games and show that, for almost all semivalues, it does not depend on the order of players. In particular, we introduce a new group value, called the Union Shapley Value, and investigate its axiomatic properties. Our results build on a comprehensive analysis of group values in coalitional games. Specifically, we define a class of group (weak consistent) semivalues - a variant of semivalues satisfying a weak form of monotonicity. This framework allows us to clarify the differences between existing notions in the literature. We show that existing group values either assess the total worth of a group or measure its synergy. We distinguish these two approaches axiomatically and uncover a connection between the corresponding values. In particular, we show that the well-known Interaction Index is a synergistic counterpart of the value introduced by Marichal et al., which corresponds to the merge approach. The analysis also yields new synergistic group values associated with the Union Shapley Value, which we call the Intersection Shapley Value. Our results demonstrate that the sequential extension - and the Union Shapley value in particular - constitute one of the most natural extensions of player values to groups in coalitional games.

GradPower: Powering Gradients for Faster Language Model Pre-Training

Thu, 21 May 2026 00:00:00 -0400

arXiv:2505.24275v3 Announce Type: replace Abstract: We propose GradPower, a lightweight gradient-transformation technique for accelerating language model pre-training. Given a gradient vector $g=(g_i)_i$, GradPower first applies the elementwise sign-power transformation: $\varphi_p(g)=({\rm sign}(g_i)|g_i|^p)_{i}$ for a fixed $p>0$, and then feeds the transformed gradient into a base optimizer. Notably, GradPower requires only a single-line code change and no modifications to the base optimizer's internal logic, including the hyperparameters. When applied to Adam (termed AdamPower), GradPower consistently achieves lower terminal loss across diverse architectures (LLaMA, Qwen2MoE), parameter scales (66M to 2B), datasets (C4, OpenWebText), and learning-rate schedules (cosine, warmup-stable-decay). The most pronounced gains are observed when training modern mixture-of-experts models with warmup-stable-decay schedules. GradPower also integrates seamlessly with other state-of-the-art optimizers, such as Muon, yielding further improvements. Finally, we provide theoretical analyses that reveal the underlying mechanism of GradPower and highlight the influence of gradient noise.

Causal Path Alignment: Anchoring the Optimization Trajectory for Controllable In-Parameter Knowledge Editing

Xiyu Liu, Zhengxiao Liu, Naibin Gu, Zheng Lin, Weiping Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.04042v2 Announce Type: replace Abstract: Knowledge editing is pivotal for efficiently updating the parametric memory of Large Language Models (LLMs), enabling them to function as evolving agents in dynamic environments. However, mainstream in-parameter knowledge editing approaches suffer from Subject-Dominant Memory Interference: modifying a specific fact inadvertently corrupts the broader structural knowledge associated with the same subject within LLMs. We diagnose the root cause as a shortcut learning pathology, where the optimization objective overfits subject representations while bypassing the essential relational context. To rectify this, we propose Causal Path Alignment (CPA), a principled framework designed to anchor the optimization trajectory to valid causal pathways. CPA enforces parameter updates to route through relation-aware intermediate states, thereby preventing the erasure of contextual dependencies. Experimental results across diverse LLM backbones demonstrate that CPA consistently eliminates the shortcut, significantly improving relation specificity while exhibiting minimal side-effects. Moreover, CPA serves as a model-agnostic plug-in for existing editors, paving the way for reliable and trustworthy in-parameter knowledge editing.

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.11060v2 Announce Type: replace Abstract: Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly outperforms strong baselines, achieving a crash-resolution rate (CRR) of 48%, compared to 31.5% by SWE-agent and 31% by Agentless, using OpenAI's GPT-4o model. Scaling up sampling budget to 10 trajectories increases Code Researcher's CRR to 54%. Code Researcher is also robust to model choices, reaching 67% with the newer Gemini 2.5-Flash model. Through another experiment on an open-source multimedia software, we show the generalizability of Code Researcher and also conduct ablations. Our experiments highlight the importance of global context gathering and multi-faceted reasoning for large codebases.

Partitioning for Intrinsic Model Inversion Resistance in Collaborative Inference

Rongke Liu, Youwen Zhu, Lei Zhou, Xianglong Zhang, Dong Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.15412v3 Announce Type: replace Abstract: In collaborative inference (CI), transmitting intermediate representations $Z$ from edge devices enables model inversion attacks (MIA) that reconstruct the original inputs $X$, while existing defenses mainly perturb shallow-layer $Z$ at the cost of utility. We instead ask where an edge-cloud model should be partitioned to obtain intrinsic resistance to MIA. We challenge the intuition that depth is the driver of MIA resistance, and show that depth is sufficient only insofar as it enables a representational transition; this transition is necessary for intrinsic resistance and is marked by an abrupt rise in the lower bound of $H(X|Z)$. Correspondingly, the decisive variance term in the entropy bound shifts from a global variance to the intra-class mean-squared radius $R_c^2$ rather than dimensionality alone, yielding an $R_c^2$-based criterion to locate the transition zone, or identify it post hoc from MIA outcomes, which we term the Golden Partition Zone (GPZ). We further explain how $R_c^2$ evolves during training and show that it can be controlled through the label distribution; we refer to this controllable dynamic behavior as the Neural Vortex, an analysis-backed explanatory concept. Across four representative deep vision models, partitioning at the GPZ yields more than 4x higher reconstruction MSE compared to shallow splits; under entropy and inversion-model enhancements, decision-level representations provide 66 percent stronger resistance than feature-level ones, and we further observe that data type affects both the transition boundary and reconstruction.

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.16950v2 Announce Type: replace Abstract: Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Zesen Wang, Lijuan Lan, Yonggang Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.17631v4 Announce Type: replace Abstract: Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

PID Tuning via Desired Step Response Curve Fitting

Senol Gulgonul — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.17655v3 Announce Type: replace Abstract: This paper presents a PID tuning method based on step response curve fitting (PID-SRCF) that utilizes L2-norm minimization for precise reference tracking and explicit transient response shaping. The algorithm optimizes controller parameters by minimizing the root-mean-square error between desired and actual step responses. The proposed approach determines optimal PID parameters by matching any closed-loop response to a desired system step response. Practically a first-order plus time delay model or a second-order system with defined settling time and overshoot requirements are preferred. The method has open-source implementation using constrained nonlinear optimization in MATLAB. Comparative evaluations demonstrate that PID-SRCF can replace known analytical methods like Ziegler Nichols, Lambda Tuning, Pole Placement, Dominant Pole and MATLAB proprietary PID tuning applications.

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.21039v3 Announce Type: replace Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://jaebak1996.github.io/SSE/

M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.01053v4 Announce Type: replace Abstract: Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using samples from the EHRSQL 2024 benchmark with two language models. On one hundred answerable questions, the proprietary Claude Sonnet 4 achieved 94% accuracy and the open-weights gpt-oss-20B (deployable locally on consumer hardware) achieved 93%; on a matched sample of one hundred unanswerable questions, where correct behavior is to abstain rather than produce SQL, gpt-oss-20B correctly abstained on 69%. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-weights model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis and is designed with security measures including OAuth2 authentication, query validation, and audit logging.

Towards Serverless Processing of Spatiotemporal Big Data Queries

Diana Baumann, Tim C. Rese, David Bermbach — Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.06005v3 Announce Type: replace Abstract: Spatiotemporal data are being produced in continuously growing volumes by a variety of data sources and a variety of application fields rely on rapid analysis of such data. Existing systems such as PostGIS or MobilityDB usually build on relational database systems, thus, inheriting their scale-out characteristics. As a consequence, big spatiotemporal data scenarios still have limited support even though many query types can easily be parallelized. In this paper, we propose our vision of a native serverless data processing approach for spatiotemporal data: We break down queries into small subqueries which then leverage the near-instant scaling of Function-as-a-Service platforms to execute them in parallel. With this, we partially solve the scalability needs of big spatiotemporal data processing.

Two-Level Distributed Interference Management for Large-Scale HAPS-Empowered vHetNets

Afsoon Alidadi Shamsabadi, Animesh Yadav, Halim Yanikomeroglu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.08299v3 Announce Type: replace Abstract: High altitude platform stations (HAPS) offer a promising solution for achieving ubiquitous connectivity in next-generation wireless networks (xG). Integrating HAPS with terrestrial networks, creating HAPS-empowered vertical heterogeneous networks (vHetNets), significantly improves coverage and capacity and supports emerging novel use cases. In HAPS-empowered vHetNets, HAPS and terrestrial network tiers can share the same spectrum, forming harmonized spectrum vHetNets that enhance spectral efficiency (SE). However, harmonized spectrum vHetNets face major challenges, including severe co-channel interference and scalability in large-scale deployments. To address the first challenge, we adopt a cell-free multiple-input multiple-output (MIMO) network architecture in which users are simultaneously served by multiple base stations using beamforming. However, beamforming weight design leads to a nonconvex, high-dimensional optimization problem, highlighting the scalability challenge. To address this second challenge, we develop a two-level distributed proportional fairness beamforming weight design (PFBWD) algorithm. This algorithm combines the augmented Lagrangian method (ALM) with a three-block ADMM framework. Simulation results demonstrate the performance improvements achieved by integrating HAPS with standalone terrestrial networks, as well as the reduced complexity and signaling overhead of the distributed algorithm compared to centralized algorithms.

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.09180v4 Announce Type: replace Abstract: Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

Evasion Under Blockchain Sanctions

Endong Liu, Mark Ryan, Liyi Zhou, Pascal Berrang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.11721v2 Announce Type: replace Abstract: Sanctioning blockchain addresses has become a common regulatory response to malicious activities. However, enforcement on permissionless blockchains remains challenging due to complex transaction flows and sophisticated fund-obfuscation techniques. Using cryptocurrency mixing tool Tornado Cash as a case study, we quantitatively assess the effectiveness of U.S. Office of Foreign Assets Control (OFAC) sanctions over a 957-day period, covering 6.79 million Ethereum blocks and 1.07 billion transactions. Our analysis reveals that while OFAC sanctions reduced overall Tornado Cash deposit volume by 71.03% to approximately 2 billion USD, attackers still relied on Tornado Cash in 78.33% of Ethereum-related security incidents, underscoring persistent evasion strategies. In this paper, we identify three significant, structural limitations in current sanction enforcement practices: (i) fragmented censorship in blockchain consensus and application layer; (ii) the complexity of obfuscation virtual asset services exploited by users; and (iii) the susceptibility of naive binary sanction classifications to dusting attacks. Our analysis and findings contribute to ongoing discussions around regulatory effectiveness in Decentralized Finance by providing empirical evidence, clarifying enforcement challenges, and informing future compliance strategies in response to sanctions and blockchain-based security risks.

Distributed Non-Uniform Scaling Control of Multi-Agent Formation via Matrix-Valued Constraints

Tao He, Gangshan Jing — Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.02289v2 Announce Type: replace Abstract: Distributed formation maneuver control refers to the problem of maneuvering a group of agents to change their formation shape by adjusting the motions of partial agents, where the controller of each agent only requires local information measured from its neighbors. Although this problem has been extensively investigated, existing approaches are mostly limited to uniform scaling transformations. This article proposes a new type of local matrix-valued constraints, via which non-uniform scaling control of position formation can be achieved by tuning the positions of only two agents (i.e., leaders). Here, the non-uniform scaling transformation refers to global scaling the position formation with different ratios along different orthogonal coordinate directions. Moreover, by defining scaling and translation of attitudes, we propose a distributed control scheme for scaling and translation maneuver control of joint position-attitude formations. It is proven that the proposed controller achieves global convergence, provided that the sensing graph among agents is a 2-rooted bidirectional graph. Compared with the affine formation maneuver control approach, the proposed approach leverages a sparser sensing graph, requires fewer leaders, and additionally enables scaling transformations of the attitude formation. A simulation example demonstrates our theoretical results.

FAIR-Pruner: A Flexible Framework for Automatic Layer-Wise Pruning via Tolerance of Difference

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.02291v3 Announce Type: replace Abstract: Structured pruning is a standard tool for compressing deep neural networks, but its practical performance depends on how sparsity is allocated across layers. We propose FAIR-Pruner, a search-free framework for adaptive layer-wise structured pruning. FAIR-Pruner uses two within-layer rankings: a removal-oriented signal that proposes candidate units and a protection-oriented signal that identifies task-sensitive units. Its core component, Tolerance of Difference (ToD), measures the overlap between the removal prefix and the protected tail, and uses a shared tolerance level to induce non-uniform pruning depths across layers. As a default vision instantiation, FAIR-Pruner combines a Wasserstein-based U-Score for class-conditional unit separability with a Taylor-based R-Score for task-level sensitivity; the same ToD allocation rule can also be paired with alternative removal signals. Theoretically, we analyze ToD through the population R-Score, derive rank-based control of the high-R-Score mass entering the pruning set, and identify an additive exchange condition for same-budget comparison with uniform pruning. Experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet across VGG, ResNet, DenseNet, ConvNeXt, and DeiT show strong accuracy--compression trade-offs. Prune-only experiments on routed-expert Qwen1.5-MoE-A2.7B-Chat further examine architectural extensibility under matched expert budgets. FAIR-Pruner is released as a pip-installable open-source package.

RadProPoser: Probabilistic Radar Tensor Human Pose Estimation That Knows Its Limits

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.03578v2 Announce Type: replace Abstract: Radar-based human pose estimation enables privacy-preserving motion tracking for ambient intelligence, yet the noisy nature of radar sensing makes uncertainty quantification essential. We present RadProPoser, an end-to-end probabilistic framework that predicts three-dimensional body joints with per-joint uncertainties from raw radar tensor data. Using a variational encoder-decoder with spectral attention that fuses real and imaginary radar components across temporal frames, we model aleatoric uncertainty through learnable Gaussian and Laplace distributions. Trained on a new benchmark dataset with optical motion-capture ground truth, our method achieves 6.425 cm mean per-joint position error. The model outputs per-joint aleatoric uncertainties, and isotonic recalibration yields calibrated total uncertainty with expected calibration error of 0.027. Since spectral attention operates on individual radar tensor components, extending to multi-radar configurations requires only concatenating additional input streams. On the HuPR benchmark with dual orthogonal radars, this achieves 5.042 cm MPJPE. The framework runs at 89 frames per second (FPS) on an NVIDIA RTX 3090, exceeding the 15 Hz radar frame rate.

DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.04332v3 Announce Type: replace Abstract: Multi-agent systems (MAS) have demonstrated significant effectiveness in addressing complex problems through coordinated collaboration among heterogeneous agents. However, real-world environments and task specifications are inherently dynamic, characterized by frequent changes, uncertainty, and variability. Despite this, most existing MAS frameworks rely on static architectures with fixed agent capabilities and rigid task allocation strategies, which greatly limits their adaptability to evolving conditions. This inflexibility poses substantial challenges for sustaining robust and efficient multi-agent cooperation in dynamic and unpredictable scenarios. To address these limitations, we propose DRAMA: a Dynamic and Robust Allocation-based Multi-Agent System designed to facilitate resilient collaboration in rapidly changing environments. DRAMA features a modular architecture with a clear separation between the control plane and the worker plane. Both agents and tasks are abstracted as resource objects with well-defined lifecycles, while task allocation is achieved via an affinity-based, loosely coupled mechanism. The control plane enables real-time monitoring and centralized planning, allowing flexible and efficient task reassignment as agents join, depart, or become unavailable, thereby ensuring continuous and robust task execution. The worker plane comprises a cluster of autonomous agents, each with local reasoning, task execution, the ability to collaborate, and the capability to take over unfinished tasks from other agents when needed.

Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis

Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.04999v2 Announce Type: replace Abstract: Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating information from multiple modalities, such as text, audio, and visual data. However, existing methods often suffer from spurious correlations both within and across modalities, leading models to rely on statistical shortcuts rather than true causal relationships, thereby undermining generalization. To mitigate this issue, we propose a Multi-relational Multimodal Causal Intervention (MMCI) framework, which leverages the backdoor adjustment from causal theory to address the confounding effects of such shortcuts. Specifically, we first model the multimodal inputs as a multi-relational graph to explicitly capture intra- and inter-modal dependencies. Then, we apply an attention mechanism to separately estimate and disentangle the causal features and shortcut features corresponding to these intra- and inter-modal relations. Finally, by applying the backdoor adjustment, we stratify the shortcut features and dynamically combine them with the causal features to encourage MMCI to produce stable predictions under distribution shifts. Extensive experiments on several standard MSA datasets and out-of-distribution (OOD) test sets demonstrate that our method effectively suppresses biases and improves performance.

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.06206v5 Announce Type: replace Abstract: Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.08636v2 Announce Type: replace Abstract: Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

Retrospective Sparse Attention for Efficient Long-Context Generation

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.09001v2 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.

FACET: Teacher-Centred LLM-Based Multi-Agent Systems-Towards Personalized Educational Worksheets

Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.11401v5 Announce Type: replace Abstract: The increasing heterogeneity of student populations poses significant challenges for teachers, particularly in mathematics education, where cognitive, motivational, and emotional differences strongly influence learning outcomes. While AI-driven personalization tools have emerged, most remain performance-focused, offering limited support for teachers and neglecting broader pedagogical needs. This paper presents the FACET framework, a teacher-facing, large language model (LLM)-based multi-agent system designed to generate individualized classroom materials that integrate both cognitive and motivational dimensions of learner profiles. The framework comprises three specialized agents: (1) learner agents that simulate diverse profiles incorporating topic proficiency and intrinsic motivation, (2) a teacher agent that adapts instructional content according to didactical principles, and (3) an evaluator agent that provides automated quality assurance. We tested the system using authentic grade 8 mathematics curriculum content and evaluated its feasibility through a) automated agent-based assessment of output quality and b) exploratory feedback from K-12 in-service teachers. Results from ten internal evaluations highlighted high stability and alignment between generated materials and learner profiles, and teacher feedback particularly highlighted structure and suitability of tasks. The findings demonstrate the potential of multi-agent LLM architectures to provide scalable, context-aware personalization in heterogeneous classroom settings, and outline directions for extending the framework to richer learner profiles and real-world classroom trials.

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.12247v2 Announce Type: replace Abstract: Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media

Tianliang Xu, Ariel Hasell, Sabina Tomkins — Thu, 21 May 2026 00:00:00 -0400

arXiv:2508.16453v2 Announce Type: replace Abstract: Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.

Access Paths for Efficient Ordering with Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.00303v3 Announce Type: replace Abstract: In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.

Unpacking "Personal" Health Informatics for Proactive Collective Care

Shyama Sastha Krishnamoorthy Srinivasan, Mohan Kumar, Pushpendra Singh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.01231v3 Announce Type: replace Abstract: Care is primarily a collective phenomenon, with a practice that involves sharing health and wellbeing information within a trusted "care circle" of family members and companions for sensemaking, interpretation, decision-making, and follow-through. However, current digital health tools and information systems are designed for individuals and primarily intended for Personal Health Informatics (PHI). This mismatch between collective practice and individualistic design creates new challenges for the proactive use of such systems in care settings and limits adoption, sustained engagement, and meaningful use. To examine how people practice collective care and how (if) they perceive, adopt, and integrate PHI systems for proactive care, we conducted a sequential mixed-methods study. Through an initial survey (n=87) and semi-structured interviews (n=22), we found that their practices involve collectively understanding, analyzing, and sensemaking health information. However, we also found that their use of existing systems to support such practices is constrained by factors at personal, relational, technological, and structural levels that evolve over time. To explore redesigning PHI toward "Collective Health Informatics", we conducted stakeholder-specific interviews (n=12), a follow-up survey (n=116), and co-design workshops (n=6) to understand the dynamics required for collective settings while retaining agency. Using a design probe evaluation (n=38), we refine a design vision for coordinated, trustworthy action across such care relationships. Our findings motivate CC-Proact, an operational map that translates ecological influences into three design levers: Agency, Elicitation, and Engagement. Using this map, our work empirically examines collective care practices and offers ten design recommendations for building responsible systems that proactively support collective care.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu, Jiateng Li, Yuan Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.03526v2 Announce Type: replace Abstract: The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, Bastian Leibe — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.07120v2 Announce Type: replace Abstract: Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, $\pi^3$ and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than $3\times$ while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, $\pi^3$ , and MapAnything, while substantially improving scalability to large image collections.

Temporal Counterfactual Explanations of Behaviour Tree Decisions

Tamlin Love, Antonio Andriella, Guillem Aleny\`a — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.07674v2 Announce Type: replace Abstract: Explainability, in particular, the ability for robots to explain why they have made a decision or behaved in a certain way, is a critical tool in helping users understand the robots they interact and coexist with. Behaviour trees are a popular framework for controlling the decision-making of robots, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour tree-driven robots has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states in real time, unlike previous methods which are either unable to answer contrastive questions with causal explanations, or are not guaranteed to provide consistent and accurate explanations. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable, and ultimately safe and trustworthy robotic systems.

Measuring and mitigating overreliance to build human-compatible AI

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.08010v2 Announce Type: replace Abstract: Large language models (LLMs) distinguish themselves from previous technologies by functioning as collaborative ``thought partners,'' capable of engaging more fluidly in natural language on a range of tasks. As LLMs increasingly influence consequential decisions across diverse domains from healthcare to personal advice, the risk of overreliance -- relying on LLMs beyond their capabilities -- grows. This paper argues that measuring and mitigating overreliance must become central to LLM research and deployment. First, we consolidate risks from overreliance at both the individual and societal levels, including high-stakes errors, governance challenges, and cognitive deskilling. Then, we explore LLM characteristics, system design features, and user cognitive biases that together raise serious and unique concerns about overreliance on LLMs in practice. We also examine historical approaches for measuring overreliance, identifying three important gaps and proposing three promising directions to improve measurement. Finally, we propose mitigation strategies that can be pursued to ensure LLMs augment rather than undermine human capabilities.

Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions

Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du, Qichao Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.09215v2 Announce Type: replace Abstract: Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.09946v2 Announce Type: replace Abstract: Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction

Joshua Dimasaka, Christian Gei{\ss}, Robert Muir-Wood, Emily So — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.10308v2 Announce Type: replace Abstract: In the aftermath of disasters, many institutions worldwide face challenges in monitoring changes in disaster risk, limiting assessment of progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and expert priors. We introduce a weakly supervised first-order transition matrix to capture changes in the spatiotemporal distribution of vulnerability across two disaster-affected and socioeconomically disadvantaged regions: the cyclone-impacted Khurushkul community in Bangladesh and the mudslide-affected city of Freetown in Sierra Leone. Across both case studies, the framework constructs large-scale graph representations spanning 2016-2023 and evaluates posterior compositional distributions against expert priors using Aitchison distance due to the lack of temporal groundtruth labels. The work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.

Analysis and Design of Spare Strategy for Large-Scale Satellite Constellation Using Direct Insertion under (r,q) Policy

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.10585v3 Announce Type: replace Abstract: This paper introduces a Markov chain-based approach for the analysis and optimization of spare-management policies in large-scale satellite constellations. Focusing on the direct strategy, we model spare replenishment as a periodic-review reorder-point/order-quantity policy, where spares are deployed directly to constellation planes. The stochastic behavior of satellite failures and launch vehicle lead times is captured through Markov representations of both failure and replenishment dynamics. Based on this efficient and accurate framework, we construct and solve an optimization problem aimed at minimizing operational costs. The effectiveness of the proposed method is demonstrated through a case study using a real-world mega-constellation.

Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

Hao Xu, Xiaolin Wu, Xi Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.13482v2 Announce Type: replace Abstract: 3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

Sequential Data Augmentation for Generative Recommendation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.13648v3 Announce Type: replace Abstract: Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.17396v4 Announce Type: replace Abstract: Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model's memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

Multi-needle Localization for Pelvic Seed Implant Brachytherapy based on Tip-handle Detection and Matching

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.17931v2 Announce Type: replace Abstract: Accurate multi-needle localization in intraoperative CT images is crucial for optimizing seed placement in pelvic seed implant brachytherapy. However, this task is challenging due to poor image contrast and needle adhesion. This paper presents a novel approach that reframes needle localization as a tip-handle detection and matching problem to overcome these difficulties. An anchor-free network, based on HRNet, is proposed to extract multi-scale features and accurately detect needle tips and handles by predicting their centers and orientations using decoupled branches for heatmap regression and polar angle prediction. To associate detected tips and handles into individual needles, a greedy matching and merging (GMM) method designed to solve the unbalanced assignment problem with constraints (UAP-C) is presented. The GMM method iteratively selects the most probable tip-handle pairs and merges them based on a distance metric to reconstruct 3D needle paths. Evaluated on a dataset of 100 patients, the proposed method demonstrates superior performance, achieving higher precision and F1 score compared to a segmentation-based method utilizing the nnUNet model,thereby offering a more robust and accurate solution for needle localization in complex clinical scenarios.

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.20912v3 Announce Type: replace Abstract: Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.22963v3 Announce Type: replace Abstract: Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.24725v3 Announce Type: replace Abstract: Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

Effective Model Pruning: Measure The Redundancy of Model Components

Yixuan Wang, Dan P. Guralnik, Saiedeh Akbari, Warren E. Dixon — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.25606v3 Announce Type: replace Abstract: This article initiates the study of a basic question about model pruning. Given a vector $s$ of importance scores assigned to model components, how many of the scored components could be discarded without sacrificing performance? We propose Effective Model Pruning (EMP), which derives the desired sparsity directly from the score distribution using the notion of effective sample size from particle filtering, also known as the inverse Simpson index. Rather than prescribe a pruning criterion, EMP supplies a universal adaptive threshold derived from the distribution of the score $s$ over the model components: EMP maps $s$ to a number $N_{eff}=N_{eff}(s)$, called the effective sample size. The $N-N_{eff}$ lowest scoring components are discarded. A tight lower bound on the effective mass $s_{eff}$ (the sum of retained normalized scores) in terms of $N_{eff}$ is derived. This process yields models with a provable upper bound on the loss change relative to the original dense model. Numerical experiments are performed demonstrating this phenomenon across a variety of network architectures including MLPs, CNNs, Transformers, LLMs, and KAN. It is also shown that EMP addresses a rich set of pruning criteria such as weight magnitude, attention score, KAN importance score, and even feature-level signals such as image pixels.

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.26627v3 Announce Type: replace Abstract: Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.00520v2 Announce Type: replace Abstract: Foundation models are reshaping medical imaging, yet their application in echocardiography remains limited, hindered by a heavy reliance on private datasets that prevent reproducible comparison. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited diverse public datasets. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography foundation models. Specifically, CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. Leveraging this framework, we evaluate several leading foundation models, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our analysis reveals that while general-purpose encoders transfer well and often close the gap with probing, they struggle significantly with fine-grained distinctions like view classification and subtle pathology recognition. Results indicate that models capturing temporal cardiac dynamics perform best on functional tasks, while retrieval-based approaches generalize more consistently across datasets. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point to guide the architectural design of future echocardiography and possibly other medical imaging foundation models.

Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches

Yicheng Tao, Yuante Li, Yao Qin, Yepang Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.04905v3 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have significantly improved automated code generation. While existing approaches have achieved strong performance at the function and file levels, real-world software engineering requires reasoning over entire repositories, including cross-file dependencies, evolving execution environments, and global semantic consistency. This challenge has led to the emergence of Repository-Level Code Generation (RLCG), where models must retrieve, organize, and utilize repository-scale context to generate coherent and executable code changes. To address these challenges, Retrieval-Augmented Generation (RAG) has become an increasingly important paradigm for repository-level code intelligence. In this survey, we present a comprehensive review of Retrieval-Augmented Code Generation (RACG), with a particular focus on repository-level approaches. Rather than viewing RACG as a static ``retrieve-then-generate'' pipeline, we characterize it as a coupled and evolving process involving context construction, retrieval optimization, generation, and environment interaction. We organize existing methods through a unified analytical framework spanning retrieval substrate, control regime, and evaluation setting. Based on this framework, we systematically examine retrieval strategies, graph-based and non-graph-based retrieval paradigms, training-driven optimizations, and autonomous agent architectures. We further summarize widely used datasets, benchmarks, and system configurations, and discuss key challenges including scalability, reliability, efficiency, and the necessity boundary between RACG and long-context LLMs. Through this survey, we aim to provide a structured understanding of the rapidly evolving RACG landscape and highlight promising directions for future AI-powered software engineering research.

EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi, Anastasia Giachanou, Robert A. Bagheri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.05942v3 Announce Type: replace Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson's $r \approx 0.90$ on WVS). Yet we find a clear regional difference: Western regions average $r=0.82$ while non-Western regions average $r=0.61$ (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment ($r=0.74$, $p<.001$; PEW $r=0.39$, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.06063v2 Announce Type: replace Abstract: Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

Efficient numeracy in language models through single-token number embeddings

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.06824v2 Announce Type: replace Abstract: To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.08482v3 Announce Type: replace Abstract: Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.09060v2 Announce Type: replace Abstract: Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation

Qiaosheng Chen, Yang Liu, Lei Li, Kai Chen, Qipeng Guo, Gong Cheng, Fei Yuan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.09724v2 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.

Geometric local parameterization for solving Hele-Shaw problems with surface tension

Zengyan Zhang, Wenrui Hao, John Harlim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.14088v2 Announce Type: replace Abstract: In this work, we introduce a novel computational framework for solving the two-dimensional Hele-Shaw free boundary problem with surface tension. The moving boundary is represented by point clouds, eliminating the need for a global parameterization. Our approach leverages Generalized Moving Least Squares (GMLS) to construct local geometric charts, enabling high-order approximations of geometric quantities such as curvature directly from the point cloud data. This local parameterization is systematically employed to discretize the governing boundary integral equation, including an analytical formula of the singular integrals. We provide a rigorous convergence analysis for the proposed spatial discretization, establishing consistency and stability under certain conditions. The resulting error bound is derived in terms of the size of the uniformly sampled point cloud data on the moving boundary, the smoothness of the boundary, and the order of the numerical quadrature rule. Numerical experiments confirm the theoretical findings, demonstrating high-order spatial convergence and the expected temporal convergence rates. The method's effectiveness is further illustrated through simulations of complex initial shapes, including interfaces driven by anisotropic surface tension, which correctly evolve towards circular equilibrium states under the influence of surface tension, highlighting the versatility of the method for complex geometry-dependent interface dynamics.

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.14444v3 Announce Type: replace Abstract: Post-training pruning can substantially reduce LLM inference costs, but it often degrades quality unless the remaining weights are adapted. Since global retraining is expensive at LLM scale, recent work has largely focused on increasingly sophisticated pruning criteria that aim to select better sparsity patterns without adaptation. We revisit this trade-off through local reconstruction: after pruning, we adapt one subset of the model parameters at a time on a calibration set, training it to match the corresponding intermediate activations of the dense model. We evaluate local reconstruction across model families and scales, up to 72B parameters, and establish three main findings. First, local reconstruction is an effective adaptation mechanism for LLMs: it matches post-pruning retraining while using over an order of magnitude less data and compute, even when using PEFT techniques. Second, reconstruction exhibits a broad "free-lunch" regime in granularity, i.e., the reconstruction parameter window: as long as the reconstructed region contains at least a nonlinear submodule, final quality is largely insensitive to the window size, allowing granularity to be chosen primarily based on memory constraints. In contrast, reconstructing individual matrices, despite being the natural approach often proposed in the literature, consistently underperforms, as small matrix-level errors accumulate into larger activation drift. Lastly, reconstruction reduces the relative importance of the pruning criterion: performance gaps between sophisticated criteria and simple baselines shrink with model scale, making simple methods competitive again. Overall, our results challenge the prevailing view that post-pruning adaptation is impractical for LLMs.

Free-Grained Hierarchical Visual Recognition

Seulki Park, Zilin Wang, Stella X. Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.14737v3 Announce Type: replace Abstract: Hierarchical image recognition seeks to predict class labels along a semantic taxonomy, from broad categories to specific ones, typically under the tidy assumption that every training image is fully annotated along its taxonomy path. Reality is messier: A distant bird may be labeled only bird, while a clear close-up may justify bald eagle. We introduce free-grain training, where labels may appear at any level of the taxonomy and models must learn consistent hierarchical predictions from incomplete, mixed-granularity supervision. We build benchmark datasets with varying label granularity and show that existing hierarchical methods deteriorate sharply in this setting. To make up for missing supervision, we propose two simple solutions: One adds broad text-based supervision that captures visual attributes, and the other treats missing labels at specific taxonomy levels as a semi-supervised learning problem. We also study free-grained inference, where the model chooses how deep to predict, returning a reliable coarse label when a fine-grained one is uncertain. Together, our task, datasets, and methods move hierarchical recognition closer to the way labels arise in the real world.

FineVision: Open Data Is All You Need

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.17269v2 Announce Type: replace Abstract: The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.18034v3 Announce Type: replace Abstract: Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications

Zhuohang Bian, Feiyang Wu, Zhuoran Li, Teng Ma, Youwei Zhuo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.18586v3 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that rely on external function calls. This workload creates severe performance challenges for the KV Cache: spatial contention leads to the eviction of critical agents' caches and temporal underutilization leaves the cache of agents stalled on long-running function calls idling in GPU memory. We present TokenCake, a KV-Cache-centric serving framework that bridges this gap by co-optimizing scheduling and memory management through an agent-aware design. TokenCake's Temporal Scheduler employs an event-driven, opportunistic policy to proactively offload idle KV Caches during function calls and uses predictive uploading to hide data transfer latency. TokenCake's Spatial Scheduler uses dynamic memory partitioning, guided by a hybrid priority metric combining graph structure and runtime state, to reserve GPU memory for critical-path agents. Our evaluation on representative multi-agent benchmarks shows that TokenCake reduces end-to-end latency by over 47.06% and improves effective GPU memory utilization by up to 16.9% compared to vLLM.

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.21583v2 Announce Type: replace Abstract: Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is available on https://github.com/xingzhejun/GCPO.

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.23538v3 Announce Type: replace Abstract: The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

THEval. Evaluation Framework for Talking Head Video Generation

Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva — Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.04520v4 Announce Type: replace Abstract: Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

Model-agnostic super-resolution in high dimensions

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.07846v2 Announce Type: replace Abstract: The problem of super-resolution, roughly speaking, is to reconstruct an unknown signal to high accuracy, given (potentially noisy) information about its low-degree Fourier coefficients. Prior results on super-resolution have imposed strong modeling assumptions on the signal, typically requiring that it is a linear combination of spatially separated point sources. In this work we analyze a very general version of the super-resolution problem by considering completely general non-negative signals (equivalently, distributions) over the $d$-dimensional torus $[0,1)^d$; we do not assume any spatial separation between point sources, or even that the distribution is a finite linear combination of point sources. The question naturally arises: what can be said about super-resolution in such a general setting? - As a warm-up, we first give a set of results for reconstructing distributions under the Wasserstein distance. We establish essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $\kappa$ of the noise for which accurate reconstruction is possible: we show that for $d$-dimensional distributions, estimates of $\approx \exp(d)$ many Fourier coefficients are both necessary and sufficient for accurate Wasserstein reconstruction. - As our main result, we define a new notion of "heavy hitter" reconstruction for distributions, which essentially amounts to achieving high-accuracy reconstruction of all "sufficiently dense" regions of the distribution. We give essentially matching upper and lower bounds on the cutoff frequency $T$ and the magnitude $\kappa$ of the noise for which accurate reconstruction is possible under this notion. Our results show that (in sharp contrast with Wasserstein reconstruction) accurate estimates of only $\approx \exp(\sqrt{d})$ many Fourier coefficients are both necessary and sufficient for heavy hitter reconstruction.

Understanding and Improving Communication Performance in Multi-node LLM Inference

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.09557v4 Announce Type: replace Abstract: As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9$\times$-3.6$\times$ lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72$\times$ reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

Secure Parameter Identification for Multi-Participant ARX Systems via CKKS Cryptosystem-Based Proxy Re-Encryption

Jialong Chen, Ji-Feng Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.14267v2 Announce Type: replace Abstract: This paper investigates the parameter identification for multi-participant autoregressive exogenous input (ARX) systems while protecting the system input and output. To do so, the discrete Gaussian noise in the standard Cheon-Kim-Kim-Song (CKKS) cryptosystem is replaced with a truncated one. By using the CKKS cryptosystem with the truncated discrete Gaussian noise and the key-switching technique, a proxy re-encryption scheme is developed. Based on this scheme, a secure parameter identification algorithm is proposed for multi-participant ARX systems. By rigorously proving that the statistical distance between the discrete Gaussian noise and the truncated one is negligible, the polynomial-time reduction between the standard Ring-Learning with Errors (RLWE) problem and the RLWE problem with the truncated discrete Gaussian noise is established. This result ensures the indistinguishability under chosen-plaintext attacks (IND-CPA) security of the algorithm. By giving a lower bound condition on the size of the plaintext space, the computational overflow in encryption is avoided. Based on this condition, the mean square convergence and convergence rate of the algorithm are given. The trade-off between the security level and the convergence of the algorithm is presented. Finally, a numerical example is given to verify the effectiveness of the algorithm.

SoK: Critical Evaluation of Quantum Machine Learning for Adversarial Robustness

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.14989v3 Announce Type: replace Abstract: Quantum Machine Learning (QML) integrates quantum computational principles into learning algorithms, offering improved representational capacity and computational efficiency. However, the security and robustness of QML systems remain underexplored, particularly under adversarial conditions. We present the first comprehensive systematization of adversarial robustness in QML, combining conceptual organization with empirical evaluation across black-box, gray-box, and white-box threat models. We implement five representative attacks: a label-flipping poisoning attack under black-box; an encoder-level indiscriminate poisoning attack and a proxy-model clean-label backdoor attack under gray-box; and a circuit-level backdoor attack (QTrojan) and gradient-based evasion attacks (FGSM and PGD) under white-box. We evaluate these attacks using a Quantum Multilayer Perceptron (QMLP) trained on MNIST and AZ-Class across circuit depths of 2, 5, 10, and 50 layers with angle and amplitude encoding schemes. Our evaluations reveal a fundamental accuracy-robustness trade-off. Amplitude encoding achieves the highest clean accuracy (92.6% on MNIST and 67% on AZ-Class) but collapses under adversarial perturbations and depolarizing noise, whereas shallow angle-encoded models remain more stable. QUID is effective under noiseless conditions but weakened by noise, while the proxy-model backdoor persists unless the circuit itself is overwhelmed. Furthermore, the circuit-level backdoor fails in the multi-class setting, indicating a scalability limitation. Finally, QMLP models are more robust than Classical Multi-Layer Perceptron (CMLP) models under label-flipping attacks but substantially more vulnerable to gradient-based evasion. We conclude by proposing a threat-aware and noise-resilient framework for secure QML deployment.

Carbon-Aware Intrusion Detection: A Comparative Study of Supervised and Unsupervised DRL for Sustainable IoT Edge Gateways

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.18240v2 Announce Type: replace Abstract: The rapid expansion of the Internet of Things (IoT) has intensified cybersecurity challenges, particularly in mitigating Distributed Denial-of-Service (DDoS) attacks at the network edge. Traditional Intrusion Detection Systems (IDSs) face significant limitations, including poor adaptability to evolving and zero-day attacks, reliance on static signatures and labeled datasets, and inefficiency on resource-constrained edge gateways. Moreover, most existing DRL-based IDS studies overlook sustainability factors such as energy efficiency and carbon impact. To address these challenges, this paper proposes two novel Deep Reinforcement Learning (DRL)-based IDS: DeepEdgeIDS, a label-free Autoencoder-DRL hybrid, and AutoDRL-IDS, a supervised LSTM--DRL model. Both DRL-based IDS are validated through theoretical analysis and experimental evaluation on edge gateways. Results demonstrate that AutoDRL-IDS achieves 94% detection accuracy using labeled data, while DeepEdgeIDS attains 98% offline evaluation accuracy through label-free anomaly detection and online mitigation feedback. This study introduces a carbon-aware, multi-objective reward formulation that supports supervised reward optimization for AutoDRL-IDS and label-free online reward learning for DeepEdgeIDS, enabling sustainable real-time IDS operation in dynamic IoT networks.

Robust Differential Evolution via Nonlinear Population Size Reduction and Adaptive Restart: The ARRDE Algorithm

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.18429v4 Announce Type: replace Abstract: Robustness across heterogeneous optimization regimes remains a central challenge in bound-constrained continuous optimization. In practice, users often prefer optimizers that remain reliable across different dimensionalities, landscape structures, and evaluation budgets. Yet many Differential Evolution (DE) variants that perform strongly in one regime degrade substantially when transferred to others. To address this issue, we propose the \textit{Adaptive Restart--Refine Differential Evolution} (ARRDE) algorithm, a DE variant designed explicitly for cross-regime robustness. ARRDE combines an adaptive restart--refine strategy, a nonlinear population-size reduction schedule that depends on problem dimensionality, and a budget-aware population-initialization rule for restricted-budget settings. Because robustness cannot be established credibly from a narrow experimental setting, we evaluate ARRDE on five benchmark suites: CEC2011, CEC2017, CEC2019, CEC2020, and CEC2022. These suites span markedly different dimensions, landscape characteristics, and evaluation budgets, making this, to the best of our knowledge, one of the most comprehensive robustness-oriented evaluations reported for a proposed DE variant in this context. Since their official performance metrics emphasize different aspects and are not directly comparable, we additionally introduce a bounded accuracy-based scoring metric derived from relative error for cross-suite robustness assessment. Using both the official suite-specific metrics and the proposed unified metric, ARRDE demonstrates consistently strong performance and one of the most stable aggregate profiles across the five suites. These results support ARRDE as a competitive DE variant for robust optimization across heterogeneous benchmark regimes.

Artographer: a Curatorial Interface for Art Space Exploration

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.02288v2 Announce Type: replace Abstract: Relating a piece to previously established works is crucial in creating and engaging with art, but AI interfaces tend to obscure such relationships, rather than helping users explore them. Embedding models present new opportunities to support spatially exploring and relating artwork. We built Artographer, an art-exploration system featuring a zoomable 2-D map, constructed from similarity-clustered embeddings of ~16,000 historical artworks. We used Artographer as a design probe to explore how alternative artwork distribution interface design can shape media engagement: we invited 20 participants, including 9 art history scholars, to traverse the map, collecting artworks for a goal-driven task and while freely exploring. We identify values enacted in spatial art discovery (Visibility, Agency, Serendipity, Friction) and consider how these values challenge dominant design paradigms -- in particular, the recommendation systems governing contemporary media distribution platforms. We reimagine a curatorial approach to media distribution, within digital ecosystems where history and culture can thrive.

PystachIO: Efficient Distributed GPU Query Processing with PyTorch over Fast Networks & Fast Storage

Jigao Luo, Nils Boeschen, Muhammad El-Hindi, Carsten Binnig — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.02862v3 Announce Type: replace Abstract: The AI hardware boom has led modern data centers to adopt HPC-style architectures centered on distributed, GPU-centric computation. Large GPU clusters interconnected by fast RDMA networks and backed by high-bandwidth NVMe storage enable scalable computation and rapid access to storage-resident data. Tensor computation runtimes (TCRs), such as PyTorch, originally designed for AI workloads, have recently been shown to accelerate analytical workloads. However, prior work has primarily considered settings where the data fits in aggregated GPU memory. In this paper, we systematically study how TCRs can support scalable, distributed query processing for large-scale, storage-resident OLAP workloads. Although TCRs provide abstractions for network and storage I/O, naive use often underutilizes GPU and I/O bandwidth due to insufficient overlap between computation and data movement. As a core contribution, we present PystachIO, a prototype of a PyTorch-based distributed OLAP engine that combines fast network and storage I/O with key optimizations to maximize GPU, network, and storage utilization. Our evaluation shows up to 3x end-to-end speedups over existing distributed GPU-based query processing approaches.

Generative AI Practices, Literacy, and Divides: An Empirical Analysis in the Italian Context

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.03671v2 Announce Type: replace Abstract: The rise of generative AI (GenAI) chatbots accessible via conversational interfaces is transforming digital interactions and holds economic promise. However, these tools might deepen existing inequalities -- not only through uneven, socially stratified adoption, but through differentials in their purposeful, critical use. Drawing on original survey data from 1,906 Italian-speaking adults, we provide a comprehensive analysis of GenAI adoption, literacy, and usage patterns. Our findings show that GenAI is supporting diversified personal and professional activities and replacing traditional information-seeking tools. Yet less-educated and older individuals, and those with lower technology familiarity, are less likely to adopt it; 40% cite competence barriers as a key obstacle. Among users, AI training emerges as the primary predictor of purposeful, capital-enhancing engagement -- content creation, learning, and creativity enhancement -- while more passive, recreational uses (e.g., companionship, information seeking) remain insensitive to competence levels. We thus highlight digital literacy as a lever for how people leverage GenAI, not just whether they use it. Finally, gender operates as a persistent cross-cutting divide, shaping both adoption and usage frequency. These findings challenge the assumption that high accessibility translates into broadly shared gains. Rather, they offer a granular, multi-level account of emerging disparities in the GenAI era -- with implications for how this technology may ultimately drive outcomes and benefit divides.

Internal Deployment in the AI Act

Matteo Pistillo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.05742v3 Announce Type: replace Abstract: This memorandum analyzes and stress-tests arguments in favor and against the inclusion of internal deployment within the scope of the European Union Artificial Intelligence Act (AI Act). In doing so, it aims to offer several possible interpretative pathways to the European Commission, AI providers and deployers, courts, and the legal and policy community at large based on Articles 2(1), 2(6), 2(8) of the AI Act. Specifically, this memorandum first analyzes interpretative pathways based on Article 2(1)(a)-(c) supporting the application of the AI Act to internally deployed AI models and systems. Then, it examines possible objections and exceptions based on Articles 2(6) and 2(8), with particular attention to the complexity of the scientific R&D exception under Article 2(6). Finally, it illustrates how Articles 2(1), 2(6), and 2(8) can be viewed as complementary to each other, once broken down to their most plausible meaning and interpreted in conjunction with Articles 3(1), 3(3), 3(4), 3(9), 3(10), 3(11), 3(12), 3(63), and Recitals 12, 13, 21, 25, 97, 109, and 110.

Learning Dynamics from Infrequent Output Measurements for Uncertainty-Aware Optimal Control

Robert Lefringhausen, Theodor Springer, Sandra Hirche — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.08013v2 Announce Type: replace Abstract: Reliable optimal control is challenging when the dynamics of a nonlinear system are unknown and only infrequent, noisy output measurements are available. This work addresses this setting of limited sensing by formulating a Bayesian prior over the continuous-time dynamics and latent state trajectory in state-space form and updating it through a targeted Metropolis-Hastings sampler equipped with a numerical ODE integrator. The resulting posterior samples are used to formulate a scenario-based optimal control problem that accounts for the uncertainty in the dynamics and latent state and is solved using standard nonlinear programming methods. The approach is validated in a numerical case study on glucose regulation using a Type 1 diabetes model.

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.09447v2 Announce Type: replace Abstract: Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.09806v2 Announce Type: replace Abstract: Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.13402v2 Announce Type: replace Abstract: Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

Verification of Unknown Dynamical Systems via Autoencoder Latent Space

Robert Reed, Luca Laurenti, Morteza Lahijanian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.13593v4 Announce Type: replace Abstract: Formal verification provides a powerful framework for proving that dynamical systems satisfy their specifications. However, these techniques face scalability challenges in high-dimensional settings, as they often rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring correctness of latent space verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements.

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

Shengfan Cao, Francesco Borrelli, Eunhyek Joa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.13788v3 Announce Type: replace Abstract: Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.14896v2 Announce Type: replace Abstract: In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

Semantic Grounding of Digital Twin Metamodels Using RDF Graphs

Faima Abbasi, Jean-S\'ebastien Sottet, Cedric Pruski — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.15281v2 Announce Type: replace Abstract: Digital Twins (DTs) represent digital counterparts of physical systems, assets, or processes, referred to as the actual twin (AT). DTs integrate heterogeneous data, models, and semantic technologies to support monitoring, simulation, prediction, and optimization, enabling informed decision-making while maintaining a dynamic and accurate reflection of the AT. A key challenge is aligning heterogeneous models, which can cause semantic mismatches, inconsistencies, and synchronization issues. Existing approaches relying on static mappings and manual updates are often inflexible and error-prone. In this study, we address heterogeneity challenge in multi-layered DT, by introducing semantic grounding pipeline for multi-layered DTs that enables consistent and reliable interoperability between abstraction layers. We make three contributions. First, we design and implement multi-layered DT using flexible modelling framework, to organize data, model and metamodel layers. Second, we semantically lift DT metamodel to RDF graph for unified representation. Finally, we present a graph-based alignment approach (SSM-OM), which leverages semantic embeddings, lexical similarity, and large language model (LLM) reasoning to accurately establish and validate correspondences between the lifted metamodel and ontology. We validate correctness, interoperability, cross-layer traceability, domain applicability and general empirical performance through RDF tests, a DT usecase, and ontology alignment evaluation initiative (OAEI) benchmarks, demonstrating semantic consistency in multi-layered DT.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.23292v3 Announce Type: replace Abstract: The prevailing paradigm in AI for physical systems (scaling general-purpose foundation models toward universal multimodal reasoning) confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision--language models achieve only 50--53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility by violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation: perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway "toward" domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic nuclear reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. Scaling induces strong improvements in closed-loop reliability under nominal simulated conditions, with a steep but smooth gain at strict tolerances: small-scale systems exhibit high-variance imitation with severe tail excursions, while large-scale models undergo variance collapse (approximately 500times reduction), stabilizing execution-level behavior within the sampled distribution. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70\% of the training distribution, concentrating 95% of runtime execution on a single-bank strategy. This emergent policy distillation arises without reinforcement learning or reward engineering, driven solely by outcome-level success under physical execution.

Statistical Guarantees in the Search for Less Discriminatory Algorithms

Chris Hays, Ben Laufer, Solon Barocas, Manish Raghavan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.23943v2 Announce Type: replace Abstract: U.S. discrimination law can impose liability on firms that fail to adopt a less discriminatory alternative (LDA): a decision policy that achieves the same business objectives while reducing disparate impact on legally protected groups. Recent scholarship argues that this doctrine has direct implications for algorithmic decision-making in high-stakes domains such as employment, lending, and housing, potentially obligating firms to search for "less discriminatory algorithms" (Black et al., 2024). Regulators have at times encouraged proactive LDA searches, reinforcing the expectation of a good-faith effort to identify equally performant models with lower disparate impact. Model multiplicity makes such searches plausible: retraining with different random seeds can yield models with comparable predictive performance but materially different disparate impacts. Yet firms cannot retrain indefinitely, raising a central question: when is the search sufficient to demonstrate good faith? We formalize LDA search under multiplicity as an optimal stopping problem in which a developer seeks to produce evidence that further search is unlikely to yield meaningful improvements. Our main contribution is an adaptive stopping algorithm that provides a high-probability upper bound on the best disparate-impact gains attainable through continued retraining, enabling developers to certify (e.g., to a court) that additional search is unlikely to help. We also show how stronger distributional assumptions over the model space can yield tighter bounds, and we validate the approach on real-world credit and housing datasets.

Secure, Verifiable, and Scalable Multi-Client Data Sharing via Consensus-Based Privacy-Preserving Data Distribution

Prajwal Panth, Sahaj Raj Malla — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.00418v2 Announce Type: replace Abstract: We propose the Consensus-Based Privacy-Preserving Data Distribution (CPPDD) framework, a lightweight and post-setup autonomous protocol for secure multi-client data aggregation. The framework enforces unanimous-release confidentiality through a dual-layer protection mechanism that combines per-client affine masking with priority-driven sequential consensus locking. Decentralized integrity is verified via step (sigma_S) and data (sigma_D) checksums, facilitating autonomous malicious deviation detection and atomic abort without requiring persistent coordination. The design supports scalar, vector, and matrix payloads with O(N*D) computation and communication complexity, optional edge-server offloading, and resistance to collusion under N-1 corruptions. Formal analysis proves correctness, Consensus-Dependent Integrity and Fairness (CDIF) with overwhelming-probability abort on deviation, and IND-CPA security assuming a pseudorandom function family. Empirical evaluations on MNIST-derived vectors demonstrate linear scalability up to N = 500 with sub-millisecond per-client computation times. The framework achieves 100% malicious deviation detection, exact data recovery, and three-to-four orders of magnitude lower FLOPs compared to MPC and HE baselines. CPPDD enables atomic collaboration in secure voting, consortium federated learning, blockchain escrows, and geo-information capacity building, addressing critical gaps in scalability, trust minimization, and verifiable multi-party computation for regulated and resource-constrained environments.

Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning

Abhisek Ganguly, Santosh Ansumali, Sauro Succi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.00473v3 Announce Type: replace Abstract: We revisit the analogy between feed-forward deep neural networks (DNNs) and discrete dynamical systems derived from neural integral equations and their corresponding partial differential equation (PDE) forms. A comparative analysis between the numerical/exact solutions of the Burgers' and Eikonal equations, and the same obtained via PINNs is presented. We show that PINN learning provides a different computational pathway compared to standard numerical discretization in approximating essentially the same underlying dynamics of the system. Within this framework, DNNs can be interpreted as discrete dynamical systems whose layer-wise evolution approaches attractors, and multiple parameter configurations may yield comparable solutions, reflecting the degeneracy of the inverse mapping. In contrast to the structured operators associated with finite-difference (FD) procedures, PINNs learn dense parameter representations that are not directly associated with classical discretization stencils. This distributed representation generally involves a larger number of parameters, leading to reduced interpretability and increased computational cost. However, the additional flexibility of such representations may offer advantages in high-dimensional settings where classical grid-based methods become impractical.

A stable and accurate X-FFT solver for linear elastic homogenization problems in 3D

Flavia Gehrig, Matti Schneider — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.02172v3 Announce Type: replace Abstract: Although FFT-based methods are renowned for their numerical efficiency and stability, traditional discretizations fail to capture material interfaces that are not aligned with the grid, resulting in suboptimal accuracy. To address this issue, the work at hand introduces a novel FFT-based solver that achieves interface-conforming accuracy for three-dimensional mechanical problems. More precisely, we integrate the extended finite element (X-FEM) discretization into the FFT-based framework, leveraging its ability to resolve discontinuities via additional shape functions. We employ the modified abs(olute) enrichment and develop a preconditioner based on the concept of strongly stable GFEM, which mitigates the conditioning issues observed in traditional X-FEM implementations. Our computational studies demonstrate that the developed X-FFT solver achieves interface-conforming accuracy, numerical efficiency, and stability when solving three-dimensional linear elastic homogenization problems with smooth material interfaces.

Fairness in Opinion Dynamics

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.03859v3 Announce Type: replace Abstract: Ways in which people's opinions change are, without a doubt, subject to a rich tapestry of differing influences. Factors that affect how one arrives at an opinion reflect how they have been shaped by their environment throughout their lives, education, material status, what belief systems are they subscribed to, and what socio-economic minorities are they a part of. This already complex system is further expanded by the ever-changing nature of one's social network. It is therefore no surprise that many models have a tendency to perform best for the majority of the population and discriminating those people who are members of various marginalized groups . This bias and the study of how to counter it are subject to a rapidly developing field of Fairness in Social Network Analysis (SNA). The focus of this work is to look into how a state-of-the-art model discriminates certain minority groups and whether it is possible to reliably predict for whom it will perform worse. Moreover, is such prediction possible based solely on one's demographic or topological features? To this end, the NetSense dataset, together with a state-of-the-art CoDiNG model for opinion prediction have been employed. Our work explores how three classifier models (Demography-Based, Topology-Based, and Hybrid) perform when assessing for whom this algorithm will provide inaccurate predictions. Finally, through a comprehensive analysis of these experimental results, we identify four key patterns of algorithmic bias. Our findings suggest that no single paradigm provides the best results and that there is a real need for context-aware strategies in fairness-oriented social network analysis. We conclude that a multi-faceted approach, incorporating both individual attributes and network structures, is essential for reducing algorithmic bias and promoting inclusive decision-making.

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.04068v4 Announce Type: replace Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

Efficient training for compact compression models via sequential distillation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.05639v2 Announce Type: replace Abstract: Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to significantly reduce autoencoder-based compression networks in a more stable Knowledge Distillation process. The intuition is that highly reduced architectures benefit from simplified optimization objectives in early training, with complexity gradually introduced later. Therefore, our approach begins with a sequential encoder--decoder distillation stage that provides a robust initialization for the lightweight model. This is followed by standard training that can be regularized with latent distillation. We evaluate the resulting lightweight autoencoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity in early epochs better than training lightweight autoencoders with the original loss, making it practical for resource-limited environments.

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.05877v3 Announce Type: replace Abstract: Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

Emissions and cost tradeoffs of time-matched clean electricity procurement under inter-annual weather variability -- case study of hydrogen production

Michael Giovanniello, Dharik S. Mallapragada — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.13202v2 Announce Type: replace Abstract: Regulators and voluntary corporate sustainability efforts are increasingly adopting time-matching requirements (TMRs) for clean electricity procurement for large loads, such as data centers, and electricity-intensive fuel production, such as hydrogen. We use a stochastic capacity expansion model (CEM) framework to assess how inter-annual weather variability affects the cost, composition, and emissions of procurement-driven infrastructure to meet annual and hourly TMRs using the case study of a grid-connected hydrogen producer in Texas. Our approach, which relies on co-optimizing investments and hourly operations over nine weather scenarios, reveals that hourly TMR comes at a higher cost premium compared to annual TMR than previously estimated by single-scenario deterministic modeling, while emissions outcomes remain directionally consistent. Demand flexibility and partial hourly TMR (80-90%) lower the cost premium while preserving emissions benefits. We further examine how binding renewable portfolio standards (RPS) interact with TMR costs and emissions outcomes. When an RPS is applied to non-H2 electricity demand, annual TMR reduces emissions comparably to hourly TMR at a lower cost. Incorporating H2-related electricity demand directly into the RPS constraint, rather than imposing a separate TMR, achieves similar emissions outcomes at still lower cost, suggesting that TMR-based clean electricity procurement, particularly hourly matching, offers limited additional value in regions with stringent grid decarbonization policies.

Legal Retrieval for Public Defenders

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.14348v3 Announce Type: replace Abstract: AI tools are suggested as solutions to assist public agencies with heavy workloads. In public defense -- where a constitutional right to counsel meets the complexities of law, overwhelming caseloads, and constrained resources -- practitioners face especially taxing conditions. Yet, there is little evidence of how AI could meaningfully support defenders' day-to-day work. In partnership with the New Jersey Office of the Public Defender, we develop the NJ BriefBank, a retrieval tool which surfaces relevant appellate briefs to streamline legal research and writing. We show that existing retrieval benchmarks fail to transfer to real public defense research, however adding domain knowledge improves retrieval quality. This includes query expansion with legal reasoning, domain-specific data and curated synthetic examples. To facilitate further research, we release a taxonomy of realistic defender search queries and a manually annotated evaluation dataset for public defense retrieval. This benchmark is highly correlated with a proprietary retrieval dataset annotated by experienced public defenders. Our work improves on the status quo of realistic legal retrieval benchmarking and illustrates one approach to applying AI in a real-world public interest setting.

Building Deep Graph Predictors with Graph Imitation Learning

Andr\'e Eberhard, Gerhard Neumann, Pascal Friederich — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.15133v3 Announce Type: replace Abstract: Recent years have seen substantial progress in neural generation of text, images, and audio, supported by mature training pipelines and large-scale optimization. For graphs, however, comparable progress has been more limited. We attribute this gap to graph-specific optimization and representation challenges that undermine the effectiveness of training neural networks with backpropagation and gradient descent. We argue that representing graphs on a fixed-size Euclidean grid, as is common in recently proposed models for supervised graph prediction, may not be the optimal choice in these settings. To support our view, we provide an analysis of neural graph generation methods and identify theoretical challenges that lead to pitfalls when training neural networks to produce graphs as their output. Motivated by this analysis, we introduce \textbf{GRA}ph~\textbf{I}mitation~\textbf{L}earning~(GRAIL), a framework for training neural networks in supervised settings in which the supervision signal is a graph. GRAIL generates graphs sequentially through a Markov decision process over embeddings of partial graphs, thereby avoiding the representation issues associated with fixed-size grid graph representations. We empirically show that GRAIL achieves competitive results on supervised graph prediction across a comprehensive suite of 18 benchmarks, matching or surpassing state-of-the-art methods in several settings.

AMBER: A Columnar Architecture for High-Performance Agent-Based Modeling in Python

Anh-Duy Pham — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.16292v2 Announce Type: replace Abstract: Python is widely used for agent-based modelling because it is accessible and has a mature scientific ecosystem, but object-per-agent execution incurs interpreter overhead that restricts the population sizes feasible in interactive modelling, calibration, and parameter sweeps. This paper presents AMBER, a Python framework that stores agent state in a Polars-backed columnar table and exposes population operations through a compact view API. The framework preserves conventional model and agent abstractions while translating common population updates into compiled column operations; behaviours that do not vectorise remain expressible through a buffered object-oriented path. We evaluate AMBER on wealth transfer, random walk, and spatial SIR benchmarks against Mesa, AgentPy, SimPy, Melodie, Agents.jl, and AMBER's own loop path, using invariant checks to verify comparable model outputs before timing. Across the tested workloads, AMBER has the lowest execution time among Python-hosted implementations and achieves speedups of up to $1118\times$ over Mesa; on the largest SIR benchmark it is also faster than the Julia-based Agents.jl implementation.

Self-Refining Video Sampling

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.18577v2 Announce Type: replace Abstract: Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

Paul Whitten, Francis Wolff, Chris Papachristou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.18696v4 Announce Type: replace Abstract: Hardware trojans are malicious circuits which compromise the functionality and security of an integrated circuit (IC). These circuits are manufactured directly into the silicon and cannot be fixed by security patches like software. The solution would require a costly product recall by replacing the IC and hence, early detection in the design process is essential. Hardware detection at best provides statistically based solutions with many false positives and false negatives. These detection methods require more thorough explainable analysis to filter out false indicators. Existing explainability methods developed for general domains like image classification may not provide the actionable insights that hardware engineers need. A question remains: How do domain-aware property analysis, model-agnostic case-based reasoning, and model-agnostic feature attribution techniques compare for hardware security applications? This work compares three categories of explainability for gate-level hardware trojan detection on the Trust-Hub benchmark dataset: (1) domain-aware property-based analysis of 31 circuit-specific features derived from gate fanin patterns, flip-flop distances, and primary Input/Output (I/O) connectivity; (2) model-agnostic case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution methods (Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), gradient) that provide generic importance scores without circuit-level context.

When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

Nima Leclerc, Chris Miller, Nicholas Brawand — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.18973v4 Announce Type: replace Abstract: Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. We further introduce a few-shot pre-adaptation protocol that estimates the optimal adaptation budget from $N{=}3$-5 probe steps within 3-19% relative error across out-of-distribution regimes.

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.21662v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

Learning Incentive Structures for Cooperative Resilience in Multi-Agent Systems under Social Dilemmas

Manuela Chacon-Chamorro, Luis Felipe Giraldo, Nicanor Quijano — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.22292v2 Announce Type: replace Abstract: Multi-agent social dilemmas, such as the tragedy of the commons, capture settings where individual incentives conflict with collective well-being, making these systems highly vulnerable to collapse under disruptions. In this context, this work studies cooperative resilience, understood as the system-level ability to maintain collective well-being under perturbations through adaptive agent behavior. We propose a framework for learning incentive structures aligned with collective well-being in multi-agent reinforcement learning systems, where reward functions shape individual decision-making and collective behavior. A resilience metric is used to score and rank agent trajectories, allowing the inference of reward functions that promote resilient collective behavior. These inferred reward functions are integrated into the multi-agent reinforcement learning process to shape agent interactions in social dilemma settings. The approach is evaluated in resource-sharing environments subject to disruptions, using three incentive structures: individual incentives, resilience-aligned incentives, and a hybrid incentive structure that combines both individual and collective components. The results show that the hybrid incentive structure promotes sustained collective behavior, reduces collapse events associated with resource depletion, and preserves system performance under disruption. These findings highlight the role of incentive design as a mechanism for promoting resilient collective behavior and provide a computational framework for multi-agent social dilemmas under disruptions.

Learning to Defer in Non-Stationary Time Series via Switching State-Space Models

Yannis Montreuil, Letian Yu, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.22538v2 Announce Type: replace Abstract: Learning-to-defer (L2D) routes each decision to a system's own predictor or to an external expert. Streaming time-series settings break the offline-L2D assumptions: the data are non-stationary, expert availability shifts over time, and the internal predictor is trained online. We propose L2D-SLDS, a one-stage online L2D framework based on a factorized switching linear-Gaussian state-space model over all potential residuals: a discrete regime, a shared global factor, and per-expert idiosyncratic states. The always-observed internal residual continuously updates beliefs about every unqueried expert through the shared factor, and a learner-aware query score balances immediate cost against latent-state information gain and one-step learner improvement. We prove an oracle inequality against a time-varying learn-and-defer comparator, decomposing regret into a query-bonus budget, an SLDS predictive-cost-error term~$\mathcal{E}_{\mathrm{SLDS}}$, and the internal learner's interval dynamic regret. On synthetic, Melbourne, Jena, and 24-expert Delhi benchmarks, L2D-SLDS is competitive with or improves on contextual- and non-stationary-bandit baselines while deferring on ${<}2\%$ of real-data rounds.

Stable Personas: Dual-Assessment of Temporal Stability in LLM-Based Human Simulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.22812v2 Announce Type: replace Abstract: Large Language Models (LLMs) acting as artificial agents offer the potential for scalable behavioral research, yet their validity depends on whether LLMs can maintain stable personas across extended conversations. We address this point using a dual-assessment framework measuring both self-reported characteristics and observer-rated persona expression. Across two experiments testing four persona conditions (default, high, moderate, and low ADHD presentations), seven LLMs, and three semantically equivalent persona prompts, we examine between-conversation stability (3,473 conversations) and within-conversation stability (1,370 conversations and 18 turns). Self-reports remain highly stable both between and within conversations. However, observer ratings reveal a tendency for persona expressions to decline during extended conversations. These findings suggest that persona-instructed LLMs produce stable, persona-aligned self-reports, an important prerequisite for behavioral research, while identifying this regression tendency as a boundary condition for multi-agent social simulation.

DC-LA: Difference-of-Convex Langevin Algorithm

Hoang Phuc Hau Luu, Zhongjian Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.22932v2 Announce Type: replace Abstract: We study a sampling problem whose target distribution is $\pi \propto \exp(-f-r)$ where the data fidelity term $f$ is Lipschitz smooth while the regularizer term $r=r_1-r_2$ is a non-smooth difference-of-convex (DC) function, i.e., $r_1,r_2$ are convex. By leveraging the DC structure of $r$, we can smooth out $r$ by applying Moreau envelopes to $r_1$ and $r_2$ separately. In line with DC programming, we then redistribute the concave part of the regularizer to the data fidelity and study its corresponding proximal Langevin algorithm (termed DC-LA). We establish convergence of DC-LA to the target distribution $\pi$, up to discretization and smoothing errors, in the $q$-Wasserstein distance for all $q \in \mathbb{N}^*$, under the assumption that $V$ is distant dissipative. Our results improve previous work on non-log-concave sampling in terms of a more general framework and assumptions. Numerical experiments show that DC-LA produces accurate distributions in synthetic settings and provides qualitatively reasonable uncertainty quantification in a real-world Computed Tomography application.

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.23086v2 Announce Type: replace Abstract: Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.00593v3 Announce Type: replace Abstract: Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model (Gemini-3.1-Pro) achieves only 51.7% average accuracy, even with access to visual ground truth and search tools. Our analysis attributes this low accuracy to three factors, frequent visual grounding errors even with visual ground truth, shallow search harnessing, and VLM's inability to retrieve long-tail, unstructured local information. This striking gap exposes the limitations of current models in assisting humans with real-world scenarios that demand overwhelming visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the next generation of language-vision agents that seamlessly integrate fine-grained perception with robust knowledge search.

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.00933v3 Announce Type: replace Abstract: The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.

Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.01273v4 Announce Type: replace Abstract: Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by 6.14$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.02304v2 Announce Type: replace Abstract: Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are structurally ill-suited to explain these shifts, because they either treat models as static objects, as traditional eXplainable AI (XAI) approaches do, or merely compare independent explanations across different checkpoints of a model. As a result, these approaches fail to explain the functional transition between two model instances in which a certain behavior has shifted following an intervention. This gap creates significant governance risks across jurisdictions including the EU AI Act, US state legislation, and Chinese AI regulations, which require documenting causal chains for substantial system modifications. This position paper argues that explaining behavioral shifts in large language models requires a principled approach that treats the shift itself as the primary object of explanation: namely, one that explains how and why an intervention transforms a reference model into an updated model with different behavior. To support this claim, we introduce \textit{Comparative} XAI (XAI$_\Delta$), a novel XAI paradigm aimed at explaining the difference between two model checkpoints where a behavior has shifted, together with a set of desiderata specifying what XAI$_\Delta$ explainers and explanations must satisfy, including comparability, validity, actionability, and monitoring, with the goal of grounding model auditing in explicit, measurable requirements. Finally, we provide preliminary evidence suggesting the need for XAI$_\Delta$ in practice through illustrative experiments, compiling the resulting findings into a transition report directly usable for governance and incident documentation.

MARS: Modular Agent with Reflective Search for Automated AI Research

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.02660v3 Announce Type: replace Abstract: A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

Graph Autoencoder for Process Monitoring

Xiangrui Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.03004v2 Announce Type: replace Abstract: To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

Marco Job, Thomas Stastny, Eleni Kelasidi, Roland Siegwart, Michael Pantic — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.03209v2 Announce Type: replace Abstract: Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

The Complexity of Maximal/Closed Frequent Tree Mining for Bounded Height Trees

Kenta Komoto, Kazuhiro Kurita, Hirotaka Ono — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.03436v3 Announce Type: replace Abstract: Frequent tree mining asks us to enumerate tree patterns that occur frequently in a database of rooted trees. This problem is motivated by tree-structured data in bioinformatics, such as glycans and pseudoknot-free RNA secondary structures. A direct enumeration of all frequent trees is often highly redundant, because every subtree of a frequent tree is again frequent. Closed and maximal frequent trees are standard ways to reduce this redundancy, but their enumeration can still be computationally hard. In this paper, we study the effect of bounding the height of the input trees. This is a natural restriction for rooted trees, since the height is the depth of the hierarchy. We ask whether closed/maximal frequent tree mining remains hard when every input tree has a small height. Our results show that the answer depends sharply on the model. For rooted unordered trees of height at most 2, we give a polynomial-delay algorithm for enumerating closed frequent trees. On the other hand, for rooted ordered trees of height at most 2, we show that an output-polynomial time algorithm for enumerating closed frequent trees would imply an output-polynomial time algorithm for Dualization. For maximal frequent tree enumeration, we prove that no output-polynomial time algorithm exists unless P = NP already for rooted ordered trees of height at most 2 and for rooted unordered trees of height at most 3. Thus, even very small height bounds do not make the enumeration problems easy in general. At the same time, the unordered closed case of height at most 2 admits polynomial-delay enumeration. These results give a height-based classification of the complexity of closed and maximal frequent tree mining on shallow rooted trees.

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.04876v2 Announce Type: replace Abstract: We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

Jianhong Chen, Naichen Shi, Xubo Yue — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.04907v2 Announce Type: replace Abstract: Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06025v2 Announce Type: replace Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06358v2 Announce Type: replace Abstract: We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLMs). By reusing the frozen LLM's own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/MuLabPKU/SHINE

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Emanuel Sommer, Kangning Diao, Jakob Robnik, Uros Seljak, David R\"ugamer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06500v2 Announce Type: replace Abstract: Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

Evolutionary Generation of Multi-Agent Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06511v3 Announce Type: replace Abstract: Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard. Code is available at https://github.com/amazon-science/EvoMAS

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

Meng Lou, Stanley Yu, Yizhou Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06862v2 Announce Type: replace Abstract: Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since ParaX modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experimental results demonstrate the superiority of ParaX across diverse visual recognition tasks. Code is publicly released at: https://github.com/LMMMEng/ParaX.

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.06886v3 Announce Type: replace Abstract: Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

Xian Wu, Kaijie Zhu, Ying Zhang, Lun Wang, Wenbo Guo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.07832v2 Announce Type: replace Abstract: Process rewards have been widely used in deep reinforcement learning to improve training efficiency, reduce variance, and prevent reward hacking. In LLM reasoning, existing works also explore various solutions for learning effective process reward models (PRM) with or without the help of an expert policy. However, existing methods either rely on strong assumptions about the expert policies (e.g., requiring their reward functions) or suffer intrinsic limitations (e.g., entropy collapse), resulting in weak PRMs or limited generalizability. In this paper, we introduce rePIRL, an inverse RL-inspired framework that learns effective PRMs with minimal assumptions about expert policies. Specifically, we design a dual learning process that updates the policy and the PRM interchangeably. Our learning algorithm has customized techniques to address the challenges of scaling traditional inverse RL to LLMs. We theoretically show that our proposed learning framework can unify both online and offline PRM learning methods, justifying that rePIRL can learn PRMs with minimal assumptions. Empirical evaluations on standardized math and coding reasoning datasets demonstrate the effectiveness of rePIRL over existing methods. We further show the application of our trained PRM in test-time training, test-time scaling, and providing an early signal for training hard problems. Finally, we validate our training recipe and key design choices via a detailed ablation study.

CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.08023v3 Announce Type: replace Abstract: Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Ning Yang, Chengzhi Wang, Yibo Liu, Baoliang Tian, Haijun Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.08686v2 Announce Type: replace Abstract: Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\bar\rho{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

Bayesian Preference Learning for Test-Time Steerable Reward Models

Jiwoo Hong, Shao Tang, Zhipeng Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.08819v2 Announce Type: replace Abstract: Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

AI-Assisted Scientific Assessment: A Case Study on Climate Change

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.09723v2 Announce Type: replace Abstract: The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.10408v2 Announce Type: replace Abstract: Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

Compute Only Once: UG-Separation for Efficient Large Recommendation Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.10455v2 Announce Type: replace Abstract: Driven by scaling laws, recommender systems increasingly rely on larger-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models can reuse user-side computation through KV Caching, such reuse is difficult in TokenMixer-based dense feature interaction architectures, where user and group features are deeply entangled and mixed-up across layers. In this work, we present User-Group Separation (UG-Sep), an industrial large-scale framework that enables user-side computation reusable in TokenMixer-based dense interaction models for the first time. UG-Sep explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens preserves purely user-side representations across layers. This design allows the corresponding per-token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for the potential expressive capacity loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance to validate the effectiveness of UG-Sep. Results show that UG-Sep reduces inference latency by up to 20% without causing adverse changes to online user experience and commercial metrics on multiple influential business scenarios compared to TokenMixer at ByteDance, including Douyin Feed Recommendation, Hongguo Feed Recommendation, Chuanshanjia Ads, and Qianchuan Ads.

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.11499v2 Announce Type: replace Abstract: Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment reasoning with domain-specific tools to resolve visual-semantic ambiguities and hallucinations during inference. Moreover, we incorporate a generative model to reconstruct alternative viewpoints, enabling the agent to 'imagine' under limited viewpoints. Finally, we propose a composite reward mechanism to jointly optimize prediction accuracy and tool efficiency. Evaluations on both SWIG-HOI and HICO-DET datasets demonstrate that our method achieves state-of-the-art performance while requiring merely 36.7% of the training data compared to existing methods, validating our robustness, empirical effectiveness and efficiency.

Learning to Configure Agentic AI Systems

Aditya Taparia, Som Sagar, Ransalu Senanayake — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.11574v2 Announce Type: replace Abstract: Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling {\tau}-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

Edward Y. Chang, Longling Geng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.11675v4 Announce Type: replace Abstract: Large language models can answer causal questions correctly for the wrong reasons. Current RL methods reward \emph{what} a model concludes but ignore \emph{why}, reinforcing correlational shortcuts -- a failure we call \emph{Reward Entrenchment}. We introduce \emph{Epistemic Regret Minimization} (\erm), a framework that critiques the causal \emph{structure} of a model's reasoning trace rather than its answer. Applying established causal principles, \erm flags unexamined confounders, correlation--intervention conflation, and unchecked back-door paths from exposed reasoning traces. The framework admits \emph{label-free} operation -- without the true causal graph or correct answer -- and we separately distinguish favorable benchmark-derived critique, error-direction cues, and fully label-free judge-generated critique in the experiments. Within a single episode, \erm detects and repairs causal reasoning errors; across episodes, it accumulates interventional evidence into a reward signal applicable where no answer key exists. Experiments on 1,360 scenarios across six frontier LLMs show that reasoning-heavy models (GPT-4 Turbo, GPT-5.2) resist outcome-only correction (25--31\% recovery) yet respond to causal critique (78--91\%), gaining $+53$--$59$ pp. Standard test-time methods (self-consistency, Best-of-$N$, Self-Refine) \emph{underperform} outcome-only reprompting on causal tasks, while ERM reduces residual Rung Collapse from 55--70\% to 4\%. A separation theorem proves outcome-only reward cannot close this gap; a controlled simulation confirms epistemic feedback does, outperforming outcome-only baselines 38-fold.

Federated Learning of Nonlinear Temporal Dynamics with Graph Attention-based Cross-Client Interpretability

Ayse Tursucular, Ayush Mohanty, Nazal Mohamed, Nagi Gebraeel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.13485v2 Announce Type: replace Abstract: Networks of modern industrial systems are increasingly monitored by distributed sensors, where each system comprises multiple subsystems generating high dimensional time series data. These subsystems are often interdependent, making it important to understand how temporal patterns at one subsystem relate to others. This is challenging in decentralized settings where raw measurements cannot be shared and client observations are heterogeneous. In practical deployments each subsystem (client) operates a fixed proprietary model that cannot be modified or retrained, limiting existing approaches. Nonlinear dynamics further make cross client temporal interdependencies difficult to interpret because they are embedded in nonlinear state transition functions. We present a federated framework for learning temporal interdependencies across clients under these constraints. Each client maps high dimensional local observations to low dimensional latent states using a nonlinear state space model. A central server learns a graph structured neural state transition model over the communicated latent states using a Graph Attention Network. For interpretability we relate the Jacobian of the learned server side transition model to attention coefficients, providing the first interpretable characterization of cross client temporal interdependencies in decentralized nonlinear systems. We establish theoretical convergence guarantees to a centralized oracle and validate the framework through synthetic experiments demonstrating convergence, interpretability, scalability and privacy. Additional real world experiments show performance comparable to decentralized baselines.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Melkamu Abay Mersha, Jugal Kalita — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.16608v2 Announce Type: replace Abstract: Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

Flow Map Language Models: One-step Language Modeling via Continuous Denoising

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.16813v3 Announce Type: replace Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. Despite their promise, these models typically produce samples whose quality sharply degrades in the few-step regime, preventing a dramatic speedup in practice. Here, we show that language models based on continuous flows over one-hot token embeddings can outperform discrete diffusion in both quality and speed. Importantly, our continuous formulation defines a unique flow map that can be learned directly for efficient few-step inference, a structure we show is unavailable to discrete methods. In this setting, we show that both the flow and its associated flow map can be learned with simple cross-entropy objectives that respect the simplex geometry of the data, and we identify three distinct choices for flow map distillation whose performance we compare in practice. Using these insights, we build a flow language model (FLM), a continuous flow that matches state-of-the-art discrete diffusion baselines on the One Billion Words (LM1B) and OpenWebText (OWT) datasets. We then distill FLM into a flow map language model (FMLM), whose one-step generation exceeds the 8-step quality of recent few-step discrete diffusion language models. Our work challenges the widely-held hypothesis that discrete noising processes are necessary for generative modeling over discrete modalities and paves the way toward accelerated language modeling at scale. Code is available at https://github.com/david3684/flm.

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Yonghyeon Jo, Sunwoo Lee, Seungyul Han — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.17062v2 Announce Type: replace Abstract: Value decomposition is a core approach for cooperative multi-agent reinforcement learning (MARL). However, existing methods still rely on a single optimal action and struggle to adapt when the underlying value function shifts during training, often converging to suboptimal policies. To address this limitation, we propose Successive Sub-value Q-learning (S2Q), which learns multiple sub-value functions to retain alternative high-value actions. Incorporating these sub-value functions into a Softmax-based behavior policy, S2Q encourages persistent exploration and enables $Q^{\text{tot}}$ to adjust quickly to the changing optima. Experiments on challenging MARL benchmarks confirm that S2Q consistently outperforms various MARL algorithms, demonstrating improved adaptability and overall performance. Our code is available at https://github.com/hyeon1996/S2Q.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Xiuying Wei, Caglar Gulcehre — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.18196v4 Announce Type: replace Abstract: Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D=16, and drops by about 2--3 points at D=64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

VLANeXt: Recipes for Building Strong VLA Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.18532v2 Announce Type: replace Abstract: Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.19320v2 Announce Type: replace Abstract: Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.20399v2 Announce Type: replace Abstract: Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2$\times$. These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at https://github.com/Physics-Scaling/GeoPT.

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.24138v2 Announce Type: replace Abstract: Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.00086v2 Announce Type: replace Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment, pending validation on larger corpora.

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Lennon J. Shikhman — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.01406v2 Announce Type: replace Abstract: Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.01712v2 Announce Type: replace Abstract: Fine-tuning large language models for vertical domains remains labor-intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end-to-end LLM fine-tuning has not been systematically studied as an interactive agent task. We introduce FT-Dojo, an interactive benchmark environment for autonomous LLM fine-tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT-Dojo standardizes a task interface, shared raw-data repository, sandboxed execution environment, structured feedback protocol, and held-out evaluation procedure. We further develop FT-Agent, a fine-tuning-oriented autonomous framework that uses structured iteration planning, fail-fast validation, and multi-level feedback analysis to refine data and training strategies. Experiments show that FT-Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open-source planning backbones, and multi-run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long-horizon planning. The implementation is available at https://github.com/microsoft/rd-agent.

SPARC: Spatial-Aware Path Planning via Attentive Robot Communication

Sayang Mu, Xiangyu Wu, Bo An — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.02845v3 Announce Type: replace Abstract: Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat all neighboring robots equally regardless of their spatial proximity, leading to diluted attention in congested regions where coordination matters most. We propose Relation enhanced Multi Head Attention (RMHA), a communication mechanism that explicitly embeds pairwise Manhattan distances into the attention weight computation, enabling each robot to dynamically prioritize messages from spatially relevant neighbors. Combined with a distance-constrained attention mask and GRU gated message fusion, RMHA integrates seamlessly with MAPPO for stable end-to-end training. In zero-shot generalization from 8 training robots to 128 test robots on 40x40 grids, RMHA achieves approximately 75 percent success rate at 30 percent obstacle density outperforming the best baseline by over 25 percentage points. Ablation studies confirm that distance-relation encoding is the key contributor to success rate improvement in high-density environments. Index Terms-Multi-robot path planning, graph attention mechanism, multi-head attention, communication optimization, cooperative decision-making

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.06007v2 Announce Type: replace Abstract: Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

An Investigation of Stabilization Scaling in Finite-Strain Virtual Element Methods for Hyperelasticity

Paulo Akira F. Enabe, Rodrigo Provasi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.07218v2 Announce Type: replace Abstract: Low-order virtual element methods (VEM) compute a consistent finite-strain contribution through polynomial projections and rely on stabilization to control the unresolved modes in the projector kernel. In current hyperelastic VEM practice, stabilization is often defined by integrating a nonlinear surrogate energy over an auxiliary sub-triangulation and scaled through modified Lam\'e parameters and incompressibility factors; this can introduce sensitivity to the arbitrary internal tessellation, complicate consistent Newton linearization, and, most critically, inject bulk-dependent proxies into shear-type kernel penalties, artificially stiffening isochoric missing modes in the nearly incompressible regime. This work develops a submesh-free, kernel-only stabilization that decouples deviatoric and volumetric channels and is explicitly designed to scale like the current Newton tangent energy on the kernel: the deviatoric term is scaled solely by a shear measure and enhanced by bounded geometry-driven directional weights, while the volumetric term is scaled by an independent bulk measure and can be capped or suppressed as $\nu\to 1/2$. A spectral framework is established in which the canonical VEM stability requirement on the kernel is characterized by generalized Rayleigh quotients and eigenvalue bounds, and it is shown under standard polygon regularity assumptions that the deviatoric stabilization is uniformly equivalent to $\mu_E|\cdot|_{1,E}^2$ on the kernel with constants independent of mesh size and Poisson ratio. Element-level diagnostics confirm that classical surrogate-based stabilizations assign bulk-driven energy to isochoric kernel modes as $\nu\to 1/2$, whereas the proposed decoupled stabilization remains shear-scaled; kernel spectra and Cook's membrane simulations in the nearly incompressible regime further support improved robustness across polygonal mesh families.

Coordination Games on Multiplex Networks: Consensus, Convergence, and Stability of Opinion Dynamics

Ruey-An Shiu, Parinaz Naghizadeh — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.07633v2 Announce Type: replace Abstract: This paper studies opinion dynamics in multilayer (social) networks. Extending a single-layer model, we formulate opinion updates as a synchronous coordination game in which agents minimize a local cost to stay close to their neighbors' opinions. We propose two coupling mechanisms between layers: (i) a merged model that aggregates layers through weighted influences, and (ii) a switching model that periodically alternates across layers. Using random-walk and spectral analysis, we derive sufficient conditions for consensus, characterize convergence rates, and analyze stability under network perturbations. We show that multilayer interactions can induce or accelerate global consensus even when no single layer achieves it alone, and conversely, that individually coordinated layers may lose consensus once interconnected. Notably, we show that similarity between the layers (as captured by alignments in the weighted degrees of the nodes) is a main determinant of if merging or switching can speed up convergence to consensus compared to when the layers operate in isolation, providing guidelines for network design interventions.

C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.08155v3 Announce Type: replace Abstract: Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.08235v2 Announce Type: replace Abstract: Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.09024v2 Announce Type: replace Abstract: Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $\theta$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.

The coordination gap in frontier AI safety policies

Isaak Mengesha — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.10015v2 Announce Type: replace Abstract: Frontier AI Safety Policies concentrate on prevention: capability evaluations, deployment gates, and usage constraints, while neglecting the capacity to coordinate responses when prevention fails. We argue this coordination gap is structural: investments in ecosystem robustness yield diffuse benefits but concentrated costs, generating systematic underinvestment. Drawing on risk regimes in nuclear safety, pandemic preparedness, and critical infrastructure, we propose that similar mechanisms (precommitment, shared protocols, standing coordination venues) could be adapted to frontier AI governance. Closing the gap requires cross-actor "note-exchange" of ex ante if-then response logic, exposing not only triggers but the decision processes that convert signals into actions. Without such architecture, institutions cannot learn from failures at the pace of relevance.

The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

Romain Peyrichou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.10139v2 Announce Type: replace Abstract: Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or -- given only examples -- to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent -- they characterize the same set -- but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization "generation is easy, parsing is hard" is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions -- directionality and temporality -- have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

Breaking User-Centric Agency: A Tri-Party Framework for Agent-Based Recommendation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.10673v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have stimulated growing interest in agent-based recommender systems, enabling language-driven interaction and reasoning for more expressive preference modeling. However, most existing agentic approaches remain predominantly user-centric, treating items as passive entities and neglecting the interests of other critical stakeholders. This limitation exacerbates exposure concentration and long-tail under-representation, threatening long-term system sustainability. In this work, we identify this fundamental limitation and propose the first Tri-party LLM-agent Recommendation framework (TriRec) that explicitly coordinates user utility, item exposure, and platform-level fairness. The framework employs a two-stage architecture: Stage 1 empowers item agents with personalized self-promotion to improve matching quality and alleviate cold-start barriers, while Stage 2 performs platform-level sequential multi-objective re-ranking, balancing user relevance, item utility, and exposure fairness. Experiments show consistent gains in accuracy, fairness, and item-level utility. Moreover, we find that item self-promotion can simultaneously enhance fairness and effectiveness, challenging the conventional trade-off assumption between relevance and fairness. Our code is available at https://github.com/Marfekey/TriRec.

Riemannian MeanFlow for One-Step Generation on Manifolds

Zichen Zhong, Haoliang Sun, Yukun Zhao, Yongshun Gong, Yilong Yin — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.10718v2 Announce Type: replace Abstract: Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.10726v2 Announce Type: replace Abstract: Large Language Models (LLMs) rely on optimizations like Automatic Prefix Caching (APC) to accelerate inference. APC works by reusing previously computed states for the beginning part of a request (prefix), when another request starts with the same text. While APC improves throughput, it introduces timing side channels: cache hits are faster than misses, creating observable latency differences. In multi-tenant systems, attackers can exploit these differences to infer sensitive information, e.g., by incrementally reconstructing another user's request by observing hit/miss patterns. Current defenses take a sledgehammer approach: they disable APC and cache sharing, isolating users, and sacrificing efficiency for regular users. This paper presents PrefixWall, a system that secures multi-tenant LLM serving systems against APC side channels without sacrificing performance and efficiency. PrefixWall monitors cache reuse across users, flags suspicious sharing, and selectively isolates prefixes, restricting their reuse only when necessary. Evaluation shows that PrefixWall enables up to 70% higher cache reuse and 30% lower inference latency compared to existing defenses that isolate users. PrefixWall's lightweight design demonstrates how security in LLM serving does not have to come at the cost of unnecessarily reduced performance or unbearable overheads.

Diffusion Models Memorize in Training -- and Generalize in Inference

Tim Kaiser, Markus Kollmann — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.13419v2 Announce Type: replace Abstract: Diffusion models generalize well in practice. However, an optimal diffusion model fully memorizes the training data and therefore fails to generalize, raising the question of what induces generalization in a real diffusion model. We show that, despite generalizing at the sample level, diffusion models progressively overfit the denoising training objective and thereby create a generalization gap between the performance on validation and training samples. This gap is most pronounced at intermediate noise levels. Using a fully analytic error-prone toy model, we trace the factors affecting the generalization gap. We find that the optimal denoising flow field localizes sharply around training points, but the model error suppresses the exact recall of training points, yielding a smooth, generalizing flow field. Finally, we find that the generalization gap observed in training does not translate to inference, which would result in a strong similarity between generated samples and training samples. This is because the intermediate states of sampling trajectories are sufficiently far from the distribution of noisy training samples the model is trained on. Together, these findings reveal a novel picture of how diffusion models generalize: the flow field generalizes through model error, which moves sampling trajectories outside the domain of noisy training samples and thereby naturally prevents overfitting.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.14184v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.14392v2 Announce Type: replace Abstract: Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance. The code is available at https://github.com/511205787/WestWorld.

Dual Quaternion Based Contact Modeling for Fast and Smooth Collision Recovery of Quadrotors

Valentin Gaucher, Wenlong Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.14698v3 Announce Type: replace Abstract: Unmanned aerial vehicles (UAVs) operating in cluttered environments require efficient and accurate impact modeling to maintain stability post collisions, however classical impulse contact models decouple the normal and tangential components. This letter presents a dual quaternion impulse reset map directly on the SE(3) manifold. By operating on the unified spatial twist (unified linear and angular velocities), the proposed formulation retains the cross-coupling between normal and tangential impulse components in a single closed-form expression, and recovers the classical decoupled Newton impulse model as a special case. A recovery controller is designed that couples linear and angular momentum to enforce kinetic energy dissipation across impacts. Hardware-in-the-loop benchmarks demonstrate a 24\% reduction in execution latency compared to an optimized matrix-based implementation, and a 20\% reduction relative to a position-plus-quaternion (PQ) formulation. MuJoCo simulations across Monte Carlo sweeps over impact angles and friction coefficients show a 50.8\%-75.1\% reduction in position root-mean-square error (RMSE) and a 68.7\%-85\% decrease in peak kinetic energy compared to published linear-admittance baselines.

Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning

Jeremy J Samuelson — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.15842v2 Announce Type: replace Abstract: Modern machine learning systems increasingly rely on sensitive data, creating significant privacy, security, and regulatory risks that existing privacy-preserving machine learning (ppML) techniques, such as Differential Privacy (DP) and Homomorphic Encryption (HE), address only at the cost of degraded performance, increased complexity, or prohibitive computational overhead. This paper introduces Informationally Compressive Anonymization (ICA) and the VEIL architecture, a privacy-preserving ML framework that achieves strong privacy guarantees through architectural and mathematical design rather than noise injection or cryptography. ICA embeds a supervised, multi-objective encoder within a trusted Source Environment to transform raw inputs into low-dimensional, task-aligned latent representations, ensuring that only irreversibly anonymized vectors are exported to untrusted training and inference environments. The paper rigorously proves that these encodings are structurally non-invertible using topological and information-theoretic arguments, showing that inversion is logically impossible, even under idealized attacker assumptions, and that, in realistic deployments, the attacker conditional entropy over the original data diverges, driving reconstruction probability to zero. Unlike prior autoencoder-based ppML approaches, ICA preserves predictive utility by aligning representation learning with downstream supervised objectives, enabling low-latency, high-performance ML without gradient clipping, noise budgets, or encryption at inference time. The VEIL architecture enforces strict trust boundaries, supports scalable multi-region deployment, and naturally aligns with privacy-by-design regulatory frameworks, establishing a new foundation for enterprise ML that is secure, performant, and safe by construction, even in the face of post-quantum threats.

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.16513v3 Announce Type: replace Abstract: Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

Romil Imtiaz, Dimitris K. Iakovidis — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.17784v2 Announce Type: replace Abstract: We developed a multi-label gastrointestinal video analysis pipeline based on a ResNet-50 frame classifier followed by anatomy-guided temporal event decoding. The system predicts 17 labels, including 5 anatomy classes and 12 pathology classes, from frames resized to 336x336. A major challenge was severe class imbalance, particularly for rare pathology labels. To address this, we used clipped class-wise positive weighting in the training loss, which improved rare-class learning while maintaining stable optimization. At the temporal stage, we found that direct frame-to-event conversion produced fragmented mismatches with the official ground truth. The final submission therefore combined GT-style framewise event composition, anatomy vote smoothing, and anatomy-based pathology gating with a conservative hysteresis decoder. This design improved the final temporal mAP from 0.3801 to 0.4303 on the challenge test set.

Verifiable Error Bounds for Physics-Informed Neural Network Solutions of Lyapunov and Hamilton-Jacobi-Bellman Equations

Jun Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.19545v2 Announce Type: replace Abstract: Many core problems in nonlinear systems analysis and control can be recast as solving partial differential equations (PDEs) such as Lyapunov and Hamilton-Jacobi-Bellman (HJB) equations. Physics-informed neural networks (PINNs) have emerged as a promising mesh-free approach for approximating their solutions, but in most existing works there is no rigorous guarantee that a small PDE residual implies a small solution error. This paper develops verifiable error bounds for approximate solutions of Lyapunov and HJB equations, with particular emphasis on PINN-based approximations. For both the Lyapunov and HJB PDEs, we show that a verifiable residual bound yields relative error bounds with respect to the true solutions as well as computable a posteriori estimates in terms of the approximate solutions. For the HJB equation, this also yields certified upper and lower bounds on the optimal value function on compact sublevel sets and quantifies the optimality gap of the induced feedback policy. We further show that one-sided residual bounds already imply that the approximation itself defines a valid Lyapunov or control Lyapunov function. We illustrate the results with numerical examples.

TabPFN Extensions for Interpretable Geotechnical Modelling

Taiga Saito, Yu Otake, Daijiro Mizutani, Stephen Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.21033v2 Announce Type: replace Abstract: Geotechnical site characterisation relies on sparse, heterogeneous borehole data, where uncertainty quantification and interpretability matter as much as predictive accuracy. We evaluate TabPFN~\citep{Hollmann2025}, a tabular foundation model, and its \texttt{tabpfn-extensions} library on two geotechnical tasks: (1) soil-type classification from N-value and shear-wave velocity data as a controlled illustrative case, and (2) iterative imputation of five mechanical parameters ($s_\mathrm{u}$, $E_{\mathrm{u}}$, ${\sigma'}_\mathrm{p}$, $C_\mathrm{c}$, $C_\mathrm{v}$) in BM/AirportSoilProperties/2/2025. Without retraining, we apply cosine-similarity analysis to TabPFN embeddings, visualise predictive distributions, and compute SHAP attributions. On the regression benchmark we compare TabPFN with mean imputation, linear regression, random forests, XGBoost, and HBM; introduce a proxy decomposition of predictive uncertainty across context-perturbation classes; and propagate marginal $C_\mathrm{c}$ and ${\sigma'}_\mathrm{p}$ distributions through a one-dimensional consolidation model to obtain the reliability index $\beta$ and serviceability exceedance probability $P_\mathrm{f}$. Embeddings exhibit label-consistent Clay/Sand grouping; iterative imputation reduces RMSE for all five targets, with TabPFN lowest on four; SHAP attributions are consistent with the Skempton compression-index correlation and the inverse preconsolidation-pressure-water-content dependence; the within-posterior component is largest in the proxy decomposition. We position the contribution as a worked evaluation workflow that may complement established methods for data-scarce geotechnics, not as algorithmic innovation.

Inference Time Policy Optimization for Offline RL with Differentiable World Models

Rohan Deb, Stephen J. Wright, Arindam Banerjee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.22430v2 Announce Type: replace Abstract: Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference-time adaptation, however, is expensive: rollout generation and backpropagation dominate per-step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one-step MeanFlow sampler recovers much of the gains at a fraction of the cost.

ParlayMarket: Automated Market Making for Parlay-style Joint Contracts

Ranvir Rana, Viraj Nadkarni, Niusha Moshrefi, Pramod Viswanath — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.22596v2 Announce Type: replace Abstract: Prediction markets are powerful mechanisms for information aggregation, but existing designs are optimized for single-event contracts. In practice, traders frequently express beliefs about joint outcomes - through parlays in sports, conditional forecasts across related events, or scenario bets in financial markets. Current platforms either prohibit such trades or rely on ad hoc mechanisms that ignore correlation structure, resulting in inefficient prices and fragmented liquidity. We introduce ParlayMarket, the first automated market-making design that supports parlay-style joint contracts within a unified liquidity pool while maintaining coherent pricing across base markets and their combinations. Our main result is a convergence characterization of the resulting system. Under repeated trading, the AMM dynamics converge to a unique fixed point corresponding to the best approximation to the true joint distribution within the model class. We show that (i) parameter error remains bounded at stationarity due to a balance between signal and noise in trade-induced updates, and (ii) pricing error and monetary loss scale with this parameter error, implying that aggregate market-maker loss remains controlled and grows at most quadratically in the number of base markets. These results establish explicit limits on the information-retrieval error achievable through the trading interface. Importantly, parlay trades play a structural role in this convergence: by providing direct constraints on joint outcomes, they improve identifiability of dependence structure and reduce steady-state error relative to markets that rely only on marginal trades. Empirically, we show both in controlled simulations and in replay on historical Kalshi parlay data that this design achieves the intended scaling while remaining effective in realistic market settings.

Energy Detection for Cognitive Radio with Distributional Uncertainty and Signal Variety under Nonlinear Expectation Theory

Jialiang Fu, Wen-Xuan Lang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.22659v2 Announce Type: replace Abstract: Classical energy detection (ED) methods for cognitive radio (CR) have addressed noise uncertainty as deviations in noise power and signal uncertainty as variability in signal characteristics, which use probabilistic methods and assume fixed probability distributions for both. In practical scenarios, due to the uncertainty in probability models and the significant variation of primary signals encountered by receivers across different radio technologies, wireless environments exhibit not only distributional uncertainty but also substantial signal variety. In this paper, we develop a generalized formulation of energy detection based on nonlinear expectation theory, where both the signal and noise distributions are uncertain. We utilize the $G$-normal distribution to characterize channel noise. Moreover, to capture practical signal variety, the absolute values of transmitted signal random variables are assumed to lie within a bounded range $[\underline{\sigma}_X,\overline{\sigma}_X]$. The worst-case detection performance is then characterized by a double supremum, meaning over all admissible distributions and all possible signal realizations. We derive estimations for the minimum and the maximum detection error probabilities, and demonstrate the validity of the results through numerical simulations. The proposed model generalizes the classical theoretical analysis of energy detection and offers a potential theoretical foundation for robust detection and information-theoretic analysis under distributional uncertainty.

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

\"Ozg\"ur Togay, Javier Garcia-Bernardo, Florian Kunneman, Anastasia Giachanou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.23531v2 Announce Type: replace Abstract: Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.23890v2 Announce Type: replace Abstract: As the modern microservice architecture for cloud applications grows in popularity, cloud services are becoming increasingly complex and more vulnerable to misconfiguration and software bugs. Traditional approaches rely on expert input to diagnose and fix microservice anomalies, which lacks scalability in the face of the continuous integration and continuous deployment (CI/CD) paradigm. Microservice rollouts, containing new software installations, have complex interactions with the components of an application. Consequently, this added difficulty in attributing anomalous behavior to any specific installation or rollout results in potentially slower resolution times. To address the gaps in current diagnostic methods, this paper introduces Praxium, a framework for anomaly detection and root cause inference. Praxium aids administrators in evaluating target metric performance in the context of dependency installation information provided by a software discovery tool, PraxiPaaS. Praxium continuously monitors telemetry data to identify anomalies, then conducts root cause analysis via causal impact on recent software installations, in order to provide site reliability engineers (SRE) relevant information about an observed anomaly. In this paper, we demonstrate that Praxium is capable of effective anomaly detection and root cause inference, and we provide an analysis on effective anomaly detection hyperparameter tuning as needed in a practical setting. Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at >0.97 macro-F1. In addition, we show that causal impact analysis reliably infers the correct root cause of anomalies, even as package installations occur at increasingly shorter intervals.

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.24139v2 Announce Type: replace Abstract: Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.24472v3 Announce Type: replace Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-1.7B/8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.

On Integrating Resilience and Human Oversight into LLM-Assisted Modeling Workflows for Digital Twins

Lekshmi P, Neha Karanjkar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.25898v3 Announce Type: replace Abstract: LLM-assisted modeling holds the potential to rapidly build executable Digital Twins of complex systems from only coarse descriptions and sensor data. However, resilience to LLM hallucination, human oversight, and real-time model adaptability remain challenging and often mutually conflicting requirements. We present three critical design principles for integrating resilience and oversight into such workflows, derived from insights gained through our work on FactoryFlow - an open-source LLM-assisted framework for building simulation-based Digital Twins of manufacturing systems. First, orthogonalize structural modeling and parameter fitting. Structural descriptions (components, interconnections) are LLM-translated from coarse natural language to an intermediate representation (IR) with human visualization and validation, which is algorithmically converted to the final model. Parameter inference, in contrast, operates continuously on sensor data streams with expert-tunable controls. Second, restrict the model IR to interconnections of parameterized, pre-validated library components rather than monolithic simulation code, enabling interpretability and error-resilience. Third, and most important, is to use a density-preserving IR. When IR descriptions expand dramatically from compact inputs hallucination errors accumulate proportionally. We present the case for Python as a density-preserving IR : loops express regularity compactly, classes capture hierarchy and composition, and the result remains highly readable while exploiting LLMs strong code generation capabilities. A key contribution is detailed characterization of LLM-induced errors across model descriptions of varying detail and complexity, revealing how IR choice critically impacts error rates. These insights provide actionable guidance for building resilient and transparent LLM-assisted simulation automation workflows.

How Open Must Language Models be to Enable Reliable Scientific Inference?

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.26539v2 Announce Type: replace Abstract: How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

Eziyo Ehsani, Luca Giamattei, Ivano Malavolta, Roberto Pietrantuono — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.26603v2 Announce Type: replace Abstract: The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a replicable and reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality of LLMs on mobile devices. We harness this pipeline to conduct an empirical case study on a flagship Android device, capturing granular metrics across eight LLMs ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. The findings highlight the trade-offs between generation quality, performance, power and resource consumption, revealing which LLMs offer the best balance across metrics and under different conditions. Besides, we uncovered a counter-intuitive quantization energy paradox: while modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

MeshTailor: Cutting Seams via Generative Mesh Traversal

Xueqi Ma, Xingguang Yan, Congyue Zhang, Hui Huang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.27309v2 Announce Type: replace Abstract: We present MeshTailor, the first mesh-native generative framework for synthesizing edge-aligned seams on 3D surfaces. Unlike prior optimization-based or extrinsic learning-based methods, MeshTailor operates directly on the mesh graph, eliminating projection artifacts and fragile snapping heuristics. We introduce ChainingSeams, a hierarchical serialization of the seam graph that orders chains from global structural cuts down to local details in a coarse-to-fine manner, and a dual-stream encoder that fuses topological and geometric context. Leveraging this hierarchical representation and dual-stream vertex embeddings, our MeshTailor Transformer utilizes an autoregressive pointer layer to trace seams vertex-by-vertex within local neighborhoods. Extensive evaluations show that MeshTailor produces more coherent and structurally regular seam layouts compared to recent optimization-based and learning-based baselines.

AI-Powered Facial Mask Removal Is Not Suitable For Identification

Emily A Cooper, Hany Farid — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.27747v2 Announce Type: replace Abstract: Recently, crowd-sourced online criminal investigations have used generative-AI to enhance low-quality visual evidence. In one high-profile case, social-media users circulated an "AI-unmasked" image of a federal agent involved in a fatal shooting, fueling a wide-spread misidentification. In response to this and similar incidents, we conducted a large-scale analysis evaluating the efficacy and risks of commercial AI-powered facial unmasking, specifically assessing whether the resulting faces can be reliably matched to true identities.

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Luigi Curini, Alfio Ferrara, Giovanni Pagano, Sergio Picascia — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.28103v2 Announce Type: replace Abstract: Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Why Aggregate Accuracy is Inadequate for Evaluating Fairness in Law Enforcement Facial Recognition Systems

Khalid Adnan Alsayed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.28675v2 Announce Type: replace Abstract: Facial recognition systems are increasingly deployed in law enforcement and security contexts, where algorithmic decisions can carry significant societal consequences. Despite high reported accuracy, growing evidence demonstrates that such systems often exhibit uneven performance across demographic groups, leading to disproportionate error rates and potential harm. This paper argues that aggregate accuracy is an insufficient metric for evaluating the fairness and reliability of facial recognition systems in high-stakes environments. Through analysis of subgroup-level error distribution, including false positive rate (FPR) and false negative rate (FNR), the paper demonstrates how aggregate performance metrics can obscure critical disparities across demographic groups. Empirical observations show that systems with similar overall accuracy can exhibit substantially different fairness profiles, with subgroup error rates varying significantly despite a single aggregate metric. The paper further examines the operational risks associated with accuracy-centric evaluation practices in law enforcement applications, where misclassification may result in wrongful suspicion or missed identification. It highlights the importance of fairness-aware evaluation approaches and model-agnostic auditing strategies that enable post-deployment assessment of real-world systems. The findings emphasise the need to move beyond accuracy as a primary metric and adopt more comprehensive evaluation frameworks for responsible AI deployment.

The Computer System Trail

Sushant Kumar Gupta — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.28777v2 Announce Type: replace Abstract: No matter how much the world of computing changes, system design remains crucial. While most people try to learn it through quick tutorials or AI-generated summaries, there is no better way to master the field than by studying the original research papers. This book serves as a roadmap through those foundational texts, covering seminal papers in distributed systems, operating systems, and big data. It doesn't just look at what these systems do; it digs deep into why they were built that way. Built from years of notes taken during discussions at top universities and industry meetups, this guide helps readers understand how systems work under the hood. It is for those who are tired of surface-level content and want to develop the technical patience to wrestle with complex problem-solving. Readers will find the journey long and challenging but highly rewarding, as it enables them to elevate their engineering craft to a truly professional level.

IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.29183v3 Announce Type: replace Abstract: Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates. Code is available at https://github.com/mala-lab/IMPACT.

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

Khalid Adnan Alsayed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.01449v3 Announce Type: replace Abstract: Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

CB-VER: A Stable Foundation for Modular Control Plane Verification

Dexin Zhang, Timothy Alberdingk Thijm, David Walker, Aarti Gupta — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.03539v2 Announce Type: replace Abstract: Network operators are often interested in verifying \emph{eventually-stable properties} of network control planes: properties of control plane states that hold eventually, and hold forever thereafter, provided the operating environment remains unchanged. Examples include eventually-stable reachability, access control, or path length properties. In this work, we introduce \textsc{CB-Ver}, a new framework for verifying such properties, based on the key idea of a \emph{converges-before graph} (CB-graph for short). When a user provides interfaces for each network component, \textsc{CB-Ver} checks the necessary component-by-component requirements in parallel using an SMT solver. In addition, the tool automatically synthesizes a CB-graph and checks whether it connects all nodes in a network -- if it does, the interfaces are valid and users can check whether additional eventually-stable properties are implied. Moreover, the CB-graph can then be used to determine fault tolerance properties of the network. We formalize our verification algorithm in the Lean theorem proving environment and prove its soundness. We evaluate the performance of \textsc{CB-Ver} on a range of benchmarks that demonstrate its ability to verify expressive properties in reasonable time. Finally, we demonstrate it is possible to automatically generate suitable interfaces by turning the problem around: Given a CB-graph, we use an off-the-shelf Constrained Horn Clause (CHC) solver to synthesize interfaces for every network component that together ensure the given correctness property.

Region of Attraction Estimation for Linear Quadratic Regulator, Linear and Robust Model Predictive Control on a Two-Wheeled Inverted Pendulum

Lorenzo Fici, Dalim Wahby, Alvaro Detailleur, Matthieu Barreau, Guillaume Ducard — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.04455v2 Announce Type: replace Abstract: Nonlinear underactuated systems such as two-wheeled inverted pendulums (TWIPs) exhibit a limited region of attraction (RoA), which defines the set of initial conditions from which the closed-loop system converges to the equilibrium. The RoA of nonlinear and constrained systems is generally nonconvex and analytically intractable, requiring numerical or approximate estimation methods. This work investigates the estimation of the RoA for a TWIP stabilized under three model-based control strategies: saturated linear quadratic regulator (LQR), linear model predictive control (MPC), and constraint tightening MPC (CTMPC). We first derive a Lyapunov-based invariant set that provides a certified inner approximation of the RoA. Since this analytical bound is highly conservative, a Monte Carlo-based estimation procedure is then employed to obtain a more representative approximation of the RoA, capturing how the controllers behave beyond the analytically guaranteed region. The proposed methodology combines analytical guarantees with data-driven estimation, providing both a formally certified inner bound and an empirical characterization of the RoA, offering a practical way to evaluate controller performance without relying solely on conservative analytical bounds or purely empirical simulation.

Signature Placement in Post-Quantum TLS Certificate Hierarchies: An Experimental Study of ML-DSA and SLH-DSA in TLS 1.3 Authentication

Jos\'e Luis Delgado Jim\'enez — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.06100v3 Announce Type: replace Abstract: Post-quantum migration in TLS 1.3 couples signature-algorithm choice with certificate-hierarchy structure, chain exposure during the handshake, and role-dependent cryptographic cost. In certificate-based authentication, the practical effect of a signature family depends on where it appears in the certification hierarchy, how much of that hierarchy is exposed during the handshake, and how the resulting cryptographic cost is distributed across client and server roles. Post-quantum TLS migration must therefore be evaluated as cryptographic design within authenticated key establishment, with algorithm selection assessed in its deployment context. This paper presents a local experimental study of TLS 1.3 authentication strategies implemented with OpenSSL 3 and oqsprovider. Using a reproducible laboratory setting, it compares ML-DSA and SLH-DSA across multiple certificate placements, hierarchy depths, and key-exchange modes, including classical, hybrid, and pure post-quantum configurations. The analysis is organized into four complementary campaigns: a leaf-only comparison, a full hierarchy strategy matrix, a depth comparison, and a key-exchange exploration. Across the experimental matrix, the main discontinuity appears when SLH-DSA is placed in the server leaf certificate. In that configuration, handshake latency and server-side compute cost increase by orders of magnitude, whereas strategies that confine SLH-DSA to upper trust layers and preserve ML-DSA in the interactive leaf remain within a more plausible operational range. The results also show that transport size alone does not explain the heavy regime: outside leaf-SLH scenarios, transferred bytes and observed chain size track latency closely, but once SLH-DSA reaches the leaf, server-side cryptographic cost becomes dominant.

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki, Mattia Piccinini, Johannes Betz — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.06750v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

Diffusion Processes on Implicit Manifolds

Victor Kawasaki-Borruat, Clara Grotehans, Pierre Vandergheynst, Adam Gosztolai — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.07213v2 Announce Type: replace Abstract: High-dimensional data are often assumed to lie on lower-dimensional manifolds. We study how to construct diffusion processes on this data manifold using only point cloud samples and without access to charts, projections, or other geometric primitives. Here, we introduce Implicit Manifold-valued Diffusions (IMDs), a data-driven mathematical formalism for defining stochastic differential equations in the original high-dimensional space that describe drifting Brownian particles evolving intrinsically on the underlying manifold. Our construction hinges on approximating the corresponding infinitesimal generator of the diffusion process using a proximity graph over the data and using the carr\'e-du-champ of the generator, which encodes the local tangent spaces of the manifold and lifts the intrinsic process into ambient coordinates. We show that as the number of samples grows, our discrete diffusion process converges in law on the space of probability paths to its smooth manifold counterpart. We further present an Euler-Maruyama scheme for the numerical integration of IMDs. We validate our framework using numerical experiments on synthetic manifolds and the MNIST data manifold, showing that IMDs remain confined over the manifold and enable its guided exploration. Our work provides the mathematical foundation and practical implementations of diffusion processes on data manifolds, opening new avenues for manifold-aware sampling, exploration, and generative modeling.

Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.09174v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.10245v2 Announce Type: replace Abstract: Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.10784v2 Announce Type: replace Abstract: Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.11071v3 Announce Type: replace Abstract: We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 3rd place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.11530v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

Towards Autonomous Mechanistic Reasoning in Virtual Cells

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.11661v3 Announce Type: replace Abstract: Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

Multi-Axis Additive Manufacturing for Customized Automotive Components

Uzair Aziz Muhammad, Zheng Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.12236v2 Announce Type: replace Abstract: The reproduction of automobile components through additive manufacturing presents significant geometric challenges, as many automotive parts feature complex, organically shaped surfaces that are difficult to fabricate accurately using conventional 3D printing approaches without wasteful support structures. Multi-axis Digital Light Processing (DLP) 3D printing addresses this by orienting a robotic arm to cure resin layers at varying angles and positions, enabling the fabrication of geometries that fixed-axis systems cannot reliably reproduce. However, this flexibility introduces a key challenge: layers printed at non-orthogonal orientations exhibit non-uniform thickness across their cross-section, which traditional DLP systems cannot accommodate without subdividing the layer, increasing total layer count, print time, and the need for supporting structures. This paper introduces a variable exposure method to address this challenge. Rather than splitting a non-uniform layer into multiple uniform ones, our approach divides each layer into sublayers and modulates the UV illumination duration for each sublayer proportionally to its local thickness. This is governed by an established cure-depth equation relating exposure time to material penetration depth, allowing precise control over curing without additional hardware. The result is a meaningful reduction in total layer count for printed objects. Fewer layers directly translates to faster print times and a reduction in wasteful support structures. Our contribution is a practical and low-overhead extension to existing multi-axis DLP pipelines that improves print efficiency without sacrificing geometric accuracy, with clear applications in the rapid prototyping and reproduction of automotive components.

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

Manognya Lokesh Reddy, Zheng Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.12239v2 Announce Type: replace Abstract: Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing optical character recognition text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation on a controlled static dataset reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work.

TIP: Token Importance in On-Policy Distillation

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.14084v3 Announce Type: replace Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

Khalid Adnan Alsayed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.15038v2 Announce Type: replace Abstract: The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.

LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

Zeyang Ma, Jinqiu Yang, Tse-Hsun Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.16359v2 Announce Type: replace Abstract: Software systems generate massive, evolving, semi-structured logs that are central to reliability engineering and AIOps, yet difficult to analyze at scale under drift and limited labels. Recent advances in pretrained Transformer models and instruction-tuned large language models (LLMs) have reshaped log analysis by enabling semantic generalization and cross-source evidence integration, but also introducing deployment risks such as context limits, latency and cost, privacy constraints, and hallucinations. This paper presents LLM4Log, a systematic review of LLM-based log analysis across the end-to-end pipeline, from upstream logging-statement generation and maintenance to log parsing/structuring and downstream tasks including anomaly detection, failure prediction, root cause analysis, and log summarization. Following a structured search and manual screening protocol, we completed literature collection in November 2025 and identified 145 unique papers across seven logging tasks. We organize the research area through a unified, task-driven taxonomy, summarize common design patterns (prompting/ICL, retrieval grounding, fine-tuning, tool/agent augmentation, and verification), and analyze evaluation practices, datasets, metrics, and reproducibility. Based on these cross-paper analyses, we summarize key lessons and open challenges for reliable real-world adoption. We emphasize robustness under drift and long-tail events, grounding and faithfulness for operator-facing outputs, and deployment-oriented designs with verifiable behavior.

From Public-Key Linting to Operational Post-Quantum X.509 Assurance for ML-KEM and ML-DSA: Registry-Driven Policy, Mutation-Based Evaluation, and Import Validation

Jos\'e Luis Delgado Jim\'enez — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.17003v2 Announce Type: replace Abstract: Final FIPS and PKIX standards for ML-KEM and ML-DSA settle the normative floor, yet they do not by themselves provide assurance. In practical post-quantum X.509 deployments, failures still emerge at certificate-profile semantics, SubjectPublicKeyInfo representation, and private-key container import, while current PQ public-key linting does not yet provide a reproducible workflow that says which checks belong to the certification authority, which belong to the artifact importer, and how those checks should act under deployment-facing policy. We present an operational post-quantum X.509 assurance framework for ML-KEM and ML-DSA in a narrow executable profile, pkix-core. The framework reifies 17 final-standards requirements into an assurance registry indexed by owner, stage, detector kind, normative strength, and mode-specific action; packages those requirements into three operator gate packs; spans certificate/profile, SPKI/public-key, and private-key-container/import surfaces; and evaluates them through a frozen mutation-based corpus backed by bounded public-appendix and cross-tool supporting evidence. Across a controlled corpus of 48 artifacts, comprising 21 valid and 27 invalid cases, the artifact detects all expected invalid artifacts in both strict and deployable modes with zero false positives. Strict blocks all 17 active requirements; deployable preserves the same underlying detection coverage while downgrading exactly one exercised ML-KEM canonicality condition from block to warning. On the importer-owned private-key surface, all 7 active requirements are covered, with 7/7 expected invalid detections and no open detector gaps. On a comparable certificate subset, a frozen JZLint baseline meets 5/10 expected invalid detections and fatally rejects 3 valid ML-KEM certificates, whereas the local artifact meets 10/10 with no fatal valid rejections.

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.18309v2 Announce Type: replace Abstract: Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness or review quality) over the explicit causal explanation quality. We systematically vary the debugging information to study how distinct context compositions affect the quality of LLM-generated failure explanations. Across 93 context configurations on real bugs and three economically viable models (gpt-5-mini, DeepSeek-V3.2, and Grok-4.1-fast), we evaluate explanations with six criteria and validate the LLM-as-a-judge scores against human ratings in a user study. Our results indicate that explanation quality is causally affected by context composition. Evidence-rich, failure-specific artifacts improve causal and action-oriented quality, whereas overly large contexts tend to yield vague explanations. Higher explanation-score quartiles are associated with higher downstream repair pass rates and, for some models, with fixes that are closer to the reference minimal fixes. In contrast, low-score quartiles can even underperform the no-explanation baseline. Reproduction package is publicly available.

Differentially Private Model Merging

Qichuan Yin, Manzil Zaheer, Tian Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.20985v2 Announce Type: replace Abstract: In machine learning, privacy requirements at inference or deployment time often evolve due to changing policies, regulations, or user preferences. In this work, we aim to construct a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post-processing techniques, namely random selection and linear combination, to generate final private models satisfying any target privacy parameter. We provide privacy accounting of these approaches from the lens of R'enyi DP and privacy loss distributions on general problems, as well as on private mean estimation, where we precisely characterize the privacy/utility tradeoffs and compare the two mechanisms. Empirically, we demonstrate the effectiveness of our approaches and validate our analyses on several models and both synthetic and real-world datasets.

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.21060v2 Announce Type: replace Abstract: Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

Analytical PI Tuning for Second-Order Plants with Monotonic Response and Minimum Settling Time

Senol Gulgonul — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.21294v5 Announce Type: replace Abstract: This study presents two analytical closed-form PI controller tuning solutions for second-order plants with real poles, each achieving monotonic step response and minimum settling time. The first solution employs pole-zero cancellation, placing the controller zero at the slower plant pole and reducing the closed-loop dynamics to a critically damped second-order system. The second solution, applicable when the plant pole ratio is less than two, places all three closed-loop poles at a common location without cancelling any plant pole, yielding a closed-loop transfer function with a triple real pole and a zero. Despite retaining a closed-loop zero, this solution achieves strictly faster settling time than the pole-zero cancellation method in its region of applicability. The two solutions coincide at the boundary pole ratio of two and together form a continuous piecewise-analytical tuning covering the full range of plant pole ratios. This study further establishes that closed-loop transfer functions of the form a^n/(s + a)^n possess a maximum sensitivity Ms together with phase margin and gain margin that are independent of the pole location a and depend solely on the order n, yielding universal robustness constants for each n. A closed-form expression GM(n) = 1 + sec^n(pi/n) is established for the gain margin of the family. Numerical verification confirms the analytical results across multiple plant configurations.

Sound Agentic Science Requires Adversarial Experiments

Dionizije Fa, Marko Culjak — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.22080v2 Announce Type: replace Abstract: LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

Speech Enhancement Based on Drifting Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.24199v3 Announce Type: replace Abstract: We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.24697v2 Announce Type: replace Abstract: Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.24764v2 Announce Type: replace Abstract: Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

Compute Aligned Training: Optimizing for Test Time Inference

Adam Ousherovitch, Ambuj Tewari — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.24957v2 Announce Type: replace Abstract: Scaling test-time compute has emerged as a powerful mechanism for enhancing Large Language Model (LLM) performance. However, standard post-training paradigms, Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), optimize the likelihood of individual samples under a base policy, creating a misalignment with test time procedures that rely on aggregated or filtered outputs. In this work, we propose Compute Aligned Training, which aligns training objectives with test-time strategies. By conceptualizing inference strategies as operators on the base policy, we derive new loss functions that maximize performance when said strategies are applied. We instantiate such loss functions for SFT and RL across common test time strategies. Finally, we provide empirical evidence that this training method substantially improves test time scaling over standard training.

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Mengya Hu, Qiong Wei, Sandeep Atluri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.26052v3 Announce Type: replace Abstract: Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

Jerry Y. Huang, Justin Lin, Sheel Shah, Kartik Nair, Nicholas M. Boffi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.27147v2 Announce Type: replace Abstract: In generative modeling, we often wish to produce samples that maximize a user-specified reward such as aesthetic quality or alignment with human preferences, a problem known as \textit{guidance}. Despite their widespread use, existing guidance methods either require expensive multi-particle, many-step schemes or rely on poorly understood approximations. We reformulate guidance as a \textit{deterministic optimal control problem}, yielding a hierarchy of algorithms that subsumes existing approaches at the coarsest level. We show that the \textit{flow map}, an object of significant recent interest for its role in fast inference, arises naturally in the optimal solution. Based on this observation, we propose \textbf{Flow Map Reward Guidance (FMRG)}: a training-free, \textit{single-trajectory} framework that uses the flow map to both integrate and guide the flow. At text-to-image scale, FMRG matches or surpasses baselines across inverse problems and reward-guided generation with \textbf{as few as 3 NFEs}, giving at least an order-of-magnitude speedup in comparison to prior state of the art.

Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.27195v2 Announce Type: replace Abstract: Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.27375v2 Announce Type: replace Abstract: Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.27505v2 Announce Type: replace Abstract: While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

Qinchuan Cheng, Jiaqi Liu, Ruixuan Xie, Xiaoya Yuan, Yuxin Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.01486v2 Announce Type: replace Abstract: Legal consultation is inherently iterative: before giving advice, a system must identify relevant legal elements, gather missing facts and authorities, and determine whether the current evidence is sufficient. Existing retrieval-augmented legal agents often use fixed retrieval budgets or single-shot search, making them insensitive to the evolving coverage state of a consultation. This paper introduces a coverage-driven retrieval-control framework for multi-turn legal consultation. The framework maintains a structured map over user facts, legal elements, retrieval goals, and retrieved evidence, and uses element coverage, evidence validity coverage, and marginal retrieval gain to decide whether to retrieve, clarify, reformulate, or stop. On a 50-case synthetic Chinese labor-law consultation pilot with fixed legal-element schemas, a DeepSeek V4-Pro action-selection variant achieves full measured element coverage under the pilot metric while requiring 3.4 retrieval rounds and 7.1 evidence snippets on average. Diagnostic analyses show that model-backed action selection recovers rule-policy failure cases with a small retrieval-budget increase, while forced continuation mainly increases token and latency costs. These results suggest that legal-element coverage is a useful control signal for adaptive legal retrieval, while remaining bounded to retrieval-control behavior under synthetic fixed-schema conditions rather than deployment-level legal correctness.

Statistical Consistency and Generalization of Contrastive Representation Learning

Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.02116v2 Announce Type: replace Abstract: Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

From TinyGo to gc Compiler: Extending Zorya's Concolic Framework to Real-World Go Binaries

Karolina Gorna, Nicolas Iooss, Yannick Seurin, Rida Khatoun, Keith Makan — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.03492v2 Announce Type: replace Abstract: Zorya is a concolic execution framework that lifts compiled binaries to Ghidra's P-Code intermediate representation and uses the Z3 SMT solver to detect vulnerabilities by reasoning over both concrete and symbolic values. Previous versions supported only single-threaded TinyGo binaries. In this paper, we extend Zorya to multi-threaded binaries produced by Go's standard gc compiler. This is achieved by restoring OS thread states from gdb dumps, neutralizing runtime preemption, and introducing overlay path analysis with copy-on-write semantics to detect silent vulnerabilities on untaken branches. We rigorously assess Zorya on 11 real-world vulnerabilities from production Go projects such as Kubernetes, Go-Ethereum, and CoreDNS. Our evaluation shows that Zorya detects seven bugs at the binary level, including a silent integer overflow detects no other evaluated tool finds without a manually written oracle.

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

Jorge L. Ruiz Williams — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.03562v3 Announce Type: replace Abstract: KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is score error modulo constant shifts; this yields HeadQ, a key-side method that stores a low-rank residual side code in a calibration-learned query basis and applies it as an additive logit correction. For values, fixed-attention readout gives an $A^2$-weighted token-distortion surrogate. Across six models, Fisher/score-space error predicts attention KL far better than raw key MSE; same-budget counterexamples, null-space interventions, query-PCA controls, and wrong-sign HeadQ falsify storage-MSE alternatives. Matched Pythia checkpoints localize the main anomaly to a small-model low-entropy route-flip boundary. In K-only WikiText-103 decode experiments with dense values, HeadQ removes roughly $84$--$94\%$ of the excess perplexity on the strongest 2-bit rows; in an auxiliary full-KV 2-bit composition, HeadQ plus an $A^2$ value policy improves all six models.

Most ReLU Networks Admit Identifiable Parameters

Moritz Grillo, Guido Mont\'ufar — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.03601v2 Announce Type: replace Abstract: We study the realization map of deep ReLU networks, focusing on when a function determines its parameters up to scaling and permutation. To analyze hidden redundancies beyond these standard symmetries, we introduce a framework based on weighted polyhedral complexes. Our main result shows that for every architecture whose input and hidden layers have width at least two, there exists an open set of identifiable parameters. This implies that the functional dimension of every such architecture is exactly the number of parameters minus the number of hidden neurons. We further show that minimal functional representations can still have non-trivial parameter redundancies. Finally, we establish a generic depth hierarchy, whereby for an open set of parameters the realized function cannot be represented generically by any shallower network.

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Bronislav Sidik, Lior Rokach — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.03675v2 Announce Type: replace Abstract: Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory architecture for the OpenClaw agent runtime that introduces a structured episodic JSONL store, a five-signal weighted retrieval engine, an attention-attributed cognitive weight update loop, an asynchronous consolidation daemon promoting episodic facts to a semantic tier, and a PPO-based policy framework for adapting retrieval weights (infrastructure validated; performance gains pending camera-ready). On the full 500-question LongMemEval-S benchmark (Wu et al., 2025), MEMTIER achieves Acc=0.382, F1=0.412 with Qwen2.5-7B on a consumer 6GB GPU - a +33 percentage point improvement over the full-context baseline (0.050 -> 0.382, i.e., 5% -> 38%). With DeepSeek-V4-Flash fact pre-population, single-session recall reaches 0.686-0.714, exceeding the paper's RAG BM25 GPT-4o baseline (0.560) on those categories. Temporal reasoning rises to 0.323 and multi-session synthesis to 0.173, demonstrating that structured semantic pre-population qualitatively changes what lightweight retrieval can achieve. All phases run locally on a consumer laptop with a 6GB GPU.

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.03690v2 Announce Type: replace Abstract: We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.04128v2 Announce Type: replace Abstract: We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

Inverse Quadratic Decay in Random Subset Sum

Edwin Chen, Christof Teuscher — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.04465v2 Announce Type: replace Abstract: The Subset Sum Problem is a fundamental NP-complete problem in cryptography and combinatorial optimization, with many real-world applications. The Random Subset Sum Problem (RSSP) is a more applicable version of subset sum, where numbers are drawn from some i.i.d input distribution. We present an algorithm that, with probability $1-\delta$, constructs the same $O(B/w)$ mesh as Da Cunha et al. (2023), while trimming to $w$ elements throughout and running in $O(w\log w)$ time. Then, we present a novel beam search heuristic running in linearithmic time w.r.t list size $n$ and beam width $w$ using the mesh that gives an expected error of $O\!\left(\frac{B}{nw^2}\right)$ under a standard mean-field assumption with equal standard deviation, demonstrating the practical effectiveness of meshing to achieve error decay. The algorithm is empirically robust to multiple input distributions and can naturally extend to variants with simple changes to the scoring heuristic, establishing a new practical baseline for robust subset sum error decay and $\epsilon$-approximation theory.

Transformed Latent Variable Multi-Output Gaussian Processes

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.05133v2 Announce Type: replace Abstract: Multi-Output Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions, such as employing low-rank or sum-of-separable kernels, which can limit expressiveness. We propose the Transformed Latent Variable MOGP (T-LVMOGP), a novel framework that scales MOGPs to a massive number of outputs while preserving the capacity to capture meaningful inter-output dependencies. T-LVMOGP constructs a flexible multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularised neural network. Combined with stochastic variational inference, our model effectively scales to high-dimensional output settings. Across diverse benchmarks, including climate modelling with over 10,000 outputs and zero-inflated spatial transcriptomics data, T-LVMOGP outperforms baselines in both predictive accuracy and computational efficiency.

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.05253v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) has become the standard approach for grounding large language models in information that was not available during training. While existing datasets and benchmarks focus on web or other public sources, there is still no widely adopted dataset that realistically reflects the nature of company-internal knowledge. Meanwhile, startups, enterprises, and researchers are increasingly developing AI Agents designed to operate over exactly this kind of proprietary data. To close this gap, we release a synthetic enterprise corpus, its generation framework, and a leaderboard. We present EnterpriseRAG-Bench, a dataset consisting of approximately 500,000 documents spanning nine enterprise source types (Slack, Gmail, Linear, Google Drive, HubSpot, Fireflies, GitHub, Jira, and Confluence) and 500 questions across ten categories that test distinct retrieval and reasoning capabilities. The corpus is generated with cross-document coherence (grounded in shared projects, people, and initiatives) and augmented with realistic noise such as misfiled documents, near-duplicates, and conflicting information. The question set ranges from simple single-document lookups to multi-document reasoning, constrained retrieval, conflict resolution, and recognizing when information is absent. The generation framework lets teams generate variants tailored to their own industry, scale, and source mix. The dataset, code, evaluation harness, and leaderboard are available at https://github.com/onyx-dot-app/EnterpriseRAG-Bench.

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

James Walsh, William Fawcett, Grace Colverd, Ra\'ul Ramos-Poll\'an — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.05405v3 Announce Type: replace Abstract: Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.05863v2 Announce Type: replace Abstract: Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.06139v2 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

Consistent Geometric Deep Learning via Hilbert Bundles and Cellular Sheaves

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.06395v2 Announce Type: replace Abstract: Modern deep learning architectures increasingly contend with sophisticated signals that are natively infinite-dimensional, such as time series, probability distributions, or operators, and are defined over irregular domains. Yet, a unified learning theory for these settings has been lacking. To start addressing this gap, we introduce a novel convolutional learning framework for possibly infinite-dimensional signals supported on a manifold. Namely, we use the connection Laplacian associated with a Hilbert bundle as a convolutional operator, and we derive filters and neural networks, dubbed as \textit{HilbNets}. We make HilbNets and, more generally, the convolution operation, implementable via a two-stage sampling procedure. First, we show that sampling the manifold induces a Hilbert Cellular Sheaf, a generalized graph structure with Hilbert feature spaces and edge-wise coupling rules, and we prove that its sheaf Laplacian converges in probability to the underlying connection Laplacian as the sampling density increases. Notably, this result is a generalization to the infinite-dimensional bundle setting of the Belkin \& Niyogi \cite{BELKIN20081289} convergence result for the graph Laplacian to the manifold Laplacian, a theoretical cornerstone of geometric learning methods. Second, we discretize the signals and prove that the discretized (implementable) HilbNets converge to the underlying continuous architectures and are transferable across different samplings of the same bundle, providing consistency for learning. Finally, we validate our framework on synthetic and real-world tasks. Overall, our results broaden the scope of geometric learning as a whole by lifting classical Laplacian-based frameworks to settings where the signal at each point lives in its own Hilbert space.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.06534v2 Announce Type: replace Abstract: Agentic reinforcement learning (RL) is reshaping LLM post-training, but end-to-end training time is dominated by compute-intensive, multi-turn rollouts whose resource demand varies significantly across training steps. Resource-fixed systems cannot adapt to this variation, while resource-elastic approaches that provision external GPUs on demand suffer from high allocation overhead and limited availability. We observe that serving clusters leave substantial GPU compute and memory idle, and propose cooperative elasticity: sharing already-deployed serving GPUs with rollout workloads to provide on-demand elastic capacity. Realizing this is non-trivial, as it must preserve serving SLOs under bursty traffic while minimizing cross-cluster communication overhead. We present ROSE, a system that realizes cooperative elasticity for agentic RL post-training, comprising three components: (1) an SLO-safe co-serving executor that co-locates heterogeneous serving and rollout models on the same GPUs, dynamically sharing memory and compute while preserving serving SLOs; (2) a cross-cluster weight transfer engine that leverages shard-aware routing and weight sparsity for fast synchronization; and (3) an elastic rollout scheduler that dynamically routes rollouts across dedicated and opportunistic serving GPUs. Experiments across multiple model sizes and cluster scales show that ROSE improves end-to-end throughput by 1.3 - 3.3 x over resource-fixed baselines and reduces rollout time by 1.2 - 1.5 x over resource-elastic baselines, with no serving SLO violations.

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07021v2 Announce Type: replace Abstract: Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues

InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization

Hongyang Zhang, Maonnan Wang, Ziyao Wang, Hongrui Yin, Man-On Pun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07099v3 Announce Type: replace Abstract: Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. Existing approaches often rely on global feature alignment, but they suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.

RepoZero: Can LLMs Generate a Code Repository from Scratch?

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07122v3 Announce Type: replace Abstract: Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of verifiable and scalable evaluation: existing benchmarks either focus on patch-based editing or rely on human or LLM-based judgments, which introduce bias and limit reproducibility. In this work, we present RepoZero, the first benchmark that enables fully automated, execution-based verification of repository-level generation from scratch. Our key idea is to reformulate generation as repository reproduction: given only API specifications, an agent must re-implement an entire repository such that its behavior matches the original implementation. This design allows for strict black-box validation via output equivalence, while naturally supporting large-scale construction by reusing existing open-source repositories. To further mitigate data leakage and shortcut solutions, we introduce cross-language constraints and a sandboxed evaluation protocol. Building on this benchmark, we propose an Agentic Code-Test Evolution (ACE) framework that performs iterative test generation and error-driven refinement, enabling effective test-time scaling for repository-level synthesis. Extensive experiments across multiple state-of-the-art LLMs and agent frameworks reveal that even the strongest LLM agents achieve only limited pass rates (30\% - 55\%), exposing a substantial gap between current capabilities and real-world software development requirements. Our results establish RepoZero as a challenging, scalable, and reliable testbed for end-to-end code generation, and highlight self-verification via test generation as a critical direction for advancing LLM-based coding agents.

How to Utilize Failure Demo Data?: Effective Data Selection for Imitation Learning Using Distribution Differences in Attention Mechanism

Kana Miyamoto, Kanata Suzuki, Tetsuya Ogata — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07560v2 Announce Type: replace Abstract: Imitation learning for robotic tasks has relied primarily on policies trained only on successful demonstrations, although failures are unavoidable during human data collection. Many existing approaches for exploiting failure data require additional data processing or iterative policy updates through autonomous rollouts, making it difficult to directly and stably utilize failure data accumulated during data collection. In this work, we propose a method that learns latent representations of success-failure discrepancies and incorporates them into the attention mechanism. During inference, an appropriate latent mode is selected from the initial observation to improve action stability. Furthermore, we introduce a post-training metric that quantifies the attention discrepancy between each failure sample and successful demonstrations to select failure data. Simulation results show that the proposed method improves task success rates when trained with failure data and that the proposed metric identifies failure samples that are beneficial for learning when combined with successful demonstrations. These results suggest that the proposed method can support more efficient use of collected demonstrations in robotic data collection pipelines.

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07731v2 Announce Type: replace Abstract: This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07816v2 Announce Type: replace Abstract: This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 44 known and 22 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.

Adaptive Regularization for Sparsity Control in Bregman-Based Optimizers

Ahmad Aloradi, Tim Roith, Emanu\"el A. P. Habets, Daniel Tenbrinck — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07892v3 Announce Type: replace Abstract: Sparse training reduces the memory and computational costs of deep neural networks. However, sparse optimization methods, e.g., those adding an $\ell_1$ penalty, often control sparsity only indirectly through a regularization parameter $\lambda$, whose mapping to the final sparsity rate is non-trivial. In our experiments, we found this parameter sensitivity to be particularly pronounced for Bregman-based optimizers. Specifically, the two variants LinBreg and AdaBreg reach the same sparsity at $\lambda$ values that differ by up to two orders of magnitude, requiring expensive trial-and-error sweeps to achieve a user-specified sparsity. To address this, we propose an adaptive regularization scheme that updates $\lambda$ based on the difference between the model's current sparsity and the target sparsity. We analyze the resulting algorithm and evaluate it on automatic speaker verification with ECAPA-TDNN and ResNet34 on VoxCeleb and CNCeleb. The proposed method reliably achieves sparsity targets ranging between 75% and 99%. It also converges faster than the oracle-tuned non-adaptive baseline during early training and matches or surpasses its final performance in equal error rate. We further show that the adaptive scheme inherits key properties from its non-adaptive counterpart, including improved out-of-distribution robustness over the dense baselines.

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.07926v2 Announce Type: replace Abstract: As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

Dylan Forde — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08123v2 Announce Type: replace Abstract: We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported $R=2$ TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$ on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the single-active-dustbin path used in our TPU runs; this bridge is algebraic and does not claim a general KL-unbalanced or arbitrary-capacity gap model. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $5.57$ to $2.05$ and sparse CE from $5.53$ to $5.30$ relative to step $0$, with CE logged diagnostically rather than optimized directly; target-barycenter alignment metrics do not materially improve, and a deterministic diagonal reference remains stronger on those metrics.

Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

Konstantin Riedl, Konstantinos Spiliopoulos, Justin Sirignano — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08352v2 Announce Type: replace Abstract: A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

Binge, Bot, Repeat: Unpacking the Ecosystem of Video Piracy on Telegram

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08418v2 Announce Type: replace Abstract: Telegram has emerged as a major platform for large-scale video piracy, where copyrighted content is rapidly distributed among users. Despite its prominence, the structural and operational dynamics of this ecosystem remain insufficiently understood. To address this gap, we present the first large-scale study of video piracy on Telegram through a mixed-method analysis of 1,057 channels that shared 209k unique posts between December 2023 and January 2026 - systematically characterizing their content, distribution strategies, and how the ecosystem is sustained at scale. Central to our approach is the development of a fine-grained taxonomy that enables a structured understanding of the activity and intent of these channels on a per-post level. The channels collectively distributed 19,033 unique copyrighted titles originating from 175 countries, accumulating over 4.85B unique views and resulting in a lower-bound estimated financial loss of $17.49B for content rights holders. We also find that this ecosystem is deliberately engineered to be resilient against takedown efforts, frequently redirecting users through chains of intermediary channels and automated bots that collectively handle hosting, access control, monetization, and channel discovery. The scale and persistence of this ecosystem motivated the development of Anti-RIP, a real-time framework for detecting emerging video piracy communities on Telegram. Anti-RIP utilizes our taxonomy to generate contextual, interpretable insights that stakeholders confirmed improve the triaging action against reported posts and channels. Over a 61-day period, the framework facilitated the takedown of 524 previously unknown piracy channels and 71 bots. To support reproducibility and future research, we open-source both the dataset and the Anti-RIP framework.

Single-Thread JPEG Decoder Benchmarks Mis-Evaluate ML Data Loaders

Vladimir Iglovikov, Dmitry Kosarevsky — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08731v2 Announce Type: replace Abstract: JPEG decode is routine ML infrastructure, but Python decoder choices are often justified by single-process, single-thread microbenchmarks. We audit this evaluation assumption with thirteen Python-accessible JPEG decode paths on five matched 16 vCPU Google Cloud CPUs: Intel Emerald Rapids, AMD Zen 4, AMD Zen 5, ARM Neoverse V2, and ARM Neoverse N1. ImageNet validation is the workload, not a new dataset contribution: each run decodes the full 50,000-image split from memory and reports single-thread throughput for all decoders, PyTorch \texttt{DataLoader} throughput for eligible decoders at worker counts $\{0,2,4,8\}$, and decoder skip behavior. The evaluation protocol changes the supported conclusion. On Neoverse V2, \texttt{imageio} is ninth in single-thread throughput yet lands in the top DataLoader tier with \texttt{torchvision}; on Zen 4, \texttt{torchvision} rises from seventh single-thread to the top measured DataLoader tier; on Neoverse N1, \texttt{imagecodecs} is the single-thread leader but fifth at peak DataLoader throughput. We also find that worker-count conclusions differ between Zen 4 and Zen 5, TensorFlow has a large single-thread ARM penalty, and strict native JPEG decoders/wrappers reject the same rare ImageNet JPEG. For PyTorch DataLoader workloads, \texttt{torchvision} and \texttt{simplejpeg} form the strongest measured zero-skip tier: \texttt{torchvision} has the highest mean normalized throughput, while \texttt{simplejpeg} has the highest minimum. OpenCV remains a robust general-purpose fallback above 90\% of the platform-local winner on every tested CPU. We release raw JSON, generated tables/figures, and an executable local/cloud benchmark framework.

ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08858v2 Announce Type: replace Abstract: Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG

Diffusion Restore: Real-Time Markov Chain Monte Carlo Light Transport

Sascha Holl, Gurprit Singh, Hans-Peter Seidel — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.08916v3 Announce Type: replace Abstract: We present Diffusion Restore, a real-time framework for diffusion-based MCMC light transport. MCMC methods are highly suitable for sampling from complex high-dimensional distributions and for approximating integrals over them. In practice, they are often the only viable solution when direct sampling is not possible and alternative methods are either inefficient or cannot be applied due to the structure of the target distribution. However, controlling the exploration of the target distribution in MCMC methods remains challenging. Efficient exploration requires a balance between local exploration and global discovery, and local dynamics must rapidly explore individual modes without getting stuck or exhibiting excessive backtracking. The problem of global discovery has recently been addressed by the introduction of the Restore framework. In this work, we build on this framework and focus on improving local exploration. We show how to choose diffusion-based local dynamics within the Restore framework while completely avoiding Metropolis-adjustment, which is known to slow down convergence. Furthermore, we model these dynamics as nonreversible, introducing momentum in the drift and thereby enabling more directed exploration of the target distribution compared to reversible, random-walk-like dynamics. We provide a theoretical justification for the validity of our choice of local dynamics. Empirically, we demonstrate across diverse scenes that Diffusion Restore outperforms all existing MCMC light transport methods and establishes a new state of the art. In addition, we present a GPU implementation in ray tracing and compute shaders and achieve real-time frame rates. This demonstrates that Diffusion Restore is not only superior in offline rendering, but also outperforms traditional Path Tracing methods in real-time rendering settings, such as interactive applications and games.

The Impossibility of Simultaneous Time and I/O Optimality for The Planar Maxima and Convex Hull Problems

Peyman Afshani, Gerth St{\o}lting Brodal, Nodari Sitchinava — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.09464v2 Announce Type: replace Abstract: We prove that no deterministic output-sensitive algorithm for the planar convex hull and maxima problems can obtain both optimal time and I/O complexity, where the optimality is defined with respect to both the input and output sizes. This explains why the best previous algorithms achieved an optimal I/O bound at the cost of sub-optimal running time (Goodrich et al. [FOCS, 1993]). To the best of our knowledge, the impossibility of simultaneous optimality was only shown previously for the permutation problem by Brodal and Fagerberg [STOC, 2003]. Our results imply that no optimal deterministic output-sensitive cache-oblivious algorithm exists for either problem. In addition, we present simple deterministic algorithms that match our lower bounds and that provide a trade-off between time and I/Os. On the other hand, a simple modification of our deterministic algorithm results in a randomized algorithm that simultaneously achieves optimal (worst-case) time and optimal expected I/O bounds.

APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation

Tianyu Zheng, Hong Wu, Jiaji Zhong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.09492v2 Announce Type: replace Abstract: Large language models (LLMs) often suffer from hallucinations due to error accumulation in autoregressive decoding, where suboptimal early token choices misguide subsequent generation. Although multi-path decoding can improve robustness by exploring alternative trajectories, existing methods lack principled strategies for determining when to branch and how to regulate inter-path interactions. We propose Adaptive Path-Contrastive Decoding (APCD), a multi-path decoding framework that improves output reliability through adaptive exploration and controlled path interaction. APCD consists of two components: (1) Entropy-Driven Path Expansion, which delays branching until predictive uncertainty - measured by Shannon entropy over top candidate tokens - indicates multiple plausible continuations; and (2) Divergence-Aware Path Contrast, which encourages diverse reasoning trajectories while dynamically attenuating inter-path influence as prediction distributions diverge. Experiments on eight benchmarks demonstrate improved factual accuracy while maintaining decoding efficiency. Our code is available at https://github.com/zty-king/APCD.

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei, Jingmin Chen, Lei Sun — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.09586v2 Announce Type: replace Abstract: World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics-neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand-continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis. Project page: https://can-lee.github.io/deformmaster-web/

MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.09620v2 Announce Type: replace Abstract: Recent developments in 3D generative AI enable users to create bespoke 3D models from text or image prompts. However, these approaches provide limited control over spatial structure, making them ill suited for tasks requiring precise geometric composition. We present MiXR, an XR system for in-situ compositional modeling that enables users to create new 3D models by harvesting geometry from their environment. Users extract segments from captured objects and assemble new artifacts through direct 3D manipulation, while generative AI synthesizes a coherent model from the user-defined composition. This hybrid workflow allows users to define spatial structure explicitly while delegating geometric refinement to generative models, enabling them to specify spatial intent that is difficult to express through verbal prompts alone. In a controlled user study ($N=12$), participants using MiXR rated their designs as significantly closer to the target, felt more in control, and experienced lower cognitive workload compared to a generative composition baseline.

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.09860v3 Announce Type: replace Abstract: Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

Inhyuk Park, Doohyun Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10165v2 Announce Type: replace Abstract: Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10181v2 Announce Type: replace Abstract: Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

Separation Logic for Verifying Physical Collisions of CNC Programs

Yeonseok Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10437v2 Announce Type: replace Abstract: Safety verification in Computer Numerical Control (CNC) machining has traditionally relied on simulation-based methods that require repetitive tests when requirements change. This paper introduces a formal verification framework that conceptualizes the physical CNC workspace as a Spatial Heap, treating physical occupancy as a managed logical resource. Central to our approach is a Parser-Prover Handshake that decouples machine kinematics from formal logic. By mapping tool trajectories and safety buffers into a discrete spatial model prior to evaluation, the framework enables the use of Separation Logic (SL) to verify safety via formal triples. Within this model, physical collisions are redefined as logical Spatial Data Races, detected through the failure of the separating conjunction to establish disjointness. Furthermore, we extend the methodology to collaborative environments using Concurrent Separation Logic (CSL), where physical hand-offs are verified as formal ownership transfers. Additionally, the framework scales to multi-axis kinematics (e.g., 5-axis Table-Table configurations) by treating the workpiece as a dynamically mutable spatial resource. By serving as a complement to traditional geometric simulation, this approach reduces the number of required iterative test cycles, offering a foundation for autonomous, less-collision manufacturing.

Correct-by-Construction G-Code Generation: A Neuro-Symbolic Approach via Separation Logic

Yeonseok Lee — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10568v3 Announce Type: replace Abstract: This paper proposes a neuro-symbolic framework for G-code generation that seeks to integrate the neural generative capabilities of the GLLM method (Abdelaal et al., 2025) with formal verification via a Separation Logic (SL) prover. To establish a reliable physical baseline, the framework extracts deterministic boundary representations from 3D CAD models (STEP files) using the OpenCASCADE framework. This extracted geometric data supports a two-component architecture: the LLM serves as an initial code generator, while the SL Prover, utilizing a Spatial Heap model, evaluates the output. By conceptualizing physical collisions as logical Spatial Data Races -- violations of the separating conjunction in SL -- our framework translates proof failures into structured mathematical feedback. These failures are condensed into bounding boxes that serve as directives for the LLM's iterative self-correction. Ultimately, this work aims to develop a self-correcting system that reduces the need for human supervision, leading to safer and verified autonomous manufacturing.

LLM Jaggedness Unlocks Scientific Creativity

Shray Mathur, J. Anibal Boscoboinik, Esther H. R. Tsai, Kevin G. Yager — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10574v2 Announce Type: replace Abstract: As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.

Segment Anything with Robust Uncertainty-Accuracy Correlation

Hongyou Zhou, Marc Toussaint, Ling Shao, Zihan Ye — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10603v3 Announce Type: replace Abstract: Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://hongyouzhou.github.io/ruac/.

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10787v2 Announce Type: replace Abstract: Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges

Johann Knechtel, Ozgur Sinanoglu, Ramesh Karri — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10807v3 Announce Type: replace Abstract: The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high-level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in-depth analysis of the state-of-the-art in LLM-driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs -- from reasoning-driven synthesis and multi-agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red-teaming for robust security assessment. Finally, we synthesize cross-cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.

Predicting 3D structure by latent posterior sampling

Azmi Haider, Dan Rosenbaum — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10830v3 Announce Type: replace Abstract: The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Zhiyuan Liu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.10933v3 Announce Type: replace Abstract: While Mixture-of-Experts (MoE) scales model capacity without proportionally increasing computation, its massive total parameter footprint creates significant storage and memory-access bottlenecks, which hinder efficient end-side deployment that simultaneously requires high performance, low computational cost, and small storage overhead. To achieve these properties, we present DECO, a sparse MoE architecture designed to match the performance of dense Transformers under identical total parameter budgets and training tokens. DECO utilizes the differentiable and flexible ReLU-based routing enhanced by learnable expert-wise scaling, which adaptively balances the contributions of routed and shared experts. Furthermore, we introduce NormSiLU, an activation function that normalizes inputs prior to SiLU operators, producing a more stable trend of routed-expert activation ratio and a higher intrinsic sparsity level. We also identify an empirical advantage in using non-gated MLP experts with ReLU-based routing, indicating the possibility of MoE architecture simplification. Experiments demonstrate that DECO, activating only 20% of routed experts, matches dense performance and outperforms established MoE baselines. Our specialized acceleration kernel delivers a 2.93$\times$ speedup on Jetson AGX Orin compared with dense inference. Code and checkpoints are available at https://github.com/thunlp/DECO.

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

Andrew Choi, Wei Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.11151v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.11302v2 Announce Type: replace Abstract: We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

HySecTwin: A Knowledge-Driven Digital Twin Framework Augmented with Hybrid Reasoning for Cyber-Physical Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.11682v2 Announce Type: replace Abstract: Existing Digital Twin (DT) approaches often lack semantic reasoning capabilities for effective cybersecurity modelling in Cyber-Physical Systems (CPS). This paper presents HySecTwin, a knowledge-driven digital twin architecture that places automated reasoning at the core of real-time threat detection. HySecTwin incorporates semantic modelling to transform heterogeneous CPS telemetry, device attributes, and operational relationships into machine-interpretable representations, combined with an embedded reasoning engine operating over contextualized system states. Unlike opaque detection methods, the framework integrates deterministic rule-based inference with hybrid fuzzy reasoning to generate explicit, interpretable, and auditable security assessments from live device telemetry. This enables context-aware monitoring of complex CPS environments while preserving transparency and trust. Experimental evaluation using a representative CPS testbed and MITRE ATT\&CK campaign-inspired attack scenarios demonstrates sub-millisecond twin synchronization latency and up to 21.5\% faster threat detection compared with deterministic reasoning alone. The results show that semantic modelling, semantic enrichment, and hybrid reasoning improve explainability and resilience without extra system overhead. HySecTwin provides a lightweight, containerized, and extensible framework for secure-by-design digital twin deployments in mission-critical infrastructures

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

Yiming Ren, Xuenan Xu, Ziyang Zhang, Wen Wu, Baoxiang Li, Chao Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.11866v2 Announce Type: replace Abstract: Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.

ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

Cao Yuan, Junjun Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12196v2 Announce Type: replace Abstract: Accurate ultra-short-term wind power forecasting is critical for grid dispatch and reserve management, yet remains challenging due to the non-stationary, condition-dependent nature of wind generation. Meteorological exogenous variables carry substantial predictive information, but the most informative variable combination varies across sites, operating conditions, and prediction horizons. Existing deep learning approaches either treat exogenous inputs as generic auxiliary channels through uniform mixing or soft gating, or rely on fixed preprocessing steps such as PCA, without exploiting the physical structure of meteorological variables. We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework that decomposes exogenous variable modeling into two complementary modules. Physically-Grounded Variable Selection (PGVS) performs hierarchical, group-aware sparse selection over exogenous variables using a domain-informed physical prior and sparsemax activations, producing a compact, condition-adaptive exogenous context. Exogenous-Conditioned Regime Refinement (ECRR) routes the forecast through learned regime experts that apply gain--bias calibration and horizon-specific corrections via a mixture-of-experts paradigm. Experiments on three wind farms spanning different climates, capacities (66--200 MW), and exogenous dimensions (11--13 variables) demonstrate that ECTO achieves the lowest MSE across all sites, with relative improvements over the strongest baseline ranging from 2.2% to 5.2%, widening to 8.6% at the longer prediction horizon ($H=32$). Ablation analysis confirms that each exogenous-related component contributes positively (PGVS +1.84%, ECRR +2.86%), and interpretability analysis reveals that PGVS learns physically meaningful, site-specific variable selection patterns, while ECRR converges to well-separated calibration strategies consistent across sites.

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12321v2 Announce Type: replace Abstract: Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LIDSA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.

Reinforcing VLAs in Task-Agnostic World Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12334v2 Announce Type: replace Abstract: Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12374v3 Announce Type: replace Abstract: Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12483v4 Announce Type: replace Abstract: In settings where labeled verifiable training data is the binding constraint, each checked example should be allocated to the model and reward density where it is most informative. We identify a reward-density principle that governs this allocation: sparse sequence-level reward is most useful on models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller deployment model. The principle yields a simple allocation rule: use scarce labeled data upstream on the strongest available teacher, then transfer the reward-shaped behavior downstream as dense supervision. We evaluate this rule through a four-stage workflow -- teacher RL, forward-KL warmup, on-policy distillation, optional post-bridge student RL -- on verifiable math with Qwen3 and Llama models. At fixed Qwen3-1.7B deployment-student size, an RL-improved 8B teacher distilled through the dense bridge outperforms direct GRPO on the same student ($79.3\%$ vs.\ $75.9\%$ on MATH; $25.2\%$ vs.\ $19.8\%$ on AIME~2024, avg@16), while transfer from the same teacher \emph{before} RL underperforms. A component ablation confirms that each stage is load-bearing: replacing the RL-improved teacher with a raw teacher costs $7.8$ MATH points, removing the forward-KL warmup costs $1.7$, and removing on-policy distillation costs $3.3$. The teacher-quality ordering -- raw-teacher transfer $<$ direct GRPO $<$ RL-teacher transfer -- replicates on Llama-3.1-8B-Instruct with a Llama-3.3-70B-Instruct teacher. The operational lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the bridge.

WriteSAE: Sparse Autoencoders for Recurrent State

Jack Young — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12770v4 Announce Type: replace Abstract: We introduce WriteSAE, a sparse autoencoder for the matrix updates written into recurrent language-model state. In Gated DeltaNet, Mamba-2, and RWKV-7, each token writes a matrix-shaped update to a recurrent cache; a residual-stream SAE has vector-shaped atoms and cannot replace that update directly. WriteSAE learns rank-1 matrix atoms with the same shape as the model's own write. This lets us test a direct replacement: at positions where the SAE activates an atom, we remove the model's write, insert the atom scaled by its SAE activation, and continue the forward pass. The atom gives a closer final token distribution than deleting the write on 92.4% of evaluated positions; averaged per atom, the rate is 89.8%. For Gated DeltaNet, a formula using the forget gate, read query, and output embedding predicts the resulting logit change with $R^2 = 0.98$. The same replacement test transfers to Mamba-2-370M at 88.1%. In generation, the formula chooses a write direction; writing it into three consecutive cache positions at $3\times$ the norm of the model's write makes tokens initially ranked 100--1000 by the unmodified model appear in 100% of continuations, up from 33.3%. To our knowledge this is the first cache-level steering intervention reported in a state-space or hybrid recurrent layer.

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12960v2 Announce Type: replace Abstract: Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

Ivica Kostric, Krisztian Balog — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.13053v2 Announce Type: replace Abstract: Recent years have seen a surge of research into conversational recommender systems (CRS). Among existing datasets, ReDial is the most widely used benchmark, cited in hundreds of studies. However, variations in how the dataset is preprocessed and used in experiments, particularly in the definition of ground-truth items, make it difficult to compare results across studies. These comparisons are further complicated by confounding factors such as the choice of the underlying large language model (LLM) and the use of external data sources. In this work, we revisit seven prominent CRS methods across three architectural families and evaluate them under standardized conditions. Our reproducibility study reveals a ``granularity gap,'' where fine-grained ranking (Recall@1) is highly sensitive to implementation details, while our replicability analysis shows that nearly 50% of reported accuracy stems from ``repetition shortcuts'' that are absent in novelty-focused evaluation. Furthermore, we find that performance gains are often driven more by the capacity of the LLM backbone than by specific architectural innovations. Finally, by applying user-centric utility metrics, we demonstrate that traditional recall frequently overstates a system's actual conversational effectiveness. This work establishes a transparent, controlled baseline and promotes evaluation practices that prioritize novelty and interaction efficiency.

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

Guangqian Yang, Ye Du, Wenlong Hou, Qian Niu, Shujun Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.13081v2 Announce Type: replace Abstract: Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

Safe Bayesian Optimization for Uncertain Correlation Matrices in Linear Models of Co-Regionalization

Jannis L\"ubsen, Annika Eichler — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.13302v2 Announce Type: replace Abstract: This paper extends safety guarantees for multi-task Bayesian optimization with uncertain co-regionalization matrices from intrinsic co-regionalization models to linear models of co-regionalization. The latter allows for more flexible modeling of the inter-task correlations by composing multiple features. We derive uniform error bounds for vector-valued functions sampled from a Gaussian process with a linear model of co-regionalization kernel. Furthermore, we show the potential performance gains of linear models of co-regionalization in a numerical comparison on a safe multi-task Bayesian optimization benchmark.

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.13475v2 Announce Type: replace Abstract: Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

Learning Equilibria in Coordination Games via Minorization-Maximization

Ashok Krishnan K. S., Helene Le Cadre, Ana Busic — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.13644v2 Announce Type: replace Abstract: This paper considers games where the utilities for agents are the sum of a term proportional to a social utility, and another term that is an individual cost or reward. The agents are assumed to be irrational in their perception of the individual cost or reward. The multi equilibrium game is regularized, and its strictly concave potential function is used to select a unique equilibrium. This selected equilibrium is shown to be an $\epsilon-$equilibrium of the original game, where $\epsilon$ is parametrized by the regularizing function. A minorization-maximization based iterative learning scheme is proposed to learn equilibria in this game. This scheme converges to the potential-optimal equilibrium, and has superior convergence behaviour in comparison to gradient and best response methods.

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.14201v2 Announce Type: replace Abstract: Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.14259v2 Announce Type: replace Abstract: Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data

Jiaqi Sun, Boyang Sun, Rasmy M. H., Xiangchen Song, Kun Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.14364v3 Announce Type: replace Abstract: Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.14382v3 Announce Type: replace Abstract: Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.14417v2 Announce Type: replace Abstract: Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15104v2 Announce Type: replace Abstract: Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

MeMo: Memory as a Model

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15156v2 Announce Type: replace Abstract: Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, (c) it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM's weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15157v2 Announce Type: replace Abstract: Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15229v2 Announce Type: replace Abstract: Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

WorldParticle: Unified World Simulation of Lagrangian Particle Dynamics via Transformer

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15305v4 Announce Type: replace Abstract: A unified simulator that can model diverse physical phenomena without solver-specific redesign is a long-standing goal across simulation science. We present a learning-based particle simulator built on a single transformer architecture to model cloth, elastic solds, Newtonian and non-Newtonian fluids, granular materials, and molecular dynamics. Our model follows a prediction-correction design on a shared Lagrangian particle representation. An explicit predictor first advances particles under the known external forces, producing an intermediate state that captures externally driven motion but not inter-particle interactions. A learned corrector then predicts the residual position and velocity updates through three stages: a particle tokenizer that encodes local particle-particle, particle-boundary, and topology-guided interactions; a super-token encoder that hierarchically merges particle tokens into a compact set of super tokens via alternating self-attention and token merging; and a super-token decoder that lifts these super tokens back to particle resolution through cross-attention to predict per-particle position and velocity corrections. Progressive token merging reduces the attention cost at successive encoder layers by halving the token count at each level, and the decoder communicates through the compact super-token set rather than full particle-to-particle attention. Across the six dynamics categories, the same architecture generalizes to unseen materials, boundary configurations, initial conditions, and external forces. We further demonstrate downstream interactive control, inverse design, and learning from real-world manipulation data, reducing the need for per-phenomenon solver engineering.

Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design

Shukai Du, Junzhe Zhang, Yiming Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15419v2 Announce Type: replace Abstract: Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe that this corresponds to the simplest case of the least-action principle in classical mechanics, in which the kinetic Lagrangian yields free-particle straight-line trajectories. Building on this observation, we propose Lagrangian flow matching, a physics-based framework in which the probability path and velocity field are determined by minimizing the action of a general Lagrangian subject to the continuity equation and the prescribed endpoints. We show that this dynamic problem admits an equivalent static optimal transport (OT) formulation, yielding a family of simulation-free training objectives that recover OT-based flow matching as the kinetic special case and the trigonometric variance-preserving diffusion path as the harmonic-oscillator case. More general Lagrangians give rise to new probability paths and velocity fields, and numerical experiments show that they induce meaningful changes in the learned dynamics while remaining competitive with existing conditional flow matching models.

Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory "VaCoAl" as a Substrate for Vector-HaSH and TEM

Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15652v3 Announce Type: replace Abstract: Vector-HaSH and the Tolman-Eichenbaum Machine propose the hippocampal-entorhinal circuit factorizes content from a grid-cell scaffold, supporting compositional memory via ripple-mediated replay. Human electrophysiology shows multi-hop replay fidelity decays multiplicatively. We show VaCoAl, an algebro-deterministic hyperdimensional memory built from Galois-field linear-feedback shift registers, supplies a shared algebraic object.Specifically: (i) deterministic Galois-field diffusion offers a substrate-level alternative to random projections, ensuring quasi-orthogonality and bit-exact reproducibility; (ii) the path-integral Confidence Ratio provides a tractable model of multiplicative decay in multi-hop replay; (iii) VaCoAl's STDP-like path selection follows from architectural demands - similarity preservation and bounded search - constraining hippocampal computation.We map two distinct VaCoAl regimes to the EC-CA3 direct and EC-DG-CA3 trisynaptic pathways. Cellular evidence, including mossy-fiber detonator transmission and granule-cell sparse coding, supports a reading where the DG-CA3 pathway implements biophysical homologues of Galois-field arithmetic with approximate reversibility.Crucially, we connect this to Pearl's Ladder of Causation. Reversible GF(2) binding supplies the surgical-modification algebra required by the do-operator (rung 2). The dual architecture (Regime A anchoring the factual world, Regime B minting counterfactual worlds) supplies the parallel non-interfering substrate counterfactual reasoning provably requires (rung 3), yielding a profound Pearl-based evolutionary rationale.The framing proceeds in two tiers: VaCoAl is offered first as architectural correspondence, then as biophysical realization with approximate reversibility. We prove formal correspondences and derive testable iEEG predictions, bridging computational neuroscience and hyperdimensional computing.

SEED: Targeted Data Selection by Weighted Independent Set

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15691v2 Announce Type: replace Abstract: Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

Unlocking Dense Metric Depth Estimation in VLMs

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15876v3 Announce Type: replace Abstract: Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

Qian He, Zhenshuo Yang, Wenqi Liang, Chunhui Hao, Nicu Sebe, Jiandong Tian — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.15944v2 Announce Type: replace Abstract: Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored sampling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/

Argus: Evidence Assembly for Scalable Deep Research Agents

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16217v3 Announce Type: replace Abstract: Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

The Impact of AI Search on the Online Content Ecosystem: Evidence from Google and Reddit

Peibo Zhang, Ruomeng Cui, Dennis J. Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16428v2 Announce Type: replace Abstract: Search engines traditionally complement online content platforms by directing users seeking information to external websites. The emergence of generative AI search tools that summarize answers directly on the results page may disrupt this relationship by making visits to source platforms optional. We study this question using Google AI Overviews and Reddit, one of the largest online discussion platforms. Our identification exploits Google's content moderation policy: Safe-for-Work (SFW) Reddit communities are indexed by Google organic search and surfaced in Google AI Overviews, while Not-Safe-for-Work (NSFW) communities, though indexed by organic search, are prohibited from being referenced in AI Overview summaries. Using a difference-in-differences design, we find that AI Overviews increase engagement in SFW communities: daily comments rise by 12.0 percent and the number of commenting users by 12.3 percent relative to NSFW communities. The effects are concentrated in experience-based discussions (opinions, advice, and personal experiences) rather than fact-based information. However, the subsequent introduction of Google AI Mode, which allows users to interact conversationally with the AI summary, largely eliminates these gains in experience-based content. These results suggest that the effects of AI search depend critically on interface design and types of content.

Customizing an LLM for Enterprise Software Engineering

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16517v2 Announce Type: replace Abstract: Enterprise software development is a continuous evolutionary process, characterized by incremental additions, architectural revisions, production deployments and rigorous maintenance. These activities generate valuable data that modern LLMs could be finetuned on, to unlock additional tool possibilities for enterprise software engineering. While frontier LLMs are already very capable, this form of customization offers a compelling path for enterprise-specific optimization. We introduce Gemini for Google (GfG)}, an adaptation of Gemini specialized for Google's internal software engineering ecosystem. This paper details the model's end-to-end development, from curating a trillion-token proprietary dataset to implementing a mid-training strategy that mitigates catastrophic forgetting. In a large-scale blind A/B study across 29,000 developers, Gemini for Google significantly outperformed baselines: reducing the mean number of iterations per turn by 23\%, and increasing code survival rates by about 17%. Beyond metrics, we provide a comprehensive blueprint for enterprise model adaptation, covering: (1)The extraction of high-value signals from software engineering data, (2)Data preparation strategies, (3)Full-stack model tuning (continued pre-training and post-training), and (4)The deployment of downstream applications. We believe this methodology offers a replicable path for other organizations to unlock the full potential of their internal engineering data.

Toward Template-Free Explainability for Monte Carlo Tree Search

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16524v2 Announce Type: replace Abstract: Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16530v2 Announce Type: replace Abstract: Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

Voice ''Cloning'' is Style Transfer

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16578v2 Announce Type: replace Abstract: Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss. However, in our work, we find that despite the term, voice cloning does not faithfully ''clone'' an individual's voice. Instead, we find that widely-used voice cloning models systematically apply style transfer to source voices. As rated by human annotators, cloned voices are perceived as more authoritative, warm, customer-service-like, and human-like compared to their sources. Human annotators also report greater trust in cloned voices than source voices, and a greater willingness to disclose sensitive personal information to them. Our work furthermore shows that voice cloning leads to homogenization of speaker characteristics, as measured by reduced variance in accent, speaking rate, and the audio embedding space. Together, our results highlight a new set of limitations and risks of voice cloning technology and their potential impact on human behavior.

PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting

Yangyou Liu, Zezhi Shao, Xinyu Chen, Hu Chen, Fei Wang, Yuankai Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16793v2 Announce Type: replace Abstract: Time series forecasting under non-stationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non-stationary dynamics through three physical hypotheses: wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics-informed, plug-and-play framework adopting a Disentangle--Evolve--Simulate design philosophy. Specifically, PULSE utilizes phase-anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic-Aware Mixup (SAM) to ensure robustness against out-of-distribution volatility. Empirically, PULSE enables a simple MLP backbone to achieve state-of-the-art or highly competitive performance across 12 real-world benchmarks. This validates that a correct physics-informed inductive bias is far more critical than raw architectural complexity for non-stationary forecasting. The code is available at: https://github.com/Gemost/PULSE.

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

Youngmok Ha, Viktor Schlegel, Yidan Sun, Anil Anthony Bharath — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16812v2 Announce Type: replace Abstract: While Local Differential Privacy (LDP) serves as a foundational primitive for distributed data collection, its stringent noise injection requirement often leads to severe degradation in data utility. This degradation stems from the task-agnostic nature of conventional LDP mechanisms, which inject noise uniformly across all dimensions regardless of their relative importance to the downstream objective. To address this issue, we propose a novel approach that mitigates noise in task-relevant subspaces of the data representation. Our method identifies task-critical subspaces via the Jacobian matrix of the public downstream model, selectively attenuates noise along those dimensions, and reshapes the isotropic noise of standard LDP into an anisotropic distribution. This method preserves the uniform per-dimension privacy budget while heterogeneously modulating noise impact across dimensions, thereby substantially enhancing data utility. Furthermore, our approach generalizes to both linear and non-linear models and integrates seamlessly with existing mechanisms. Extensive experiments on CIFAR-10-C (Brightness corruption at the highest severity level 5) demonstrate that integrating our approach improves the utility of PrivUnit2 and PrivUnitG by approximately 20\% at $\epsilon=7.5$. The source code is available at https://github.com/ymha/jacobian-anr-ldp.

Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding

Xiang Gao, Hui Tian, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16923v2 Announce Type: replace Abstract: Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.16962v2 Announce Type: replace Abstract: Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

Classification aggregation: a quantitative impossibility theorem

Yuval Filmus — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17136v2 Announce Type: replace Abstract: A group of individuals wishes to classify $m$ objects into $n$ categories in such a way that no class is left empty, a condition known as surjectivity. The opinions of the individuals are aggregated separately for each object using an aggregation function that can depend on the object. Maniquet and Mongin showed that if the aggregation functions are unanimous and the outcome must always be surjective, then the aggregation mechanism is dictatorial. Cailloux et al. showed that the same holds even if unanimity is relaxed to citizen sovereignty (each object can be classified into any category). We show that similar results hold even if we only require the outcome to be surjective with probability $1-\epsilon$ (with respect to an arbitrary symmetric i.i.d. distribution), provided that the aggregation functions are far from being constant. On the way, we characterize all aggregation mechanisms whose outcome is always surjective without any assumptions on the aggregation functions. Our approach uses a general result of Alekseev and Filmus which has wider applicability. We illustrate this by proving a similar impossibility result for aggregating equivalence relations.

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17140v2 Announce Type: replace Abstract: Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Mengtian Yang, Zhekun Zhang, Mingheng Wu, Jianwen Yan, Hanshi Sun, Li-wen Chang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17164v2 Announce Type: replace Abstract: Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating "what-if" Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.

One Step Further: Understanding PLC Binaries Through Cross-Platform Reverse Engineering and Function-Level Semantic Analysis

Ang Jia, Yaxin Duan, He Jiang, Zhenzhou Tian, Zhilei Ren, Xiaochen Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17392v2 Announce Type: replace Abstract: As emerging attacks increasingly target Industrial Control Systems (ICS), the security of Programmable Logic Controllers (PLCs) has become a critical concern. Binary Code Analysis (BCA), which enables analysts to understand compiled programs without source code, is essential for ICS security tasks such as post-attack digital forensics and incident response. However, automated BCA for PLC binaries remains challenging due to three key issues: heterogeneous binary formats across PLC platforms, entangled program semantics caused by the mixture of control logic with runtime code, and limited semantic representations for interpretable and learning-based downstream analysis. In this paper, we present PLC-BinX, a BCA workflow for cross-platform PLC binary understanding. PLC-BinX analyzes PLC binaries from four platforms: CODESYS v3, GEB, OpenPLC v2, and OpenPLC v3, and recovers function-level information through cross-platform reverse engineering, core-function extraction, and function-level semantic representation construction. Based on the recovered semantic representations, we further study two downstream tasks: toolchain prediction and functionality prediction. Under ten-fold program-level evaluation, PLC-BinX achieves 100.00% precision, recall, and F1 in toolchain prediction, and 51.43% precision, 49.38% recall, and 49.18% F1 in functionality prediction over 22 labels. The results demonstrate that PLC-BinX provides an effective and interpretable approach to cross-platform PLC binary understanding by exposing task-relevant function-level semantics from heterogeneous PLC binaries.

Weighted Reverse Convolution for Feature Upsampling

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17472v2 Announce Type: replace Abstract: Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

Structured Neural Marked Point Processes for Interpretable Event Interaction Modeling

Zhitong Xu, Qiwei Yuan, Yinghao Chen, Shandian Zhe, Bin Shen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17568v2 Announce Type: replace Abstract: Multi-class event streams arise in numerous real-world applications, where uncovering structured, interpretable inter-event relationships, together with accurate prediction, remains a central challenge. Existing neural point process models are highly expressive but encode event interactions in a black-box manner, preventing explicit discovery of structured dependencies. In this paper, we propose a structured neural marked point process (SNMPP) that achieves high modeling flexibility while enabling explicit event-wise and class-wise relationship discovery from data. Our model constructs a product-form neural influence kernel composed of a signed interaction network over event types and a delay-aware monotonic temporal network. This design enables explicit characterization of inter-class influence topology -- including excitation, inhibition, and neutrality -- while flexibly capturing diverse temporal decay patterns and potential influence delays. For efficient learning, we develop a stratified Monte Carlo estimator for stochastic training. Extensive experiments on synthetic and real-world benchmark datasets validate the ability of our approach to uncover structured relationships and deliver strong predictive performance.

Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17618v2 Announce Type: replace Abstract: Autism Spectrum Disorder (ASD) is characterized by challenges with social interaction and communication and by restricted or repetitive patterns of thought and behavior, with significant variability in presentation. Approximately a quarter of children with ASD are classified as having profound autism, who often exhibit challenging behaviors, such as self-injurious behavior, aggression, elopement, or pica, that pose serious safety risks and disrupt learning in educational settings. Prior work has applied wearable sensors and machine learning to detect challenging behaviors, but has been largely confined to controlled laboratory environments. This work demonstrates that predicting challenging behavior episodes is feasible in a real-world special education classroom. We collected approximately 110.7 hours of labeled multimodal wearable data comprising accelerometry, electrodermal activity (EDA), and skin temperature from 9 children and young adults aged 10 to 21 years across standard classroom sessions. We fine-tuned state-of-the-art foundation models for multimodal wearable time-series analysis and show that challenging behavior episodes can be predicted up to 10 minutes in advance with an AUC-ROC of 0.78. These results establish a concrete foundation for developing proactive in-class intervention systems that enable teachers to minimize the safety risks of challenging behaviors in special education classrooms

SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation

Abderrahmene Boudiaf, Irfan Hussain, Sajid Javed — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17630v2 Announce Type: replace Abstract: Open-vocabulary segmentation models such as SAM3 perform well across broad categories via text prompting, yet degrade when target classes are visually underrepresented in pretraining or depart from canonical depictions-limitations text prompts cannot resolve spatially. We present SegRAG, a training-free retrieval-augmented segmentation framework that grounds SAM3 with class-specific point prompts derived from a curated DINOv3 feature bank. Offline, dense patch-level descriptors are extracted from annotated references and filtered by Intra-Class Cohesion Distillation (ICCD), retaining only prototypes that reliably retrieve within-class foreground. At inference, Topographic Similarity Grounding (TSG) computes a cosine-similarity landscape against retrieved prototypes, identifies coherent high-confidence regions via connected-component analysis, and extracts peak locations through non-maximum suppression. The resulting point prompts are delivered jointly with class-name text in a single SAM3 forward pass. On four standard benchmarks, SegRAG consistently outperforms the text-only baseline, gaining up to +3.92 mIoU on LVIS. On AgML agricultural benchmarks under zero-shot domain transfer, it raises mean IoU from 25.27 to 59.24 (+33.97) and recovers individual classes from zero to over 95 mIoU. Ablations confirm that ICCD, TSG, and joint prompting each contribute independently and compound when combined. Code is available at (https://github.com/boudiafA/SegRAG).

General Science Ranking (GSR): An Open-Source, Citation-Normalized Journal and Conference Classification System for Computer Science and Medicine

Zhikai Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17657v2 Announce Type: replace Abstract: The academic journal zoning system is central to evaluating research talent, funding, and institutions. The CAS journal partition system, one of East Asia's most widely used tools, will cease operation in March 2026, creating a policy gap. Existing alternatives have major limitations: JCR depends on paid databases and excludes conferences; Scimago/CiteScore relies on Elsevier proprietary data; expert-based rankings such as CCF and CORE lack quantitative foundations and update slowly. This paper proposes the General Science Ranking (GSR), a multidimensional bibliometric framework built entirely on open-source data. GSR covers 500 computer science venues (397 journals and 103 conferences) and 500 medical journals using OpenAlex and Semantic Scholar. Scores combine four indicators: field-weighted citation impact (FWCI), two-year impact factor (IF2), five-year h-index (h5), and citation CAGR. For CS conferences lacking citation time-series data, IF2-approx was estimated from calibration on 1.41 million OpenAlex journal papers. Rankings adopt fixed quotas: Q1 (1-50), Q2 (51-100), Q3 (101-200), and Q4 (201+). All code and data are open source. In CS rankings, conferences and journals each occupy 25 of the top 50 Q1 positions. The leading conferences are NeurIPS, ICCV, ICLR, and CVPR. In medicine, CA: A Cancer Journal for Clinicians ranks first, followed by New England Journal of Medicine and The Lancet. Agreement with JCR Q1 reaches 84 percent in medicine and 71 percent in CS. Sensitivity analysis shows only 1.7 percent to 2.5 percent of CS conferences change partitions, indicating robustness. GSR provides a free, reproducible, field-normalized ranking system covering both journals and conferences, making it suitable for institutional evaluation policies.

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

Anvesh Rao Vijjini, Sagar Manjunath, Snigdha Chaturvedi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17694v2 Announce Type: replace Abstract: Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) language coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio-cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17776v2 Announce Type: replace Abstract: Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

Learning Empirical Evidence Equilibria under Weak Environmental Coupling

Aya Hamed, Jason R. Marden, Jeff S. Shamma — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17848v2 Announce Type: replace Abstract: Strategic multi-agent systems are fundamentally characterized by decentralization, uncertainty, and ambiguity. Agents operating under limited observations will often need to make decisions based on simplified internal models of the environment, reflecting bounded rationality in both computational capacity and environmental knowledge. The Empirical Evidence Equilibrium (EEE) framework explicitly accounts for these limitations by modeling each agent as forming a potentially misspecified belief derived from signals obtained through partial observations of the environment. The resulting equilibrium concept captures the system's steady state under bounded rationality and decentralization. In this work, we study games in which the environment dynamics are driven jointly by exogenous factors and agents' actions. We analyze agent behavior under Q-value iteration where each agent independently forms a belief model, computes Q-values, and derives a greedy strategy, yet the collective actions of all agents jointly shape the environment each agent faces at the next stage. We prove that despite this decentralization, an EEE emerges from the joint dynamics when the coupling between agents' actions and the environment is sufficiently weak. We further extend this result to softmax policies, establishing a contraction result under a sufficient coupling condition.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17946v2 Announce Type: replace Abstract: Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

Active Defense Against False Data Injection Attacks in Robotic Manipulators

Gabriele Gualandi, Carl Mikael Larsson, Alessandro V. Papadopoulos — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.17950v2 Announce Type: replace Abstract: Robotic systems are vulnerable to False Data Injection Attacks (FDIAs), where adversaries corrupt sensor signals to gain malicious control. Feedback linearization exposes robotic systems to integrator vulnerability, making them susceptible to stealthy attacks that can cause significant deviations in end-effector behavior without raising alarms. This paper addresses the resilience of manipulators against finite-horizon FDIAs by formalizing two defense methods, namely anomaly-aware virtual damping and manipulability reduction, with probabilistic guarantees on nominal task execution. Simulations on a 7-DOF redundant manipulator show that the proposed defenses substantially reduce the impact of FDIA compared to using solely a threshold-based ADS like the Chi-squared, while preserving nominal task performance in the absence of attack.

FUSE: A Framework for Unified State Estimation in Vehicular and Robotic SLAM Systems

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18047v2 Announce Type: replace Abstract: Tightly coupled SLAM formulations under mixed-rate sensing often bind temporal processing, local geometric association, estimator formulation, and map-update policy into method-specific designs. Such binding makes it difficult to vary one design choice without re-engineering the rest of the state-estimation process. This paper presents FUSE, a framework for unified state estimation in vehicular and robotic SLAM systems. FUSE organizes the state-estimation interface around observation ingestion, propagation, update, and state query, and uses this interface to separate temporal processing, residual-ready local geometric association, estimator formulation, and map-update policy. A LiDAR--IMU instantiation is developed to examine the framework under mixed-rate sensing and directional degeneracy, where high-rate inertial propagation, LiDAR-triggered geometric update, residual screening, and degeneracy-aware correction operate through the same interface boundaries. On a 418~m loop-corridor sequence, the instantiation reports a 1.626 m end-to-end trajectory error, corresponding to a 7.9% relative error reduction compared with Faster-LIO, the lowest-error baseline on this sequence. The results support FUSE as a framework for organizing state-estimation design choices and show how the evaluated instantiation regularizes updates along weakly observable directions.

NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18447v3 Announce Type: replace Abstract: Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.

One Developer Is All You Need: A Case Study of an AI-Augmented One-Person Squad in a Brownfield Enterprise

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18461v2 Announce Type: replace Abstract: AI tools are enabling engineers to absorb roles previously distributed across cross-functional squads, yet there is little structured evidence on how to design or evaluate such a one-person squad in a regulated enterprise setting. Without that evidence, organizations adopting this model lack guidance on which design decisions make it viable and which conditions cause it to break down. We report a case study in which a single staff engineer, supported by four AI agents under a Spec-Driven Development workflow, delivered a brownfield product initiative scoped for a four-person squad in half the planned time, with 90\% acceptance of AI-generated code on first review, full integration test pass rates, and an above-85\% reduction in direct staffing cost. The results indicate that AI does not replace team members it multiplies the throughput of the experienced engineer who remains, making specification quality and institutional knowledge, not model capability, the binding constraints on one-person squad success.

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18579v3 Announce Type: replace Abstract: Pre-training on text-attributed graphs (TAGs) is central to building transferable graph foundation models, where LLM-as-Aligner methods align graph and text representations through the semantic knowledge of large language models. However, these methods usually assume that node texts provide sufficient and reliable supervision, an assumption often violated in real-world sparse TAGs. When textual anchors are missing, noisy, or uneven across domains, graph structures must be aligned with weak semantic evidence, leading to unreliable structure-semantics correspondence and sparsity-induced transfer bias. This paper presents S2Aligner, a sparsity-aware and structure-enhanced LLM-as-Aligner framework for graph-text pre-training on sparse TAGs. The key idea is to decouple semantic alignment from structural modeling, allowing topology-aware signals to enhance alignment without contaminating the shared semantic space. Specifically, S2Aligner decomposes graph-text representations into semantic and structural components, uses structure-oriented reconstruction with consistency control to inject reliable topology cues into text representations, and suppresses inconsistent structural signals under textual sparsity. Moreover, S2Aligner introduces sparsity-aware cross-domain risk balancing, which calibrates domain risks through a global-domain density ratio and downweights unreliable sparse samples via graph reliability estimation. Theoretical analysis shows that this objective reduces cross-domain generalization gaps by controlling domain risk discrepancy. Extensive experiments across diverse graph domains, sparsity levels, and downstream tasks demonstrate that S2Aligner consistently outperforms existing baselines.

An Approximation Algorithm for Graph Label Selection

Josia John, Simon Meierhans, Maximilian Probst Gutenberg — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18623v2 Announce Type: replace Abstract: In the graph label selection problem, one is given an $n$-vertex graph and a budget $k$, and seeks to select $k$ vertices whose labels enable accurate prediction of the labels on the remaining vertices. This problem formalizes distilling a small representative set from the whole graph. We present the first $\tilde{O}(\log^{1.5} n)$-approximation algorithm for graph label selection under the standard budget constraint. Prior work either relies on resource augmentation, allowing substantially more than $k$ labeled vertices, or consists primarily of heuristics without provable guarantees. Finally, we demonstrate that practical heuristic variants of our algorithm scale to significantly larger graphs than previous methods, while essentially retaining their quality.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18678v2 Announce Type: replace Abstract: We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

General Preference Reinforcement Learning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18721v2 Announce Type: replace Abstract: Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

Spectral Progressive Diffusion for Efficient Image and Video Generation

Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18736v2 Announce Type: replace Abstract: Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.

WorldString: Actionable World Representation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18743v2 Announce Type: replace Abstract: Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

Exact Linear Attention

Weinuo Ou — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18848v2 Announce Type: replace Abstract: This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by exploiting the exact decomposition property of kernel functions, thereby eliminating approximation error. We identify and address two key limitations of prior linear attention -- gradient explosion and token attention dilution -- by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, including the Hadamard Exp Kernel, Summation Squared Euclidean Distance Kernel, and Subtraction Squared Euclidean Distance Kernel, each tailored for specific attention behaviors. Beyond the core attention formulation, the paper presents three engineering innovations: (1) a Hyper-Link structure that replaces traditional residual connections to mitigate gradient degradation; (2) a Memory Lobe module based on bidirectional linear attention, which captures "transformation flow" across layers to implement qualitative memory and an implicit reinforcement learning paradigm; and (3) a routing-score-based bias mechanism for Mixture-of-Experts (MoE) to improve interpretability and semantic alignment. Experimental results demonstrate that ELA achieves up to 6x faster decoding speed and 75% reduction in KV cache memory usage compared to full attention, while maintaining comparable or superior training performance. The proposed memory module accelerates convergence and enhances generalization. Furthermore, we extend the linear attention principle to vision models, yielding YOLO-LAT, which attains up to 4.3x GPU inference speedup and 7.9x parameter reduction with competitive detection accuracy. These results underline the broad applicability of exact linear attention for scaling Transformer models to ultra-long sequences and efficient visual tasks.

Spectral structural distortion reveals redundant neurons in neural networks

Yongyu Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18860v2 Announce Type: replace Abstract: Overparameterized neural networks often contain many removable neurons, yet what makes a neuron redundant remains poorly understood. Existing pruning criteria commonly rely on local quantities such as weight magnitude, activation strength, or gradient sensitivity, but these measures provide limited insight into the structural role of a neuron in the transformation performed by a layer. Here we show that neuronal redundancy can be characterized by weak participation in the spectral structural distortion induced by layer-wise representation transformations. For each hidden layer of a trained network, we record pre-activation and post-activation hidden states, model neurons as graph nodes, and construct input-side and output-side graphs that describe neuron-level relational structure before and after the layer transformation. We then define a spectral structural importance score that measures the contribution of each neuron to the dominant graph-spectral distortion between these two relational structures. Low-participation neurons are treated as structurally redundant and removed through an iterative pruning process in which scores are recomputed after each structural change. No parameter updates are performed during intermediate pruning rounds; after the target parameter reduction is reached, a single recovery fine-tuning stage is applied to the compact model. Direct ablation analysis and experiments across conventional neural networks, encoder-only Transformers, and decoder-only language models show that this graph-spectral criterion identifies removable neurons and Transformer units while preserving task performance after compression. These results suggest that neural redundancy is not merely a consequence of small weights or weak activations, but can be understood through weak participation in the spectral distortion of layer-wise relational structure.

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Yujie Lin, Chengyi Yang, Zhishang Xiang, Yiping Song, Jinsong Su — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18879v2 Announce Type: replace Abstract: Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

Towards Zero Trust Architecture: A Pilot Study on Information Systems Security Readiness amongst Small and Medium Enterprises

Yu Deng, Anushia Inthiran — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18901v2 Announce Type: replace Abstract: Small and medium enterprises (SMEs) face growing cyber threats but often lack the resources and expertise needed to adopt Zero Trust Architecture (ZTA). This pilot study examines the drivers and barriers shaping SME perceptions of ZTA necessity and proposes an exploratory staged adoption path. Survey data from 64 IT and security professionals in the Asia-Pacific region show that ZTA familiarity and cloud-computing needs are the strongest positive correlates of perceived necessity, whereas accumulated barriers show only a weak negative association. Identity and access management complexity and scalability emerge as the main implementation hurdles. Based on these findings, we propose a three-stage route for SMEs: strengthening identity governance, segmenting high-value assets, and introducing targeted monitoring in line with operational capacity. The study offers early evidence for more realistic Zero Trust transitions in resource-constrained firms.

Agent Security is a Systems Problem

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.18991v2 Announce Type: replace Abstract: We take the position that agent security must be approached as a systems problem: the AI model powering the agent must be treated as an untrusted component, and security invariants must be enforced at the system level. Through this lens, efforts to increase model robustness (the dominant viewpoint in the community) are insufficient on their own. Instead, we must complement existing efforts with techniques from the systems security domain. Based on our experience as cybersecurity researchers in operating systems, networks, formal methods, and adversarial machine learning, we articulate a set of core principles, grounded in decades of systems security research, that provide a foundation for designing agentic systems with predictable guarantees. As evidence, we analyze eleven representative real-world attacks on agents and discuss how systems principles, if realized, could have prevented these attacks. We also identify the research challenges that stand in the way of implementing these principles in agents.

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19138v2 Announce Type: replace Abstract: The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19269v2 Announce Type: replace Abstract: Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

Toward User Comprehension Supports for LLM Agent Skill Specifications

Zikai Alex Wen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19362v2 Announce Type: replace Abstract: Users often interpret and select agent skills through their SKILL markdown specifications. To protect users, existing audits mainly focus on malicious or unsafe skills. We study the complementary question of whether specifications help users form bounded expectations about what a skill consumes, produces, and covers. Across 878 cybersecurity skills, we used rule-based coding to measure textual cues for four comprehension anchors, namely operational basis, output contract, boundary disclosure, and example capability demonstration. Cues for operational basis were common, but only 19.0% of specifications exhibited cues for an example task, sample, or expected outcome, and only 2.3% exhibited cues for all four anchors. We further examined a small DNS/C2 telemetry subset (n$=$6) to illustrate why missing examples may matter. Examples appeared to make first local checks easier to construct, while no-example skills typically required helper code inspection to recover command arguments or output fields. We argue that agent-skill evaluation should treat specifications as user-facing capability disclosures, not merely as containers for executable instructions.

Generative Recursive Reasoning

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19376v2 Announce Type: replace Abstract: How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_\theta(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_\theta(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19385v2 Announce Type: replace Abstract: The explosive growth of AI-generated images has created a sustainability challenge for storage infrastructure. Platforms like Midjourney and Adobe Firefly already host billions of generative images, yet conventional object stores persist them as blobs with full-resolution pixels, consuming huge amounts of storage capacity and bandwidth. Unlike natural photos, however, AI-generated images can be deterministically reconstructed from compact, model-native latent tensors, making persistent image storage fundamentally redundant. This paper presents LatentBox, a latent-first storage system for AI-generated images. LatentBox treats compressed latents as durable storage objects and uses on-demand GPU reconstruction on the read path to trade inexpensive compute for large persistent storage savings. Our design is guided by the first large-scale analysis of AI-generated image access we are aware of, based on a 35-month, 2-billion-request production trace from a major generative-content platform. Motivated by the trace analysis, LatentBox keeps frequently accessed images in decoded pixel format for fast hits, stores less-active objects as compressed latents to expand effective cache capacity, and continuously adjusts the splits between the image and latent cache to optimize user-perceived access latency.We build a LatentBox prototype and evaluate it with the production trace. LatentBox reduces persistent storage by 78.7% with competitive or even lower mean and tail latency over a pure image-based storage.

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19398v2 Announce Type: replace Abstract: Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

Unlocking the Potential of Continual Model Merging: An ODE Perspective

Lihong Lin, Haidong Kang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19409v2 Announce Type: replace Abstract: Continual Model Merging (CMM) enables rapid customization of foundation models across sequentially arriving tasks, offering a scalable alternative to repeated retraining. However, existing merging rules lack explicit controllability over the allocation of learning capacity between previously learned capabilities and newly merged models. Consequently, as tasks are merged sequentially, this deficiency accumulates into severe forgetting, particularly in scenarios with heterogeneous task importance, where performance allocation becomes highly inconsistent. The key reason can be attributed to the fact that previous methods treat each task model as an isolated parameter point and apply fixed algebraic combinations, rather than explicitly constructing a transition that respects how independently trained models can be connected in parameter space. Motivated by mode connectivity, we assume that desirable merged models lie on low loss connecting paths, and that continual merging should follow such paths without crossing loss barriers that induce forgetting. Grounded in these insights, we propose a novel ODE-driven Merging (ODE-M) tailored for CMM that traces such a path by integrating a time-dependent velocity field and enforcing barrier constraints to prevent loss-increasing steps. Extensive experiments demonstrate that ODE-M achieves state-of-the-art performance compared to its competitors across mainstream CMM benchmarks.

ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders

Carlo Romeo, Andrew D. Bagdanov — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19503v2 Announce Type: replace Abstract: Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints. Source code is available at https://github.com/CarloRomeo427/ARC_RL.git.

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

David Pape, Jonathan Evertz, Lea Sch\"onherr — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19537v2 Announce Type: replace Abstract: Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

Component-Aware Structure-Preserving Style Transfer for Satellite Visual Sim2Real Data Construction

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19624v2 Announce Type: replace Abstract: For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

CAD-Free Learning of Spacecraft Pose Estimators via NeRF-Based Augmentations

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19649v2 Announce Type: replace Abstract: Spacecraft pose estimation networks require tens of thousands of CAD-rendered images to be trained. This reliance on synthetic CAD data (i) limits applicability to targets with reliable geometry prior, excluding uncooperative or poorly documented spacecraft, and (ii) causes poor generalization to real on-orbit conditions due to unrealistic illumination and material appearance. This paper introduces a NeRF-based image augmentation method that enables the learning of spacecraft pose estimators from only a few tens to a few hundreds of images. The method learns a Neural Radiance Field of the target and generates a large, diverse dataset through geometrically-consistent viewpoint and appearance augmentation. This augmented dataset enables the training of accurate target-specific pose estimators without requiring a CAD model or large synthetic datasets. Experiments show that our approach supports the training of accurate pose estimators from only 25 to 400 realistic images, even under severe illumination variations. When applied on large CAD-based synthetic datasets, the NeRF-based augmentation also enhances out-of-domain generalization, yielding improved robustness to real on-orbit conditions.

LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models

Hyunsoo Han, Sangyeop Yeo, Jaejun Yoo — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19729v2 Announce Type: replace Abstract: We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from its substantially larger capacity - poses a significant challenge for the student model to faithfully mimic. To address this problem, we propose a coarse-to-fine distillation framework with LInear FiTtingbased distillation (LIFT) and Piecewise Local Adaptive Coefficient Estimation (PLACE). First, LIFT decomposes the objective into a "coarse" alignment and a "fine" refinement. The student is then trained on coarse alignment before proceeding to hard refinement. Second, PLACE extends LIFT to address spatially non-uniform errors by partitioning outputs into error-based groups, providing locally adaptive guidance. Our experiments show that LIFT and PLACE is effective across diffusion spaces (image/latent), backbones (U-Net/DiT), tasks (unconditional/conditional), datasets, and even extends to flow-based models such as MMDiT (SD3). Furthermore, under extreme compression with a 1.3M-parameter student (only 1.6% of the teacher), conventional KD fails to provide sufficient guidance for stable training, with FID scores often degrading to 50-200+, but our method remains stably convergent and achieves an FID of 15.73.

Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19776v2 Announce Type: replace Abstract: Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19846v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs. Project page and code at https://joslefaure.github.io/assets/html/finebench.html.

SSV: Sparse Speculative Verification for Efficient LLM Inference

Zhibin Wang, Ziyu Zhong, Nuo Shen, Yuhang Zhou, Rong Gu, Sheng Zhong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19893v2 Announce Type: replace Abstract: Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SSV, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SSV combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SSV achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.

D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20036v2 Announce Type: replace Abstract: Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet these requirements, we introduce D$^3$-Subsidy (Dynamic Driver-side Diffusion-based Subsidy), a hierarchical diffusion-based framework for deployable city-wide subsidy control. To bridge the train-inference gap, D$^3$-Subsidy employs a prefix-conditioned diffusion model that samples plausible future trajectories from immutable historical observations, ensuring the training protocol aligns with the fixed-history nature of online deployment. These generated plans are then decoded by a context-conditioned inverse module into low-dimensional city-level control signals. For scalable execution, we bridge the gap between city-level planning and fine-grained dispatch via a Lagrangian-dual-derived mapping, which embeds subsidy-rate caps directly into order-driver incentives without iterative optimization. Additionally, a multi-city pretraining strategy with parameter-efficient fine-tuning enables robust transfer across heterogeneous cities. Extensive offline evaluations demonstrate that D$^3$-Subsidy improves \texttt{Rides} and \texttt{GMV} while enhancing cap compliance, and a real-world A/B test confirms significant uplift while keeping budget-related violation metrics within operational thresholds.

PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20052v2 Announce Type: replace Abstract: Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label \textbf{prompt}-tuning approach for \textbf{rad}iology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.

PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars

Julian Kaltheuner, Jan Spindler, Sina Kitz, Patrick Stotko, Reinhard Klein — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.20185v2 Announce Type: replace Abstract: Existing Gaussian avatar methods typically parameterize geometry on a body-template surface, which entangles the avatar's representation space with the template's deformation space and limits the capture of layered, off-body, and non-rigid clothing geometry. We present PiG-Avatar, which addresses this limitation by using the parametric body model solely for kinematic transport, while representing the avatar as Gaussians anchored in a volumetric canonical space governed by a continuous neural field. This decouples representation from template topology, avoiding the geometric constraints of surface-based parameterizations. Kinematic coherence is maintained through 3D barycentric anchor transport, which guides motion without constraining geometry and allows anchors to deviate freely from the template surface, yielding dense, stable temporal surface correspondences by construction. To make this unconstrained formulation tractable, we introduce dual-level spatially coherent optimization, combining Sobolev-preconditioned neural-field updates with a novel KNN-based preconditioning of canonical anchor geometry. Together, these mechanisms induce an emergent self-organization of anchor density: anchors migrate toward regions of high curvature, appearance variation, and non-coherent motion without explicit heuristics. As a result, complex clothing geometry and layered surfaces emerge as natural, high-fidelity outputs. This single representation further supports hierarchical reconstruction across multiple levels of detail, with coarse-level supervision propagating to finer levels through the shared field and coupled anchor graph. On established benchmarks featuring subjects with complex clothing and challenging non-rigid motion, PiG-Avatar achieves state-of-the-art rendering quality, generalizes robustly to imperfect body model initialization, and renders in real time across all detail levels.

Clustered independence and bounded treewidth

Kolja Knauer, Torsten Ueckerdt — Thu, 21 May 2026 00:00:00 -0400

arXiv:2303.13655v4 Announce Type: replace-cross Abstract: A set $S\subseteq V$ of vertices of a graph $G$ is a $c$-clustered set if it induces a subgraph with components of order at most $c$ each, and $\alpha_c(G)$ denotes the size of a largest $c$-clustered set. For any graph $G$ on $n$ vertices and treewidth $k$, we show that $\alpha_c(G) \geq \frac{c}{c+k+1}n$, which improves a result of Dvo\v{r}{\'a}k and Wood [Innov.\ Graph Theory, 2025], while we construct $n$-vertex graphs $G$ of treewidth $k$ with $\alpha_c(G)\leq \frac{c}{c+k}n$. In the case $c\leq 2$ or $k=1$ we prove the better lower bound $\alpha_c(G) \geq \frac{c}{c+k}n$, which settles a conjecture of Chappell and Pelsmajer [Electron.\ J.\ Comb., 2013] and is best-possible. Finally, in the case $c=3$ and $k=2$, we show $\alpha_c(G) \geq \frac{5}{9}n$ which is best-possible.

Optical Quantum Mixed-State Reconstruction With Multiple Deep Learning Approaches

Nhan Trong Luu, Tuyen Quang Nguyen, Duong Trung Luu, Thang Cong Truong — Thu, 21 May 2026 00:00:00 -0400

arXiv:2407.01734v4 Announce Type: replace-cross Abstract: Quantum state tomography is a crucial technique for characterizing the state of a quantum system, which is essential for many applications in quantum technologies. In recent years, there has been growing interest in leveraging neural networks to enhance the efficiency and accuracy of quantum state tomography. However, versatile methods that are broadly applicable across diverse reconstruction scenarios remain relatively underexplored. In this paper, we present two neural network-based reconstruction approaches for both pure and mixed quantum state tomography: Restricted Feature Based Neural Network and Mixed States Neural Network. By leveraging class information during reconstruction, we are able to achieve state-of-the-art performance of tomography for both pure and mixed quantum states.

Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features

Ikjun Choi, Ilmun Kim — Thu, 21 May 2026 00:00:00 -0400

arXiv:2407.08976v2 Announce Type: replace-cross Abstract: Recent years have seen a surge in methods for two-sample testing, among which the Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data. Despite its success and widespread adoption, the primary limitation of the MMD test has been its quadratic-time complexity, which poses challenges for large-scale analysis. While various approaches have been proposed to expedite the procedure, it has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost. To fill this gap, we revisit the approximated MMD test using random Fourier features, and investigate its computational-statistical trade-off. We start by revealing that the approximated MMD test is pointwise consistent in power only when the number of random features approaches infinity. We then consider the uniform power of the test and study the time-power trade-off under the minimax testing framework. Our result shows that, by carefully choosing the number of random features, it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time. We demonstrate this point under different distributional assumptions such as densities in a Sobolev ball. Our theoretical findings are corroborated by simulation studies.

Potential Hessian Ascent: The Sherrington-Kirkpatrick Model

David Jekel, Juspreet Singh Sandhu, Jonathan Shi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2408.02360v3 Announce Type: replace-cross Abstract: We present the first iterative spectral algorithm to find near-optimal solutions for a random quadratic objective over the discrete hypercube, resolving a conjecture of Subag [Subag, Communications on Pure and Applied Mathematics, 74(5), 2021]. The algorithm is a randomized Hessian ascent in the solid cube, with the objective modified by subtracting an instance-independent potential function [Chen et al., Communications on Pure and Applied Mathematics, 76(7), 2023]. Using tools from free probability theory, we construct an approximate projector into the top eigenspaces of the Hessian, which serves as the covariance matrix for the random increments. With high probability, the iterates' empirical distribution approximates the solution to the primal version of the Auffinger-Chen SDE [Auffinger et al., Communications in Mathematical Physics, 335, 2015]. The per-iterate change in the modified objective is bounded via a Taylor expansion, where the derivatives are controlled through Gaussian concentration bounds and smoothness properties of a semiconcave regularization of the Fenchel-Legendre dual to the Parisi PDE. These results lay the groundwork for (possibly) demonstrating low-degree sum-of-squares certificates over high-entropy step distributions for a relaxed version of the Parisi formula [Open Question 1.8, arXiv:2401.14383].

Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.12771v2 Announce Type: replace-cross Abstract: The ability to discover new materials with desirable properties is critical for numerous applications from helping mitigate climate change to advances in next generation computing hardware. AI has the potential to accelerate materials discovery and design by more effectively exploring the chemical space compared to other computational methods or by trial-and-error. While substantial progress has been made on AI for materials data, benchmarks, and models, a barrier that has emerged is the lack of publicly available training data and open pre-trained models. To address this, we present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models. OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity. Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard and are capable of predicting ground-state stability and formation energies to an F1 score above 0.9 and an accuracy of 20 meV/atom, respectively. We explore the impact of model size, auxiliary denoising objectives, and fine-tuning on performance across a range of datasets including OMat24, MPtraj, and Alexandria. The open release of the OMat24 dataset and models enables the research community to build upon our efforts and drive further advancements in AI-assisted materials science.

Learning junta distributions, quantum junta states, and QAC$^0$ circuits

Jinge Bao, Francisco Escudero-Guti\'errez — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.15822v3 Announce Type: replace-cross Abstract: In this work, we consider the problems of learning junta distributions, their quantum counterparts (quantum junta states) and $\mathsf{QAC}^0$ circuits, which we show to be close to juntas. (1) Junta distributions. A probability distribution $p:\{-1,1\}^n\to \mathbb [0,1]$ is a $k$-junta if it only depends on $k$ bits. We show that they can be learned with to error $\varepsilon$ in total variation distance from $O(2^k\log(n)/\varepsilon^2)$ samples, which quadratically improves the upper bound of Aliakbarpour et al. (COLT'16) and matches their lower bound in every parameter. (2) Junta states. We initiate the study of $n$-qubit states that are $k$-juntas, those that are the tensor product of a $k$-qubit state and an $(n-k)$-qubit maximally mixed state. We show that these states can be learned with error $\varepsilon$ in trace distance with $O(12^{k}\log(n)/\varepsilon^2)$ single copies. We also prove a lower bound of $\Omega((4^k+\log (n))/\varepsilon^2)$ copies. Additionally, we show that, for constant $k$, $\tilde{\Theta}(2^n/\varepsilon^2)$ copies are necessary and sufficient to test whether a state is $\varepsilon$-close or $7\varepsilon$-far from being a $k$-junta. (3) $\mathsf{QAC}^0$ circuits. Nadimpalli et al. (STOC'24) recently showed that the Pauli spectrum of $\mathsf{QAC}^0$ circuits (with a limited number of auxiliary qubits) is concentrated on low-degree. We remark that they implied something stronger, namely that the Choi states of those circuits are close to be juntas. As a consequence, we show that $n$-qubit $\mathsf{QAC}^0$ circuits with size $s$, depth $d$ and $a$ auxiliary qubits can be learned from $2^{O(\log(s^22^a)^d)}\log (n)$ copies of the Choi state, improving the $n^{O(\log(s^22^a)^d)}$ by Nadimpalli et al.

Improved convergence rate of kNN graph Laplacians: differentiable self-tuned affinity

Xiuyuan Cheng, Yixuan Tan, Nan Wu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2410.23212v2 Announce Type: replace-cross Abstract: In graph-based data analysis, $k$-nearest neighbor ($k$NN) graphs are widely used due to their adaptivity to local data densities. Allowing weighted edges in the graph, the kernelized graph affinity provides a more general type of $k$NN graph where the $k$NN distance is used to set the kernel bandwidth adaptively. In this work, we consider a general class of $k$NN graph where the graph affinity is $W_{ij} = \epsilon^{-d/2} k_0 ( \| x_i - x_j \|^2 / \epsilon \phi( \hat \rho(x_i), \hat \rho(x_j) )^2 ) $, with $\hat{\rho}(x)$ being the (rescaled) $k$NN distance at the point $x$, $\phi$ a symmetric bi-variate function, and $k_0$ a non-negative function on $[0,\infty)$. Under the manifold data setting, where $N$ i.i.d. samples $x_i$ are drawn from a density $p$ on a $d$-dimensional unknown manifold embedded in a high dimensional Euclidean space, we prove the operator pointwise convergence of the $k$NN graph Laplacian to the limiting manifold operator (depending on $p$) at the rate of $O(N^{-2/(d+6)})$, up to a log factor, when $k_0$ and $\phi$ have $C^3$ regularity and satisfy other technical conditions. This is obtained when $\epsilon \sim N^{-2/(d+6)}$ and $k \sim N^{6/(d+6)}$, both at the optimal order to balance the theoretical bias and variance errors. Our improved convergence rate is based on a refined analysis of the $k$NN estimator, which can be of independent interest. We validate our theory by numerical experiments on simulated data.

SMILE-UHURA Challenge -- Small Vessel Segmentation at Mesoscopic Scale from Ultra-High Resolution 7T Magnetic Resonance Angiograms

Thu, 21 May 2026 00:00:00 -0400

arXiv:2411.09593v3 Announce Type: replace-cross Abstract: The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.

Bayesian Optimization by Kernel Regression and Density-based Exploration

Tansheng Zhu, Hongyu Zhou, Ke Jin, Xusheng Xu, Qiufan Yuan, Lijie Ji — Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.06178v5 Announce Type: replace-cross Abstract: Bayesian optimization is highly effective for optimizing expensive-to-evaluate black-box functions, but it faces significant computational challenges due to the cubic per-iteration cost of Gaussian processes, which results in a total time complexity that is quartic with respect to the number of iterations. To address this limitation, we propose a novel algorithm, Bayesian optimization by kernel regression and density-based exploration (BOKE). BOKE uses kernel regression for efficient function approximation, kernel density for exploration, and integrates them into the confidence bound criteria to guide the optimization process, thus reducing computational costs to quadratic. Our theoretical analysis rigorously establishes the global convergence of BOKE under noisy evaluations. Through extensive numerical experiments on both synthetic and real-world optimization tasks, we demonstrate that BOKE not only performs competitively compared to Gaussian process-based methods and several other baseline methods but also exhibits superior computational efficiency. These results highlight BOKE's effectiveness in resource-constrained environments, providing a practical approach for optimization problems in engineering applications.

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

Chengpiao Huang, Yuhang Wu, Kaizheng Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2502.17773v5 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.

Batched Single-Index Global Multi-Armed Bandits with Covariates

Sakshi Arya, Hyebin Song — Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.00565v3 Announce Type: replace-cross Abstract: The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making, where a decision-maker selects an arm in each round with the goal of maximizing long-term rewards. In many practical applications, such as personalized medicine and recommendation systems, contextual information is available at the time of decision-making, rewards from different arms are related rather than independent, and feedback is provided in batches. We propose a novel semi-parametric framework for batched bandits with covariates that incorporates a shared parameter across arms. We leverage the single-index regression (SIR) model to capture relationships between arm rewards while balancing interpretability and flexibility. Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy with a dynamic binning mechanism guided by the single-index direction. We consider two settings: one where a pilot direction is available and another where the direction is estimated from data, deriving theoretical regret bounds for both cases. When a pilot direction is available with sufficient accuracy and the number of arms $K$ is fixed, our approach achieves minimax-optimal rates (with $d = 1$) for nonparametric batched bandits, circumventing the curse of dimensionality. Extensive experiments on simulated and real-world datasets demonstrate the effectiveness of our algorithm compared to the nonparametric batched bandit method introduced by \cite{jiang2025batched}.

FLUME-FNO: data-efficient and scalable prediction of 3D wind and temperature fields in unseen urban morphologies

Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.19708v2 Announce Type: replace-cross Abstract: Urban microclimate, encompassing wind and temperature fields shaped by building geometry, significantly impacts energy consumption, pedestrian winds, pollutant dispersion, urban heat island, and public health. Accurately predicting microclimate is crucial yet challenging. Conventional Computational Fluid Dynamics (CFD) is computationally prohibitive for rapid assessments, while many deep learning approaches require extensive training data and struggle with generalization in unseen configurations. We present the Fast Localized Urban Microclimate Emulation Fourier Neural Operator (FLUME-FNO), a data-efficient and scalable framework for rapid prediction of 3D wind and temperature fields based solely on building geometry. FLUME-FNO assumes the local urban microclimate is primarily governed by surrounding geometry directly visible from a specific location. To encode this, the framework introduces a novel Multi-Directional Distance Feature (MDDF), representing visible open-space structures by measuring directional distances to surrounding buildings. By computing MDDF over the full domain and cropping encoded geometric features into smaller 3D patches, FLUME-FNO effectively augments limited CFD data, enabling robust learning from just 23 CFD simulations. The model achieves mean absolute errors of 0.2 m/s for wind speed and 0.19 {\deg}C for temperature on unseen configurations. Addressing the need for trustworthy fast microclimate prediction, the framework is further assessed using a deep ensemble as a practical proxy for FLUME-FNO uncertainty, ranging from 3% to 40% depending on location. The UQ framework demonstrates FLUME-FNO provides resilient, trustworthy predictions within acceptable accuracy thresholds for wind engineering and microclimate studies, highlighting its potential for real-world applications.

Bridging Language Models and Financial Analysis

Alejandro Lopez-Lira, Jihoon Kwon, Sangwoon Yoon, Jy-yong Sohn, Chanyeol Choi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2503.22693v2 Announce Type: replace-cross Abstract: The rapid advancements in Large Language Models (LLMs) have unlocked transformative possibilities in natural language processing, particularly within the financial sector. Financial data is often embedded in intricate relationships across textual content, numerical tables, and visual charts, posing challenges that traditional methods struggle to address effectively. However, the emergence of LLMs offers new pathways for processing and analyzing this multifaceted data with increased efficiency and insight. Despite the fast pace of innovation in LLM research, there remains a significant gap in their practical adoption within the finance industry, where cautious integration and long-term validation are prioritized. This disparity has led to a slower implementation of emerging LLM techniques, despite their immense potential in financial applications. As a result, many of the latest advancements in LLM technology remain underexplored or not fully utilized in this domain. This survey seeks to bridge this gap by providing a comprehensive overview of recent developments in LLM research and examining their applicability to the financial sector. Building on previous survey literature, we highlight several novel LLM methodologies, exploring their distinctive capabilities and their potential relevance to financial data analysis. By synthesizing insights from a broad range of studies, this paper aims to serve as a valuable resource for researchers and practitioners, offering direction on promising research avenues and outlining future opportunities for advancing LLM applications in finance.

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.08277v3 Announce Type: replace-cross Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.09521v2 Announce Type: replace-cross Abstract: Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.

Control and optimization for Neural Partial Differential Equations in Supervised Learning

Alain Bensoussan, Minh-Binh Tran, Bangjie Wang — Thu, 21 May 2026 00:00:00 -0400

arXiv:2506.20764v2 Announce Type: replace-cross Abstract: Although there is a substantial body of literature on control and optimization problems for parabolic and hyperbolic systems, the specific problem of controlling and optimizing the coefficients of the associated operators within such systems has not yet been thoroughly explored. In this work, we aim to initiate a line of research in control theory focused on optimizing and controlling the coefficients of these operators-a problem that naturally arises in the context of neural networks and supervised learning. In supervised learning, the primary objective is to transport initial data toward target data through the layers of a neural network. We propose a novel perspective: neural networks can be interpreted as partial differential equations (PDEs). From this viewpoint, the control problem traditionally studied in the context of ordinary differential equations (ODEs) is reformulated as a control problem for PDEs, specifically targeting the optimization and control of coefficients in parabolic and hyperbolic operators. To the best of our knowledge, this specific problem has not yet been systematically addressed in the control theory of PDEs. To this end, we propose a dual system formulation for the control and optimization problem associated with parabolic PDEs, laying the groundwork for the development of efficient numerical schemes in future research. We also provide a theoretical proof showing that the control and optimization problem for parabolic PDEs admits minimizers. Finally, we investigate the control problem associated with hyperbolic PDEs and prove the existence of solutions for a corresponding approximated control problem.

Gradient Scalability and Taylor Surrogation of Quantum Cost Landscapes

Sabri Meyer, Francesco Scala, Francesco Tacchino, Aurelien Lucchi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.06344v3 Announce Type: replace-cross Abstract: Variational Quantum Algorithms are promising candidates for near-term quantum computing, yet they face scalability challenges due to barren plateaus, where gradients vanish exponentially relative to system size. Recent conjectures suggest that avoiding these plateaus might inherently lead to classical simulability, thereby limiting the opportunities for quantum advantage. In this work, we advance the theoretical understanding of the relationship between gradient scalability at initialization and the computational complexity of variational quantum algorithms. We first present the Taylor surrogate, a classical simulation technique that matches Pauli path runtime guarantees on near-Clifford regions while offering runtime advantages in specific regimes. Leveraging this surrogate, we prove that beyond previously established classically simulable regions, the computational complexity is at least super-polynomial. Next, we introduce the Linear Clifford Encoder, a classically efficient ansatz modifier that ensures constant-scaling gradients within landscape regions close to Clifford circuits. Finally, numerical experiments on these modified landscapes provide preliminary empirical evidence of a transition zone where constant-scaling gradients may decay polynomially in super-polynomially complex regions rather than exponentially. These findings suggest speculative instances where non-vanishing gradients and super-polynomial complexity could potentially coexist, vindicating the need for future formal proofs.

Machine-Learned Force Fields for Lattice Dynamics at Coupled-Cluster Level Accuracy

Thu, 21 May 2026 00:00:00 -0400

arXiv:2507.06929v2 Announce Type: replace-cross Abstract: We investigate Machine-Learned Force Fields (MLFFs) trained on approximate Density Functional Theory (DFT) and Coupled Cluster (CC) level potential energy surfaces for the carbon diamond and lithium hydride solids. We assess the accuracy and precision of the MLFFs by calculating phonon dispersions and vibrational densities of states (VDOS) that are compared to experiment and reference ab initio results. To overcome limitations from long-range effects and the lack of atomic forces in the CC training data, a delta-learning approach based on the difference between CC and DFT results, as well as a charge aware MLFF approach is explored. Compared to DFT, MLFFs trained on CC theory yield higher vibrational frequencies for optical modes, agreeing better with experiment. Furthermore, the MLFFs are used to estimate anharmonic effects on the VDOS of lithium hydride at the level of CC theory.

BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories under Spatio-Temporal Vector Fields

Rui-Yang Zhang, Lachlan Astfalck, Edward Cripps, David S. Leslie, Henry B. Moss — Thu, 21 May 2026 00:00:00 -0400

arXiv:2509.26005v3 Announce Type: replace-cross Abstract: We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer time-dependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed spatio-temporal Gaussian process surrogate model. The majority of existing placement campaigns either follow standard `space-filling' designs or relatively ad-hoc expert opinions. A key challenge to applying principled active learning in this setting is that Lagrangian observers are continuously advected through the vector field, so they make measurements at different locations and times. It is, therefore, important to consider the likely future trajectories of placed observers to account for the utility of candidate placement locations. To this end, we present BALLAST: Bayesian Active Learning with Look-ahead Amendment for Sea-drifter Trajectories. We observe noticeable benefits of BALLAST-aided sequential observer placement strategies on both synthetic and high-fidelity ocean current models. In addition, we developed a novel GP inference method -- the Vanilla SPDE Exchange (VaSE) -- to boost the GP posterior sampling efficiency, which is also of independent interest.

Quantum reservoir computing in Jaynes-Cummings models: Nonlinear memory and time-series prediction

Sreetama Das, Gian Luca Giorgi, Roberta Zambrini — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.00171v2 Announce Type: replace-cross Abstract: We investigate quantum reservoir computing (QRC) using a hybrid qubit-boson system described by the Jaynes-Cummings (JC) Hamiltonian and its dispersive limit (DJC). These models provide high-dimensional Hilbert spaces and intrinsic nonlinear dynamics, making them powerful substrates for temporal information processing. We systematically benchmark both reservoirs through linear and nonlinear memory tasks, demonstrating that they exhibit an unusual superior nonlinear over linear memory capacity. We further test their predictive performance on the Mackey-Glass time series, a widely used benchmark for chaotic dynamics, and show comparable forecasting ability. We also investigate how memory and prediction accuracy vary with reservoir parameters, and show the role of higher-order bosonic observables and time multiplexing in enhancing expressivity, even in minimal spin-boson configurations. Our results establish JC- and DJC-based reservoirs as versatile platforms for time-series processing and as elementary units that overcome the setting of equivalent qubit pairs and offer pathways toward tunable, high-performance quantum machine learning architectures.

Efficient Matrix Product State Learning in Logarithmic Depth

Chia-Ying Lin, Nai-Hui Chia, Shih-Han Hung — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.07798v2 Announce Type: replace-cross Abstract: Learning the closest matrix product state (MPS) representation of a quantum state enables useful tools for quantum machine learning and analysis of complex quantum systems. In this work, we study the problem of learning MPS in the following setting: given many copies of an input MPS, the task is to recover a classical description of the state. The best known polynomial-time algorithm, introduced by [LCLP10, CPF+10], requires linear circuit depth and $\widetilde O(n^5)$ samples, and has seen no improvement in over a decade. These costs, neither known to be optimal, renders existing algorithms impractical for near-term quantum devices with limited resources. We introduce parallel disentangling algorithms for MPS learning. For exact MPS learning, our algorithm runs in polynomial time and uses circuit depth $O(\log n)$ and sample complexity $\widetilde O(n^3)$, improving both the depth and the dependence on the system size $n$. The key idea is to exploit the bounded-rank structure of reduced states on middle blocks of an MPS and organize the disentangling operations in a tree structure. We further extend the algorithm to closest MPS learning, improving the sample complexity dependence on $n$ from $n^9$ to $n^7$ and complement the algorithms with an $\Omega(n)$ product-state lower bound. We also investigate MPS learning under hardware constraints, including restricted measurements and geometric connectivity. Under the Learning Parity with Noise (LPN) assumption, we show computational hardness for learning an MPS(2) family with non-adaptive single-qubit measurements. Finally, we show that our algorithm can be implemented with depth $O(q n^{1/q})$ on a $q$-dimensional hypercubic lattice, giving an asymptotic reduction in depth. Together, our work provides a complete characterization of the quantum resources needed for efficient MPS learning.

Adversarial Robustness in One-Stage Learning-to-Defer

Yannis Montreuil, Letian Yu, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.10988v2 Announce Type: replace-cross Abstract: Learning-to-Defer (L2D) enables hybrid decision-making by routing inputs either to a predictor or to external experts. While promising, L2D is highly vulnerable to adversarial perturbations, which can not only flip predictions but also manipulate deferral decisions. Prior robustness analyses focus solely on two-stage settings, leaving open the end-to-end (one-stage) case where predictor and allocation are trained jointly. We introduce the first framework for adversarial robustness in one-stage L2D, covering both classification and regression. Our approach formalizes attacks, proposes cost-sensitive adversarial surrogate losses, and establishes theoretical guarantees including $\mathcal{H}$, $(\mathcal{R }, \mathcal{F})$, and Bayes consistency. Experiments on benchmark datasets confirm that our methods improve robustness against untargeted and targeted attacks while preserving clean performance.

ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination

Thu, 21 May 2026 00:00:00 -0400

arXiv:2510.15949v5 Announce Type: replace-cross Abstract: Large language models show promise for financial decision-making, yet deploying them as autonomous trading agents raises fundamental challenges: how to adapt instructions when rewards arrive late and obscured by market noise, how to synthesize heterogeneous information streams into coherent decisions, and how to bridge the gap between model outputs and executable market actions. We present ATLAS (Adaptive Trading with LLM AgentS), a unified multi-agent framework that integrates structured information from markets, news, and corporate fundamentals to support robust trading decisions. Within ATLAS, the central trading agent operates in an order-aware action space, ensuring that outputs correspond to executable market orders rather than abstract signals. The agent can incorporate feedback while trading using Adaptive-OPRO, a novel prompt-optimization technique that dynamically adapts the prompt by incorporating real-time, stochastic feedback, leading to increasing performance over time. Across regime-specific equity studies and multiple LLM families, Adaptive-OPRO consistently outperforms fixed prompts, while reflection-based feedback fails to provide systematic gains.

Z-Dip: a standardized measure for data modality assessment

Edoardo Di Martino, Matteo Cinelli, Roy Cerqueti — Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.01705v2 Announce Type: replace-cross Abstract: Detecting multimodality in empirical distributions is a fundamental problem in statistics and data analysis, with applications ranging from clustering to the study of complex systems. In practice, however, assessing departures from unimodality in a consistent and comparable way remains challenging. Widely used methods such as Hartigan and Hartigan's Dip Test illustrate these difficulties, as the interpretation of their statistics depends strongly on sample size, requires calibration to determine significance, and, for large samples, exhibit increasing sensitivity, leading to rejection of unimodality for arbitrarily small deviations from the null. We introduce Z-Dip, a standardized measure of multimodality that addresses these limitations. By treating the Dip statistic as a random variable under the null hypothesis of unimodality and standardizing its observed value, the proposed approach yields scores that are directly comparable across datasets of different sizes. Using simulation-based calibration, we derive a universal decision threshold that closely reproduces classical Dip Test decisions without requiring sample-size-specific adjustments. Extensive validation on simulated data and on more than 88,000 empirical opinion distributions shows near-perfect agreement with the classical Dip Test while providing a more interpretable and comparable measure of modality. Finally, we propose a downsampling-based correction that mitigates residual sensitivity in extremely large samples. Open-source software and reference tables are provided to facilitate practical adoption.

Universal Quantum Computer Simulation of 50 Qubits on Europe`s First Exascale Supercomputer Harnessing Its Heterogeneous CPU-GPU Architecture

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.03359v3 Announce Type: replace-cross Abstract: We have developed a new version of the high-performance J\"ulich universal quantum computer simulator (JUQCS-50) that leverages key features of the GH200 superchips as used in the JUPITER supercomputer, enabling simulations of a 50-qubit universal quantum computer for the first time. JUQCS-50 achieves this through three key innovations: (1) extending usable memory beyond GPU limits via high-bandwidth CPU-GPU interconnects and LPDDR5 memory; (2) adaptive data encoding to reduce memory footprint with acceptable trade-offs in precision and compute effort; and (3) an on-the-fly network traffic optimizer. These advances result in a 16.6-fold speedup over the previous 48-qubit record on the K computer

Maxitive Donsker-Varadhan Formulation for Possibilistic Variational Inference

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.21223v2 Announce Type: replace-cross Abstract: Variational inference (VI) is a cornerstone of modern Bayesian learning, enabling approximate inference in complex models. However, its formulation depends on expectations and divergences defined through high-dimensional integrals, often rendering analytical treatment impossible and necessitating heavy reliance on approximations. Possibility theory, an imprecise probability framework, allows us to directly model epistemic uncertainty instead of relying on a subjective interpretation of probabilities. While this framework provides robustness and interpretability under sparse or imprecise information, adapting VI to the possibilistic setting requires rethinking core concepts such as divergences, which presuppose additivity. In this work, we develop a principled formulation for performing possibilistic VI by establishing a maxitive analogue of the classical Donsker-Varadhan formulation. The resulting framework enables us to derive a learning rule for possibilistic VI with exponential-family candidates and practical update rules for neural-network training, giving rise to a family of optimizers termed CBOpt. Finally, we demonstrate that CBOpt achieves competitive performance on both in-domain and out-of-domain image classification tasks.

Cosserat micropolar and couple-stress elasticity models of flexomagnetism at finite deformations

Thu, 21 May 2026 00:00:00 -0400

arXiv:2511.22756v3 Announce Type: replace-cross Abstract: We propose geometrically nonlinear (finite) continuum models of flexomagnetism based on the Cosserat micropolar and its descendent couple-stress theory. These models introduce the magneto-mechanical interaction by coupling the micro-dislocation tensor of the micropolar model with the magnetisation vector using a Lifshitz invariant. In contrast to conventional formulations that couple strain-gradients to the magnetisation using fourth-order tensors, our approach relies on third-order tensor couplings by virtue of the micro-dislocation being a second-order tensor. Consequently, the models permit centrosymmetric materials with a single new flexomagnetic constant, and more generally allow cubic-symmetric materials with two such constants. We postulate the flexomagnetic action-functionals and derive the corresponding governing equations using both scalar and vectorial magnetic potential formulations, and present numerical results for a nano-beam geometry, confirming the physical plausibility and computational feasibility of the models.

Cluster-Based Generalized Additive Models Informed by Random Fourier Features

Xin Huang, Jia Li, Jun Yu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2512.19373v3 Announce Type: replace-cross Abstract: In developing data-driven modeling methodologies, there is an ongoing need to reconcile the strong predictive performance of opaque black-box models with the transparency required for critical applications. This work introduces an interpretable and computationally tractable regression framework for heterogeneous data by combining response-informed spectral representation learning with localized additive modeling. The method first fits a random Fourier feature regression model and constructs a spectral feature map from the learned amplitudes and adaptively resampled frequencies, so that the representation reflects predictive variation in the data. This representation is then compressed by principal component analysis to obtain a low-dimensional latent embedding, in which a Gaussian mixture model performs soft regime discovery. Within each regime, a cluster-specific generalized additive model captures nonlinear covariate effects through interpretable spline-based univariate smooth functions. The final predictor is formed as a soft mixture of these local additive models, enabling flexible modeling of a nonlinear, heterogeneous structure while preserving interpretability. Numerical experiments across several benchmark regression datasets show that the proposed method consistently improves upon classical globally interpretable baselines while remaining competitive with more flexible black-box models. Overall, the framework provides a unified approach to heterogeneous regression that combines predictive adaptivity with interpretable local covariate effects.

Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review

Gustav Olaf Yunus Laitinen-Fredriksson Lundstr\"om-Imanov, \"Ozkan G\"unalp — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.00990v2 Announce Type: replace-cross Abstract: Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.

DNACHUNKER: Learnable Tokenization for DNA Language Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.03019v4 Announce Type: replace-cross Abstract: DNA language models are increasingly used to represent genomic sequence, yet their effectiveness depends critically on how raw nucleotides are converted into model inputs. Unlike natural language, DNA offers no canonical boundaries, making fixed tokenizations a brittle design choice under shifts, indels, and local repeats. We introduce DNAChunker, a masked DNA language model that incorporates a learnable adaptive segmentation module to produce context-dependent, variable-length units. Building on a dynamic segmentation procedure, DNAChunker learns to allocate finer granularity to functionally enriched regions while compressing repetitive or redundant sequence. We pretrain DNAChunker on the human reference genome and evaluate it across five benchmarks, where it consistently improves over strong fixed-tokenization baselines. Further analyses and ablations indicate that unlike fixed tokenizations, segmentation is learned in a biologically-informed, mutation-resilient manner.

Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Bang Zeng, Beilong Tang, Wang Xiang, Ming Li — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.06006v2 Announce Type: replace-cross Abstract: Target speaker extraction (TSE) aims to recover the speech of a desired speaker from a mixture given a short enrollment utterance, while speech enhancement (SE) focuses on improving speech quality under noisy conditions. Most existing TSE and SE systems are based on discriminative modeling and have shown strong interference suppression ability, but they often remain limited in perceptual quality and naturalness. To address this issue, we first introduce LauraTSE, a generative TSE model built on an autoregressive decoder-only language model. Although generative modeling is promising for quality enhancement, purely generative TSE may suffer from hallucination, content drift, and limited controllability in complex acoustic conditions. We therefore propose a discriminative-generative two-stage framework, where a discriminative front-end first produces target-related representations with strong interference suppression, and a generative back-end then reconstructs high-quality speech in the neural audio codec representation space. This design combines the controllability of discriminative extraction with the reconstruction capability of generative modeling. We further investigate several collaboration strategies for the two-stage framework, including front-end freezing, joint fine-tuning, SI-SDR regularization, and autoregressive/non-autoregressive inference. Experimental results on both TSE and SE benchmarks show that the proposed framework achieves a better balance among perceptual quality, intelligibility, and speaker consistency than purely discriminative or purely generative baselines.

Surface Dean--Kawasaki equations

John Bell, Ana Djurdjevac, Nicolas Perkowski — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.06863v2 Announce Type: replace-cross Abstract: We consider stochastic particle dynamics on hypersurfaces represented in Monge gauge parametrization. Starting from the underlying Langevin system, we derive the surface Dean-Kawasaki (DK) equation and formulate it in the martingale sense. The resulting SPDE explicitly reflects the geometry of the hypersurface through the induced metric and its differential operators. Our framework accommodates both pairwise interactions and environmental potentials, and we extend the analysis to evolving hypersurfaces driven by an SDE that interacts with the particles, yielding the corresponding surface DK equation for the coupled surface-particle system. We establish a weak uniqueness result in the non-interacting case, and we develop a finite-volume discretization preserving the fluctuation-dissipation relation. Numerical experiments illustrate equilibrium properties and dynamical behavior influenced by surface geometry and external potentials.

Approximate FKG inequalities for phase-bound spin systems, with applications to central limit theorems for exponential random graphs

Satyaki Mukherjee, Vilas Winstein — Thu, 21 May 2026 00:00:00 -0400

arXiv:2601.07169v2 Announce Type: replace-cross Abstract: The Fortuin-Kasteleyn-Ginibre (FKG) inequality is an invaluable tool in monotone spin systems satisfying the FKG lattice condition, which provides positive correlations for all coordinate-wise increasing functions of spins. This inequality has numerous applications and plays an integral role in the proof of various central limit theorems (CLTs), including recent work on ferromagnetic exponential random graph models (ERGMs) wherein a Hamiltonian tilt promotes the presence of small subgraphs like triangles. However, the FKG lattice condition fails to hold when confining a spin system to a particular phase in the low-temperature regime of parameters. Thus it is not a priori clear if each phase internally has positive correlations for increasing functions, or if the positive correlations in the overall model (which is a mixture of phases) arise primarily from the global choice of phase. In this article, we show that the individual phases in ERGMs do indeed satisfy an approximate form of the FKG inequality internally. We use this to finish the proof of various CLTs within each individual phase in the phase-coexistence regime, answering a question posed by Bianchi, Collet, and Magnanini. We present the FKG inequality for ERGMs as a consequence of a more general result which holds under certain inputs related to metastable mixing; we expect this general result to be widely applicable, and we devote a section to spelling out the details of its application to a class of generalized higher-order ferromagnetic Curie-Weiss models where the necessary inputs are relatively transparent.

AFD-INSTRUCTION: A Comprehensive Antibody Instruction Dataset with Functional Annotations for LLM-Based Understanding and Design

Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.04916v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have significantly advanced protein representation learning. However, their capacity to interpret and design antibodies through natural language remains limited. To address this challenge, we present AFD-Instruction, the first large-scale instruction dataset with functional annotations tailored to antibodies. This dataset encompasses two key components: antibody understanding, which infers functional attributes directly from sequences, and antibody design, which enables de novo sequence generation under functional constraints. These components provide explicit sequence-function alignment and support antibody design guided by natural language instructions. Extensive instruction-tuning experiments on general-purpose LLMs demonstrate that AFD-Instruction consistently improves performance across diverse antibody-related tasks. By linking antibody sequences with textual descriptions of function, AFD-Instruction establishes a new foundation for advancing antibody modeling and accelerating therapeutic discovery.

Variational Optimality of F\"ollmer Processes in Generative Diffusions

Yifan Chen, Eric Vanden-Eijnden — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.10989v3 Announce Type: replace-cross Abstract: We construct and analyze generative diffusions that transport a point mass to a prescribed target distribution over a finite time horizon using the stochastic interpolant framework. The drift is expressed as a conditional expectation that can be estimated from independent samples without simulating stochastic processes. We show that the diffusion coefficient can be tuned \emph{a~posteriori} without changing the time-marginal distributions. Among all such tunings, we prove that minimizing the impact of estimation error on the path-space Kullback--Leibler divergence selects, in closed form, a F\"ollmer process -- a diffusion whose path measure minimizes relative entropy with respect to a reference process determined by the interpolation schedules alone. This yields a new variational characterization of F\"ollmer processes, complementing classical formulations via Schr\"odinger bridges and stochastic control, and provides a conditional-expectation representation of the F\"ollmer drift that enables simulation-free estimation from data. We further establish that, under this optimal diffusion coefficient, the path-space Kullback--Leibler divergence becomes independent of the interpolation schedule, rendering different schedules statistically equivalent in this variational sense. We provide numerical experiments to illustrate the impact of path-space variational optimality of F\"ollmer's processes in probabilistic forecasting and data assimilation applications.

Multi-Channel Replay Speech Detection using Acoustic Maps

Michael Neri, Tuomas Virtanen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2602.16399v2 Announce Type: replace-cross Abstract: Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

Learning-to-Defer with Expert-Conditional Advice

Yannis Montreuil, Le\"ina Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.14324v3 Announce Type: replace-cross Abstract: Learning-to-Defer routes each input to the expert that minimizes expected cost, but it assumes that the information available to every expert is fixed at decision time. Many modern systems violate this assumption: after selecting an expert, one may also choose what additional information that expert should receive, such as retrieved documents, tool outputs, or escalation context. We study this problem and call it Learning-to-Defer with advice. We show that a broad family of natural separated surrogates, which learn routing and advice with distinct heads, is inconsistent even in the smallest non-trivial setting. We then introduce an augmented surrogate that operates on the composite expert--advice action space and prove an $\mathcal{H}$-consistency guarantee together with an excess-risk transfer bound, yielding recovery of the Bayes-optimal policy in the limit. Experiments on tabular, language, and multi-modal tasks show that the resulting method improves over standard Learning-to-Defer while adapting its advice-acquisition behavior to the cost regime; a synthetic benchmark confirms the failure mode predicted for separated surrogates.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.17837v4 Announce Type: replace-cross Abstract: During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

CRANE: Correcting Errors in Raw Nanopore Signals Using Hidden Markov Models

Thu, 21 May 2026 00:00:00 -0400

arXiv:2603.20420v2 Announce Type: replace-cross Abstract: Nanopore sequencing can read substantially longer sequences of nucleic acid molecules, called reads, than other sequencing methods, which has led to advances in genomic analysis such as the gapless human genome assembly. By analyzing the raw electrical signal reads that nanopore sequencing generates from molecules, existing works can map these reads without translating them into DNA characters (i.e., basecalling), allowing for quick and efficient analysis of sequencing data. However, raw signals often contain errors due to noise and processing errors, which limits the overall accuracy of raw signal analysis. Our goal in this work is to detect and correct errors in raw signals to improve the accuracy of raw signal analyses. To this end, we propose CRANE, a mechanism that trains and utilizes a Hidden Markov Model (HMM) to accurately correct signal errors. Our extensive evaluation on various datasets shows that CRANE 1) consistently improves the overall accuracy of the underlying raw signal analysis tools, 2) minimizes the burden of optimizing analysis pipelines for newer nanopore technologies, and 3) does not introduce substantial computational overhead. We conclude that CRANE provides an effective mechanism to systematically identify and correct the errors in raw nanopore signals before further analysis, which can enable the development of a new class of error correction mechanisms purely designed for raw nanopore signals. Source Code: CRANE is available at https://github.com/STORMgroup/CRANE. We also provide the scripts to fully reproduce our results on our GitHub page

Beyond Augmented-Action Surrogates for Multi-Expert Learning-to-Defer

Yannis Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.09414v3 Announce Type: replace-cross Abstract: A learning-to-defer (L2D) system decides, for each input, whether to predict on its own or to hand it to one of several available experts. The very well established recipe trains classifier and router jointly by treating the $K$ classes and $J$ experts as competing actions in one shared $(K{+}J)$-action geometry. Subsequent work has proposed a series of incremental fixes within this geometry; we show that each still suffers, to varying severity, from an optimization-level pathology (target distortion, gradient amplification, winner-take-all starvation, set-mass collapse, or class--expert coupling) even under statistical consistency. We step outside the augmented-action family entirely and propose a decoupled surrogate: a softmax classifier head and an independent sigmoid head per expert, mirroring the two natural objects of the problem. We show that per-sample updates are then coordinatewise and the class--expert Hessian block is identically zero, and prove an excess-risk bound with calibration constant $\max\{2\sqrt{2},\sqrt{2J/\lambda}\}$ -- to our knowledge the first multi-expert L2D guarantee whose constant does not grow with the expert pool when the per-expert weight is held fixed. On controlled synthetic studies and on CIFAR-10, CIFAR-10H, and Covertype, it is the only method in our comparison that remains stable as the expert pool grows, preserves rare specialists, and improves over a standalone classifier on every real-data benchmark.

Multi-scale Dynamic Wake Modeling and Prediction of Floating Offshore Wind Turbines via Physics-Informed Neural Networks and Fourier Neural Operators

Guodan Dong, Jianhua Qin, Chang Xu — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.23937v2 Announce Type: replace-cross Abstract: Multi-scale dynamic wake modeling and prediction are essential for the real-time control and optimization of floating offshore wind turbines (FOWTs). In this study, wakes of FOWTs under coupled surge and pitch motions across a range of Strouhal numbers (St), which can induce wake meandering, are modeled via two novel deep-learning frameworks: physics-informed neural networks (PINNs) and Fourier neural operators (FNOs). The high-fidelity dataset is obtained from large-eddy simulations with the actuator line model (LES-AL). The results demonstrate that the dominant large-scale dynamic structures, such as meandering, can be well modeled by both frameworks; however, FNOs exhibit significant advantages over the PINN model in terms of efficiency (8-fold computational speedup and 40-fold faster convergence), long-term predictive capability, and multi-scale coherent structural fidelity. Furthermore, the wakes predicted by the PINN model exhibit a smoothing effect that limits the resolution of high-frequency coherent structures and underestimates turbulent fluctuations in both the wake center and half-width. Spectral analysis reveals that FNOs resolve the primary meandering frequency (where Stp denotes the frequency induced by the coupled surge and pitch motions), its corresponding higher-order harmonics (2Stp, 3Stp), and the energy cascade. In contrast, the energy cascade in the PINN predictions dissipates more rapidly in the high-frequency regime (St > 1.0). Additionally, the pre-multiplied power spectral density indicates that the energy contained in meandering and the corresponding harmonic frequencies modeled by PINNs is relatively low compared to that in CFD and FNOs. These findings suggest that FNOs are promising for the high-fidelity, real-time modeling of FOWT wakes.

Sliced-Regularized Optimal Transport

Khai Nguyen — Thu, 21 May 2026 00:00:00 -0400

arXiv:2604.23944v3 Announce Type: replace-cross Abstract: We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a smoothened sliced OT (SOT) plan. To the best of our knowledge, SROT is the first approach to leverage a version of SOT plan as a reference to improve classical OT. We provide a formal definition of SROT, derive its dual formulation, and provide a post-Bayesian interpretation of SROT. We then develop a Sinkhorn-style algorithm for efficient computation, retaining the same scalability advantages as EOT. By incorporating a scalable SOT plan as a prior, SROT yields more accurate approximations of the exact OT plan than EOT under the same level of regularization. Moreover, the resulting transport plan improves upon the reference SOT plan itself. We further introduce the corresponding OT divergence induced by SROT, named SROT divergence, and analyze its topological and computational properties. Finally, we validate our approach through experiments on synthetic datasets and color transfer tasks, demonstrating that SROT is better than both EOT and SOT in approximating exact OT. Additional experiments on gradient flows further highlight the advantages of SROT divergence.

Online Learning-to-Defer with Varying Experts

Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12340v2 Announce Type: replace-cross Abstract: Learning-to-Defer (L2D) methods route each query either to a predictive model or to external experts. While existing work studies this problem in batch settings, real-world deployments require handling streaming data, changing expert availability, and shifting expert distribution. We introduce the first online L2D algorithm for multiclass classification with bandit feedback and a dynamically varying pool of experts. Our method achieves regret guarantees of $O((n+n_e)T^{2/3})$ in general and $O((n+n_e)\sqrt{T})$ under a low-noise condition, where $T$ is the time horizon, $n$ is the number of labels, and $n_e$ is the number of distinct experts observed across rounds. The analysis builds on novel $\mathcal{H}$-consistency bounds for the online framework, combined with first-order methods for online convex optimization. Experiments on synthetic and real-world datasets demonstrate that our approach effectively extends standard Learning-to-Defer to settings with varying expert availability and reliability.

The critical slowing down in diffusion models

Luca Maria Del Bono, Giulio Biroli, Patrick Charbonneau, Marylou Gabri\'e — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.12597v2 Announce Type: replace-cross Abstract: Computational sampling has been central to the sciences since the mid-20th century. While machine-learning-based approaches have recently enabled major advances, their behavior remains poorly understood, with limited theoretical control over when and why they succeed. Here we provide such insight for diffusion models-a class of generative schemes highly effective in practice-by analyzing their application to the $O(n)$ model of statistical field theory in the Gaussian limit $n \to \infty$. In this analytically tractable setting, we show that training a score model with a one-layer network architecture matching the exact solution exhibits a form of critical slowing down in parameter learning. This slowing down also impacts the generation process, indicating that the well-known difficulties of sampling near criticality persist even for learned generative models. To overcome this bottleneck, we demonstrate the power of combining architectural depth with physical locality. We find that using a two-layer architecture drastically reduces the critical slowing down, with the training time scaling logarithmically rather than quadratically with system size. By introducing a local score approximation we show that this acceleration in training time can be achieved without increasing the number of neural network parameters. Taken together, these results demonstrate that diffusion models can overcome the critical slowing down through appropriate architectural design, and establish a controlled framework for understanding and improving learned sampling methods in statistical physics and beyond.

Do Better Volatility Forecasts Lead to Better Portfolios? Evidence from Graph Neural Networks

Rylan Wade — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19278v2 Announce Type: replace-cross Abstract: This paper tests whether graph neural networks improve realized volatility forecasts and whether those forecasts improve portfolio performance. Using weekly realized volatility for 465 S&P 500 equities from 2015-2025, Heterogeneous Autoregressive and Long Short-Term Memory baselines are compared against GraphSAGE models built on rolling correlation, sector, and Granger-causal graphs, with and without macro regime features. The empirical finding is that the model with the lowest forecast MSE, the model with the highest cross-sectional ranking accuracy, and the model with the highest portfolio Sharpe ratio are three different models. Forecast accuracy, ranking quality, and portfolio performance are related but not interchangeable objectives. Graph volatility models add value only when the portfolio rule can exploit the cross-sectional structure they encode.

Data-driven approximation of regions of attraction via an LP-based selection of PWA Lyapunov functions

Oumayma Khattabi, Matteo Tacchi-B\'enard, Martin Gulan, Sorin Olaru — Thu, 21 May 2026 00:00:00 -0400

arXiv:2605.19961v2 Announce Type: replace-cross Abstract: This paper presents a method to approximate regions of attraction of unknown nonlinear dynamical systems from data. Assuming point-wise evaluations of the vector field and known Lipschitz bounds, a polyhedral uncertainty set of admissible dynamics is constructed. This uncertainty description enables the synthesis of a continuous piece-wise affine Lyapunov candidate via a linear program, enforcing a robust decrease condition for all admissible vector fields. The approach allows certification of a region of attraction consistent with the available data. Numerical examples illustrate the effectiveness of the proposed method in extracting certified regions of attraction from sparse data.