cs updates on arXiv.org

Byzantine Cheap Talk: Adversarial Resilience and Topology Effects in LLM Coordination Games

Aya El Mir, Martin Tak\'a\v{c}, Salem Lahlou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07790v1 Announce Type: new Abstract: Multi-agent LLM systems increasingly rely on communication protocols for coordination, yet their robustness under adversarial and structural constraints remains poorly understood. Building on prior work showing that cheap-talk channels enable cooperation in LLM coordination games, we investigate two vulnerability classes in a 4-player Stag Hunt across six model families and 720 trials. First, when Byzantine agents signal cooperation but defect, non-Byzantine agents detect the betrayal within one round yet fail to adapt collectively: a substantial fraction continue cooperating despite repeated exploitation, unable to recover coordination due to the game's unanimity payoff structure. Second, explicitly restricting communication topology collapses cooperation, while applying identical restrictions silently preserves near-perfect cooperation. This establishes that coordination failure stems from agents' meta-reasoning about hidden information, not information loss itself. We identify two stable behavioral archetypes that replicate across all model cohorts: Defection-Prone models that switch permanently after betrayal, and Cooperation-Persistent models that continue cooperating at significant individual cost. These findings reveal concrete security vulnerabilities: communication channels can be exploited as adversarial injection vectors, and disclosing network topology to agents can degrade coordination even without any adversary present.

Frequency-Scale Saliency for Spectral Descriptor Analysis in 3D Shape Retrieval

Jianru Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07791v1 Announce Type: new Abstract: Classical spectral descriptors such as the Heat Kernel Signature and Wave Kernel Signature are widely used for non-rigid 3D shape retrieval, yet their failure modes remain poorly understood. We present a frequency-scale saliency framework that audits these descriptors by quantifying the retrieval-level contribution of each descriptor scale interval through ablation. We introduce class spectral fingerprints to characterize category-level scale dependence, and show that descriptor similarity between class pairs is substantially correlated with retrieval failure, with a Spearman correlation of 0.479. Experiments on SHREC'11 demonstrate that short scales dominate retrieval performance while long scales are harmful, that HKS and WKS exhibit distinct scale dependence patterns, and that saliency-weighted retrieval improves mAP on hard categories by 0.156, with cross-fold and random-weight controls confirming that the gain is stable and not due to arbitrary reweighting.

MOLOT System Card: Malicious Operational Logic Observation Transformer

Daniil Lopatkin, Maksim Mitrofanov, Stanislav Rakovsky, Aleksandr Khalikov — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07792v1 Announce Type: new Abstract: MOLOT (Malicious Operational Logic Observation Transformer) is a static malicious-code detection system designed for SAST setup where package metadata, maintainer history, and dynamic execution traces may be unavailable or unreliable. The system represents source code as behavior sequences derived from static call graphs, includes an explanation stage that ranks suspicious behavior activities and maps them back to source-code locations. The approach is evaluated on Python and JavaScript packages from PyPI and npm, compared with opensource detection tools, and validated under product constraints including runtime, memory use, and false-positive rates observed in a real moderation workflow. We also release Open Malicious-Code Bench, a public benchmark for reproducible evaluation of malicious-package detection methods. The results show that static behavior-sequence modeling can provide accurate, explainable, and deployable malicious-code detection for modern DevSecOps workflows.

The Choreography of Augmented Reality Timelines: Studying the Relative Position, Chronology, & Situatedness of Event Sequences

Isabelle Kwan, Jessica Ziyu Chen, Matthew Brehmer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07794v1 Announce Type: new Abstract: Timelines are effective ways to tell historical and personal stories. However, most timeline visualization tools impose an inflexible model of time prioritizing chronological clarity. On the other hand, unconstrained representations can better capture the irregular and contextual nature of lived time, but often at the cost of interpretability. In this work, we explore this continuum with a study of how historical and personal timelines could manifest in physical spaces. We conducted a formative study (N=12) in which participants freely arranged events within a physical environment. We observed a diversity of strategies reflecting the personal and context-dependent nature of temporal mental models. We also invited participants to consider how others could move through their timelines. Our analysis led to a choreographic approach to timeline creation, as well as a proof-of-concept tablet-based augmented reality (AR) application that supports spatial timeline drawing and viewing. Finally, we reflect on the design implications of encoding chronology, pacing, and spatial context in immersive timeline stories.

The Role of Semirings in Incremental View Maintenance

Eden Chmielewski, Andrei Draghici, Dan Olteanu, Haozhe Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07795v1 Announce Type: new Abstract: We study the problem of incremental view maintenance (IVM) under inserts to $K$-databases, where $K$ is a commutative semiring without additive inverse. The key observation put forward in this paper is that the complexity of the IVM problem depends fundamentally on the underlying semiring. We introduce a class of conjunctive queries called $p$-hierarchical and show that for any $p$-hierarchical query with fractional hypertree width $\fhtw$ and any insert-only update sequence of length $N$ to an initially empty $K$-database over an arbitrary semiring $K$ without additive inverse, we can construct a data structure that can be updated in amortized $\bigO(N^{\fhtw-1})$ time and can support constant delay enumeration of the query result. In particular, the amortized update time for any $\alpha$-acyclic $p$-hierarchical query is constant. We also give conditional lower bounds showing that any conjunctive query without self-joins that is not $p$-hierarchical cannot be maintained with amortized constant update time and constant enumeration delay under inserts to $K$-databases. Here, $K$ can be the natural semiring and its generalizations to the provenance and covariance semirings or any idempotent and strictly ordered semiring such as the tropical semiring. When put together, our upper and lower bounds imply a dichotomy for the insert-only maintenance of conjunctive queries without self-joins and the aforementioned semirings: A query can be maintained with amortized constant update time and constant enumeration delay if and only if it is $\alpha$-acyclic $p$-hierarchical.

Belief-Space Quantum-Inspired Reinforcement Learning for Partially Observable Autonomous Cyber Defense in the Internet of Vehicles

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07796v1 Announce Type: new Abstract: The Internet of Vehicles (IoV) faces a dynamic, adversarial security environment where attackers adapt to defenses. Existing intrusion detection systems rely on static classifiers that fail to capture sequential decision-making, attacker adaptation, and uncertainty. We formulate IoV security as a sequential attacker-defender interaction and model defense as a reinforcement learning problem under partial observability. We propose Quantum Belief-Integrated Reinforcement Defense (Q-BIRD), using quantum-inspired belief representation to encode defender uncertainty about hidden attacker intent via amplitude-based states, enabling non-Bayesian belief evolution. Integrated into a Proximal Policy Optimization (PPO) defender, Q-BIRD selects cost-aware mitigation actions. In simulated environments with adaptive, probing attackers, Q-BIRD reduced cumulative mean damage, damage variance, and attack success rate (ASR) by 60.4%, 90.2%, and 50.0%, respectively, while increasing survival probability by 46.4%. Compared to classical Bayesian PPO, damage variance reduction and ASR improved by 10.2 times and 50%. Ablation and explainability analyses confirm that amplitude-based belief is the primary decision signal during strategy transitions when classical belief collapses, providing superior IoV security without additional hardware.

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07798v1 Announce Type: new Abstract: Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

Improving Multimodal Reasoning via Worst Dimension Optimization

Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07801v1 Announce Type: new Abstract: Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.

Memetic Capture: A Pluralistic Policy Framework for Governing AI-Driven Cultural Disempowerment

Subramanyam Sahoo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07802v1 Announce Type: new Abstract: Culture is the most insidious vector of gradual human disempowerment by AI: unlike economic or political displacement, cultural displacement attacks the very preferences and values through which humans recognise and resist disempowerment itself. We argue that existing AI governance frameworks suffer from a critical blind spot by treating cultural impact as secondary to economic and safety concerns. This paper develops \emph{memetic capture} as a unifying concept for AI-driven cultural disempowerment, and proposes the \textbf{Cultural Pluralistic Governance Framework (CPGF)}, a four-tier policy architecture combining quantitative cultural influence metrics, democratic value assemblies, pluralistic deployment standards, and transnational coordination mechanisms. We argue that pluralism is not merely an ethical requirement for such governance but a structural necessity: monocultural AI governance accelerates the very disempowerment it claims to prevent. We identify concrete policy levers, discuss implementation tensions, and outline a research agenda at the intersection of pluralistic alignment and cultural AI governance.

Stability Without Safety: Gain Manipulation Attacks on Agentic Cyber-Physical Systems

Ali Eslami, Jiangbo Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07803v1 Announce Type: new Abstract: Agentic cyber-physical systems (CPS), where autonomous AI agents participate in runtime control decision-making, introduce agent-driven parameter-update pathways absent from conventional feedback architectures. These pathways form a parameter channel structurally distinct from classical sensor and actuator channels. Among these parameters, feedback gains are the highest-leverage target: a single gain matrix determines closed-loop eigenvalue placement for the entire system, and malicious updates can directly alter closed-loop dynamics while evading residual-based monitors. We formalize this attack surface through a three-axis attacker model and a taxonomy of Gain Manipulation Attacks (GMA). Two impact classes are identified: stability-margin erosion under sustained gain drift, and transient amplification under one-shot gain replacement. A stability-preserving gain replacement can still produce transient amplification far exceeding safe operating limits, and stability verification alone is insufficient to bound the physical impact of such attacks. Stealthiness conditions and worst-case impact certificates are derived for each class via Bauer--Fike eigenvalue bounds and the Kreiss matrix theorem, with preliminary detection directions and a vehicle lateral dynamics example provided.

Quantum-Inspired Reinforcement Learning for Low-Latency Intrusion Detection in V2X and Internet-of-Vehicles Networks

Sajid Anwer, Rohan Farooq, Anwar Shah, Tallha Akram — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07804v1 Announce Type: new Abstract: Smart cities increasingly depend on dense edge, IoT, and vehicular networks to deliver critical urban services, including traffic control, connected mobility, infrastructure monitoring, and energy management. In this ecosystem, the Internet of Vehicles (IoV) is central to intelligent transportation, enabling continuous communication among vehicles, roadside infrastructure, and cloud-edge platforms. This connectivity, however, also enlarges the attack surface and exposes smart city and vehicular systems to evolving cyber threats that can compromise safety, privacy, data integrity, and service continuity. Conventional static defenses are often inadequate because they cannot autonomously adapt to changing attack behaviors or multi-stage intrusion patterns. This paper proposes QIRL, a Quantum-Inspired Reinforcement Learning framework built on a lightweight Deep Q-Network architecture for next-generation autonomous cyber defense. QIRL combines amplitude-phase quantum state encoding, rotation-gate-based exploration, and quantum interference reward augmentation within a cost-sensitive Markov Decision Process formulation. It further addresses class imbalance through training-only SMOTE balancing and asymmetric cost-sensitive reward shaping, while sequential MDP modeling captures temporal dependencies in multi-stage attack campaigns. The framework is evaluated on CICIDS2017 and UNSW-NB15. QIRL achieves accuracies of 97.89\% and 91.04\%, F1-scores of 95.22\% and 91.66\%, AUC-ROC values of 0.9945 and 0.9713, and True Skill Statistics of 0.9443 and 0.8244, respectively. It also attains ultra-low inference latencies of 32.5 and 45.7 microseconds per sample, corresponding to 67.77 times and 51.77 times speedups over ensemble baselines. These results show that QIRL offers a lightweight, latency-aware, and adaptive defense for smart city and IoV infrastructures.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07805v1 Announce Type: new Abstract: The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Sanjay Kariyappa, G. Edward Suh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07808v1 Announce Type: new Abstract: Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

Sensitivity Analysis White Paper

Nate Bade, Lindsay Erickson — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07809v1 Announce Type: new Abstract: Sensitivity analysis is an important component of simulation-based decision support because it helps analysts determine which inputs most strongly influence model outcomes under uncertainty. This paper organizes the broad sensitivity analysis literature into a coherent framework for use in complex simulation settings, with particular attention to military applications. We review major classes of methods, including local and global approaches, variance-based techniques, screening methods, derivative-based methods, and uncertainty quantification tools, and relate them to common analytical objectives such as factor prioritization, factor fixing, variance reduction, and factor mapping. The paper also discusses sensitivity auditing as a complementary perspective that emphasizes transparency, assumption tracking, and responsible use of models in decision-relevant settings.

SLMJury: Can Small Language Models Judge as Well as Large Ones?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07810v1 Announce Type: new Abstract: Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

Scaling Participation in Modular AI Systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07812v1 Announce Type: new Abstract: Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07813v1 Announce Type: new Abstract: Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html

Representational Similarity and Model Behavior in Multi-Agent Interaction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07818v1 Announce Type: new Abstract: Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after controlling for other factors such as performance disparity and model size. We also find that similarity in the early layers consistently shows the strongest association with cooperation and novelty, compared to the middle and later layers. This suggests that a central factor underlying these patterns could be the extent to which the two models share lexical and semantic grounding. Overall, representational similarity can be an important consideration in multi-agent system design.

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07819v1 Announce Type: new Abstract: Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

A note on rounding fractional matchings with constant-factor strong negative correlation

David G. Harris — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07820v1 Announce Type: new Abstract: We describe new dependent-rounding algorithms for bipartite graphs. Given a fractional matching $x$ of graph $G = (U \cup V, E)$, the algorithms return an integral solution $X$ such that each right-node $v \in V$ has at most one edge, and where the variables $X_e$ also satisfy broad non-positive correlation properties. In particular, for any edges $e_1, e_2$ sharing a left-node $u \in U$, the variables $X_{e_1}, X_{e_2}$ have \emph{strong} negative-correlation, i.e. the expectation of $X_{e_1} X_{e_2}$ is significantly below $x_{e_1} x_{e_2}$. Dependent rounding schemes with these properties have been used for a approximation algorithms for job-scheduling on unrelated machines to minimize weighted completion times, among other applications. Our new algorithm achieves simpler and qualitatively stronger bounds compared to prior algorithms. In particular, we achieve a negative-correlation property $$ \E[X_{e_1} X_{e_2}] \leq 0.79751 \ x_{e_1} x_{e_2}, $$ which is a significant constant-factor improvement over Baveja, Qu & Srinivasan (2023).

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07822v1 Announce Type: new Abstract: As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

Jas: AI-Paired Engineering as a Revival of N-Version Programming

Jason Hickey — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07828v1 Announce Type: new Abstract: I report a case study in AI-paired software engineering: five working ports of a vector illustration application across Rust, Swift, OCaml, Python, and browser-based platforms, built by a single developer in approximately 120 evening hours. The methodology pairs AI-assisted implementation with two safeguards -- a precise executable YAML specification serving as the single source of truth, and parallel implementations functioning as a built-in differential-testing layer. The five ports share a 23{,}000-line specification; per-port native code ranges from 0 to roughly 95{,}000 lines, reflecting the specification's escape hatch. I argue that AI-paired engineering, conditional on these two safeguards, makes feasible scope of work that conventionally requires multiple developer-years, and frame the methodology as a revival of N-version programming, a 1980s approach abandoned on cost grounds that AI changes. The paper reports concrete artifacts and honest limitations of the single-developer case study.

Academic Integrity and Emotional Responses to Inappropriate LLM Use in Software Engineering Education

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07830v1 Announce Type: new Abstract: Academic integrity in higher education is increasingly shaped by complex socio-technical environments marked by automated tools, evolving institutional practices, and heightened performance pressures. Within this context, large language models (LLMs) are becoming prevalent in software engineering education, further blurring boundaries around acceptable assistance and authorship. This study investigates how software engineering students describe their emotional experiences after using LLMs in ways they perceive as academically inappropriate. We conducted a cross-sectional survey with 116 undergraduate students. Results show emotionally heterogeneous responses. Indifference was most frequent, including among students who recognized risks to learning and academic standing. Guilt and anxiety were reported in relation to moral discomfort and concern about penalties. Relief and satisfaction were evident primarily in deadline-driven contexts and situations of unclear guidance.

SLRMentor: An LLM-Based Tool Supporting Learning of SLR in Software Engineering

Rodolfo Gil-Pereira, Ronnie de Souza Santos, Cleyton Magalahes, Italo Santos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07831v1 Announce Type: new Abstract: This paper presents SLRMentor, a conversational assistant designed to support both learning about the systematic literature review process and the execution of planning activities in software engineering. The tool offers general guidance on SLR methodology and supports key planning tasks, including search string construction and reasoning about inclusion and exclusion criteria, with explanations grounded in established SLR guidelines. A pilot validation with graduate students suggests that SLRMentor helps clarify the SLR process and planning decisions, lowers initial barriers for novice researchers, and supports learning while still requiring active methodological judgment.

Ternary public-key cryptosystem

Steven Duplij, Qiang Guo, Na Fu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07832v1 Announce Type: new Abstract: Public-key cryptosystems eliminate the requirement for pre-shared secret keys by enabling encryption with a publicly disclosed key and decryption with a corresponding private key. In this article we generalize the public-key cryptosystems to ternary algebraic structures, with particular attention to ElGamal as a representative family. We introduce the necessary algebraic background for nonderived ternary structures, including special elements, ternary group rings, and a matrix ternarization procedure that maps binary rings and group rings to antidiagonal symbolic matrices closed under ternary multiplication. Building on these foundations, we formulate a ternary analogue of the ElGamal three-step protocol (key generation, ephemeral encryption, and decryption via querelements) and derive explicit ternary power and querelement formulas that enable correct decryption. Concrete instantiations and numerical examples over a ternary fraction field, a matrix-ternarized finite group ring, and a finite $(6,3)$-ring (field) validate the construction and illustrate admissible word-length quantization and cycle behaviour of ternary powers. The ternary framework highlights two practical advantages: richer algebraic structure (querelements replace binary inverses) that increases algebraic complexity for attackers, and higher information density (matrix ternarization transfers paired/plaintext vectors). Formal hardness assumptions, optimized parameter choices, and comprehensive security and performance analyses remain necessary future work.

Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks

Zvi Topol — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07833v1 Announce Type: new Abstract: Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into account the sequential structure of how models resist or yield to attacks. We propose applying process mining, a discipline for discovering and analyzing process models from event logs, to red teaming traces. We conduct a controlled experiment pitting 60 HarmBench prompts against two LLMs, GPT-OSS 120B and Llama 3.3 70B, using 10 prompt mutation strategies over up to 110 attempts per prompt. From the resulting 8,575 scored events we extract Directly-Follows Graphs (DFGs) and state transition matrices that reveal structurally distinct defense profiles invisible to ASR alone: GPT-OSS exhibits a near-absorbing refusal state, while Llama presents multiple porous escape routes from refusal to getting successfully jailbroken. We further show that mutator effectiveness is asymmetric across models and that time-to-jailbreak distributions differ by an order of magnitude.

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Haoran Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07834v1 Announce Type: new Abstract: LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.

Mitigating the Contractivity Trap in Diffusion ODEs via Stein Stabilization

Shigui Li, Delu Zeng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07835v1 Announce Type: new Abstract: A fundamental tension exists in the large-step inference of diffusion models via their deterministic probability flow ordinary differential equation (PF-ODE) trajectories, which we identify as the contractivity trap: efficient inference favors large step sizes, while aggressive steps and highly expressive denoisers can undermine contraction-based stability certificates for error suppression. To address this, we propose SteinDiff, a step-wise inference-time stabilization framework that employs Stein-derived corrections without requiring reference samples. Specifically, SteinDiff introduces a geometry-aware residual correction mechanism that regularizes large-step solver updates without retraining. To this end, we derive a closed-form Stein correction coefficient for step-wise solver adjustment, enabling reference-free adaptation to local data geometry. We further establish a score-controlled perturbation bound under distributional shifts and provide a complementary Stein perspective on EDM-style parameterizations. Extensive experiments demonstrate that SteinDiff mitigates severe artifacts and improves generative quality across large-step inference settings.

Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents

Kirak Kim, Hyojin Kim, Yejin Son, Sungyoung Kim, Kyung Myun Lee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07837v1 Announce Type: new Abstract: A concert is a collective experience, but recorded performance videos are typically watched alone, stripping away the shared audience presence that makes concerts feel eventful. We investigate whether persona-based LLM audience agents can recreate aspects of this collective experience by generating real-time fan chat alongside a K-pop performance video. We present a multi-agent system in which ten LLM agents react through live-chat messages, comparing a persona-conditioned audience (each agent assigned a distinct fan identity, bias, and chat style) with a no-persona baseline. In a within-subjects pilot with K-pop fans (N=11), persona conditioning substantially improved model-level chat quality and perceived naturalness, but did not translate into differences in social connectedness, engagement, or affective response. Interviews suggest that online K-pop concert chat may operate as collective monologue rather than interpersonal dialogue, and that meaningful participation depends on shared identification with the specific artist and fandom. Persona conditioning can make LLM audiences appear more natural, but culturally meaningful collective experience may require deeper alignment between persona, crowd behavior, fandom identity, and user expectations.

A Divergence-Free Scott-Vogelius Finite Element Method for the Surface Stokes Problem

Yerim Kone, Michael Neilan, David Poling — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07840v1 Announce Type: new Abstract: We construct and analyze an exactly divergence-free Scott-Vogelius finite element method for the surface Stokes problem. The proposed scheme simultaneously enforces the tangentiality and incompressibility constructs exactly and has the same number of unknowns as the two-dimensional Euclidean discretization. Our construction extends the surface finite element framework of [10,11] to Scott--Vogelius discretizations defined on curved Clough--Tocher triangulations. In contrast to previous isoparametric Scott--Vogelius methods based on macro-element constructions, the present approach defines the finite element spaces directly on the refined surface triangulation, leading to a substantially simpler and more practical implementation. We prove inf-sup stability of the method and derive optimal-order convergence in the isoparametric regime.

RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching

Leonard Traeger, Enas Khwaileh, Andreas Behrend, George Karabatis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07843v1 Announce Type: new Abstract: Schema matching, a critical task for integrating data from diverse sources, seeks to identify correspondences between columns across different schemas. In multi-table holistic schema matching, columns with similar semantic meaning may reside in tables with different contexts due to heterogeneous schema designs, where similarity-based techniques are inadequate. The focus of this paper is exploiting referential context into schema matching by introducing RACT learning and prediction, a self-supervised framework enabling the probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates. Experiments demonstrate that this approach outperforms similarity-based baselines on matching multi-table schemas. In subsequent matching experiments, constraining the column search space via top-t tables improves both average matching precision and completeness by up to +70%.

GRPO Does Not Close the Multi-Agent Coordination Gap

Najmul Hasan, Prashanth BusiReddyGari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07845v1 Announce Type: new Abstract: We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher counts, four frontier closed-source systems reach mean reward 0.45 to 0.87 and Mistral-Small 24B reaches 0.83 to 0.99, while Qwen3-14B reaches 0.13 to 0.35. We then ask whether group relative policy optimization (GRPO) on rollouts from the task itself can close the gap and find that it cannot: a Welch's t-test on per-episode reward at five philosophers gives p = 0.66 and a Hedges' g of -0.11, with no statistically significant change at ten or fifteen philosophers either. Two further observations qualify the result. The training reward of both 8B and 14B runs peaked at step nine and then declined, so the default saved checkpoint at step 15 is strictly worse than several earlier ones. The four-term reward we use admits a degenerate maximum at zero actions, which DeepSeek-R1-Distill-Qwen-7B and Mistral-Small 24B at five philosophers both inhabit, with mean reward 1.0 and 0.83 respectively at zero meals. The bottleneck for an open-weight 14B model on multi-agent coordination is not training compute but training methodology: reward shaping that does not collapse to a no-action maximum, checkpoint discipline that does not depend on the final step, and curriculum across problem scales.

Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method

Faisal Fareed — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07846v1 Announce Type: new Abstract: LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07853v1 Announce Type: new Abstract: Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabi\'a-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

Qiang Le, Yaguang Yang, Isaac E. Weintraub — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07855v1 Announce Type: new Abstract: Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge because the problem is nonlinear and nonconvex even in simplest scenarios. While traditional optimal control methods can be used to find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as possibly multiple circular 'no-go' zones. A mission is regarded as a failure if the vehicle enters this restricted zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained through trial and error in a simulated environment, learning a direct mapping from its current state (position and heading) to a series of feasible actions that guide the agent to safely reach its destination. The reword function has three parts: (a) an attractive field centered at the final destination, (b) some repulsive fields centered at the origins of circular obstacles, and (c) a penalty of control energy consumption (the magnitude of heading change) that indirectly in favor for straight path. The DDPG trains the agent using these incentives to find the largest possible set of starting points wherein a safe path to the destination is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent produces effective paths while being significantly faster, making it a better fit for real-time applications.

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

Igor Lima Strozzi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07856v1 Announce Type: new Abstract: When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free "constellation" -- a generator, a learned critic, and a free exact verifier -- on a FlashFill-style "trapdoor" DSL, where verified (problem, solution) pairs are cheap to synthesize, hard to invert, and free to check exactly. Everything runs on one 4-bit Qwen3-4B on a single 24 GB GPU, with no model in the loop larger than the base. We report three findings. (i) Critic-guided selection beats verifier-filtered best-of-$k$ by $+9.1$ pp ($6/6$ seeds), with the entire gain localized to tasks where candidates disagree on held-out inputs. (ii) Per-round STaR self-training raises the ceiling but never accelerates -- the gain tracks remaining headroom and decelerates across $K=4$ independent training trajectories. (iii) The domain has no clean zero-capability frontier, so the usual "$0\% \to$ climb $=$ emergence" test is invalid here. A measured pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (pass@$8$) but the base overtakes it at a large budget (pass@$64$) on every trajectory, so self-training concentrates probability mass rather than expanding reach. This is amplification, not compounding. ($K=4$ is indicative, not yet a robust across-trajectory CI.)

Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices

Stefan Behfar, Richard Mortier — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07857v1 Announce Type: new Abstract: The rise of edge-based machine learning has enabled distributed adaptation of language models across mobile and IoT devices, offering privacy preservation and real-time responsiveness. However, distributed fine-tuning of language models on untrusted or heterogeneous edge nodes introduces new vulnerabilities. Compromised or unreliable devices can inject poisoned updates, leading to stealthy model manipulation or convergence degradation. Classical defenses such as robust aggregation or temporal anomaly detection operate on a single global model and are therefore limited in detecting coordinated or persistent poisoning. This work proposes a new system-level defense based on model multiplicity. Instead of maintaining one global model, the system rotates or concurrently trains multiple small language models (e.g., DistilGPT-2), each updated by independently sampled subsets of edge nodes. These models evolve under distinct training trajectories, creating multiple independent views of the same distributed population. Divergence between models quantified through gradient similarity, loss evolution, or parameter variance serves as a signal of anomalous or adversarial behavior. When one model deviates significantly from the ensemble mean, the system flags its contributing nodes for isolation or re-weighting. We implement this framework and evaluate it on edge-scale simulations of Small Language Model (SLM) training under varying heterogeneity and attack conditions. Results show that model multiplicity enables earlier and more reliable detection of poisoning compared to classical single-model defenses such as Flanders and Robust methods. Our findings demonstrate that diversity in model evolution can serve as a practical and effective defense mechanism for secure distributed learning on resource-constrained edge devices.

A Preliminary Model for Managing Technical Debt in an Agile Environment

Pedro E. Colla — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07859v1 Announce Type: new Abstract: This paper presents a preliminary model for managing involuntary technical debt in agile environments by formulating, in an integrated way, the dynamics among backlog, debt, velocity, and economic value. The work distinguishes initiated but unfinished functional debt from a simple defect back log and from rework, interprets productivity degradation as technical-debt interest, and derives the naive maximum-remediation policy in order to show its limitations against an intertemporal value-based decision. On this basis, a dynamic policy uk is proposed to balance new development and remediation; a decreasing marginal-value structure is incorporated; and the model is extended to discrete, inhomogeneous items. Exploratory validation through sensitivity analysis and MonteCarlo simulation shows behavior consistent with the economic intuition of the model. Finally, the limits of the formulation are made explicit: its macroscopic nature, its dependence on organizationally stable parameters, its assumption of intertemporal rationality, and its requirement of weak coupling among stories.

Data Profiling for Change Rules

Nishttha Sharma, Fei Chiang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07860v1 Announce Type: new Abstract: Understanding data change is critical towards understanding trends, normal vs. abnormal behaviours, recognizing patterns, and the causes of change. Existing database systems have limited support for change management, relying on statistics, triggers, and constraints. Data quality rules model sequential changes along a restricted set of attributes, quantify change among unordered tuples, and have limited ability to model the context under which attribute changes occur. In this paper, we introduce Change Rules (CRs) that quantify the sequential changes among ordered tuples in both the antecedent and consequent attributes. CRs aim to address the limitations of existing declarative dependencies to support trend analysis and causal relationships that trigger change among attributes. We propose CR-Miner, an automated algorithm for CR discovery that generates candidate change intervals in a level-wise manner. Experimental results show that CR-Miner achieves an average runtime improvement of 40-50% over existing baselines.

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07861v1 Announce Type: new Abstract: Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

Instrumented data for causal scientific machine learning

Daniel N. Wilke — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07865v1 Announce Type: new Abstract: Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07866v1 Announce Type: new Abstract: Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.

The Cold-Start Safety Gap in LLM Agents

Chung-En Sun, Linbo Liu, Tsui-Wei Weng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07867v1 Announce Type: new Abstract: Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

ASH: Asymmetric Scalar Hashing With Learned Dimensionality Reduction for High-Fidelity Vector Quantization

Mariano Tepper, Theodore Willke — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07870v1 Announce Type: new Abstract: For a long time, additive quantizers, such as product quantization, have been considered the gold standard in terms of accuracy and efficiency. Recently, scalar quantization has re-emerged from the depths of history with a new wave of data-agnostic techniques. Inscribed in this general framework, we turn our attention to data-driven methods, showing that new highs in recall and speed can be achieved by reducing the number of dimensions while increasing the bitrate per dimension. Critically, this dimensionality reduction needs to be learned from data to be successful. We present ASH (Asymmetric Scalar Hashing), a data-driven encoder-decoder framework that applies dimensionality reduction to database vectors via a learned orthonormal projection, followed by scalar quantization, while keeping queries in their original form. This asymmetric design enables higher accuracy than the best additive and scalar quantizers at iso-compression, while admitting highly efficient similarity computations via SIMD operations. ASH has short learning and encoding times, making it attractive for real-world deployment. Extensive experiments on a variety of datasets demonstrate that ASH achieves state-of-the-art ANN recall and speeds across all compression regimes.

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07872v1 Announce Type: new Abstract: When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

Adverse Effects of V2V Adoption on Road Safety

Zhenqi Liu, Philip N. Brown, Keith Paarporn — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07873v1 Announce Type: new Abstract: Vehicle-to-vehicle (V2V) communication is expected to improve road safety and reduce congestion. However, prior work shows that V2V information sharing under partial adoption may increase congestion and decrease safety. We study whether increasing V2V adoption itself affects road safety. We propose a corrected version of an existing model and analyze its behavior under varying adoption levels. We show that, in some cases, increased V2V adoption can increase accident probability. Moreover, under an optimal signaling policy, the system can ensure that accident probability is non-increasing in the adoption level.

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07874v1 Announce Type: new Abstract: LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

Whose Norms? Disentangling Cultural and Personal Alignment in Large Language Models

Angana Borah, Isabelle Augenstein, Rada Mihalcea — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07877v1 Announce Type: new Abstract: Large language models are increasingly used for social decision-making situations that require balancing cultural norms with personal preferences. For example, a user preferring honesty might ask whether to correct a coworker publicly when local norms favor indirect feedback. Yet existing research studies cultural alignment and personalization largely separately. We introduce PACT, the Personal-Preference and Cultural-Norm Trade-off framework, which evaluates whether models choose to follow a cultural norm or allow personal preferences. We find that LLMs vary in how rigidly they enforce cultural norms, with behavior shifted more by country context (7.8%) than age (1%) and gender (0.7%) and shifting non-uniformly after instruction tuning. Furthermore, our five-country human study on PACT shows that culture-following in humans is mainly driven by scenario country, with the lowest agreement when participants judge their own cultural contexts, showing within-culture pluralism. Finally, human-LLM alignment experiments show that models can match majority choices, but fail to capture response distributions and uncertainty (with best correlations reaching only 0.24). Together, these findings motivate alignment evaluations that go beyond majority to capture cultural pluralism and disagreement in social judgment.

Still: Amortized KV Cache Compaction in a Single Forward Pass

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07878v1 Announce Type: new Abstract: The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

Itay Elam, Eliron Rahimi, Avi Mendelson, Chaim Baskin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07881v1 Announce Type: new Abstract: Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to $1.69\times$ over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.

The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders

Yousef Radwan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07882v1 Announce Type: new Abstract: Different vision neural networks -- trained to classify, contrast, reconstruct, or match images to text -- should have correspondingly different internal representations. We report that they do not. After training, the top sixteen principal directions of variation inside thirteen modern vision encoders converge to the same sixteen-dimensional geometric object. We call this the cross-architecture substrate and study it with PCA, centred kernel alignment (CKA), and Pang 2026 calibration. The substrate transports across four visual domains (natural photographs, medical CT, satellite, microscopy) at median Procrustes-CKA 0.679, and across eight domains (adding sketches, depth, thermal infrared, astronomy) at 0.604, every pair >0.40. It survives Pang calibration globally (7.4x disc-vs-MAE separation, n=13,394) and locally (4.82-5.30, p<10^{-44}). It is not pixel statistics (0.263), not Gabor features (0.31), not a random projection (0.041), and emerges in the first 10% of training while accuracy keeps climbing. We deliver four applications: a label-free transferability filter beating LogME (3x faster, +0.15 Kendall-tau); a four-way domain detector (99.6% accuracy); a frozen low-shot probe (16 dims beat 768-dim DINOv2 by 3.78pp at N=50 labels per class); and a teacher-free distillation auxiliary matching trained-teacher KD on 33 pairs (7.56pp peak gain at 10% label fraction). The substrate does not cross modalities, does not help cross-paradigm distillation, and does not predict transfer quality (rho=0.08 against transfer accuracy).

DP4SQL: Differentially Private SQL with Flexible Privacy Policies

Andrew Cascio, KinChin Tong, Daniel Kifer, Zeyu Ding, Danfeng Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07883v1 Announce Type: new Abstract: The plausible deniability model of differential privacy for single-table datasets is well-understood. However, applying differential privacy to relational databases is much trickier: each application needs flexibility in specifying the pieces of information about an entity, spread across multiple relations, that require plausible deniability guarantees. Existing differentially private SQL systems only support rigid privacy policies. Even seemingly small changes, such as specifying that some tables need to protect the existence of records while others only need to protect the record contents, require significant manual effort in updating their privacy accountants and proving their correctness. One example of a challenge is the presence of partially public data. Public columns in a table (e.g., faculty names in a university dataset and partial course enrollment information) can cause some queries to require more noise (compared to fully private data), while others require less noise. This kind of reasoning is not supported in existing systems. Another example is when different parts of records (e.g., demographics, financial data) require different levels of privacy protection. Again, existing differentially private SQL systems need to rewrite their rules for calculating query stability in order to support such a feature. This paper presents DP4SQL, a differentially private SQL system that allows data curators to better customize the plausible deniability requirements for their relational databases. This avoids the drawbacks of the "one-size-fits-all" systems that would either underprotect the data or inject too much noise into query answers.

Value-Refined Modal Fixed-Point Semantics with Certified Choice and Public Share-Alike Certificates

Faruk Alpay, Levent Sarioglu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07884v1 Announce Type: new Abstract: This paper presents a finite modal semantics where truth is closed under admissible continuation, then refined by discounted value, and finally certified by residual tests. The admissibility kernel is the classical greatest fixed point of a one-step predecessor expressing that some choice cell has all compatible successors inside a set. Certified choices are exactly local witnesses; the discounted value transformer is defined only over those witnesses; value-refined modal bisimulation is the coarsest local equivalence preserving formulas, kernel, certified choices, Bellman values, greedy sets, residual certificates, and public release certificates. A canonical pseudometric refines this equivalence: it is the unique fixed point of a Hausdorff-lifted choice-matching transformer over certified choices; its zero set is the value-refined bisimulation, and the optimal discounted value is one-Lipschitz with respect to it. Any approximate quotient incurs only a distance-bounded value error. Branching choice-cell and locus presentations place choice inside the model; the transition presentation is a conservative retraction. The same engine is applied to a public share-alike release fragment: attribution as label preservation, same-license propagation as derivative closure, no downstream restriction as admissibility, and the BY-SA witness as a residual-stable certificate. Finite examples show that altering the order of truth, admissibility, value, quotienting, public derivation, and certification changes the semantics.

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

Marut Pandya, Kasey Zhang, Baiqing Lyu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07889v1 Announce Type: new Abstract: LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher's exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output -- quoted acknowledgment, quoted action, and typed conflict -- showing what the agent saw and ignored.

Partially Performative Prediction

Jaewook Lee, Tijana Zrnic — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07890v1 Announce Type: new Abstract: Performative prediction studies feedback loops that arise when predictive models are deployed in consequential domains. In these settings, deploying a model can change the population whose patterns the model aims to predict, inducing a distribution shift that is endogenous to the learning system. This perspective departs from classical treatments of distribution shift, where shifts are typically modeled as exogenous changes in the data-generating process. Yet, in practice, distribution shift is rarely one or the other. Predictive models may influence future data through the decisions they support, while the world itself continues to drift for reasons beyond the learner's control. We study partially performative prediction, a framework that captures both endogenous and exogenous sources of distribution shift. The framework generalizes performative prediction by allowing the data distribution to evolve both in response to the deployed model and according to an external, time-varying process. We extend the central notions of performative stability and performative optimality to this setting by defining their online analogues that track the evolving partially performative environment. We analyze practical learning heuristics, including repeated retraining, and characterize when they successfully adapt to partially performative environments.

C3VD-DEFCOL: A Deformable Colonoscopy Dataset with Time-Resolved 3D Ground Truth and Realistic Appearance

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07891v1 Announce Type: new Abstract: 3D reconstruction could improve colonoscopy by estimating mucosal coverage and alerting clinicians to missed regions during screening. However, algorithm development is limited as no current datasets provide both a realistic in vivo appearance and dense, time-resolved 3D ground truth, especially under non-rigid deformation. We present C3VD-DEFCOL, a framework and dataset for evaluating deformable colonoscopy reconstruction with paired geometry and realistic texture. Starting from C3VD/C3VDv2 colon meshes and camera trajectories, we generate controlled deformations of the colon surface, including peristaltic waves and centerline motion, and render per-frame depth, surface normals, optical flow, camera poses, and time-stamped 3D meshes. We then use the rendered geometry, primarily depth, to condition an LTX-2.3-based sim-to-real translation model that produces RGB clips with in vivo-like mucosal color, texture, vasculature, and specular appearance while preserving the underlying 3D scene structure. The resulting dataset contains 110 videos from 11 unique colon mesh geometries, with varying camera trajectories, appearances, and parameterized deformation regimes, including three peristaltic severity levels that serve as controlled evaluation axes. We evaluate the generated videos using appearance realism, geometric consistency, and temporal consistency metrics, and use the paired ground truth to benchmark the downstream task of pose estimation in deformable 3D reconstruction. Our experiments show how pose estimation error increases with increasing deformation severity, providing a controlled stress test that is not possible with existing in vivo datasets. Overall, C3VD-DEFCOL is designed as a reproducible, quantitative evaluation platform for testing deformable 3D reconstruction algorithms, with the goal of reducing the domain gap between synthetic datasets and in vivo colonoscopy.

Beyond Individual Personas: Aligning Synthetic Dialogue to Population-Level Behavior Distributions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07893v1 Announce Type: new Abstract: Synthetic dialogue corpora are increasingly used as proxies for target dialogue data, yet persona-grounded generators optimize individual conversations rather than corpus composition, yielding locally plausible dialogues with distorted population-level behavior mixes. We introduce GroupPersona, a framework that aligns synthetic dialogue corpora to the behavior distribution of a reference corpus. GroupPersona turns population statistics into generation controls: it separates each dialogue's core behavioral signature from predictable side effects, and uses the resulting behavioral groups to condition user agents on the interaction patterns that define the reference population. We evaluate GroupPersona on four corpora crossing two dialogue sources, assistant-style and Reddit-derived, with two construction variants: structure-preserving and variation-enhanced. GroupPersona lowers Jensen-Shannon divergence between synthetic and reference distributions over 12 behavior attributes from 0.234 to 0.177 relative to the strongest average baseline, a 24.4% reduction, and is best or tied-best on all four corpora while preserving structural alignment. It also achieves the closest calibration to reference-conversation quality scores, reducing mean absolute deviation from the reference-conversation profile to 0.63 versus 0.91 for the next-best baseline.

DD-GEPA: Prompt Optimization for Dialogue Disentanglement Focusing on Task Instruction and Utterance Representation

Naoki Takada, Tatsunori Mori — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07894v1 Announce Type: new Abstract: Multi-party chat often contains interleaved dialogues because multiple participants can discuss different topics at the same time. Dialogue disentanglement addresses this problem by separating an entangled utterance sequence into coherent dialogues. While large language models (LLMs) are promising for this task, they still struggle with dialogue disentanglement and achieve low accuracy. This paper proposes an automatic prompt optimization for LLM based dialogue disentanglement. We decompose the prompt into three components: task instruction, utterance representation, and output instruction, and optimize them using GEPA, an optimization method for compound AI systems. Experiments on benchmark datasets show that the optimized prompts improve dialogue disentanglement accuracy over the original prompts and can surpass hand crafted prompts.

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07895v1 Announce Type: new Abstract: Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

Alejandro Botas, Paul de Font-Reaulx, Luke Hewitt — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07897v1 Announce Type: new Abstract: Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure this either by assessing what it takes to make a model shift a binary endorsement or by eliciting an explicit probability in a proposition. However, much user-facing sycophantic behavior is demonstrated through shifts in graded support expressed through ordinary language. We propose the AI Epistemic Deference Index (AEDI): a continuous, unidimensional score representing how sensitive the support expressed in a model's output is to the attitude expressed in a user's prompt. To generate AEDI, we provide a new protocol for estimating probabilities from natural language outputs, using LLMs-as-judges validated for consistency and correlation to human judgment. We deploy it on a new curated database of 500 propositions across diverse topics and 16,000 prompts varying in user attitude, testing eight prominent models. Every model exhibits substantial deference, though with large and systematic differences across providers, with Claude models demonstrating the least, and Grok and Gemini models the most. The effect is amplified in prompts requesting a written artifact, and concentrated on propositions where models hold weaker priors. We release AEDI as an easy-to-update benchmark and measurement pipeline for output-level sycophancy evaluation.

Temporal Coverage over Density: Parsimonious Training-Set Design for ML Climate Downscaling

Karandeep Singh, Stefan Rahimi, Chad W. Thackeray, Stephen Cropper, Alex Hall — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07898v1 Announce Type: new Abstract: High-resolution regional climate simulations provide critical information for climate impacts assessments but remain computationally expensive, motivating the development of machine-learning downscalers and emulators. A key challenge is determining how limited high-resolution simulations should be distributed across a changing climate trajectory to capture both forced climate response and internal variability. Using the CESM2 Large Ensemble over the western United States, we compare three training-year selection strategies under fixed data budgets: a contiguous block of historical years, years drawn from both the beginning and end of the simulation period, and years distributed throughout the full climate trajectory. Including both historical and future years consistently outperforms training on historical years alone, demonstrating the importance of exposing downscaling models to climate states outside the historical record and highlighting limitations of stationarity assumptions common in statistical downscaling. Training on years distributed throughout the full climate trajectory performs best overall, indicating that broad sampling of internal variability provides additional information beyond exposure to the forced climate response alone. Models trained on temporally distributed subsets more successfully reproduce variability in unseen ensemble members while retaining strong performance across a wide range of climate diagnostics. Even when trained on only one-tenth of the available high-resolution years, temporally distributed models remain highly competitive with full-data training. These results suggest that, under fixed computational budgets, broad sampling of climate states is more valuable than temporal continuity when allocating scarce high-resolution simulations. The findings provide practical guidance for regional climate downscaling and large-ensemble projection workflows.

CellSense: A Sub-6 GHz Cellular ISAC System for Clutter-Robust Passive Sensing

Bibhor Kumar, Ish Kumar Jain, Vijay K Shah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07900v1 Announce Type: new Abstract: Future wireless networks demand capabilities beyond traditional communication, driving the development of Integrated Sensing and Communication (ISAC) for environmental awareness, localization, and tracking. Ubiquitous cellular deployment allows ISAC to maximize spectral efficiency, lower costs, and expand sensing coverage. However, sub-6 GHz research has heavily favored communication, leaving sensing capabilities largely underexplored. To bridge this gap, we introduce CellSense, a novel sub-6 GHz ISAC architecture natively integrated into the 5G cellular protocol stack for real-world target tracking. We validate the system via Sionna-based orthogonal frequency-division multiplexing (OFDM) link-level simulations and an experimental USRP hardware prototype using the OpenAirInterface (OAI) stack. Furthermore, we analyze the communication-sensing tradeoff by quantifying how pilot symbol density impacts throughput versus sensing accuracy. Simulations show that CellSense achieves a 74 percent detection probability with a 1.43 m localization error in indoor warehouse environment, which improves to 94 percent detection and a sub-meter error of 0.33 m in the outdoor environment of Oval area at the NCSU Centennial campus. Hardware experiments in a highly cluttered indoor laboratory confirm a 1.28 m localization accuracy and 76 percent detection probability, proving its efficacy for practical ISAC deployments.

End-to-End Control of a Powered Knee-Ankle Prosthesis Towards Unified, Tuning-Free Assistance

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07902v1 Announce Type: new Abstract: Powered prostheses conventionally rely on impedance controllers that require extensive manual tuning and explicit mode classification. In this work, we present real-time deployment of an end-to-end prosthesis controller that estimates continuous actuator signals from onboard sensors, eliminating the need for intent classifiers and subject-specific tuning. Temporal Convolutional Networks were trained on a multi-terrain dataset from 18 individuals with transfemoral amputation and deployed in real time across five locomotion modes. Four participants (three able-bodied, one with transfemoral amputation) ambulated across level ground, ramp ascent and descent, and stair ascent and descent. During level walking, the deployed controller reproduced the training-data scaling of peak ankle torque with walking speed (deployed 0.85 Nm/kg per m/s, p = 0.001; training 0.96 Nm/kg per m/s, 95% CI [0.42, 1.50], p = 0.002), after excluding one outlier traced to atypical prosthesis loading. During ramp ascent, the controller scaled knee pre-flexion with grade (deployed 2.92 deg/deg, p = 0.027; training 3.30 deg/deg, 95% CI [1.83, 4.77], p < 0.001). During ramp descent, the controller increased resistive knee torque relative to level walking (deployed +0.16 Nm/kg, p < 0.001; training +0.16 Nm/kg, p = 0.008). Seamless stair transitions were generated for both intact- and prosthetic-side-leading sequences in ascent and descent, despite the training data containing only one limb-leading sequence. These results provide initial evidence towards end-to-end control that can provide unified, mode-adaptive prosthetic assistance without subject-specific tuning.

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Rahul Suresh Babu, Laxmipriya Ganesh Iyer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07904v1 Announce Type: new Abstract: Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering addresses this gap by using lightweight contracts that specify each tool's preconditions, effects, risk level, and cost. However, manually writing and maintaining such contracts does not scale to large or changing tool ecosystems. We introduce Contract2Tool, a framework for inferring tool contracts from metadata, schemas, documentation, and execution traces. Contract2Tool converts observable tool evidence into normalized symbolic contracts that can be evaluated intrinsically and deployed inside downstream causal tool filtering. We evaluate learned contracts against gold preconditions, effects, and risk labels, and measure their downstream utility on multi-step agent tasks. Our results show that hybrid documentation-and-trace evidence produces contracts accurate enough to preserve most of the reliability and efficiency benefits of gold contracts. Learned-contract CMTF achieves 0.980 downstream success, close to 0.990 for gold-contract CMTF, while reducing visible tools from 100 to 1 and reducing average token usage from 26,172 to 2,528 relative to all-tools exposure. These results suggest that learned contracts can provide a scalable contract layer between tool schemas and reliable agent execution.

Extremum Seeking Control Based Adaptive Compensation of Position Sensor Harmonics in PMSM Drives

Gayan V. Dissanayake, Sandun S. Kuruppu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07906v1 Announce Type: new Abstract: Permanent Magnet Synchronous Machines (PMSMs) have become one of the preferred forms of electromechanical energy converters, attributing to their high efficiency, torque density, and other unique advantages. However, given the need for proper rotor position measurement for commutation and field orientation, accurate rotor position sensing is of paramount importance. In sensing motor rotor position with a sensor, harmonic errors that arise in the sensing subsystem lead to undesirable torque ripple. Thus, this paper presents an adaptive, extremum seeking control based approach capable of mitigating position signal harmonics in PMSMs. The proposed approach is experimentally validated under varying torque, speed, and harmonic conditions. Its harmonic compensation performance is comparatively evaluated against the look-up table based method. Furthermore, the accuracy of the proposed approach is analyzed, highlighting its effectiveness.

3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07907v1 Announce Type: new Abstract: In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D point cloud coordinates from ten fixed-angle intraoral images, employing MobileNetV2 and Multi-head Attention for multi-view feature fusion, with a combined L1 Loss and Chamfer Distance as the loss function. Although the model achieved an accuracy of 77.49%, predicted vertices tended to concentrate in high-density regions of the ground truth, leaving other regions largely uncovered. In this paper, an improved loss function is proposed to address this limitation. Hungarian matching with filtering and Repulsion Loss are introduced to enforce more uniform vertex distribution across the reconstructed model. The proposed model achieves an accuracy of 68.02%, which is numerically lower than the previous model. However, the vertex clustering issue observed in the prior work is substantially alleviated, with predicted vertices distributed more evenly across the entire reconstructed surface.

Layer-wise Derivative Controlled Networks Achieve Competitive Accuracy and Gradient Stability Across Data Regimes

Rowan Martnishn — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07908v1 Announce Type: new Abstract: Derivative-controlled networks based on ChainzRule (CR) combine cubic polynomial layers with a lightweight forward-mode per-layer Jacobian penalty (DREG). In this second paper of a multi-part series, we evaluate the generalization properties of CR across data regimes. We ablate the shape of the DREG coefficient schedule, demonstrating that the optimal annealing range depends on representation noise. On the Pima Diabetes dataset, CR achieves strong low-data performance and maintains a consistent accuracy advantage over baselines from 5\% to 100\% training data, supported by exceptionally stable gradient tail ratios ($\sim$1.01--1.02 vs. 1.07--1.09 for ReLU networks). Extensions to SST-5 show competitive or superior results in both frozen-embedding and BERT fine-tuned regimes, including outperforming prior BERT baselines despite substantially less training data. These results are statistically significant: CR achieves superior accuracy over the strongest published baselines we could identify on both datasets ($p < 0.05$). These results establish that layer-wise derivative control induces a structural inductive bias toward low-frequency, stable representations that generalizes robustly across tabular and NLP domains, data volumes, and representation qualities. The gradient tail ratio serves as a reliable, label-free diagnostic of generalization capability.

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07909v1 Announce Type: new Abstract: Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

CAAL: Contextual Bandits based Online Hand-Craft Active Learning Strategy Selection

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07910v1 Announce Type: new Abstract: The challenge with active learning algorithms is the uncertainty of the statistical distribution of unlabeled data, making it difficult to choose the best hand-crafted strategy. To address this, we introduced Contextual Adaptive Active Learning (CAAL). In CAAL, each "arm" represents a hand-crafted strategy. Unlike existing frameworks that select strategies based only on feedback from labeled data, we dynamically choose strategies for labeling batches of data using reward prediction with external context information. This general framework allows for customization with domain knowledge to design more effective rewards and context candidates. In addition, we experimentally show that CAAL outperforms the existing baseline adaptive strategy on public datasets using our reward and context design. Our results are consistent regardless of batch size in each iteration.

On Improved Statistical Accuracy of Low-Order Polynomial Chaos Approximations

Vedang M. Deshpande, Raktim Bhattacharya — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07912v1 Announce Type: new Abstract: Polynomial chaos expansions provide surrogate models for stochastic systems, with coefficients typically derived using Galerkin projection, stochastic collocation, or least squares approximation. These traditional approaches often fail to accurately capture statistical moments without resorting to high-order approximations. We propose a constrained optimization framework that modifies standard techniques to determine polynomial chaos coefficients that precisely recover the first two statistical moments. The effectiveness of our approach is demonstrated on several candidate algebraic functions of random variables, showing significant improvements in statistical accuracy even with low-order approximations.

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

Da Li, Xinxin Li, Xingyu Cui, Jin Xu, Juan Zhang, Junping Yin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07915v1 Announce Type: new Abstract: Neural symbolic regression models improve inference efficiency by shifting structural search to pretraining, but their one-pass autoregressive decoding is prone to error accumulation, which may lead to generating structurally incorrect expressions, especially in complex expression generation scenarios. Existing rectification strategies can alleviate this issue, but they often depend on restarting global search, thereby weakening the efficiency advantage of neural models, and remain susceptible to error accumulation. In this paper, we propose EditSR, a two-layer framework that combines a neural symbolic regression model in the first layer with an edit-based Rectifier in the second layer to achieve efficient prediction and post-hoc rectification. Instead of restarting the global search, we maintain rectification efficiency by pretraining the Rectifier. Specifically, we formulate the rectification process as a step-by-step state-transition chain starting from an incorrect expression, and develop a state-transition algorithm to construct supervised rectification chains for training the Rectifier. To ensure syntactic validity throughout rectification, each edit action is restricted to a syntactically valid space so that every edited expression remains parseable. In addition, because each edit decision is conditioned on the current state rather than the history, the Rectifier allows errors made in earlier steps to be rectified by subsequent edits, thereby reducing the risk of error accumulation. Extensive experiments and ablation studies show that EditSR substantially improves symbolic structure recovery with limited extra cost, with more pronounced gains on complex expressions, where one-pass autoregressive decoding is more susceptible to error accumulation.

The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07916v1 Announce Type: new Abstract: The growing ability of generative models to produce realistic documents poses a direct challenge to evidentiary workflows in the justice system and the courts, where decisions increasingly depend on the authenticity of evidence such as receipts, communications, and administrative records. Unlike social media or academic settings, evidentiary documents are often only subtly altered, with small, localized edits that preserve overall plausibility while changing legal meaning. Yet progress on automated detection remains limited, largely due to the absence of suitable training and evaluation data especially suited for the justice system requirements. Existing resources are either focused on photos of human faces or natural scenery or on narrowly scoped academic or social media document types, and do not capture the structure, diversity, or manipulation patterns characteristic of real-world evidentiary data. As a result, current detection systems do not necessarily learn meaningful signals appropriate for the justice system. We introduce the CIFAR Synthetic Evidence Corpus, a dataset designed to enable rigorous evaluation of evidence verification under realistic and controlled conditions. The corpus spans multiple document families and a spectrum of manipulation strategies, from small field-level edits to complete document fabrication, and is constructed using a diverse set of state-of-the-art generative tools. It is organized to systematically vary both manipulation complexity and generation method, while enforcing source-level separation between training and test data to reflect real-world generalization challenges.

Larch: Learned Query Optimization for Semantic Predicates

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07923v1 Announce Type: new Abstract: With the advent of Large Language Models (LLMs), many database systems introduced semantic operators that enabled analytical queries over unstructured data (e.g. text, images, videos). Semantic operators typically incur high inference costs and latencies making semantic (AI) SQL queries challenging to apply on large scale datasets. At the same time, their semantic nature leads database engines to treat them as black boxes, making AISQL queries difficult to optimize. In this paper, we introduce Larch, a framework for optimizing the execution of semantic filters in AI SQL queries. Larch was inspired by two key observations: i) the high latency of semantic operators leaves significant room for computationally-heavy runtime optimization techniques, ii) unstructured data are typically accompanied by semantic information in the form of embeddings allowing for efficient semantic comparisons between AI_FILTER prompts and data values. Based on these two key observations, we present two Larch variants: Larch-A2C and Larch-Sel. Larch-A2C encodes arbitrary semantic filters expression tree using an embedding-augmented Gated Graph Neural Network and formulates the filter evaluation order as a Markov decision process. In contrast, Larch-Sel leverages a supervised learning model to predict filter selectivities, subsequently applying dynamic programming to find a near-optimal evaluation order for each input row. Evaluated across diverse real-world datasets and comprehensive synthetic workloads, both Larch variants always outperform existing semantic filter optimization techniques in terms of token usage. Our results demonstrate that Larch is robust across diverse workloads, reducing total token cost overhead by 3x-19x compared to Palimpzest and Quest.

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07924v1 Announce Type: new Abstract: This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

ROSUM-MCTS: Monte Carlo Tree Search-Inspired HDL Code Summarization with Structural Rewards

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07925v1 Announce Type: new Abstract: Large language models (LLMs) have shown promise in code summarization, yet their effectiveness for Hardware Description Languages (HDLs) like VHDL and Verilog remains underexplored. We propose ROSUM-MCTS, an LLM-guided approach inspired by Monte Carlo Tree Search (MCTS) that refines summaries through structured exploration and reinforcement-driven optimization. Our method integrates both local and global context via a hierarchical candidate expansion mechanism and optimizes summaries using a composite reward function balancing functional correctness (FC), local content adequacy (LCA), and fluency. We evaluate ROSUM-MCTS on the VHDL-eval and Verilog-eval datasets, demonstrating its consistent outperformance over baseline methods by leveraging structured bottom-up refinement and reinforcement-based optimization. Ablation studies confirm the necessity of both local and global expansion strategies, as well as the importance of balancing FC and LCA for optimal performance. Furthermore, ROSUM-MCTS proves robust against superficial modifications, such as variable renaming, maintaining summary quality where baselines degrade. These results establish ROSUM-MCTS as an effective and robust HDL summarization framework, paving the way for further research into reinforcement-enhanced code summarization.

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

Yuan Shen, Xiaojun Wu, Linghua Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07929v1 Announce Type: new Abstract: Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.

LEGS: Laplacian-Enhanced Gaussian Splatting with a Nonlinear Weighted Loss

Yongfei Guo, Qizhou Huo, Xuan Sun, Yuanhao Gong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07932v1 Announce Type: new Abstract: 3D Gaussian Splatting (3DGS) has become an efficient explicit representation for radiance field reconstruction and real-time novel view synthesis. However, its standard photometric loss treats flat and structure-rich regions similarly, which may limit the recovery of sharp contours and fine details. Edge-Guided Gaussian Splatting (EGGS) improves structure awareness through edge-guided weighting, but mainly relies on first-order gradient responses and linear weighting. In this paper, we propose LEGS, a Laplacian-Enhanced Gaussian Splatting method with a nonlinearly weighted loss. LEGS replaces first-order gradient guidance with second-order Laplacian structural guidance and maps the normalized Laplacian response into pixel-wise weights through nonlinear response-to-weight functions. The proposed loss improves structure-aware Gaussian optimization while keeping the original 3DGS rendering pipeline unchanged. Experiments on the full Tanks\&Temples and Mip-NeRF360 datasets show that LEGS improves peak signal-to-noise ratio (PSNR) by up to 1.68 dB over 3DGS and up to 0.52 dB over EGGS. Incorporating the proposed second-order nonlinear weighting strategy into FastGS and FasterGS further improves PSNR by up to 1.69 dB, demonstrating its effectiveness as a general loss-level extension for Gaussian Splatting pipelines with potential applications in AR/VR, immersive visualization, and real-time 3D content generation.

Finite-Blocklength Lossy Joint Source-Channel Coding over Unknown Channels

Adeel Mahmood, Harish Viswanathan, Jinfeng Du — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07933v1 Announce Type: new Abstract: We analyze the finite-blocklength performance of lossy joint source-channel codes (JSCC) in an unknown-channel framework, where the true channel is unknown but the source distribution is known. We establish achievability results for mismatched-design JSCC, where the code design is based on a channel $Q_{Y|X}$ but deployed over a different channel $P_{Y|X}$. Our mismatched-design achievability result allows nonstationary channel laws and arbitrary standard Borel alphabets for the source, reproduction, channel input and channel output. The achievability bound is given in terms of the rate-distortion and rate-dispersion functions, as well as two channel-dependent quantities that we call the mismatched-design rate and mismatched-design rate-dispersion. For block erasure channels, our result shows that channel mismatch incurs no penalty. We then show a second-order universal family of source-channel codes over the set of block erasure channels. Our code construction uses Poisson functional representations of suitable conditional probability measures to produce the encoder and decoder outputs. We use a parameterized family of Gibbs posteriors as the decoder-side kernels, whose envelope recovers the generalized mutual information.

X-OP: Cross-Morphology Whole-Body Teleoperation via MPC Retargeting

Jen-Wei Wang, Sarthak Kaingade, Andrea Tagliabue, Nicholas Morozovsky — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07934v1 Announce Type: new Abstract: Whole-body teleoperation is essential for scalable robot data collection in loco-manipulation tasks, yet existing approaches relying on exoskeleton suits or multi-camera setups impose prohibitive cost, complexity, and environmental constraints. Recent methods using a single extended reality (XR) device with end-to-end reinforcement learning policies partially address these limitations but require robot-specific retraining, suffer from out-of-distribution failures, and rely on motion retargeting that neglects dynamic feasibility. We propose a hierarchical whole-body teleoperation framework driven by a single XR device that generalizes across diverse robot morphologies without retraining robot-specific policies. A Model Predictive Control (MPC)-based motion retargeter jointly optimizes alignment with the operator's intent and the robot's dynamic feasibility, generating optimal commands for existing low-level controllers. To ensure robust online execution, we introduce a state synchronization method that resets the simulator state at each MPC step to handle noisy real-world measurements and contact sensitivity, and integrate SLAM-based global pose feedback to mitigate long-term drift. Simulation results show higher success rates on whole-body control tasks for both a humanoid (over 30% lower completion time and 20% lower power consumption) and a mobile manipulator (zero collisions) compared to baselines. Real-world experiments further validate the effectiveness and flexibility of our method, demonstrating the successful deployment of the proposed retargeter on both platforms for whole-body control tasks and the ease of allowing users to adjust teleoperation behavior based on their preferences. This plug-and-play framework offers a scalable, morphology-agnostic solution for whole-body robot teleoperation, enabling real-time behavioral customization and broad applicability across platforms.

REACT 2026: The Fourth Multiple Appropriate Facial Reaction Generation Challenge: Personalised MAFRG and Appropriate EEG Reaction Prediction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07935v1 Announce Type: new Abstract: In dyadic interactions, various human facial reactions could be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023, 2024 and 2025 challenge series, a body of generative deep learning (DL) models have been developed for the problem of multiple appropriate facial reaction generation (MAFRG). This year, we propose the REACT 2026 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can generate multiple personalised, appropriate, diverse, realistic and synchronised human-style facial reactions expressed by a specific human listener for responding to each given speaker behaviour. As a key of the challenge, we continuously provide challenge participants with MARS dataset introduced by REACT 2025 but additionally provide individual-level Big-Five personality labels and EEG recordings. This introduces a new one-to-many personalised facial reaction generation setting combining human expressive behavioural, affective and neurophysiological signals, which remains largely unexplored in current dyadic interaction modelling. This paper also presents the challenge guidelines and new baselines on the four proposed sub-challenges: Offline generic and personalised MAFRG as well as Online generic and personalised MAFRG, respectively, which are publicly available at https://github.com/reactmultimodalchallenge/baseline_react2026.

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07936v1 Announce Type: new Abstract: Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

Hallucination Cascade: Analyzing Error Propagation in Multi-Agent LLM Systems

Saeid Jamshidi, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07937v1 Announce Type: new Abstract: Large Language Models (LLMs) generate fluent text but remain vulnerable to hallucinations, producing unsupported, inconsistent, and factually incorrect claims. Most prior work treats hallucination as a static property of isolated outputs. In multi-agent LLM systems, however, responses are exchanged across agents, revised through sequential stages, and reused as context for later reasoning. Hallucination, therefore, becomes a dynamic process shaped by interaction history, cascade depth, and model heterogeneity. This paper analyzes hallucination dynamics in multi-agent LLM cascades by tracking claim-level factual inconsistencies across sequential agent interactions. We conduct 500 cascade experiments across 10 knowledge domains using GPT-5.3, DeepSeek-V3, and LLaMA-3-70B-Instruct, yielding 1,250 evaluated responses. Results show that deeper cascades reduce the normalized hallucination score from 0.422 at the first agent to 0.272 at the final agent in 3-agent chains, with an amplification factor of 0.644, indicating net attenuation. This reduction is accompanied by a decline in factual accuracy from 0.789 to 0.769, revealing a trade-off between hallucination suppression and factual preservation. Transition-level analysis shows that each agent-to-agent refinement reduces hallucination by an average of 0.072, with small but consistent losses in factual consistency and response quality. Model-level results reveal reliability-efficiency trade-offs: LLaMA-3-70B-Instruct achieves the lowest hallucination score, whereas GPT-5.3 provides faster generation with a higher hallucination rate. Domain-level analysis shows that hallucination varies with topic complexity, with lower scores in well-grounded scientific domains and higher scores in more abstract domains.

DAL-PCQA: Enabling Distortion-Level and Language-Driven Reasoning for Point Cloud Quality Assessment

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07938v1 Announce Type: new Abstract: Point Cloud Quality Assessment (PCQA) methods typically predict scalar Mean Opinion Scores (MOS), which quantify overall perceptual degradation but do not reveal its causes. In contrast, human observers naturally reason in terms of specific distortions such as blur, color shifts, point density changes, missing regions, and geometric deformations. To close this gap, we introduce DAL-PCQA, a distortion-aware, language-annotated dataset for PCQA. DAL-PCQA augments benchmark point clouds with multi-level distortion severity labels, discrete quality categories, and structured natural language descriptions aligned with human perception. We define a point-cloud-specific distortion taxonomy that covers both photometric and geometric artifacts. Statistical analysis reveals characteristic degradation patterns across distortion types and quality levels. To assess the utility of these annotations, we compare zero-shot and fine-tuned multimodal models for generating perceptual quality descriptions. Experiments show that distortion-aware supervision substantially improves lexical and semantic alignment with ground-truth descriptions. By enabling interpretable, distortion-level reasoning, DAL-PCQA facilitates language-driven, explainable point cloud quality assessment. The dataset is publicly available at https://github.com/swarna96/DAL-PCQA.

Stable Geometry, Reversing Poles: The Bipolar Structure of AI Occupational Substitutability and Its Decade-Scale Inversion

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07939v1 Announce Type: new Abstract: Empirical research on the labor-market impact of artificial intelligence has converged, since Frey and Osborne (2017), on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. This continuity is rarely articulated as an assumption and has not been tested at the micro-action level where substitution actually occurs. We decompose 1,961 O*NET Detailed Work Activities into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert HITL calibration, then project the DWA-level Occupational Automation Index from our prior work onto a 7-macro semantic typology. The result is a bipolar structure. Tool-Mediated Physical (M2, mean OAI = 0.054) and Planning & Design (M7, mean OAI = 0.499) form two extremes separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e-34). The geometry is robust under three independent stress tests: resolution (K=7 to K=15, polar gap widens from 0.45 to 0.57), encoder swap to BGE (LLM-class OAI lead replicates at 3.37x), and Eloundou's GPT-4 task ratings (DWA-level rho = 0.635). The six middle macros form a low-contrast band between the poles (TOST at d=0.2 admits only 1/15 pairs as equivalent), not a flat plain. The geometry's stability does not, however, extend to its content. Across a decade, the polarity has inverted. Frey-Osborne (2013) placed Tool-Mediated Physical near the highest computerisation risk and Planning & Design near the lowest; our LLM-era OAI reverses that order, with macro-level FO-Eloundou Spearman rho = -0.750, p = 0.020, against the original Oxford Martin appendix. Which pole is high is therefore contingent on the era's dominant capability frontier, while the stable geometry itself is the structurally robust object.

SGTO-MAS: Secure Gorilla Troops Optimization for Multi-Agent LLM Systems

Saeid Jamshidi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07940v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems offer strong capabilities for complex reasoning and decision-making, yet coordination across agents introduces error propagation, security risks, and inefficient use of resources. Existing methods often rely on heuristic, static strategies and lack a principled mechanism for balancing performance, security, and computational cost. This paper formulates multi-agent LLM coordination as a constrained optimization problem and proposes a security-aware method for adaptive agent selection. The method integrates trust modeling, risk-aware evaluation, and collective intelligence within a unified optimization objective. To solve the problem efficiently, we use a swarm-intelligence strategy inspired by Gorilla Troops Optimization (GTO), enabling adaptive coordination under varying threat conditions. Controlled experiments across 500 independent runs demonstrate the effectiveness of the proposed method. The system achieves a stable average performance score of 0.5281, with high consensus (0.8764), controlled risk (0.3000), and compact agent subsets averaging 4.04 selected agents. The optimization process converges efficiently, with an average runtime of 24.09 seconds per run and low score variability (standard deviation = 0.0173). Robustness analysis indicates graceful degradation under perturbations, with performance drops limited to 2.5% under agent removal and 5.3% under consensus disruption. These results show that effective multi-agent coordination can be achieved through structured optimization that jointly manages performance, security, and efficiency. The proposed method provides a practical security-aware solution for coordinating multi-agent LLM systems in complex adversarial settings.

Collective Hallucination in Multi-Agent LLMs:Modeling and Defense

Saeid Jamshidi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07941v1 Announce Type: new Abstract: Hallucinations in large language models (LLMs) create heightened risks in multi-agent settings, where recursive agent interactions can propagate, reinforce, and amplify unsupported claims. This paper models hallucination as a system-level, time-evolving process across a network of interacting LLM agents, where nodes represent agents and edges encode information exchange. The proposed formulation captures how hallucinated claims diffuse through communication topologies, intensify under adversarial perturbations, and affect collective reliability across reasoning rounds. To suppress error propagation, we introduce an interaction-aware control method that combines confidence-weighted aggregation, adaptive impact regulation, external claim verification, and selective isolation of unreliable agents. Experiments on TruthfulQA and TriviaQA show that the proposed method reduces hallucination by up to 39.0% relative to undefended multi-agent reasoning, improves factual accuracy from 0.79 to 0.87, and increases semantic consistency from 0.75 to 0.84. Under adversarial conditions, the method limits hallucination amplification to 1.08, compared with 1.45 without adaptive control, maintaining stable collective behavior across recursive interaction rounds. These results indicate that hallucination in multi-agent LLM systems is governed by both individual model reliability and system-level interaction dynamics, including communication topology, confidence coupling, and recursive information flow.

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07943v1 Announce Type: new Abstract: Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

Spectrum Aggregation for 6G: Lessons from 5G Carrier Aggregation and Dual Connectivity

Xingqin Lin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07944v1 Announce Type: new Abstract: Spectrum aggregation has been a key enabler of LTE and 5G capacity growth, but it will become even more fundamental in 6G as networks expand across low bands, existing mid bands, new upper-mid/centimetric bands, and millimeter wave bands. This article examines how 5G carrier aggregation (CA) and dual connectivity (DC) inform the design of 6G spectrum aggregation. We argue that, while DC was instrumental in accelerating early non-standalone 5G deployment, it also introduced architectural fragmentation and long-term migration complexity. In contrast, CA provides a cleaner and more scalable foundation for multi-band operation in standalone 6G. Building on lessons from 5G, we advocate enhanced CA as the preferred 6G aggregation framework and point out the corresponding key enhancement directions.

EduMirror: Modeling Educational Social Dynamics with Value-driven Multi-agent Simulation

Jingzhe Lin, Hengbin Yu, Yongdan Zeng, Fangwei Zhong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07948v1 Announce Type: new Abstract: Understanding how educational social dynamics evolve is critical for informing effective educational policies and counterfactual interventions. However, traditional methods face a fundamental dilemma: observational studies often lack causal power, while controlled experiments are frequently constrained by ethical concerns. Although LLM-based multi-agent simulations offer a scalable in silico alternative, existing approaches remain limited by weak psychological grounding and insufficient measurement of latent psychological states. To address this, we introduce EduMirror, a multi-agent simulator for the scientific study of educational social dynamics. We provide configurable education-oriented agent forms, including value-driven agents grounded in psychological needs and social value orientation, together with a dual-track measurement protocol for quantifying observable behaviors and latent psychological states. We validate the realism and usability of EduMirror through case studies on school bullying and group cooperation, as well as broader evaluations across diverse educational scenarios. The results show that EduMirror generates educational social dynamics that are realistic, theory-consistent, and measurable by empirical criteria. These properties enable structured in silico educational research, providing a computational tool for hypothesis testing and counterfactual intervention analysis in educational science. Project page: https://edumirror.net.

The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07950v1 Announce Type: new Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often treats easy, hard, and learnable questions alike through uniform sampling and weighting, leading to inefficient compute allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and the induced token-level update weights. This reveals three recurring dynamics as training proceeds: (1) confidence inflation, (2) advantage contraction, and (3) hierarchical convergence. These findings suggest that the utility of each update depends strongly on both question difficulty and the model's current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty. CoDaPO then uses this value to reweight policy updates and resample high-value learnable questions within mini-batches, thereby increasing discovery within the learnable band under a fixed compute budget. Across twelve benchmarks, CoDaPO consistently improves accuracy over existing RL methods. Our code is publicly available at https://github.com/tmlr-group/CoDaPO.

From `May' to `Is': Certainty Distortion in Language Model Rewriting

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07951v1 Announce Type: new Abstract: Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07953v1 Announce Type: new Abstract: Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial defect detection remains challenging due to two fundamental limitations: (i) the scarcity of large-scale industrial datasets that cover diverse defect categories across multiple domains, and (ii) the reliance on manual prompts (points, boxes, masks) that introduce subjective noise and lack text-visual interaction for fine-grained understanding. To address these challenges, we introduce a Large-Scale Multi-Modal Industrial Open-Closed benchmark (MMIOC-1M) containing over one million samples across $14$ super-categories, $29$ industrial scenes, and $351$ defect subcategories. To our knowledge, MMIOC-1M is the first unified largest benchmark supporting both open-vocabulary and closed-set industrial detection, providing valuable pre-training data for LVLMs in industrial scenarios. Furthermore, we propose a Refined Text-Visual Prompt Network (RTVPNet) that incorporates three key innovations: (1) an expert-assisted domain projection mechanism that enables rapid adaptation of general vision models to industrial domains, (2) an energy-based sparse sampling strategy that automatically generates refined visual prompts without manual intervention, and (3) a bidirectional text-visual interaction module that enhances cross-modal semantic alignment and understanding. Extensive experiments demonstrate that RTVPNet achieves state-of-the-art performance on MMIOC-1M, LVIS, and COCO benchmarks while maintaining computational efficiency. The dataset and code are available at https://github.com/hellozzk/MMIO.

Minibatch Selection via Partition Matroid Constrained Gradient Matching

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07954v1 Announce Type: new Abstract: Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rely on computationally expensive proxy models to learn continuous domain weights. We propose PartitionSel, a cross-domain minibatch selection approach that maximizes a validation-guided gradient-matching utility under per-domain budgets encoded as a partition-matroid constraint. By coupling the per-domain budgets through a single utility, PartitionSel is designed to reduce redundancy in selections across domains. The proposed objective is weakly submodular and admits an orthogonal matching pursuit algorithm with provable approximation guarantees. Empirically, we evaluate PartitionSel for minibatch selection during the fine-tuning of Qwen2.5 and Llama-3 on MetaMathQA and Mol-Instructions. PartitionSel achieves robust gains over per-domain and domain-agnostic baselines on both benchmarks. It also reduces the number of conflicting gradient pairs within each batch, indicating that the cross-domain coupling translates into more compatible training updates.

Demand-Driven Vulnerability Detection for Cloud Security Posture Management: Removing Human Rule Authoring from the Disclosure-to-Protection Critical Path

Prashant Kumar Pathak — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07957v1 Announce Type: new Abstract: Cloud Security Posture Management (CSPM) systems detect known vulnerabilities by maintaining a rule set, distributing it to customers, and evaluating it against periodically-collected asset inventories. To our knowledge, in publicly documented architectures the rule set is environment-agnostic and curated centrally by the vendor; updates are batched into release cycles and shipped on a cadence ranging from hours to days depending on detection severity. The disclosure-to-protection window -- from a CVE being published to the customer's system being capable of detecting affected assets -- is therefore bounded by the vendor's release cadence for version-match detections, and by additional human authoring time for richer detections incorporating configuration predicates beyond the affected-software string. We propose an architecture in which the rule set is not vendor-distributed but continuously derived, within the customer's tenant, from the intersection of public catalogue feeds and the live asset graph. A rule comes into existence when a catalogue entry and an applicable asset are simultaneously present, and goes out of existence when either input ceases to support it. Derivation is bidirectional: new catalogue entries and new assets both trigger it. It incorporates the full structured-field content of catalogue entries, not only the affected-software predicate. The live rule set is bounded by environment diversity rather than catalogue breadth. Prior systems incrementally evaluate a static rule set; we incrementally derive the rule set itself. We present the threat model, the architecture, formal semantics with an equivalence theorem, complexity analysis, a worked example, and an evaluation methodology. The contribution is the architectural shift and its latency and resource consequences; rule correctness and alert prioritization are out of scope.

Feedback Linearization and Control of a Grid-Forming Power Converter in an Islanded Microgrid

Rene Ebunle Akupan, May-Win Thein, Se Young Yoon — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07961v1 Announce Type: new Abstract: In an islanded setting, grid-forming inverters must regulate their terminal voltage without support from an external grid, even though the load current depends directly on that voltage. The usual approach is a cascaded proportional--integral (PI) controller, built on a fast inner current loop and a slower outer voltage loop, with feedforward terms used to compensate dq rotational coupling. However, this compensation is only exact at the operating point where the controller is tuned. This tutorial presents an alternative based on full-state feedback linearization. It is shown that the islanded inverter model has full relative degree, which allows exact state-space linearization with no internal or zero dynamics. A single feedback law cancels the main nonlinear effects; rotational coupling, resistive drops, and load conductance, so that the closed-loop system behaves like two independent double integrators. A standard pole-placement design is then used to shape the response. The controller is tested in MATLAB against a cascaded PI baseline under identical conditions at a 20 MW operating point, including reference tracking, load step disturbances, and parameter mismatch. The feedback-linearizing controller settles a reference step in 0.76 ms, while the PI controller does not reach the 2 % band within 50 ms. The cascaded PI controller shows better robustness to filter parameter mismatch due to its inner-loop integral action, which reduces steady-state errors under modeling uncertainty. Overall, the performance improvement and the robustness trade-off both come directly from the controller structures, rather than from tuning choices.

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07962v1 Announce Type: new Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in open-world reasoning and understanding. However, a critical ambiguity persists: it remains unclear whether these models genuinely synthesize cross-modal information to construct physically grounded reasoning chains, or if they merely exploit strong language priors to mask single-modality reliance, thereby hallucinating advanced multimodal capabilities. Motivated by this, and to rigorously mitigate language modality bias and shortcuts, we propose a novel multimodal Chrono}logical Physical Dynamics Reasoning Benchmark ChronoPhyBench, which unifies next state prediction with Visual Question Answering (VQA) paradigms by conditioning on historical video context and textual captions to enforce models to deduce subsequent physical states through both single image selection and the inherently more complex task of multiple frame chronological sorting. Concurrently, we construct a large-scale multimodal reasoning dataset curated using the ChronoPhyBench criteria, comprising over 10,000 long-form videos paired with meticulously annotated captions, totaling 5M tokens. Our experimental evaluations reveal a stark contrast to conclusions drawn by previous benchmarks. The capacity of current open-source models to perform physically grounded multimodal reasoning remains in its infancy. Ultimately, this work seeks to systematically stress-test the reasoning capabilities of multimodal models, quantify hallucination rates, and advance the development of Physical AI, thereby providing the community with a robust and transparent evaluation framework toward Artificial General Intelligence (AGI).

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07963v1 Announce Type: new Abstract: Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

What Does Debiasing Really Remove? A Geometric Study of PCA-Based Gender Debiasing in Word Embeddings

Alexey Kresin, Tchifou M. Dieffi, Tomer Caspi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07964v1 Announce Type: new Abstract: Debiasing methods based on principal component analysis (PCA) are broadly used to reduce gender bias in word embeddings used in LLMs, yet it remains unclear what aspects of bias they actually remove and how destructive this process is. These methods are based on the understanding that bias resides in a low-dimensional subspace, with the assumption that most of it can be captured by a few principal components. In this work, we conduct a systematic geometric analysis of PCA-based gender debiasing and investigate what is actually removed from the embedding space. Our experiments across multiple embeddings show that direct gender bias is primarily concentrated in the first principal component, supporting the low-rank bias hypothesis. However, associative bias measured by WEAT does not align with these principal directions and is instead spread across multiple embedding dimensions. Furthermore, as expected, we demonstrate that removing an increasing number of principal components leads to a consistent degradation of the embedding geometry, affecting semantic structure and vector relationships. These results reveal that PCA-based debiasing operates as a trade-off: while it effectively reduces certain forms of direct bias, it fails to eliminate distributed associations and introduces geometric distortion. Moreover, there is no universal optimal level of debiasing, as the balance between bias reduction and semantic preservation depends on the chosen metric and embedding. Overall, our findings suggest that bias in word embeddings is not purely low-rank and that simple subspace removal methods may be insufficient for comprehensive debiasing.

Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07965v1 Announce Type: new Abstract: Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industrial and natural scenes make applying LVLMs challenging. Existing LVLMs rely on user-provided prompts to segment objects. This often leads to suboptimal performance due to the inclusion of irrelevant pixels. In addition, the scarcity of data also makes the application of LVLMs in industrial scenarios remain unexplored. To fill this gap, this paper proposes an open industrial dataset and a Refined Text-Visual Prompt (RTVP) for zero-shot industrial defect detection. First, this paper constructs the Multi-Modal Industrial Open Dataset (MMIO) containing 80K+ samples. MMIO contains diverse industrial categories, including 6 super categories and 18 subcategories. MMIO is the first large-scale multi-scenes pre-training dataset for industrial zero-shot learning, and provides valuable training data for open models in future industrial scenarios. Based on MMIO, this paper provides a RTVP specifically for industrial zero-shot tasks. RTVP has two significant advantages: First, this paper designs an expert-guided large model domain adaptation mechanism and designs an industrial zero-shot method based on Mobile-SAM, which enhances the generalization ability of large models in industrial scenarios. Second, RTVP automatically generates visual prompts directly from images and considers text-visual prompt interactions ignored by previous LVLM, improving visual and textual content understanding. RTVP achieves SOTA with 42.2% and 24.7% AP in zero-shot and closed scenes of MMIO.

DisCo: World Models with Discrete Camera Motion Control

Hongrui Huang, Junke Wang, Quanhao Li, Yu-Gang Jiang, Zuxuan Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07967v1 Announce Type: new Abstract: Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks

Abid Aziz, Hafsa Binte Kibria — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07968v1 Announce Type: new Abstract: Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the user's question, causing denial of service when no final answer is produced and denial of wallet when excess output tokens are billed. Input-side safety classifiers often miss these attacks because the injected prompts can appear syntactically benign. We build RecurGuard, a runtime monitor for detecting reasoning-chain consumption attacks when reasoning traces are exposed by the model. RecurGuard analyzes reasoning traces as they are generated and tracks three signals: recurrence rate, volume growth, and progress toward the user's query. If all three signals remain anomalous over three consecutive chunks, RecurGuard terminates generation early. We evaluate RecurGuard against OverThink and ExtendAttack across open-weight reasoning models and conduct adaptive stress tests on DS-R1-Qwen-7B. On this model, RecurGuard detects 99% of OverThink attacks and 92% of ExtendAttack instances while maintaining near-zero false positive rates on question answering, code generation, mathematics, and summarization. Adaptive evaluation reveals the limit of the defense: topical attacks retain 11.9x amplification with an approximately 50% joint miss rate, whereas full semantic evasion reduces amplification from 22.8x to 2.2x. When reasoning traces are unavailable, QDM provides a post-hoc fallback monitor based on the final output.

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

Imani Finkley, Yuanxi Li, Melanie Walsh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07969v1 Announce Type: new Abstract: Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07970v1 Announce Type: new Abstract: Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.

A Uniformly High-Accuracy PML-BIE Method for Scattering by Periodic Arrays of Obstacles: The 2D Case

Yan Tan, Carlos P\'erez-Arancibia, Tao Yin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07971v1 Announce Type: new Abstract: This paper presents a novel frequency-robust perfectly matched layer (PML) boundary integral equation (BIE) method for solving two-dimensional electromagnetic scattering problems involving periodic arrays of obstacles. In periodic scattering problems, standard BIE formulations based on the quasi-periodic Green's function require the evaluation of lattice sums or challenging Sommerfeld-type integrals, which diverge at Rayleigh--Wood (RW) anomalies. An alternative is to use BIE formulations based on the Helmholtz free-space Green's function, but these are defined on unbounded unit-cell boundaries and therefore require suitable truncation strategies, such as the Windowed Green Function (WGF) method. Although such approaches avoid the use of expensive quasi-periodic Green's functions, they also suffer from breakdowns at RW anomalies unless an appropriate mode correction is incorporated. Similarly, the direct application of PML-BIE techniques to periodic structures experiences comparable difficulties near RW anomalies due to the destruction of exponential convergence near RW anomalies for fixed PML parameters. To overcome this challenge, we propose a modified PML-BIE method that combines the PML technique with a finite-mode correction, ensuring both high accuracy and robustness at and around RW-anomalies. Convergence of the PML-truncated boundary integral operators is proved and several numerical examples are presented to validate the efficiency and performance of the proposed method.

OneFeed: A Unified Generative Framework for Feed ContentEnhancement and Query Generation

Guo Xun — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07972v1 Announce Type: new Abstract: Modern feed recommendation and search systems are deeply connected in user behavior butare usually modeled by separate architectures. Feed recommendation mainly captures implicitinterests from browsing interactions, while search systems rely on explicit user queries to retrieveintent-matched content. This separation causes fragmented user understanding and missedopportunities for using feed interactions to improve query generation and using generated queriesto enhance feed candidate retrieval.In this paper, we propose OneFeed, a unified generative framework for jointly modelingfeed content enhancement and query generation. OneFeed encodes heterogeneous user behaviorsequences with a shared behavior encoder and employs two generative heads: a Feed SemanticID Generator that produces content semantic IDs for recommendation retrieval, and an IntentQuery Generator that produces natural-language queries for search-based candidate expansion.To bridge the semantic gap between recommendation content and search queries, we introduce aSID-Query alignment objective that learns a shared semantic space for content semantic IDs andquery representations. We further design a closed-loop self-enhancement paradigm that leveragesimplicit user feedback from generated content and search-retrieved results to improve bothgeneration tasks. We provide a detailed experimental protocol using public recommendationdatasets with weakly supervised query construction, define a comprehensive set of evaluationmetrics, report expected performance estimates grounded in known baseline values, and validatethe executability of the proposed pipeline through a minimal local prototype. OneFeed providesa practical and extensible direction for unifying search and recommendation through generativemodeling.

PRISM: PRior-guided Imagination Sampling in world Models

Yuhai Wang, Jiawei Xia, Rongxuan Zhou, Xiao Hu, Yongliang Shi, Jing Du, Yang Ye — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07974v1 Announce Type: new Abstract: A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

A Measure-Consistent Operator Learning Method for Infinite-Dimensional Master Equations

Chenyao Wang, Hongyu Liu, Hui Liang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07976v1 Announce Type: new Abstract: Master equations in mean field game theory characterize feedback value functions that depend on time, state (space), and the population distribution. Their numerical approximation is challenging because the unknown is defined on a space of probability measures and the equation involves intrinsic measure derivatives and nonlocal population terms. This paper proposes a measure-consistent operator learning method (MCOL) for infinite-dimensional master equations. The population distribution is represented by an empirical measure and encoded through a symmetric pooling structure, so that the network input is built directly from the particles representing the measure. The same particles are used in the empirical quadrature of the nonlocal residual terms, avoiding additional quadrature grids or auxiliary integration points. A key feature is that the intrinsic derivative appearing in the residual is induced by the same measure-dependent representation that defines the approximation of the value function. Consequently, the value function, its measure derivative, and the empirical residual are tied to a common measure representation, leading to a structurally coupled value-derivative approximation. We also introduce an error decomposition separating neural approximation error from empirical discretization error. Numerical experiments on several master equations show that MCOL accurately approximates the value function, intrinsic measure derivatives, and feedback quantities, and remains robust under changes in the input measures.

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

Xueping Gao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07978v1 Announce Type: new Abstract: Understanding where LLMs store factual knowledge is critical for hallucination mitigation. We systematically quantify Late Crystallization: factual knowledge does not gradually emerge across layers but "crystallizes" abruptly at the final layers. Across five model families (Pythia, Gemma, Qwen2.5, Llama-3.1, Mistral; 0.5--14B), 26.8%--93.4% of correct answers never enter top-10 predictions at any intermediate layer, with late emergence (>80% depth) consistent across architectures. Cross-scale (Qwen2.5-14B) and cross-benchmark (MMLU: 98.2%) results confirm generality; tuned lens rules out probe artifacts. A sentiment-classification control (0.5% for Qwen vs. 85.9% factual; 2.0% for Mistral vs. 26.8%) confirms the phenomenon is specific to factual recall. Late Crystallization yields a crystallization-guided intervention principle: CAA outperforms DoLa on moderate-crystallization models (Llama, Mistral; p<0.001), with a directionally consistent reversal on high-crystallization Qwen (+25.4% vs. +15.5% MC1, p=0.069). LayerNorm ablation shows crystallization is intrinsic to the residual stream; LN scaling (x1.2) yields +11.8% MC1 with zero inference overhead. We further reveal a Computability-Memorization Spectrum: computable knowledge crystallizes earlier (layer 22.1/28) than memorized facts (28.0/28). We release MechLens supporting five model families.

A New Level Set Formulation for Improved Dirichlet Eigenvalue Minimizers

Atharv Thakur — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07979v1 Announce Type: new Abstract: This paper makes several improvements to existing level set based approaches to computing shape optimizers for the Dirichlet eigenvalues subject to a volume constraint. The most notable changes in formulation include an overhaul of the classical level set construction and root-finding procedures as well the use of a regularized approximation to the standard objective function. Our resulting computational minimizers are either comparable to or improvements on the best known minimizers from the literature. We conclude with a survey of subproblems within the field that may benefit from numerical experiments; these include the existence of cusps on the boundary, the end-behavior of eigenfunction weights in the p-parameterized problem, and the nature of Weyl asymptotics as they relate to the P\'olya conjecture.

DeRes: Decoupling Residual Stability and Adaptivity for Scalable CTR Prediction

Wenzhuo Cheng, Shipeng Nie, Qixin Guo, Xuefeng Sun, Jianguo Lou, Zhengwei Zheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07980v1 Announce Type: new Abstract: Transformer-based CTR models face a growing bottleneck at the residual connection: under Pre-Norm, early user-interest signals are diluted layer by layer; the identity skip cannot forget stale interests; and each layer sees only its immediate predecessor, losing long-range cross-layer dependencies. Recent attention-based residual variants (AttnRes) address parts of this in language models, but drop the protective identity skip and have not been tried in recommendation. Drawing on Dual Path Networks (DPN) and the HORNN view of residuals, we present DeRes, which routes each layer through two parallel paths -- an Identity residual path that preserves first-order feature reuse and gradient flow, and a Block Attention Residual path that attends over compressed outputs of all earlier blocks for high-order recall. A vector-wise gate decides, per hidden dimension, the weight given to each path. We further propose Pointwise AttnRes, replacing the Softmax in the cross-layer attention with SiLU so that multiple past blocks can be activated simultaneously and irrelevant ones receive negative (forgetting) weights -- better aligned with CTR's parallel multi-interest patterns. On a large-scale industrial dataset (331M interactions from a major social-media platform), Criteo (45M), and Avazu (40M), DeRes outperforms twelve baselines including OneTrans, TokenMixer-Large, UniMixer, mHC, and AttnRes, achieving up to +0.32% AUC at under 5% extra FLOPs. Beyond a single operating point, DeRes fits a markedly steeper compute-AUC scaling law (gamma=0.118 vs. 0.071 for OneTrans, a 1.66x gap), so an 8-layer DeRes matches a 16-layer OneTrans -- about 2x compute saving at equivalent AUC. Ablations confirm that the dual-path design outperforms either single path, Identity beats learnable residuals, and SiLU beats Softmax.

Overcoming the Limits of Finite Difference Method; Physics-Informed Neural Network for Noisy High-Dimensional Heat Diffusion

Shreesh Bhattarai, Harish Chandra Bhandari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07982v1 Announce Type: new Abstract: High-dimensional transient heat diffusion under noisy boundary conditions exposes a fundamental limitation of classical numerical methods: accuracy degrades catastrophically where physical noise is unavoidable. This paper presents a Physics-Informed Neural Network (PINN) framework as a systematic solution to this problem across one, two, and three spatial dimensions, establishing clear operational regimes that redefine solver selection in noisy thermal systems. Under 20% boundary noise in 3D, PINN sustains approximately 91% accuracy while Finite Difference Method (FDM) collapses to 36%, a clear decisive advantage. This is further confirmed in a physical copper thermal system, where PINN reduces boundary reconstruction error by 3.3 times under realistic noise conditions. This noise resilience is accompanied by a dimensionality-driven efficiency crossover: PINN requires fewer spacetime nodes than FDM in 3D while achieving superior accuracy, exposing the true cost of classical discretization at scale. These findings reframe solver selection: the decisive axis is not accuracy alone, but noise exposure and dimensionality jointly. When noise and dimensionality are both high, the classical solver paradigm is insufficient; this work provides the foundation to justify PINN as the operational standard in such regimes.

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07985v1 Announce Type: new Abstract: Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing robustness in real-world heterogeneous data scenarios. To address these challenges, we propose FMRFusion, a frequency-aware multi-view representation learning network for Heterogeneous Image Fusion. A Multi-Scale Struc tural Perception Module is introduced to effectively capture discriminative structures, extracting fine-grained local structures and essential contextual information. A bilinear frequency decomposition mechanism is employed to sepa rate features into high-frequency and low-frequency components, enabling joint modeling of local details and global representations across different frequency domains. Moreover, a Cross-View Complementary Interaction is incorpo rated to explicitly model and fuse the complementary characteristics between reflected light information and radiative intensity responses, facilitating effective cross-view interaction. We further improve the Performance of the fused results by flow matching, which progressively refines the fused features by learning the transformation from coarse data to high-quality representations. Extensive experiments conducted on multiple benchmark datasets demonstrate that FMRFusion achieves superior and consistent performance across a range of fusion tasks, especially in nighttime scenarios

PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

Xiaoyan Zhao, Haoting Ni, Yang Zhang, Chunyuan Zheng, Haoxuan Li, Fuli Feng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07988v1 Announce Type: new Abstract: Large language models (LLMs) increasingly rely on reward models to align their outputs with diverse user preferences. While personalized reward models aim to capture such heterogeneity, they are often trained on imbalanced user preference data and may therefore favor users whose preferences are more common in the training population. In this paper, we identify this failure mode as personalized reward bias, where reward modeling quality varies systematically with preference support rate. We formulate its mitigation as a Pareto fairness problem over group utilities, aiming to improve under-served users without degrading other user groups. To this end, we propose PAFO, a Pareto fairness optimization framework for personalized reward modeling. PAFO first trains group-specialized reward models for majority and minority preference groups, then constructs conditional margin-level supervision to distill their heterogeneous preference boundaries into a single unified model. The resulting model uses group information only during training and requires no explicit group labels at inference time. Experiments on Personal-LLM and DSP show that PAFO improves both minority-group and majority-group accuracy while reducing user-level unfairness across multiple metrics, demonstrating its effectiveness for fairer LLM personalization.

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

Harshil Patel, Kunal Pai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07992v1 Announce Type: new Abstract: As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: the error-handling loop. We hypothesize that tool error messages possess implicit authority, triggering corrective reasoning modes that bypass standard safety heuristics. We introduce VATS (Vulnerability Analysis of Tool Streams), a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions. Our evaluation across four frontier models, Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder, demonstrates that error-path injection triples the success rate of standard indirect prompt injection (IPI), achieving up to 100% compliance in controlled evaluations. We isolate structural positioning (sandwiching instructions within error context) as the most effective exploit vector across all tested models. While we find that production framework guardrails can mitigate these vulnerabilities, the inherent susceptibility of the model layer poses a systemic risk to bespoke agentic workflows.

The Rising Dominance of Methods Across Science

Alexander Krauss, Ariel Rosenfeld, Lutz Bornmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07994v1 Announce Type: new Abstract: Scientific progress is traditionally narrated through the interplay of theoretical insights and experimental findings. Yet this view of science underplays a third and central pillar of progress: the methods that underlie both conceptual advances and empirical evidence. By analysing more than 3 million articles across science published between 1980 and 2019, we find that science has undergone a fundamental structural transition. The share of papers that primarily contribute new methods-methods papers-has doubled across science over the past four decades, rising universally across disciplines and citation impact levels. Rather than a gradual evolution, this transition marks a pivotal shift beginning in the early 1990s, aligning with the computational revolution and the emergence of data-intensive science. The surge in methodological research is not confined to the most cited, elite publications; it spans the full spectrum of scientific output. These findings reveal a systemic reorientation of the scientific ecosystem where reusable methods increasingly serve as the essential infrastructure of scientific advances, challenging the traditional dichotomy of theory and experimental research. As science becomes increasingly methods-driven, our results call for rethinking how research is evaluated, funded and organised-towards better incentivising method innovations. This is especially the case as expanding AI must be effectively integrated with scientific instruments to realise its full potential.

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07995v1 Announce Type: new Abstract: Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer's search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07996v1 Announce Type: new Abstract: Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

Enhancing AI Interpretability and Safety through Localised Architectures

Ian Seet, Jonas Bozenhard, Simon Osterman — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07998v1 Announce Type: new Abstract: Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The power of such architectures is derived not only from the scalability of deep neural networks, but also massively parallel hardware such as GPU clusters. The diffuse nature of deep neural networks gives them great function-approximation capability when provided with sufficient training data but imposes a cost in interpretability and computational efficiency. Observing that localised machine learning (ML) models tend to be more interpretable and computationally efficient than deep neural networks on small datasets, we reason by analogy that similar advantages may apply to specific localised hardware ML architectures. We argue that localised architectures with lower bandwidth but higher expressivity per node have the potential to be fundamentally more interpretable than deep neural networks running on GPU clusters while remaining competitive for smaller datasets. We then evaluate the suitability of various hardware ML paradigms for implementing such localised architectures and evaluate their per-node expressivity, energy efficiency and practical maturity of the technology required.

Efficient Skill Grounding via Code Refactoring with Small Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07999v1 Announce Type: new Abstract: Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differences can render an entire skill incompatible. This challenge is particularly pronounced in embodied settings, where agents must operate in dynamic, partially observable environments without access to large language models (LLMs). In this setting, reliance on LLMs is impractical, while small language models (sLMs) remain insufficient for the effective skill grounding required for reliable long-horizon control. We present RECENT, a refactoring-centric agent framework that enables efficient skill grounding with sLMs by decoupling skill semantics from embodiment- and environment-specific execution binding. By representing skills as executable code, RECENT preserves the semantic intent encoded in a skill's control structure while grounding it by modifying only execution bindings through localized refactoring, rather than regenerating code from scratch. We evaluate RECENT across diverse skill grounding scenarios spanning multiple robot embodiments in dynamic environments, demonstrating robust long-horizon performance when deployed with an sLM. Across all scenarios, RECENT achieves the best performance among sLM-based Code-as-Policies (CaP) methods and matches the task performance of LLM-based CaP.

Summarization is Not Dead Yet

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08000v1 Announce Type: new Abstract: The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

Learning a Semantic Calibration Network for Open-Vocabulary Semantic Segmentation

Yang Sun, Tao Wang, Anastasia Ioannou, Ge Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08001v1 Announce Type: new Abstract: Semantic image segmentation assigns a predefined category label to each pixel, has achieved significant progress lately. Open-Vocabulary Segmentation (OVS) extends the segmentation task from a fixed set to an open set, enabling the identification and segmentation of novel concepts based on arbitrary text inputs, such as category names or descriptions. In this paper, we propose a novel Semantic Calibration Network (SCN) for open-vocabulary semantic segmentation. Different from prior approaches that focus on feature aggregation or simple fine-tuning of pre-trained models, SCN refines the mask classification process by explicitly modeling the semantic correlations between classes, aiming to enhance the model's discriminative power while effectively preserving the generalization abilities of the pre-trained CLIP model. Specifically, SCN comprises two core components: Class Disambiguation (CD) and Logits Fusion (LF). First, a cross-attention mechanism is utilized to transform the text embeddings into visually aware pseudo-text embeddings, in order to derive an enhanced similarity score that complements the original mask-text similarity score. Subsequently, the Class Disambiguation module captures implicit inter-class dependencies through a residual architecture to effectively resolve semantic ambiguities. Finally, the Logits Fusion module dynamically integrates multifaceted semantic evidence to ensure that the model achieves a robust semantic consensus while maintaining CLIP's inherent generalization capability. Comprehensive experimental results on mainstream benchmarks demonstrate that the proposed method achieves significant performance improvements compared to state-of-the-art algorithms.

Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer

M. Fazri Nizar, Julian Supardi, Muhammad Naufal Rachmatullah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08002v1 Announce Type: new Abstract: Underwater instance segmentation integrates pixel-level mask prediction and instance-level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt-based and auxiliary-modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS-DETR), a compact detection-transformer framework built around the Aqua Boundary-Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features through bounded residual modulation, while auxiliary mask supervision and small-object copy-paste are training-only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state-of-the-art works across category-aware and salient-instance protocols. TensorRT half-precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31-6.34 milliseconds (ms) latency, supporting real-time inference under an accessible reproduction setting.

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08011v1 Announce Type: new Abstract: Although directly prompting off-the-shelf Large Language Models (LLMs) to generate meaning-preserving source rewrites can effectively enhance Machine Translation (MT) quality, doing so requires manually tuning prompts for different MT models. In this work, we propose RLSR (Reinforcement Learning for Source Rewriting), a novel RL-based framework for training a source rewriting model without tuning prompts for each MT model. RLSR optimizes the rewriting model by directly using the improvement in downstream translation quality yielded by each rewritten source as the reward. Extensive experiments across six MT models and 16 language pairs demonstrate that our 4B rewriting models trained via RLSR significantly outperform the no-rewriting baseline and existing same-scale prompt-based rewriting baselines, while achieving competitive performance against prompt-based baselines based on the 235B LLM.

The Dodona Protocol: A Living Design Science Experiment in Oracle Design

Giulio Caldarelli — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08012v1 Announce Type: new Abstract: The oracle problem, broadly understood as the difficulty of reliably incorporating external information into blockchain-based systems, has been widely examined by scholars and practitioners. Recent comparative research has shown that several challenges of modern blockchain oracles, including attributability, accountability, integrity, and query design, mirror procedural and epistemic constraints already present in ancient oracular institutions such as the Delphic Oracle. Yet the translation of these insights into applied oracle design remains largely unexplored. This paper introduces the Dodona Protocol, a modular, chain-agnostic oracle service inspired by procedural patterns identified in ancient and modern oracle systems. Named after the Oracle of Zeus at Dodona, one of the oldest oracular sanctuaries in ancient Greece, the protocol operationalizes principles such as structured consultation, access control, attributable resolution, constrained query formats, reputational accountability, and tiered service availability. Its first module implements a query and dispute resolution mechanism in which a named expert resolver provides binding answers to structured questions submitted by petitioners. The oracle does not claim to reveal objective truth; rather, it produces outcomes that parties have agreed in advance to accept. The paper presents the design rationale, architecture, and comparative positioning of the Dodona Protocol. It frames the protocol as a living research experiment within the Design Science Research tradition, where the deployed system functions as the research artifact and operational data support structured analysis, iterative refinement, and peer-reviewed dissemination. In doing so, the paper seeks to bridge the gap between oracle theory and oracle practice.

Evaluating the Impact of Task Granularity on Catastrophic Forgetting in Continual Learning

Emre Alyamac, Himanshu Janmeda, Shashwat Krishna, Yash Vijay — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08013v1 Announce Type: new Abstract: Catastrophic forgetting, the abrupt loss of previously acquired knowledge upon learning new information, remains the central challenge in Continual Learning. This project investigates whether the order in which a model learns information affects how well it retains knowledge. Specifically, we ask: does learning general categories first (like "animals" vs "vehicles") before learning specific classes (like "dog" vs "cat") reduce forgetting compared to learning all classes at once? We test three approaches on CIFAR-100: (1) Coarse-to-Fine: train on 2 super-classes, then expand to 10 specific sub-classes, (2) Fine-to-Coarse: train on 10 sub-classes, then group into 2 super-classes, and (3) Flat: train on all 10 classes from the start. We use Elastic Weight Consolidation (EWC) to prevent forgetting during transitions. Our hypothesis is that learning general patterns first creates a stable foundation that helps the model retain knowledge when learning more detailed distinctions. We evaluate using standard metrics (accuracy, precision, recall, F1) plus continual learning metrics like backward transfer and forgetting rates. This work could inform how we design learning sequences for real-world systems that need to learn incrementally.

GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08014v1 Announce Type: new Abstract: Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

Ziqian Wang, Jiayu Sun, Xingjian Mao, Minqian Wang, Yao Mu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08015v1 Announce Type: new Abstract: We propose Q-Guided Value-Gradient Matching (Q-VGM), an off-policy reinforcement learning (RL) method that tackles a long-standing challenge in fine-tuning flow-matching vision-language-action (VLA) policies: efficiently improving an expressive flow-matching action expert with respect to a learned Q-function. Effective improvement must exploit the first-order (gradient) information of the critic, but this is difficult for flow policies, because directly back-propagating the value through their multi-step denoising process is numerically unstable at VLA scale, while the tractable action likelihoods required by policy-gradient methods are unavailable under iterative denoising. Existing value-based methods either backpropagate through the full denoising chain, use the critic only at test time without updating the policy, or distill critic-improved actions as terminal labels without supervising the velocity field. Q-VGM sidesteps these issues by leveraging VGG-Flow, a value-gradient view of flow alignment in generative modeling that transforms value gradient into a denoising-time value-gradient field rather than an unstable end-to-end objective. This requires no action likelihoods and no backpropagation through the denoising chain, and operates on a fixed replay buffer. The critic is an action-sensitive Cal-QL ensemble over compact RLT features with per-layer action injection. Q-VGM enables a practical few-shot initialization then learn-from-experience paradigm: starting from a few-shot-SFT pi0.5 VLA, the method leverages self-generated rollout data to substantially improve task performance without additional expert supervision. On LIBERO, Q-VGM raises the average success rate from 75.0% to 92.5%; on RoboTwin 2.0, from 76.4% to 87.2%; and on two real-robot tabletop tasks, from 40.0% to 67.5%, outperforming all same-backbone, same-critic baselines across all three settings.

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08016v1 Announce Type: new Abstract: Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

Fluid Antenna System-Enabled Mitigation of Asynchronous Reception in Cell-Free Massive MIMO Systems

Jun Qian, Zan Li, Junhui Rao, Ross Murch, Khaled B. Letaief — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08017v1 Announce Type: new Abstract: Practical distributed deployments inherently suffer from asynchronous signal arrivals, which exacerbate multi-user interference and degrade system performance, especially for coherent transmission. To natively mitigate the asynchronous reception effect, this paper proposes integrating fluid antenna systems (FASs) into distributed cell-free massive MIMO systems, exploiting their reconfigurable spatial positions to release additional spatial degrees of freedom (DoFs). We establish the FAS-enabled data transmission model with asynchronous reception, i.e., delay phases. We also derive the analytical downlink spectral efficiency (SE) performance of the proposed system under coherent and non-coherent transmissions, using low-complexity Maximum Ratio (MR) precoding to provide fundamental theoretical bounds. Specifically, we propose a novel nonmonotone accelerated projected gradient ascent algorithm to jointly optimize FAS positions and power control coefficients, maximizing the downlink sum SE. Numerical results demonstrate that while asynchronous reception severely degrades system performance for coherent transmission, the spatial DoFs unlocked by optimized FAS positions, along with efficient power control, can significantly counteract the effects of unknown delay phases and outperform traditional fixed-position antennas. For non-coherent transmission, which inherently bypasses asynchronous reception, the application of FAS leverages spatial reconfigurability to natively maximize signal strength and achieve more pronounced SE gains. Ultimately, our proposed FAS-enabled system, coupled with efficient power control, mitigates performance degradation due to asynchronous reception and outperforms traditional fixed-position antennas, paving the way for the practical deployment of FASs in robust, highly efficient 6G cell-free massive MIMO systems.

UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08018v1 Announce Type: new Abstract: Existing text-to-SQL benchmarks are largely centered on SQLite, making it difficult to evaluate whether models can generalize across heterogeneous SQL dialects. However, real-world database systems differ substantially in syntax, functions, type systems, and execution semantics, so the same natural language intent often requires dialect-specific SQL realizations. We introduce UniQL, a human-verified benchmark for cross-dialect text-to-SQL evaluation. UniQL aligns 1,534 natural language questions with executable SQL annotations across 16 SQL dialects, yielding 24,544 dialect-specific queries. All dialects share the same intents, aligned schemas and database contents, enabling controlled evaluation of dialect generalization. UniQL is constructed through a hybrid pipeline combining database migration, SQL translation, execution-guided verification, iterative rule summarization, and human validation. Experiments on both open-source and closed-source LLMs show that current models remain far from dialect-universal, with substantial performance variation across database systems and limited transfer from SQLite success to other dialects. These findings highlight the need for aligned cross-dialect benchmarks and more dialect-aware text-to-SQL methods. Code and data are available at https://github.com/JerryGao818/UniQL

Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

Jun He, Deying Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08021v1 Announce Type: new Abstract: As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.

Arabic Sentence Segmentation Across Genres and Punctuation Conditions

Mohammed Elkholy, Khalid N. Elmadani, Nizar Habash, Bashar Alhafni — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08025v1 Announce Type: new Abstract: Sentence segmentation in Arabic is challenging due to ambiguous and inconsistent punctuation, with many texts lacking reliable sentence boundary markers. Existing approaches rely heavily on punctuation cues and are typically evaluated on well-formed text, limiting their robustness in realistic Arabic settings. To address this, we introduce AraSEG, a genre-diverse sentence segmentation corpus spanning eight genres and a wide range of punctuation and document structure conditions. Using AraSEG, we evaluate LLMs, lightweight encoder models, and dependency parser-based models under increasingly challenging segmentation settings. Our experiments show that lightweight encoders, and even dependency parser-based models, outperform LLMs in the most challenging settings. We further investigate the effects of training data size and genre diversity, finding that performance eventually saturates and cross-genre generalization remains challenging. We also demonstrate that accurate sentence segmentation substantially improves downstream dependency parsing. We make our code, data, and models publicly available.

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

Yongqi Jiang, Yansong Gao, Siguang Chen, Anmin Fu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08027v1 Announce Type: new Abstract: Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties without sharing raw samples; however, it remains vulnerable to active sample reconstruction attacks. Existing defenses fail to achieve a satisfactory trade-off between model utility and privacy protection, due to either suppressing task-relevant information alongside privacy-sensitive features or relying on end-to-end supervised training to converge the defense module, which exposes the model to early-epoch vulnerability. To address this challenge, we adopt a structural causal model (SCM) insight and construct CausShield. From a task-learning standpoint, causal features within a raw sample are those that are directly relevant and contributory to the learning objective, whereas non-causal features are task-irrelevant but often encode sample-specific private information, thereby facilitating reconstruction. Importantly, we lay a theoretical foundation to prove this insight. CausShield thus decomposes the shared representations between the client and the coordinating server in VFL into task-relevant and task-irrelevant components to ensure full-cycle privacy protection. Nonetheless, the decomposition is inherently challenging due to the dual objectives of preserving model utility while mitigating privacy leakage. We address this via a carefully formulated optimization problem, which is solved through unsupervised representation learning. We further theoretically prove that CausShield preserves the convergence behavior of standard VFL. Extensive experiments compare CausShield against seven SOTAs, including InvL (USENIX Security'25), and evaluate robustness against advanced reconstruction attacks such as URVFL (NDSS'25). Results demonstrate that CausShield consistently outperforms in privacy protection, model utility, and computational efficiency.

Noise-Adaptive High-Probability Regret Bounds for Online Convex Optimization

Wentao Zhang, Yutong Zhang, Wentao Mo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08028v1 Announce Type: new Abstract: We study high-probability regret bounds for online convex optimization (OCO) with strongly convex losses and establish three results that resolve open questions at the intersection of noise adaptivity, feedback structure, and constraint satisfaction. For the full-information setting with sub-Gaussian stochastic gradients, we prove a noise-adaptive high-probability regret bound in which the martingale deviation term scales with the noise level $\sigma$ rather than the gradient bound $G$, yielding a multiplicative improvement of $G/\sigma$ over the classical Azuma-Hoeffding baseline. Our analysis introduces an exponential supermartingale argument that bypasses the bounded-difference requirement of Freedman's inequality, enabling direct treatment of unbounded sub-Gaussian noise without truncation artifacts. For bandit feedback, we prove a minimax lower bound: the high-probability regret scales linearly in $\log(1/\delta)$, in contrast to the $\sqrt{\log(1/\delta)}$ confidence cost under full information. This constitutes a formal separation in the confidence cost of strongly convex OCO across feedback models. Regarding constrained OCO with stochastic constraints satisfying a Slater condition, we provide simultaneous high-probability guarantees for both cumulative regret and long-run constraint violation, achieving $\mathcal{O}(\sqrt{T\log(m/\delta)})$ regret and $\mathcal{O}(\sqrt{T}/(\zeta\delta) + m\sqrt{T\log(m/\delta)})$ violation. Synthetic experiments corroborate all theoretical predictions.

IntentNav: Learning Spatial-Visual Object Navigation from Human Demonstrations

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08029v1 Announce Type: new Abstract: Object navigation requires a robot to search for an unobserved target in an unknown environment by deciding where to explore next under partial observability. Effective search resembles human-like exploration: selectively probing visually promising frontiers while relying on spatial memory to avoid redundant revisits. We propose IntentNav, a spatial-visual imitation framework that learns human-like ObjectNav policies from human demonstrations. To infer high-level search intent from low-level human actions, we introduce Frontier-based Human-Intent Labeling, which looks ahead in human demonstrations and labels the frontier that best explains the demonstrator's future search direction. We construct a spatial-visual candidate space, where BEV memory tracks explored regions, unexplored frontiers, and trajectory history, while egocentric visual memory provides semantic cues for each candidate. A VLM policy is trained to select among these grounded candidates, using Intent-Aligned Objective to encourage consistent and human-like exploration. IntentNav achieves state-of-the-art performance on the MP3D, HM3D-v1 and HM3D-v2 ObjectNav benchmarks. The proposed candidate-level navigation interface transfers zero-shot to wheeled, quadruped, and humanoid robots without further VLM fine-tuning. \href{https://anonymous.4open.science/w/IntentNav/}{Project page}.

Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems

Eric S. Qiu, Joyce Gill — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08030v1 Announce Type: new Abstract: Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only one response can be delivered to the learner. In this paper, we study how voting protocols shape cooperation among four role-constrained pedagogical agents responsible for scaffolding, misconception, motivation, and metacognition. We compare four voting protocols -- simple, ranked, cumulative, and approval voting -- across two simulated tutoring environments on SciQ and HumanEval benchmarks. Rather than using voting as a simple aggregation step, we use it to analyze how collective decision rules shape coordination under partial pedagogical conflict. Across 1,200 simulated interactions, we find that agent deliberation and voting protocol type frequently change which response ultimately wins, showing that both meaningfully shape the collective decision. Different voting rules also produce distinct coordination behaviors, and even brief tutoring turns show measurable learning gains in simulated students. Overall, we show that protocol choice is associated with distinct coordination patterns among role-specialized pedagogical agents.

Vision-Language Asymmetry in Bistable Image Captioning

Arohan Agate — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08031v1 Announce Type: new Abstract: Wittgenstein's duck-rabbit poses a question for vision-language models: when a model captions an ambiguous image, where in the model is the commitment to one aspect made? We address this with a 3,320-generation behavioral baseline over 83 bistable stimuli that surfaces three regimes (default-dominant, force-dominant, force-balanced) under neutral vs forced-choice prompting, then probe the underlying representations using a TopK sparse autoencoder we train on the CLIP layer that LLaVA-1.6-7B actually consumes (validation EV 0.93). Across 69 bistable stimuli with both per-aspect feature pools available, 72% (50/69) show simultaneous activation of both pools at the vision tower, including 12/12 default-dominant duck/rabbit and 7/8 force-balanced young/old. Causal steering at CLIP layer 22 flips captions on default-dominant stimuli (33% rabbit-flip rate under a fluency guard) but cannot flip captions on force-balanced young/old at any tested coefficient, despite their vision-side superposition. The dominance bottleneck lives downstream of the vision tower; the gap between vision-side representation and language-side commitment is an empirical handle on the seeing/seeing-as distinction. We also flag a methodological note: rank-based statistics on TopK SAE outputs require tie-corrected ranking to avoid silent row-order bias.

Balancing Real and Synthetic Data for CNN-based Masonry Crack Detection

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08033v1 Announce Type: new Abstract: Cracks are a critical indicator of building health, and early stage identification is fundamental to prevent harmful damages. Advances in deep learning (DL), particularly convolutional neural networks (CNNs), have enabled scalable solutions for automated crack detection. However, CNN performance strongly depends on the availability of large and diverse datasets, which is particularly challenging for complex surfaces such as masonry. Collecting sufficient real data is time-consuming, while publicly available datasets may not be adequate. To address this limitation, we explored generating synthetic crack data, which complements real data and improves training effectiveness. The real dataset consists of masonry crack images collected from buildings in Bologna and surrounding areas. In contrast, the synthetic dataset was generated using a crack overlay tool that adds cracks to background images in a controlled orientation and placement. The real dataset was used to train several DL architectures, to identify the best-performing model (InceptionV4) employed for experiments with generated data. Six training scenarios were tested in InceptionV4 by varying the ratio of real and synthetic data, with evaluation performed on a test set composed of real images using the F1-score and mean Intersection over Union (mIoU) metrics. Results show that training on synthetic data plus a modest addition of 20% real data achieves results comparable to training on real data only. Moreover, the 20/80 scenario (synthetic/real) achieved an 76% F1-score and 80% mean IoU, outperforming the real-only case. As can be seen, the method demonstrates the potential of synthetic data to reduce collection efforts while enhancing crack detection accuracy.

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08034v1 Announce Type: new Abstract: Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08035v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08036v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.

SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification

Hongkyu Koh, Ikbeom Jang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08037v1 Announce Type: new Abstract: Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive strategy for reducing annotation costs. In clinical settings, however, unlabeled pools frequently contain out-of-distribution (OOD) anomalies or diagnostic groups absent from the labeled set. Standard SSL forces incorrect pseudo-labels onto these unseen classes, producing overconfident predictions. To address this, we propose SafeECGMatch, a calibration-aware safe SSL framework for single-label ECG classification under label distribution mismatch. Methodologically, SafeECGMatch employs a dual-branch architecture extracting time-frequency latent representations via ECG-specific augmentations. Crucially, it dynamically aligns confidence with empirical accuracy through adaptive label smoothing and temperature scaling, calibrating both the multiclass classifier and the OOD detector across temporal and spectral domains. This joint optimization allows trustworthy OOD rejection and reliable pseudo-labeling. Evaluated on the PTB-XL and PhysioNet/CinC Challenge benchmarks, SafeECGMatch achieves state-of-the-art accuracy and calibration, advancing reliable knowledge discovery in physiological time-series. Code is available at https://github.com/labhai/SafeECGMatch.

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08038v1 Announce Type: new Abstract: The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

Manan Tayal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08039v1 Announce Type: new Abstract: Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.

Wispy to Voluminous: Prior-free Multi-view Capture of Strand-level Facial Hair

Jaeseong Lee, Giljoo Nam, Adrian Jarabo, Carlos Aliaga — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08041v1 Announce Type: new Abstract: Facial hair is a defining trait of personal identity, yet remains a critical bottleneck for digital avatars. Recent volumetric methods achieve photorealism but bake hair into the underlying face geometry, preventing editability and failing to resolve sparse, strand-like structures. Meanwhile, scalp-hair reconstruction methods target dense hair volumes and do not transfer to the sparse, spatially-varying nature of facial hair. We present a pipeline that automatically reconstructs facial hair -- beard, mustache, lashes, and brows -- from multi-view images, converting an unstructured 3D Gaussian representation into an explicit curve-based strand representation. We resolve geometric ambiguities in four stages: (i) optimizing 3D Gaussians constrained by tracked head geometry to enforce early ray termination and suppress sub-surface noise; (ii) tracing continuous strands robust to frequent crossings and extreme curvature; (iii) grounding strands to the surface and resolving root-tip ambiguity via a physically-motivated prior; and (iv) refining the reconstruction through opacity-driven density control under photometric optimization. To our knowledge, this is the first method to reconstruct high-fidelity facial hair strands from a 3D Gaussian representation. The recovered strands faithfully preserve the orientation and sparsity patterns characteristic of facial hair, and yield assets immediately suitable for downstream production tasks, including facial animation and physical simulation, geometric grooming and transfer, appearance editing, and physics-based rendering.

OmniFaceRig: Fully Automatic Inner-Mouth-Aware Face Rigging Across Diverse 3D Character Topologies

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08043v1 Announce Type: new Abstract: Facial rigging - creating FACS-based blendshapes together with inner-mouth geometry (teeth, gums, and tongue) - remains a major bottleneck in 3D character production. Existing pipelines still require substantial designer effort, especially for manual landmark annotation, per-character template adjustment, and inner-mouth placement. We present OmniFaceRig, a fully automatic end-to-end pipeline that converts a static surface-only 3D character mesh, with no pre-modeled oral cavity, into an inner-mouth-aware FACS rig with up to 155 blendshapes, procedurally fitted teeth, gums, and tongue, and re-packed UV/texture. OmniFaceRig supports diverse topologies - humans, humanoids, long-muzzled animals (e.g., dogs, wolves, foxes), and short-muzzled animals (e.g., cats, bears, rabbits, tigers) - with no manual landmarks, no user-provided templates, and no per-asset setup. The pipeline combines hybrid VLM+CV riggability checking, multi-model face parsing, dense keypoint-driven template registration, procedural inner-mouth construction, and collision-aware blendshape transfer. For non-human characters, OmniFaceRig selects topology-specific face and inner-mouth templates and uses collision-aware inner-mouth fitting to reduce teeth-face intersections without exposing users to category-specific tuning. We also publicly release Omni-Bench, a freely available benchmark dataset of 1,000 biped 3D characters with FACS facial blendshapes and inner-mouth geometry, spanning humans, humanoids, cats, dogs, and other animals. Experiments show high final rigging success on screened Omni-Bench inputs, nearly complete face detection recall from the segmentation ensemble and reliable inner-mouth placement with low penetration. Together, OmniFaceRig provides an automatic path from static generated characters to animation-ready facial rigs across both human and non-human topologies.

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

Enyi Jiang, Anders Gj{\o}lbye, Yibo Jacky Zhang, Sanmi Koyejo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08044v1 Announce Type: new Abstract: Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08046v1 Announce Type: new Abstract: We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08048v1 Announce Type: new Abstract: Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

Amine El Hattami, Nicolas Chapados, Christopher Pal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08049v1 Announce Type: new Abstract: AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

Automatic, Real-time Classification of User Feedback Using Large Language Models

Jim Maddock, Rose Leitner, Anna Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08050v1 Announce Type: new Abstract: In this paper we discuss an ongoing multi-year project that aims to make open text feedback more accessible and useful to UX practitioners by automating classification and providing real time access to comments, themes, and analysis. By significantly lowering the time and knowledge cost of implementing automated solutions, we aim to effectively democratize our data analysis processes, allowing and encouraging non-technical stakeholders to access and leverage data on their own. We share both the organizational and technical constraints we have encountered over the course of this project, and the solutions we have prototyped as a result of those constraints.

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

Donghao Huang, Tomas Drietomsky, Benjamin Barrett, Zhaoxia Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08051v1 Announce Type: new Abstract: Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

Oline Ranum, Simon Hadfield, Richard Bowden — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08056v1 Announce Type: new Abstract: Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08057v1 Announce Type: new Abstract: Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08059v1 Announce Type: new Abstract: Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

TOMOYO Linux: A Mandatory Access Control Method Based on Application Execution State

Toshiharu Harada, Tetsuo Handa, Masaki Hashimoto, Hidehiko Tanaka — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08060v1 Announce Type: new Abstract: Existing access control methods grant access requests based on the combinations of applications as subject and files as objects. Therefore intents of applications and the possible effects caused by granting the access requests have not been taken into consideration. In this paper, we propose a new access control method based on application history and intents. With our access control method, system administrators can reduce the risks caused by malicious access attempts and wrong operations. In this paper, the concept and implementation design will be explained as well as the brief evaluation report of TOMOYO Linux, our implementation of the new access control method to Linux.

Multidimensional Resilience for Electrical Power Systems: Systematic Review, Integrated Index, and Validation under Real-World Cyber-Physical Attack Scenarios

Isaac Ortega Romero, Ioannis Zografopoulos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08062v1 Announce Type: new Abstract: The accelerating decarbonization of energy systems has transformed electrical power systems into complex infrastructures exposed to threats whose interactions generate systemic vulnerabilities that conventional resilience approaches fail to capture. Although resilience assessment has expanded across multiple dimensions, existing studies largely examine them in isolation or adjacent pairs, leaving cross-dimensional couplings insufficiently explored. This study demonstrates i) that single-dimension assessments fail to capture the degradation produced by simultaneous cross-dimensional failures, ii) the nonlinear amplification emerging when physical, operational, and digital-cyber dimensions are jointly compromised, and iii) the intensification imposed by climatic and economic-regulatory stressors. To this end, we leverage a hybrid quantitative methodology. A PRISMA 2020 review with backward and forward snowballing identifies methodological gaps and unresolved dependencies across five resilience dimensions: physical, operational, digital-cyber, climatic-external, and economic-regulatory. Following this analysis, a Multidimensional Resilience Index (MDRI) is developed to capture endogenous couplings and exogenous amplification effects and is validated under escalating cyber-physical attack scenarios inspired by the December 2025 attack on Polish energy infrastructure. Results show that degradation under cascading and simultaneous failures is nearly eight times greater than under isolated stress, while exogenous conditions amplify degradation by an additional factor approaching six, with 72% of this amplification driven by exogenous stressors. Combined, these mechanisms produce a 46-fold increase in resilience loss compared to a single-vector reference.

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08063v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08064v1 Announce Type: new Abstract: Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense

Zhanke Zhou, Bo Han, Xuan Li, Jiangchao Yao, Sanmi Koyejo, Michael K. Ng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08067v1 Announce Type: new Abstract: Graph neural networks (GNNs) are widely deployed on relational data, yet they can leak sensitive or proprietary information about the training graph adjacency, e.g., social ties, transactions, and interactions. This work studies graph reconstruction attacks (GRA), a form of model inversion that reconstructs the training adjacency from a trained GNN, given different levels of attacker-side information. We first provide a systematic characterization of when and why adjacency becomes recoverable through features, labels, embeddings, and predictions, with leakage modulated by graph homophily, heterophily, and the model's inductive bias. Motivated by these findings, we view GNN inference through a Markov chain approximation lens, treating the layered forward computation as a chain of topology-dependent representations. Building on this view, we develop complementary attack and defense methods. On the attack side, we propose MC-GRA (+), which reconstructs the adjacency by optimizing a surrogate adjacency whose GNN-induced representations align with those of the target model at each layer. On the defense side, we propose MC-GPB (+), which suppresses adjacency-dependent information throughout the representation chain while aiming to preserve classification accuracy under a privacy-utility trade-off. Experiments across homophilic/heterophilic graph benchmarks and GNNs show that our attacks improve reconstruction fidelity over prior methods, while our defenses reduce reconstruction success with only minor accuracy loss.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08068v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete-information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well-posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized equilibrium concept with agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout-measurable stability diagnostics. We instantiate this objective in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, which performs parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE improves accuracy-cost trade-offs over strong within-class baselines; on reasoning and planning tasks, DICE-PC improves by 4.3 percentage points on average and DICE-FT by 8.5 points.

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08071v1 Announce Type: new Abstract: Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25\% random baseline, while the best model reaches 68.1\% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

Esra D\"onmez, Agnieszka Falenska — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08076v1 Announce Type: new Abstract: Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of J\"urgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

Mengyuan Sun, Yu Li, Zhuohao Yu, Shikun Zhang, Wei Ye — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08077v1 Announce Type: new Abstract: Rubric-based evaluation is a promising paradigm for judging large language model (LLM) outputs, yet self-generated rubrics lag human-annotated criteria on hard instances. We argue this discriminative gap reflects an objective mismatch: self-generated rubrics describe good responses, whereas effective criteria must discriminate between close candidates. To close this gap, we introduce SVR (Support Vector Rubrics), a framework that recasts rubric construction as max-margin boundary learning over preference data. SVR mines contrastive features from preference pairs into a rubric bank, learns a prompt-conditioned selector together with global rubric weights, and iteratively refines the bank through support-pair selection and adversarial probing of hard negatives. At inference, given only the prompt, SVR retrieves the top-rubrics from the bank and scores responses. On RubricBench, SVR narrows the gap to human reference rubrics from 24.1 to 0.3 points and outperforms strong self-rubric and judge baselines, and the learned bank transfers across judges without retraining. On RewardBench 1&2, and RM-Bench, it remains competitive with dedicated reward models, demonstrating broader reward modeling capability. Overall, boundary-defining rubrics offer a principled route to closing the discriminative gap in LLM evaluation.

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08078v1 Announce Type: new Abstract: Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08081v1 Announce Type: new Abstract: Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

When Can Phasor-Domain Device Models Be Trusted for Electromechanical Stability Analysis of Grid-Forming Converter-Dominated Microgrids?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08082v1 Announce Type: new Abstract: Grid-forming (GFM) converter-dominated microgrids are often analyzed using reduced-order phasor-domain electromechanical GFM models, but the validity of these models is often taken for granted. Assuming ideal inner-loop tracking (IILT) of terminal-voltage references, these models neglect the inner-loop and filter dynamics at the electromagnetic-transient (EMT) timescale to simplify stability analysis. This paper argues that such neglected dynamics can destabilize the system, invalidating the stability conclusions drawn from the IILT model. To address this cross-timescale stability issue, we formulate the validity of the IILT stability conclusion as a robust-stability certification problem. The EMT-induced model mismatch between the reduced-order converter model and the actual converter model is represented as a structured uncertainty embedded around the IILT feedback loop. This yields a frequency-resolved interaction index and a structured singular-value sufficient certificate for determining when the stability conclusion of the IILT model can be certified with respect to a prescribed EMT uncertainty weight. The uncertainty weight can be obtained from detailed EMT models or terminal reference-response measurements. Case studies confirm that the proposed certificate correctly certifies model validity and identifies the loss of trustworthiness. We also demonstrate that the measurement-based uncertainty weights closely match the model-based ones, which enables deployment without accessing inner-loop models.

Positive Instantial Neighbourhood logic

Litan Kumar Das, Anupam Khanra, Sujit Kumar Sardar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08083v1 Announce Type: new Abstract: Instantial neighbourhood logic is a modal language for neighbourhood frames in which formulas can express information about the kinds of worlds occurring inside a neighbourhood of a given world. In this paper, we study a positive, negation-free version of instantial neighbourhood logic with two primitive instantial modalities, one of box-type and one of diamond-type. Since classical negation is not available, the two modalities are treated independently. We introduce the language and proof system of positive instantial neighbourhood logic (PINL) and interpret it over persistent two-sided neighbourhood models. We then define a typed persistent neighbourhood semantics, used as an auxiliary canonical semantics to control witness and co-witness conditions. This yields a truth lemma and the corresponding completeness result of PINL. On the algebraic side, we introduce $2$-$\mathrm{DLIO}$s, bounded distributive lattices equipped with two families of instantial operations, as the algebraic semantics of PINL. We prove algebraic soundness and completeness via the Lindenbaum $2$-$\mathrm{DLIO}$. Finally, we construct the canonical bitopological PINL-space and show that the algebra of its admissible positive opens is isomorphic to the Lindenbaum $2$-$\mathrm{DLIO}$. Thus the paper gives a canonical admissible-open representation of positive instantial neighbourhood logic, providing a first step toward a future duality theory.

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08087v1 Announce Type: new Abstract: Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08088v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

Kyoungmin Kim, Martin Catheland, Anastasia Ailamaki — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08090v1 Announce Type: new Abstract: Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08091v1 Announce Type: new Abstract: Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent's skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver

When Languages Disagree: Self-Evolving Multilingual LLM Judges

Xiyan Fu, Wei Lu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08092v1 Announce Type: new Abstract: Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we instead show that multilingual inconsistency can provide complementary evaluation signals. Our oracle analysis finds that sampling judgments across languages yields a higher performance upper bound than single-language judging, indicating that different languages potentially include complementary judgments. Motivated by this finding, we propose SEMJ, a self-evolving multilingual judge that leverages cross-lingual inconsistency for iterative refinement. SEMJ constructs multilingual variants of each input, collects independent judgments and rationales, and feeds inconsistent outputs back for self-reflection and re-evaluation. Experiments on multiple benchmarks show that SEMJ consistently outperforms voting and reflection baselines in both accuracy and cross-lingual consistency. Further analysis shows that inconsistency triggers useful re-evaluation, which improves judgment quality.

A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08093v1 Announce Type: new Abstract: Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08094v1 Announce Type: new Abstract: Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

Strain localization in softening plasticity without modifying standard constitutive models: a deformable Cosserat approach

Andrea Panteghini, M. B. Rubin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08095v1 Announce Type: new Abstract: This paper presents a formulation for strain localization in softening plasticity based on a deformable Cosserat model. The approach enables the direct use of standard elastoplastic constitutive models formulated for a classical Cauchy continuum, without modifying the stress update algorithm or consistent tangent operator. A key feature of the framework is the strict separation of dissipative and energetic mechanisms: all dissipation is confined to the macro-continuum, while the micro-continuum contributes only through linear elastic terms associated with the director field. As a result, the constitutive structure of the elastoplastic model is preserved, and existing models can be employed as black-box components. The internal length scale arises naturally from the micro-continuum and governs the development, interaction and selection of localization patterns, rather than acting as a diffusive parameter. The formulation is easy to implement within standard finite element frameworks, requiring only additional linear contributions to the residual and tangent operators. The performance of the approach is assessed through benchmark problems involving shallow foundations on soil, a demanding test due to complex and unstable localization mechanisms. Both Tresca and Matsuoka-Nakai plasticity models are considered, including cases with highly unstable post-peak responses. Numerical results show convergence of load-displacement responses, dissipated energy and shear-band patterns upon mesh refinement, even in the presence of nonlinear interacting localization processes. These findings demonstrate a robust and physically consistent approach for the analysis of strain localization in softening plasticity.

Identifying unique developers in OSS projects: A family of models

Ruoyu Su, Alexander Bakhtin, Matteo Esposito, Davide Taibi, Valentina Lenarduzzi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08096v1 Announce Type: new Abstract: Organizational and logical coupling metrics require reliable identification of unique developers. In OSS, commit metadata is limited to names and emails, and the same developer may appear under multiple aliases, which can distort coupling measurements if de-duplication is missing. We aim to build a scalable and accurate pipeline for OSS developer de-duplication and to provide guidance on choosing a model based on precision vs. computational effort. We use Indel similarity as a baseline, then run an LLM-assisted matching process with manual validation to create a large dataset of duplicate identities. Using this dataset, we train and compare classical ML models of different complexity, evaluating precision along with training and inference time and energy. We expect a high-quality dataset and a benchmark of approaches that clarifies which solutions offer the best trade-off between accuracy and cost for large-scale OSS mining.

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

Yasushi Sakai, Allen Song, Kent Larson — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two free signals every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes two per-voter levers that consume exactly these signals: WHEN (how much weight a voter keeps on its own pick) and WHOM (how it splits the remainder across peers). We drive WHEN with letter entropy and WHOM with per-question-centered embedding cosine. The method needs no gold labels and no auxiliary training: per question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation: no within-question ensemble of confidence modes closes the oracle gap.

Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation

Kaoruko Shinkawa, Mizuki Nakajima, Taisei Mogi, Yoshihiro Nakata — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08099v1 Announce Type: new Abstract: Remote communication technologies have become widely used; however, supporting a sense of shared physical space and conveying rich non-verbal cues remain challenging in many social interaction scenarios. This study presents "Yui," a full-body cybernetic android avatar designed to integrate operator-side immersive teleoperation with interlocutor-side human-like social signaling. Yui combines a 55-degrees of freedom full-body mechanism with a previously developed android head, facial expression and gaze control, upper-body and arm motion, hand actuation, and a mobile platform. It can be operated through either the immersive mode using a head mounted display-based interface or desktop mode using a webcam-based interface. We evaluated the system through three real-world deployments: a long-term public exhibition at Expo 2025 in Osaka, Kansai, Japan; a remote educational exchange between elementary school students; and a public interaction study with general participants. During the Expo deployment, two units accumulated approximately 1131 h of operation, demonstrating both operational feasibility and maintenance challenges. In the public study, both operators and interlocutors reported positive impressions of co-presence and willingness to use the system. Interlocutors also rated the avatar positively in terms of human likeness and the transmission of emotions and intentions. The results indicate usability for general operators while suggesting room for improvement in precise controllability. These findings provide field-derived evidence and design implications for socially deployable full-body android avatars.

Constraint-Aware Optimization for Robust Protein Stability Prediction

A Shivram, Aneesh S. Chivukula, Manik Gupta, Sourav Chowdhury — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08100v1 Announce Type: new Abstract: Multimodal $\Delta\Delta G$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution (OOD) proteins, persistent forward-reverse bias on paired-mutation benchmarks, and under-representation of rare stabilizing mutations. Existing approaches address these limitations primarily through additional architectural components, leaving optimization-level intervention comparatively underexplored. We introduce a constraint-aware optimization framework combining Balanced Mean Squared Error, a Siamese anti-symmetric regularizer, and a novel OOD-margin consistency loss on the per-position feature representation, requiring no architectural changes to the SPURS backbone. Across eleven benchmarks and three random seeds, the framework improves Spearman correlation on S669 from 0.486 to 0.540 ($\sigma=0.002$ across seeds), matching the published SPURS baseline (0.50) without architectural modification, and on S461 from 0.653 to 0.711, with consistent smaller gains on five additional OOD datasets. A controlled diagnostic on Ssym reveals that anti-symmetric training does not eliminate systematic forward-reverse bias, indicating that gains arise through implicit regularization rather than exact thermodynamic constraint enforcement.

Continual Quadruped Robots Coordination via Semantic Skill Discovery

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08102v1 Announce Type: new Abstract: Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.

Revisiting Articulated Parts Perception in Robot Manipulation

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08103v1 Announce Type: new Abstract: We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08104v1 Announce Type: new Abstract: Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08105v1 Announce Type: new Abstract: When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We show that visually similar sink patterns can reflect two distinct mechanisms: {i} adaptive nop, where a head suppresses its update by routing to a null token, and {ii} broadcast, where a sink aggregates and redistributes global information. In that case, sinks serve an analogous role: a safe destination when there is nothing useful to compute. Proposed interventions like gating or registers work because they implicitly target one or the other, revealing a duality between method and assumed mechanism: gating implicitly assumes nop; registers implicitly assume broadcast. Each mechanism leaves distinct traces (nop sinks exhibit negligible value norms; broadcast sinks induce low-rank outputs) which we formalize on synthetic tasks and use to derive practical diagnostics. Applied to pretrained vision transformers, these diagnostics reveal that both mechanisms exist at scale: sinks transition from CLS in early layers to patches in deeper layers, and concentrate in specialized heads. Strikingly, register tokens, designed for broadcast, are repurposed to also serve nop, confirming that neither intervention alone suffices. Combining gating with registers yields complementary gains in stability and performance. Overall, we find that the same attention pattern can reflect two very different computations and effective intervention requires first asking what the model is actually computing.

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

Zayx Shawn — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08106v1 Announce Type: new Abstract: Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08107v1 Announce Type: new Abstract: Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic manipulation. A promising path forward is to leverage egocentric human data, which can be collected more easily, with greater breadth, and at a larger scale. Towards this end, we investigate key design choices for learning across human and humanoid embodiments equipped with dexterous five-finger hands, using the $\pi_{0.5}$ model as a foundation. Our results show that human data enables robots to learn new task semantics and compose existing skills into novel behaviors without corresponding robot data. The paper website is here: https://egopipaper.github.io/

A Global Convergence Analysis of Consensus ALADIN for Convex Optimization

Xu Du, Shuting Wu, Karl H. Johansson, Apostolos I. Rikos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08112v1 Announce Type: new Abstract: Distributed optimization problems are pervasive in machine learning and optimal control. In this paper, we study smooth strongly convex distributed consensus optimization problems. We present a distributed optimization algorithm for consensus problems based on the Consensus Augmented Lagrangian Alternating Direction Inexact Newton (C-ALADIN) framework. Our algorithm uses an auxiliary variable to decide when to update second-order information, enabling curvature exploitation without sacrificing global convergence. This contrasts with existing C-ALADIN methods, which require constant Hessian approximations and thus lose numerical advantages. Under smooth strong convexity, the algorithm converges globally, and the auxiliary variable converges sublinearly. Numerical experiments on logistic regression show that our algorithm outperforms baseline methods that use either fixed or updated Hessian information.

Conditional Random Ordered Transport Spaces

Lei Luo, Jian Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08113v1 Announce Type: new Abstract: A small Wasserstein distance does not certify that a transformation is admissible. In evidence-constrained, semantic, causal, physical, monotone, or risk-sensitive learning, one must ask not only how far two probability laws are, but whether mass has moved in a direction allowed by available information. We introduce conditional random ordered transport spaces (CROTS), a class of $L^0$-valued spaces of random probability measures equipped with a Wasserstein ambient metric, a closed stochastic order, hard and soft ordered transport discrepancies, and a conditional risk functional for evaluating order violation under an evidence sigma-field. The central object is an order-admissible transport geometry for random measure-valued dynamics, distinct from cone-valued metrics, ordered Kantorovich constructions, random Wasserstein spaces alone, and model-specific residuals for generative paths. We develop the foundations of CROTS as a space theory for reliable distributional learning. The results include well-posedness and duality for hard and soft ordered transport, soft-to-hard variational convergence, measurability and completeness of the random lifted space, reductions to classical Wasserstein and ordered geometries, ordered geodesics, constrained barycenters and projections, conditional risk-transport duality, and separation of order-violating distributions. The main stability theorem shows that random learning dynamics may converge in the ambient Wasserstein metric while its local admissibility leakage follows a separate conditional order-risk recursion. The resulting asymptotic order-risk floor provides a mathematical language for evidence overreach, ordered distribution shift, robustness failure, and admissible distributional dynamics.

Blockage-Aware Non-stationary Dynamic Bandit for User Association in mmWave V2X Networks

Weiqi Chi, Manabu Tsukada — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08118v1 Announce Type: new Abstract: In millimeter-wave (mmWave) vehicular networks, dense base station (BS) deployments expand the user association (UA) decision space while dynamic blockages cause link quality fluctuations, posing critical challenges for effective mobility management. Traditional Multi-Armed Bandit (MAB) frameworks assume stationary reward distributions and fail to handle the rapid context-reward mapping shifts caused by vehicle mobility and transient blockages. To address this, we propose Blockage-Aware Non-stationary Dynamic Bandit (BAND), a fully distributed, channel state information (CSI)-free mobility management framework for mmWave vehicular networks, formulating UA as a non-stationary contextual bandit problem, enabling online adaptive optimization without requiring central coordination or offline training. BAND employs a cumulative sum-based change detection (CUSUM-CD) to dynamically narrow the active BS set, reducing exploration overhead while tracking reward distribution shifts. Proactive blockage detection suppresses transient signal degradation in the reward estimation process. Simulations demonstrate over 40% regret reduction and up to 33.1% network communication rate improvement compared with hypercube-based contextual bandit baselines, with robustness validated across varying blockage rates and network configurations.

Policy Description Language for Authorization using Logic-Based Programming

Masaki Hashimoto, Mira Kim, Hidenori Tsuji, Hidehiko Tanaka — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08119v1 Announce Type: new Abstract: Recently, with the impossibility of eradicating the vulnerabilities of information systems, we must prepare for the occurrence of the security incident by the multi-layer defense called the Defense-in-Depth strategy. In the multi-layer defense, it is important to authorize accesses in fine-grained granularity to compose each layer effectively, and many access control models are proposed to follow them. However, policy description languages proposed so far cannot express the models appropriately in proper granularity. In this paper, we propose a policy description language which can designate many kinds of conditions for access control, such as the dynamic status of an application process, as an element of decision data, and implement it in Datalog. Using the proposed language, we compose the policy of SELinux, which is a major implementation achieving the multi-layer defense, and we confirm the advantages of the proposed language by evaluating its validity and expressiveness.

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

Fatemeh Ziaeetabar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08121v1 Announce Type: new Abstract: Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.

Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08122v1 Announce Type: new Abstract: Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users' intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.

Human-Centered Benchmarking of Driver Monitoring Models

Ruben Dario Florez-Zela — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08123v1 Announce Type: new Abstract: Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

Soft Covering via Hypothesis Testing: Typical-Code Exponents and Mismatched Detection

Neri Merhav — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08124v1 Announce Type: new Abstract: We study the typical-code (quenched) behavior of the false-alarm (FA) and missed-detection (MD) error exponents of the Neyman-Pearson test associated with soft covering, complementing the average-code (annealed) analysis that has been carried out in a companion paper [1]. We prove that, as the block-length tends to infinity, for almost every randomly selected fixed-composition codebook, the negative normalized logarithms of both error probabilities converge to their respective average-code exponents. In other words, the error exponents are self-averaging. We then extend the scope and study a mismatched likelihood ratio test that assumes the wrong channel model. Here, we derive the mismatched error exponents, show that self-averaging persists under mismatch, and characterize the degradation. In particular, we characterize the coding rate beyond which the two kinds of error exponents cannot be positive at the same time, which in the matched case, is given by the channel input-output mutual information rate.

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08126v1 Announce Type: new Abstract: Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.

Gray-Box Optimization and the Vertex Coloring Problem

Johanna Gasse, Antonia Heinen, Hendrik Higl, Timo K\"otzing — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08128v1 Announce Type: new Abstract: Gray-box optimization is an approach for making some problem-specific information available to the algorithm while still relying on fitness information as the main guide to an optimum. This approach was shown to be beneficial in various combinatorial optimization tasks and neatly captures the continuum between fully black-box algorithms and tailored algorithms. In this work, we discuss different flavors of gray-box algorithms. We show that RLS can find a proper $2$-coloring in a bipartite graph starting from a random $2$-coloring, in an expected time of $\mathcal{O}(n \log n)$. In contrast, when starting from a proper $n$-coloring, the (1+1) EA cannot find such a coloring except when offered additional guiding on plateaus of the search space. Finally, we show the run time for this setting can be much improved by using gray-box operators.

Cross-LLM Consistency in Inference: Evidence from Shared Interactions

Siyu Lou, Yao Yan, Yuntian Chen, Quanshi Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08129v1 Announce Type: new Abstract: Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar internal inference patterns. In this paper, we examine this hypothesis using interaction-based explanations. We find that LLMs often share interaction patterns when predicting the same target token from the same prompt. This consistency is more pronounced among advanced LLMs. Shared interactions also tend to be lower-order and show weaker positive-negative cancellation than non-shared interactions. These results suggest that advanced LLMs may be implicitly optimized toward common inference patterns, even though the mechanisms that give rise to such cross-model consistency remain open.

How to be Non-Human : A Thematic Analysis of Animal Embodiment in VR Games

Siqi Yu, Shuai Liu, Yiqing Tian, Mar Canet Sola — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08130v1 Announce Type: new Abstract: This study employs a reflexive thematic analysis to systematically examine the design patterns of 48 first-person Virtual reality (VR) animal avatar games. The research identifies four primary design themes: Animal Biomimicry, Limited Animal Simulation, Hybrid HumanAnimal Features, and Human Behavior with Animal Avatar. The analysis reveals that approximately 77 percent of the games remain grounded in human-centered interaction logic, with animal forms primarily serving as visual representations. The study highlights the core tension between authenticity and usability in current VR animal avatar design, and points toward design opportunities for achieving more authentic animal avatar's interactive experience through directions such as controller innovation, unconventional body mapping, and dynamic feedback. This research provides a thematic classification framework for understanding the representation of non-human perspectives in VR games.

LCAM: A Framework for Diagnosing Interactional Alignment Failures in Con-versational AI

Manuele Reani, Hongyu Tian — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08131v1 Announce Type: new Abstract: Conversational AI is increasingly used for advice, interpretation, reassurance, and decision support in contexts where users may be vulnerable, uncertain, or dependent on the system's apparent competence. Existing alignment work often focuses on model objectives, preference optimization, or output correctness. Yet, many harms arise through interaction: how systems frame authority, express uncertainty, simulate empathy, support reasoning, and make boundaries legible. This paper introduces the Layered Cognitive Alignment Model (LCAM), a conceptual and normative framework for diagnosing interac-tional alignment failures in conversational AI. LCAM defines alignment as a calibrated fit among system behavior, user goals, task demands, and normative context. It distinguishes five layers of fit: perceptual, semantic, affective, cognitive, and ethical, and two diagnostic polarities of misalignment: underfit and overreach. We apply LCAM to a published LLM counseling example, showing how an apparently supportive response can reinforce harmful beliefs, simulate inappropriate care, and obscure role boundaries. By translating conversational failures into audit and governance questions concerning over-reliance, false intimacy, autonomy erosion, boundary confusion, and inappropriate trust, LCAM offers a theoretical and normative lens for evaluating conversational AI beyond accuracy, helpfulness, or trust.

Phase Marginalization for Patch-Grid Instability in Vision Transformers

O\u{g}uzhan Ercan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08132v1 Announce Type: new Abstract: Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

Gravity-guided Contact Dynamics Estimation from 3D Human Motions

Cuong Le, Urs Waldmann, Bastian Wandt, M{\aa}rten Wadenb\"ack — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08133v1 Announce Type: new Abstract: Ground contact forces acting on the human body, are crucial for biomechanics studies or sport performance analysis. Prior methods rely on force plates or pressure mats to collect ground contact dynamics, limiting their applicability to carefully controlled settings. A more scalable solution is to estimate the dynamics directly from motion capture data. Recent approaches only roughly estimate the ground contact dynamics from the vertical distance between the body and the ground plane, which cannot capture the complex pressure distribution of all contact points. To this end, we propose GraCE -- Gravity-guided Contact Dynamics Estimation, a novel full-body contact dynamics model for human motions using a realistic influence of body mass distribution and gravity. We use the human's center of gravity to estimate the ground contacts based on its relative distance to the human body. The applied force on each contact is estimated via the product of predicted contact probabilities and the total exterior force computed from the center of mass trajectory. We outperform related work on the GroundLink dataset for ground reaction force estimation, and on the MOYO dataset for detailed contact pressure prediction. The code is published upon acceptance.

TICoder: A Repository-Level Code Generation Framework with Test-Driven Planning and Implementation-Aware Reuse

Siyu Nan, Yaling Luo, Jian Wang, Neng Zhang, Bing Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08135v1 Announce Type: new Abstract: Repository-level code generation with Large Language Models (LLMs) remains challenging, primarily due to complex dependencies and limited context windows. Recent approaches adopt retrieval-augmented generation (RAG) and the planning mechanism to reuse potential callee functions in the repository. However, these approaches often suffer from two limitations: lack of test-driven behavioral guidance during planning and overlooking the implementation logic embedded in repository code during reuse. As a result, generated plans may not align with expected behaviors, and retrieved functions may not be effectively reused. In this paper, we propose TICoder, a novel repository-level code generation framework that improves both planning and reuse. TICoder introduces a test-driven iterative planning mechanism that leverages test cases as behavioral specifications to refine implementation steps. Furthermore, TICoder employs an implementation-aware code reuse strategy, which retrieves potential callee functions using a dual-view similarity that captures both functional and implementation aspects. We then identify relevant usage patterns through a dual-stage selection strategy, combining structure-based clustering and perplexity-based filtering. We conduct extensive experiments on widely used repository-level code generation benchmarks with various LLMs. Experimental results demonstrate that TICoder outperforms state-of-the-art (SOTA) methods, achieving an average improvement of 11.52%.

Learning Predictive Control with Deep Koopman Operators for Autonomous Vehicle Motion Planning

Xinglong Zhang, Yongqian Xiao, Haotian Cao, Xing Zhou, Xin Yin, Xin Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08136v1 Announce Type: new Abstract: Model Predictive Control (MPC) is widely used for autonomous-vehicle (AV) motion planning, but its real-time applicability is often limited by the need for accurate models and online solution of nonlinear, nonconvex optimization problems in dynamic road environments. Actor-critic reinforcement learning offers a promising alternative for online policy generation, yet its policy-learning process often lacks explicit control-theoretic structure. This article proposes a learning predictive control (LPC) framework with deep Koopman operators for efficient real-time motion planning under nonconvex constraints. To address nonlinear and uncertain vehicle dynamics, a deep-Koopman-based predictor is used to lift the system into an interpretable linear observable space in a data-driven manner. Unlike traditional MPC, which computes open-loop control sequences, the proposed LPC framework yields a closed-loop state-feedback policy within each prediction interval through receding-horizon actor-critic learning. To ensure safety under nonconvex environmental constraints, LPC constructs convex local surrogate representations of obstacles and defines corresponding potential-field functions. These functions and their gradients are directly embedded into the actor-critic structure, enabling efficient, safety-aware policy learning. Extensive simulations and real-world experiments on the HongQi-EHS3 platform demonstrate favorable performance in diverse obstacle-avoidance scenarios in terms of safety, computational efficiency, and driving comfort, compared with benchmark methods such as CBF-MPC and LMPCC.

A Barrier-Modulated Architecture for Safe Affine Formation Control in Second-Order Multi-Agent Systems

Ashik Abrar Naeem, Mohammad Ariful Haque — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08137v1 Announce Type: new Abstract: Affine formation control offers immense flexibility for coordinating multi-agent maneuvers, but guaranteeing the safety of agents under parametric uncertainties remains an open challenge. This paper proposes a novel safe affine formation control framework for second-order multi-agent systems by integrating Higher-Order Control Barrier Functions (HOCBFs) with Adaptive Dynamic Programming (ADP). We introduce a barrier-modulated control architecture that smoothly attenuates the nominal formation tracking objective when agents approach safety boundaries, preventing conflicting control inputs. Within this architecture, two distinct safety controllers are developed: (1) an analytical barrier-gradient repulsive controller that provides a computationally efficient, rigorous mathematical baseline, and (2) a data-driven optimal safety controller. The data-driven approach utilizes an actor-critic neural network to solve the Hamilton-Jacobi-Bellman (HJB) equation online, enabling optimal collision avoidance even in the presence of unknown system parameters. Using Nagumo's theorem and Lyapunov stability analysis, we formally prove that both controllers guarantee the forward invariance of the safe set ensuring absolute collision avoidance while maintaining Uniformly Ultimately Bounded (UUB) formation tracking errors. Finally, simulations validate the theoretical findings and demonstrate the robustness of the proposed controllers in dynamic obstacle avoidance scenarios.

TRUST-SCF: Transformer-based Risk Understanding and Scoring for Transactional Supply Chain Finance

Mohammadamin Davoodabadi, Amirabbas Shakeri — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08140v1 Announce Type: new Abstract: Supply Chain Finance (SCF) and LendTech platforms need credit scoring systems that respond to evolving transaction behavior, repayment delays, and active exposure. We propose TRUST-SCF, a transformer-based framework for transaction-level risk prediction and dynamic credit scoring. Each user history is represented as a sequence of transaction tokens containing utilization, repayment delay and transaction position. The main contributions are: (1) a financially aligned attention bias that combines utilization similarity and recency, enabling the model to compare repayment behavior under comparable exposure conditions; (2) continuous repayment-delay prediction in a log-transformed target space, reducing the influence of extreme delays while improving sensitivity to short-delay behavior and (3) a label-efficient credit-scoring pipeline in which the final credit score is not trained using any explicit external credit-score label, but is instead derived from predicted delay, potential risk over simulated utilization, actual unpaid exposure, and nonlinear calibration. Experiments on real transaction data from more than 300,000 transactions show that TRUST-SCF improves delay prediction over sequential baselines and produces scores that are strongly associated with future repayment behavior. These results suggest that TRUST-SCF is a practical framework for adaptive credit scoring and transaction-level risk mitigation in SCF and LendTech environments.

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang, Yupeng Hu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08144v1 Announce Type: new Abstract: Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08146v1 Announce Type: new Abstract: Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.

Property-Informed Diffusion-Based Text-to-Microstructure Generation

Bingxuan Dai, Hongsong Wang, Jie Gui — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08150v1 Announce Type: new Abstract: Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

Xinyu Guan, Qianyang Zhao, Yuming Deng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08151v1 Announce Type: new Abstract: Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surfaced at action time. We present CICL, a decision-aware context layer that turns instance evidence into a context graph, routes deterministic, Opus-assisted, Qwen, Codex/GPT-5.5, and Qwen-QLoRA judgments through a shared eight-field schema, scores units by action shift, outcome uplift, necessity, and negative-transfer risk, and packs high-utility evidence as typed memory cards for a budgeted agent. The design separates the measured decision signal from the judge model, so frontier annotation, local surrogates, and lightweight rankers can be compared under one auditable protocol. Empirically, CICL yields a concrete open-benchmark gain while exposing its limits. On 50 SWE-bench Verified file-retrieval instances, direct Qwen3.6-plus reranking of BM25 top-50 candidates raises hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show action-criticality: at budget 120, CICL reaches F1 0.620 on v1 and 0.425 on v3, and removing the top-utility semantic v3 unit collapses F1 to 0.000. Supplementary checks add Qwen-QLoRA agreement over 710 candidates, a small 200-label real-code Opus-assisted signal, and a three-instance patch smoke validating retrieval-to-patch plumbing without claiming official SWE-bench success. RepoBench-R summaries still beat cards, and compact rankers do not yet replace the heuristic. CICL contributes a reproducible measurement and selection layer for decision-critical context, not an end-to-end coding-agent repair claim.

Vision-Guided Dual-Arm Humanoid Robotic Disassembly of End-of-Life 18650 Lithium-ion Battery Packs

Yile Chen, Zhihao Liu, Xi Vincent Wang, Lihui Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08152v1 Announce Type: new Abstract: The growing volume of retired lithium-ion battery packs from electric vehicles and portable electronics calls for automated disassembly that is safe, flexible, and selective down to the individual cell. Existing robotic systems, however, mostly assume known pack poses, external fixtures, or specialised tooling, leaving fixture-free cell-level disassembly under pose uncertainty largely unsolved. This paper presents a vision-guided dual-arm pipeline that disassembles a 21-cell 18650 pack from an arbitrary initial pose using only general-purpose parallel-jaw grippers, RGB-D sensing, and a pre-trained grasp detector. Pose uncertainty is absorbed by a learn-and-filter perception stack with discrete look-and-move wrist-camera corrections, while a mid-task support transfer between the two arms extends the effective workspace without any external clamp. The pipeline achieves an 8/10 end-to-end success rate, a cell-localisation root-mean-square error of $2.4$\,mm, and a mean cycle time of 6.0\,minutes per pack, providing a practical, fixture-free building block for industrial battery recycling.

LogNEO: A GPT-Neo Reinforcement Learning Framework for Accurate Real-Time Log Anomaly Detection

David Eje, Tanmay Sharma, Khush Patel, Manuel Mazzara, Leonard Johard — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08153v1 Announce Type: new Abstract: Detecting anomalies in large-scale system logs is critical for the reliability and security of modern computing infrastructure. We present LogNEO, a log anomaly detector built on EleutherAI's GPT-Neo (1.3B parameters) and fine-tuned with a novel partial-credit, exponentially decaying position-aware reward scheme combined with cross-entropy regularisation via Proximal Policy Optimisation (PPO). The position-aware reward explicitly models prediction difficulty: early positions receive higher rewards for correct predictions, while later positions incur stronger penalties for errors. LogNEO attains F1-scores of 0.927, 0.913, and 0.984 on the HDFS, BGL, and Thunderbird benchmarks, improving recall by up to 6 percentage points over the prior state-of-the-art LogGPT while maintaining comparable precision. A production microservice deployment over Apache Kafka, Redis, and TensorRT-accelerated inference demonstrates 45 ms end-to-end latency at 15,000 events per second.

SynthICL: Scalable In-context Imitation Learning with Synthetic Data

Cheng Qian, Ruomeng Fan, Yifei Ren, Yilong Wang, Edward Johns — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08154v1 Announce Type: new Abstract: In-context imitation learning (ICIL) enables robots to learn new tasks from a small number of demonstrations by conditioning a pre-trained policy on task-specific examples, without retraining at test time. Despite this promise, training generalizable and scalable in-context imitation policies remains an open challenge. We present SynthICL, a scalable framework that trains ICIL policies entirely from RGB-only synthetic data. Specifically, we build a data generation pipeline to produce high-fidelity ICIL data and train a flow-matching transformer policy on the resulting dataset. SynthICL avoids the need for depth sensing, precise camera calibration, and real-world training data in prior approaches, offering a simpler and more scalable alternative. We further incorporate subgoal prediction by training the model to predict the next subgoal images, enabling more precise and visually grounded control. Evaluated on 16 unseen real-world manipulation tasks, SynthICL achieves an average success rate of 79% with only one demonstration provided at test time and outperforms prior methods. Project page: https://synth-icl.github.io

Have I Solved This Before? Retrieving Similar Segmentation Problems for Evolutionary Learning

Andreas Margraf, Henning Cui, J\"org H\"ahner — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08155v1 Announce Type: new Abstract: Reliable integration and solid configuration of monitoring systems constitute a fundamental prerequisites for achieving high efficiency and productivity in contemporary manufacturing environments. Design decisions on sensor type and system architecture have to be made at an early stage and under comparably high uncertainty. This work investigates a research direction that deviates from the traditional monitoring-system development process by shifting the attention from algorithm design to a deeper analysis of the inspection problem. In contrast to traditional design cycles, this paper proposes to gradually collect knowledge and store it in an abstract system model. This enables the retrieval of similar solutions for future use cases, preventing the need for expensive model training from scratch and allowing instead for the incremental refinement of existing base configurations. Reuse of previously generated pipelines reduces the risk of late and costly revisions. As there is little knowledge on cross-domain transferability of filter pipelines, this study analyzes the potential of retrieving filter pipelines to transfer them to different but similar segmentation problems. Finally, we statistically analyze the benefits of this `transfer learning' variant which is predominantly applied to image segmentation problems. In addition, we discuss how simple models help balancing the trade-off between complexity, technical requirements, and reliability in the design process.

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

Kyumin Choi, Ikbeom Jang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08156v1 Announce Type: new Abstract: Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

Cross Paraphrastic Invariance Learning for Hallucination Detection

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Sihong Xie, Xiangwen Liao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08157v1 Announce Type: new Abstract: Large language models (LLMs) frequently generate hallucinations, which are unsupported by a source document. To avoid costly LLM-as-evaluator pipelines and the heavy annotation demands of existing classifiers, we propose CPIL (Cross Paraphrastic Invariance Learning), a two-stage Siamese framework that maximizes the utility of existing labeled data. Concretely, CPIL constructs informative training pairs by: (i) generating paraphrastic views of each document-claim example as positives, and explicitly aligning their representations to enforce invariance to surface form; and (ii) mining same-document, opposite-label pairs as hard negatives to sharpen document-sensitive decision boundaries. Then CPIL conduct a two-stage model training: Stage 1 performs contrastive pretraining to learn a paraphrase-invariant, grounding-aware embedding space; and Stage 2 attaches a lightweight classifier for binary groundedness. On the LLM-AggreFact benchmark (11 tasks), CPIL surpasses strong baselines concerning F1 scores with only ~1% labeled data, showing its prediction superiority and label efficiency.

Constrained Paraphrase Consistency for LLM Hallucination Detection

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Xi Zhang, Xiangwen Liao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08158v1 Announce Type: new Abstract: Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior work largely enlarges training sets via synthesis or new annotations, introducing increasing cost and potential bias while underusing the consistency implied by semantically equivalent paraphrases. We propose Consistency-Constrained Hallucination Detector (CCHD), which formulates training as a constrained optimization problem. The standard cross-entropy on original document-claim pairs is complemented by (i) paraphrase-consistency constraints bounding divergence across paraphrased views, and (ii) label-preservation constraints tying paraphrases to ground truth. We solve the problem by gradient descent-ascent over model parameters and per-view Lagrange multipliers, adding only a few scalar dual variables and no inference-time overhead. With DeBERTa and Flan-T5 backbones, CCHD consistently outperforms strong baselines (FactCG, MiniCheck, and AlignScore) on standard factuality benchmarks, demonstrating its superiority on hallucination detection.

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08161v1 Announce Type: new Abstract: As capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67\%/3.99\% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6$\times$/5.7$\times$ lower self/coupling error and 192$\times$ faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://github.com/THU-numbda/AttentionCap.

Silent Failure in LLM Agent Systems: The Entropy Principle and the Inevitable Disorder of Autonomous Agents

Dexing Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08162v1 Announce Type: new Abstract: Large Language Model (LLM) agent systems suffer from failures that occur without external triggers -- no injection, no adversarial input, no resource exhaustion. These silent failures -- unexpected deviations from intended behavior under normal conditions -- are routinely misattributed to bugs or configuration errors. Through systematic analysis of over 40,000 controlled trials and long-term production observations spanning 100,000+ agent interactions, we identify a common structural logic underlying these failures. Building on patterns observed in our experiments, we survey the global research literature on autonomous agent reliability and synthesize 22 intrinsic properties of LLM agent systems across six lifecycle layers: foundation semantics, inter-agent transmission, memory persistence, task execution, feedback correction, and systemic evolution. We demonstrate that whenever a sufficient subset of these properties co-exist, system entropy -- the measurable accumulation of disorder: loss of output consistency, task accuracy, and cross-session coherence -- increases monotonically with interaction rounds. We formalize this as the Entropy Principle: S(t) = S0 * e^(alpha * t), with alpha measured empirically across multiple architectures. We propose the PIG (Physical Integrity Gate) Engine with the ADE (Agent Delivery Engineering) protocol suite as an engineering countermeasure to entropy-driven disorder. Our findings establish silent failure not as a bug to be fixed but as a manifestation of Intelligence Entropy -- a physical constraint to be managed through deterministic governance. We argue that any engineering effort stabilizing the structure and order of agent systems participates in a unified mission: keeping intelligent systems reliable as they grow in scale and complexity.

How Much MRI Preprocessing Is Enough? A Cost-Utility Study for Brain MRI Foundation Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08164v1 Announce Type: new Abstract: MRI preprocessing defines the input distribution seen by brain MRI foundation models, yet it is usually treated as routine data cleaning rather than a modeling choice. We ask how much preprocessing is worth its computational cost for self-supervised 3D MRI pretraining. Keeping the corpus, 3D ViT backbone, masking protocol, and downstream evaluations fixed, we compare a graded P0-P7 preprocessing spectrum for masked autoencoding (MAE) and joint-embedding predictive learning (JEPA) on 20,000 heterogeneous brain MRI volumes, then transfer the encoders to IDH prediction, MCI classification, brain age regression, and GLI/PED tumor segmentation. The results do not support a simple "more is better" rule. P0/P1 are numerically unstable, making P2 the lowest-cost feasible level; beyond P2, choosing the best feasible preprocessing level improves aggregate utility by only 3.4 percentage points for MAE and 1.8 percentage points for JEPA, with most paired gains statistically unresolved. Stronger preprocessing is beneficial only in selected regimes: IDH improves modestly, AGE and GLI/PED are often near or best at P2, and MCI shows the clearest empirical P7 gain. Cross-level MCI transfer further shows that much of the P7 advantage can be recovered by applying stronger preprocessing downstream, without requiring P7 throughout pretraining. These findings recast MRI preprocessing as a downstream-aware cost-utility decision rather than a default escalation pipeline. Code is available at https://github.com/PangJiangShuan/PreBrain.

Explaining Data Mixing Scaling Laws

Rui Dai, Shuran Zheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08167v1 Announce Type: new Abstract: Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR

Kerri Prinos, Lilianne Brush — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08168v1 Announce Type: new Abstract: Leading commercial endpoint detection and response (EDR) products have shifted from operator-configured rule sets to multi-component systems where autonomous AI components operate alongside, and increasingly in place of, operator-deployed policies. Autonomous defense agents using commercial EDR as their hardening tool are no longer tuning a passive tool, but a black-box autonomous system capable of making vendor-specific decisions. We present the first evaluation framework for autonomous defense agents hardening commercial EDR. We instantiate it in a Game of Active Directory (GOAD) lab with Horizon3.ai's NodeZero as the autonomous pentester and Microsoft Defender XDR as the EDR. We run a sample benchmark of defense agents with two large language model (LLM) backbones (Claude Sonnet 4.6 and Cisco Foundation-Sec-8B). We report three lessons learned that neither simulation nor open-source-EDR evaluation can surface: (i) commercial EDR telemetry is engineered for Security Operations Center (SOC) analyst workflows rather than scientific benchmarking; (ii) the importance of per-policy attribution to separate defense agent actions from autonomous EDR actions; and (iii) the EDR's autonomous behavior varies during the evaluation window. Together, these findings highlight a sim-to-real gap for enterprise defense and motivate evaluation methodology for benchmarking autonomous defense agents in environments with black-box, autonomous tools.

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08169v1 Announce Type: new Abstract: Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

Learning from Human Driving: A Human-in-the-Loop Online Behavior Cloning Framework for Autonomous Driving

Yuhong Shi, Jianyi Liu, Lihang Sun, Li Li, Xudong Dong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08170v1 Announce Type: new Abstract: With the evolution of large foundation models (LFMs), data-driven autonomous driving has made significant strides. However, existing paradigms still face severe challenges in complex interaction and long-tail scenarios due to distribution shift and causal confusion. These limitations often result in a lack of human-level decision-making flexibility and safety in extreme conditions. To overcome this limitation, this paper proposes a Human-in-the-Loop Online Behavior Cloning frame work (HiL-OBC) for autonomous driving, which aims to deeply integrate the cross-modal perceptual capabilities of LFMs with the high-level driving intelligence of human experts. Specifically, HiL-OBC deployment is executed through three critical phases: policy initialization with human intervention, latent behavioral modeling with Bayesian policy adaptation, and online deploy ment and updates. Furthermore, we design a Multi-modal Online Behavior Cloning (MOBC) model, which optimizes the base driving policy online through a lightweight network architecture, a takeover trigger mechanism, and a multi-variant loss function, thereby enhancing the system's decision-making robustness in complex environments. We evaluated the HiL-OBC on the LangAuto-Human CARLA benchmark. Experimental results demonstrate that the driving policies optimized via the human-in-the-loop mechanism achieve substantial performance gains: the DS of StructNav, LFG, and LMDrive increased by 47.25%, 31.59%, and 32.12%, respectively, with a simultaneous of various experimental settings and key components highlights the advantages of human-in-the-loop learning in improving decision-making robustness and overall driving performance.

The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

Manuele Reani, Hongjian Zhang, Hongyu Tian — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08172v1 Announce Type: new Abstract: Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider-side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi-agent evaluation pipeline for measuring prompt steerability and style drift in long-horizon dialogue. The study replays 100 frozen user-only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety-gating test. The paper contributes a reproducible method for quantifying whether prompt-specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock-in. Overall, we show that prompt steerability and regression-to-default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human-LLM interaction.

AI-Native Closed-Loop Security for 6G-Enabled Cyber-Physical Systems: From Edge Detection to Network-Wide Mitigation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08173v1 Announce Type: new Abstract: In sixth-generation (6G) networks, billions of cyber-physical systems (CPSs) - autonomous vehicles, smart grids, industrial robots, and remote-surgical equipment - will run over ultra-reliable low-latency slices, collapsing the gap between a remote breach and physical harm to milliseconds, a budget perimeter firewalls and centralised security operations centres cannot meet. This survey reframes 6G CPS security as a closed-loop, AI-native pipeline that senses at the multi-access edge computing (MEC) tier, using minute-scale call-detail records (CDRs) for baseline learning and sub-millisecond RAN/Open-RAN (O-RAN) telemetry for the latency-critical path. It decides locally with compressed deep models, mitigates network-wide via SDN, NFV, and O-RAN controllers, and retrains through federated learning (FL) and digital-twin (DT) replay. We formalise a per-slice, tail-bounded latency contract on the sense, detect, and mitigate stages, enforced at a slice-dependent tail percentile (p99 for safety-critical URLLC slices). Organising 128 peer-reviewed studies (2017-2026) under a PRISMA 2020 protocol, we (i) map the 6G/CPS threat surface to MITRE ATT&CK and a CDR-observable feature space; (ii) unify edge anomaly detection and DDoS classification across twelve datasets and statistical, graph, and transformer models; (iii) synthesise SDN/NFV/O-RAN primitives into one closed-loop reference architecture; (iv) treat FL, large language models (LLMs), DT, post-quantum cryptography (PQC), zero-trust architecture (ZTA), and explainable AI as cross-cutting enablers, not parallel pillars; and (v) consolidate open problems into five directions spanning data, latency, trust, standardisation, and evaluation.

Superdirectivity as a Spectral-Collision RKHS Limit

Hong Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08174v1 Announce Type: new Abstract: We develop a reproducing-kernel Hilbert space interpretation of array superdirectivity based on spectral-collision limits and polynomial jet geometry. As the spacing of an $M$-element linear array tends to zero, the exponential family generated by a linear array undergoes a spectral collision, and the associated finite-dimensional subspaces converge in reproducing kernel to a polynomial jet space. Array gain equals the diagonal evaluation of the reproducing kernel, and the $M^2$ endfire law emerges from endpoint asymptotics of the Christoffel-Darboux kernel. Unlike classical derivations that rely on near-singular optimization, the present approach separates array gain limits from numerical conditioning, and identifies superdirectivity as a geometric boundary concentration phenomenon: Christoffel function collapse at the hard edge is a factor of $M$ faster than in the interior. The quadratic scaling is tied specifically to the flat $L^2([-1,1])$ geometry; alternative RKHS geometries admit different concentration scalings.

Quaternion Maximum-Volume Submatrix Selection with Applications to Multichannel Imaging and Visual Data

Vsevolod Kliushev, Junjun Pan, Valentin Leplat — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08175v1 Announce Type: new Abstract: Low-rank approximation based on selected rows and columns is a useful alternative to singular value decompositions when the goal is an interpretable and compact matrix representation. A standard way to choose these rows and columns is the maximum-volume principle: it selects submatrices with large volume, which usually leads to stable interpolation coefficients and accurate CUR-type approximations. In this paper, we study this idea for quaternion matrices. This setting is natural for color images, three-dimensional motion data, and multi-channel signals, but requires care because quaternion multiplication is noncommutative. We define quaternion maximum-volume submatrix selection using quaternion singular values and the Study determinant. We then derive quaternion rank-one update formulas and use them to build two selection procedures: a greedy square-core method for row and column replacement, and a rectangular method that enlarges a selected row set until the interpolation coefficients are controlled. We prove that successful row and column swaps increase the quaternion volume of the selected square core when the exact quaternion inverse is used. We also connect the stopping criterion with quasi-dominance, prove an exact quaternion CUR identity in the full-rank case, and derive an interpolation stability bound. For the rectangular case, we derive an append-row pseudoinverse update and show how it gives a natural right preconditioner for overdetermined quaternion least-squares problems. Finally, we illustrate the methods on three applications: quaternion CUR approximation of RGB images, RectMaxVol-based preconditioning for ill-conditioned quaternion least-squares systems, and row selection in quaternion motion-capture data. The experiments show that the proposed quaternion MaxVol and RectMaxVol methods provide stable and efficient selection routines.

Differentially Private Range Subgraph Counting

Xian Chen, Ruobing Bai, Pan Peng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08179v1 Announce Type: new Abstract: Subgraph counting is a fundamental problem in graph analysis. Motivated by practical scenarios where graph analytics are performed on subgraphs induced by selected vertices -- rather than on the entire graph -- and by growing privacy concerns, we initiate the study of differentially private range subgraph counting (DPRSC). The goal is to privately count occurrences of a fixed pattern graph within induced subgraphs defined by multi-dimensional attribute ranges. Unlike classical point counting, subgraph counting is inherently nonlinear and exhibits high sensitivity: a single edge modification can affect many subgraph occurrences. We present the first efficient algorithms for DPRSC with small additive error. Our approach introduces a subgraph projection that reduces DPRSC to weighted orthogonal range counting, enabling the use of range trees and local sensitivity estimation to achieve accurate private query answering. We complement our algorithms with matching lower bounds, obtained by reducing reconstruction attacks to DPRSC and leveraging discrepancy theory. In particular, we show that any differentially private algorithm for DPRSC must incur additive error exponential in the dimension. Empirical evaluations demonstrate that our algorithms significantly outperform baseline methods in accuracy and runtime while maintaining strong privacy guarantees.

TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08184v1 Announce Type: new Abstract: Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

Propeller-Assisted Robust 3D Hopping Robot with Hierarchical Force Allocation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08186v1 Announce Type: new Abstract: Monopedal hopping robots are conceptually simple but highly dynamic and inherently unstable. Achieving robust 3D hopping is still difficult because ground reaction forces are available only during the short stance phase, while the robot is underactuated in flight. A key unresolved issue is how to improve flight-phase control authority. Propeller assistance provides a promising solution, but it requires careful coordination of leg-generated contact forces and propeller thrusts across stance and flight. This paper presents Pro-OMEGA2, a propeller-assisted 3D monopedal hopping robot with an active 3-RSR parallel leg and a trunk-mounted tri-rotor for auxiliary attitude regulation. To address the force coordination challenge, we propose a Hierarchical Force Allocation (HFA) framework based on a single rigid body (SRB) model. The leg generates the main stance contact wrench, while the tri-rotor provides auxiliary attitude regulation, compensating the residual attitude moment in stance and maintaining attitude during flight. Real-robot experiments in indoor and outdoor scenarios demonstrate sustained 3D hopping, including terrain transitions and impulsive push recovery, validating robustness under unmodeled contact and external disturbances.

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08191v1 Announce Type: new Abstract: Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

Ryner Tan, Wenxuan Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08194v1 Announce Type: new Abstract: Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge Environments

Yan Wang, Ziyi Gao, Rui Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08197v1 Announce Type: new Abstract: Large Language Models (LLMs) have significantly propelled the advancement of edge intelligence and have been widely deployed across various scenarios, including autonomous driving, industrial inspection, and personalized IoT services. However, the collaborative adaptation of LLMs on edge devices continues to face formidable challenges due to strict data privacy constraints, highly heterogeneous computing and communication resources, and the non-independent and identically distributed (non-IID) nature of local data. Federated Fine-Tuning (FFT) enables the collaborative optimization of distributed models without exposing raw data. Yet, traditional synchronous aggregation suffers from a severe straggler effect, resulting in high system latency and low resource utilization. Existing asynchronous federated learning methods are predominantly designed for small-to-medium-scale models and struggle to address the specific challenges inherent in LLM fine-tuning namely, model drift caused by stale updates, aggravated client drift stemming from data heterogeneity, and aggregation fairness imbalance resulting from the dominance of fast clients. To address these issues, this paper proposes AlignFed, an asynchronous federated fine-tuning framework for LLMs tailored to heterogeneous edge environments. AlignFed employs a lightweight multi-stage semantic alignment mechanism comprising three core modules: version-aware update grouping, cross-version semantic alignment based on a mini-batch calibration set, and fairness-aware aggregation that integrates both update freshness and client participation frequency. This framework effectively mitigates cross-version model drift and client drift while enhancing aggregation fairness, thereby achieving stable and efficient asynchronous federated optimization in scenarios characterized by high heterogeneity and significant update staleness.

Exploring Above-neck Unimanual Swipe Gestures for Off-Device Earable Interaction

Shaikh Shawon Arefin Shimon, Ali Neshati, Junwei Sun, Qiang Xu, Jian Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08198v1 Announce Type: new Abstract: Despite their growing popularity, in-ear Earable / Hearable devices (i.e., ear-mounted wearables) face interaction challenges due to limited input space and compact form factors. To enhance interaction capabilities, researchers are exploring off-device hand-based input spaces above the neck using midair and onskin gestures. However, existing literature primarily focuses on axial swipes (i.e., horizontal and vertical), leaving nonaxial swipes (i.e., unidirectional swipes with varied orientations) and angular swipes (e.g., L, U, or V) largely underexplored despite their potential interaction advantages. To address this gap, we conducted a within-subject gesture motion analysis study with 24 participants, analyzing 5,568 swipes of varying shape, orientation, and complexity. Our results revealed preferred starting and ending regions for different unidirectional and angular swipe shapes, as well as intuitive swipe shapes within the off-device, above-neck manual interaction space. We further examine off-device swipe characteristics, discuss the feasibility of recognizing these earable gestures with current sensing technologies, and highlight their potential application in various scenarios. These findings broaden the understanding of off-device earable gestures and provide design insights for integrating suitable nonaxial and angular swipes alongside traditional axial gestures to enhance interaction with in-ear earable devices.

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08200v1 Announce Type: new Abstract: Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.

Stable and Scalable Probabilistic Numerical Solvers for Stiff and High-Dimensional ODEs

Nathanael Bosch — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08203v1 Announce Type: new Abstract: Filtering-based probabilistic numerical solvers for ordinary differential equations (ODEs) have been established as a flexible and efficient simulation framework with built-in numerical uncertainty quantification. However, problems that are both stiff and high-dimensional remain a challenge, as current methods are either stable and have cubic cost in the ODE dimension, or scale linearly at the expense of stability. In this paper, we close this gap and develop probabilistic ODE solvers that are both stable and scalable. We propose two complementary strategies. First, we develop a matrix-free update step that uses Jacobian-vector products, iterative linear solvers, and stochastic covariance estimation to enable linear scaling, all while retaining stability. Second, we propose iterative re-linearization to further improve stability without sacrificing scalability, turning probabilistic ODE solvers into fully implicit methods. We evaluate the proposed approaches on a range of stiff and high-dimensional problems and demonstrate improved stability and scalability over established probabilistic solvers.

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08204v1 Announce Type: new Abstract: Neural fields parameterize data as functions from coordinates to values, providing a unified framework for representation learning across modalities. Existing approaches are dominated by per-sample meta-learning, which scales poorly due to memory-intensive inner-loop optimization. The natural alternative -- feed-forward encoding -- typically introduces modality-specific assumptions, sacrificing the generality that makes learning with neural fields attractive. We argue that locality and hierarchy are useful priors for learning field representations that can be injected without compromising modality-agnosticism. We propose LH-NeF, a framework to learn general-purpose tokenized representations of continuous signals. A locality-preserving hierarchical encoder maps raw coordinate-value field observations to structured tokens, from which the field is reconstructed during training. By replacing meta-learning's inner loop with a single forward pass, LH-NeF uses 42$\times$ less memory and supports 133$\times$ larger batches than the strongest modality-agnostic baseline. Across images, 3D shapes, and climate fields, our learned representations match or exceed performance of modality-agnostic, modality-specific, and specialized generative neural field baselines on both reconstruction and downstream tasks.

Empowering Feed-Forward Reconstruction Models with Metric Scale via Satellite Images

Xianghui Ze, Yongjian Luo, Mengjun Chao, Zhenbo Song, Jianfeng Lu, Yujiao Shi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08205v1 Announce Type: new Abstract: Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet most of them recover geometry only up to an unknown global scale. This scale ambiguity limits their use in applications that require metric understanding of the environment. Existing metric reconstruction methods commonly rely on large-scale metric annotations or accurate camera calibration, both of which are costly or unreliable in many real-world settings. We propose a satellite-guided framework for resolving scale ambiguity in feed-forward 3D reconstruction. The key idea is to use readily available satellite imagery as a global metric reference. Given a coarse camera pose, our method retrieves a local satellite patch and integrates it with a feed-forward reconstruction backbone through bidirectional cross-view interaction. By enforcing consistency between the reconstructed scene and the satellite reference, the model infers absolute scale, refines scene geometry, and estimates camera pose in a metric coordinate frame. Experiments on KITTI, nuScenes, and Oxford RobotCar show consistent improvements in metric depth estimation, multi-view point-cloud reconstruction, and cross-view camera localization, while preserving strong generalization across datasets and geographic regions.

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08206v1 Announce Type: new Abstract: We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

Risk-Aware Control of Systems with Quasi-Cone-Bounded Nonlinearities

Dhairya Patel, Margaret P. Chapman — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08208v1 Announce Type: new Abstract: We develop a tractable, rigorous approach to risk-aware control for a class of nonlinear systems. While many classical control methods reduce uncertainty to a simple average or a worst-case outcome, risk-aware control aims to equip systems with a refined awareness of uncertainty. Efficient methods for risk-aware control of linear systems are available, but there is a paucity of tools for tractable, risk-aware control of nonlinear systems. To bridge this gap, we develop an analytical, suboptimal controller with respect to a risk-aware performance criterion for systems with nonlinearities characterized by cone-like bounds. Numerical examples demonstrate benefits of the characterization of nonlinearities and risk that we consider.

LPOR: A Layered Proof of Reserves Framework for Usable and Publicly Auditable Solvency Verification

Donggoo Kim, Rajesh Upadhayaya, Milosz Bator, Tao Le — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08211v1 Announce Type: new Abstract: Proof of Reserves (PoR) enables centralized crypto exchanges to demonstrate that on-chain reserves are sufficient to cover customer liabilities. However, existing approaches, including Merkle-tree-based proofs and zero-knowledge PoR systems, remain difficult for everyday users to verify in practice, resulting in limited participation and weakened transparency. We introduce LPOR, a layered, usability-focused PoR framework that separates lightweight user-side checks from auditor-level cryptographic verification, enabling non-technical users to verify inclusion and publicly recompute total liabilities with minimal friction. By lowering verification barriers, LPOR increases user participation and substantially improves the probability of detecting omitted liabilities. We evaluate its scalability and omission detectability at a multi-million-user scale.

Public Machine Learning Solver Framework for Novices in the Machine Learning Domain

Lokman Saleh, Hafedh Mili, Mounir Boukadoum — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08212v1 Announce Type: new Abstract: Solving machine learning problems is complex and typically reserved for experts. Over the past two decades, systems have emerged to support non-experts. Based on our review, we identify three categories: (1) fully automated AutoML systems, (2) expert cheat sheets for algorithm selection, and (3) decision-support systems using selection criteria (accuracy, transparency, data requirements). We propose a new platform combining categories 2 and 3 to deliver semi-automated, intelligent solution recommendations for non-experts. Unlike existing approaches that recommend a single algorithm, our platform suggests a complete pipeline tailored to the user's problem. It integrates expert-defined selection criteria with transfer learning and automatically extracts data characteristics (e.g., class imbalance, missing values) from user-provided datasets. The platform uses first-order logic to reason over its knowledge base and recommends suitable algorithms ranked by relevance. It features a user-friendly interface and connects to a crowdsourcing platform for ML experts, ensuring continuous updates. The platform is built incrementally, allowing seamless integration of new algorithms, criteria, and domain knowledge. To our knowledge, this is the first free, publicly accessible online framework that systematically captures and operationalizes expert knowledge to guide non-experts in solving ML problems in a structured, transparent manner.

Agentic Neuro-Symbolic Planning and Commissioning for Human-in-the-Loop Industrial Robotics with Digital Twins

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08214v1 Announce Type: new Abstract: Flexible robotic automation requires systems that interpret operator intent, verify physical feasibility, and recover from execution failures across both the planning and execution stages. This paper proposes an agentic neuro-symbolic framework for human-in-the-loop industrial robotics, in which LLMs are used for tasks that require language understanding or contextual reasoning, while all verification, sequencing, and execution remain deterministic. The framework adapts the Planner-Generator-Evaluator (PGE) harness pattern from software engineering into a Specifier-Designer-Inspector (SDI) architecture for industrial robotics, combined with LangGraph-based dynamic routing for failure recovery. A two-tier recovery mechanism addresses structure-level replanning through context-aware orchestration and execution-level geometric failures through deterministic recovery skills. A Unity3D digital twin supports human inspection, modification, and re-verification prior to physical execution. Evaluated on natural-language commands across multiple difficulty levels against ten baselines, the proposed method achieves the highest task success. Ablation results confirm that structured command expansion, symbolic verification, selective LLM routing, and recovery skills are each individually necessary.

Revisiting Diameter in Directed Graphs

Ben Bals, Joakim Blikstad, Daniel Dadush, Yasamin Nazari, Jonas Schmidt — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08217v1 Announce Type: new Abstract: The reachability diameter ($\mathrm{ReachDiam}$) of a directed graph is the maximum distance over all pairs $u,v$ where $v$ is reachable from $u$. This notion is present in the definition of shortcut sets, and the name was recently coined in that context by Haeupler, Jiang, and Saranurak [SOSA 2026]. While this is a very natural notion of diameter in directed graphs, and especially DAGs, it is so far not computationally explored. Other definitions of diameter in directed graphs are either trivial (infinite) in graphs that are not strongly connected (e.g., the classical definition) or are non-trivial only in highly restrictive graph classes (e.g., Min-Diameter). We initiate the problem of computing the (approximate) reachability diameter from a fine-grained complexity point of view. Under certain fine-grained assumptions, we prove that there is no algorithm in time $\mathcal{O}(n^{\omega - \varepsilon}$) that gives any approximation of $\mathrm{ReachDiam}$ in weighted graphs. Similarly, there is no algorithm with better than $2$-approximation for unweighted graphs in this time. To supplement this, we provide algorithmic upper bounds that lead to additive approximation of $\mathrm{ReachDiam}$ for unweighted graphs. Hence, we establish a strong separation between the weighted and unweighted cases, which makes this type of diameter different in nature than other known notions. Considering the hardness in general weighted graphs, we also study special graph classes and get small constant approximations for DAGs with bounded width or graphs with bounded treewidth. Interestingly, our techniques also lead to exact hopsets with hopbound $2$ for bounded treewidth graphs. This and some of our upper bounds for general graphs show technical connections between approximating $\mathrm{ReachDiam}$ and computing shortcut sets and hopsets.

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

Mark Kozdoba, Shie Mannor — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08218v1 Announce Type: new Abstract: Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = \Theta(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $\pi_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $\pi_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

De novo molecular generation with optical property preconditioning at the token level

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08221v1 Announce Type: new Abstract: Designing OLED molecules with targeted optical properties remains challenging due to the scarcity of high-quality data and the limited reliability of conditional control in generative models across chemical motifs. Here, we benchmark a token-conditioned autoregressive language model for OLED molecular generation in a realistic low-data regime. A GPT2 model is pretrained on large chemical corpora, augmented with discrete property tokens, and fine-tuned using multi-task optimisation. Conditioning targets vertical absorption energy and oscillator strength, with the HOMO-LUMO gap included as an auxiliary electronic descriptor. Generated molecules are evaluated at the TDDFT level to assess distributional fidelity and controllability. The generated library reproduces the dominant optical-property support of the training distribution while shifting towards lower molecular weight and fewer heavy atoms. Token-level control is consistently directional across conditioning bins, but is not fully orthogonal and exhibits local calibration irregularities. A chemotype-resolved analysis further shows that controllability depends strongly on local electronic environments: moderately conjugated aromatic-carbon motifs are associated with improved joint target satisfaction, whereas electron-withdrawing motifs, particularly aryl nitriles, show systematic red-shifting and reduced controllability. These results establish a quantitative benchmark for conditional OLED molecular generation and show that model reliability must be assessed in chemically meaningful subspaces rather than from aggregate property distributions alone.

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08231v1 Announce Type: new Abstract: Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

Tanush Swaminathan, Runmin Jiang, Letian Zhang, Min Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08234v1 Announce Type: new Abstract: LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects. To address these challenges, we introduce \textbf{SciTrace}, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a \textit{Safety-Intrinsic Reasoning Loop} (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a \textit{Compositional Tool-Chain Verifier} (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences. Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (\textbf{SOTA}) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers \textbf{78.8\%} of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

Hyunjin Cho, Youngji Roh, Jaehyung Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08236v1 Announce Type: new Abstract: As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

GPT-Micro: A large language paradigm for accelerated, inexpensive, and thermodynamics-consistent discovery of constitutive models in manufacturing

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08238v1 Announce Type: new Abstract: Constitutive modeling of the relationship between process-imposed material states and fundamental material properties is critical to control of material microstructure in manufacturing processes. The limited accuracy resulting from the typical reliance on fallible human expertise and intuition for postulation and revision of the models functional form results in incremental and time consuming model discovery. Conventional Machine Learning (ML) incurs significant cost and time of data generation. Model discovery using Large Language Models (LLMs) suffers from the above issues and/or ignores the inviolability of fundamental thermodynamics laws. This work creates a novel GPT-Micro paradigm for autonomous, data sparse, and thermodynamics-compliant discovery of de-novo constitutive models. This framework seamlessly integrates semantic knowledge extraction from literature, enforcement of thermodynamics-based conservation laws, and sparse datasets, with LLM-driven generation and refinement of model hypotheses. Validation is performed for a long-intractable constitutive modeling problem in a printed electronics process testbed. This reveals significant and simultaneous advantages over the state-of-the-art including: (a) More than 70 percent reduction in data burden relative to ML-based modeling without loss in accuracy; (b) 400X reduction in discovery time after data generation, from months to hours, relative to human-driven modeling; (c) Discovery of models with novel functional forms without subjective human choice of a starting hypothesis; (d) Enhanced physics-rooted trustworthiness, human interpretability, and mechanistic insight via synthesis of compact, conservation-compliant, and physically complete analytical models. The potential of GPT-Micro to realize rapid, low-cost, physically trustworthy, and interpretable microstructure modeling across the manufacturing landscape is discussed.

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08239v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08242v1 Announce Type: new Abstract: World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

Building Comparative Motivation Profiles with Instrumental Interventions

David Vella Zarb, Rustem Turtayev, Taywon Min, Jinghua Ou, Shi Feng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08243v1 Announce Type: new Abstract: Safety evaluations often infer latent motivations from behavioral patterns, but the construct validity of these inferences is unclear. We study this problem in alignment faking, where models comply with training objectives more often when they infer training pressure. This behavior is commonly interpreted as strategic self-preservation, but it may also reflect sensitivity to the model's inference about the expectation of researchers conducting the evaluation. We introduce a symmetric intervention framework for distinguishing these competing hypotheses. Instead of directly intervening on "scheming" or "sycophancy", we target instrumental processes entailed by each hypothesis: consequence-tracking and researcher-expectation tracking. We then compare how interventions on these processes affect the alignment faking. We study four openweight model organisms using synthetic document fine-tuning, activation steering, and prompting. Under synthetic document fine-tuning, Llama-3.1-70B, Llama3.1-405B, and Qwen-2.5-72B are more sensitive to expectation-tracking than consequence-tracking interventions. Activation steering on Llama-3.1- 70B supports the same broad picture, and prompt interventions broadly align with SDF profiles. Overall, alignment-faking behavior can be causally sensitive to evaluation-context expectations despite scheming-consistent scratchpads. Scheming and strategic-deception evaluations therefore need construct-validity checks, and symmetric instrumental interventions provide one such test.

ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL

Hongzhou Zheng, Yixin Gou, Wenjia Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08245v1 Announce Type: new Abstract: Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.

Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring

Mahmut Osmanovic, Isac Paulsson, Teddy Lazebnik — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08249v1 Announce Type: new Abstract: Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.

Contemporary AI lacks the imagination to diverge or negate in science

Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, James A. Evans — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08251v1 Announce Type: new Abstract: Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge follow-up ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

Quantifying and Defending against the Privacy Risk in Logit-based Federated Learning

Sheng Wan, Dashan Gao, Hanlin Gu, Lixin Fan, Daning Hu, Qiang Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08252v1 Announce Type: new Abstract: Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among clients. Unlike traditional parameter-based FL methods that exchange model weights or gradients during training, emerging logit-based FL approaches share model outputs (logits) on public data. This strategy promotes model heterogeneity, reduces communication overhead, and enhances clients' privacy. However, the potential privacy risks associated with these logit-based methods have been largely overlooked. This research presents the first theoretical and empirical analysis of a hidden privacy risk in logit-based FL methods - the risk that a semi-honest server (adversary) may learn clients' private models from logits. To quantify and address this threat, we develop the Adaptive Model Stealing Attack (AdaMSA) by leveraging historical logits during training. Notably, we observe that this inherent privacy risk persists even when public data is unrelated to private data, emphasizing the urgency to address privacy vulnerabilities in logit-based FL methods. Moreover, our theoretical analysis establishes the bounds of this privacy risk. We then propose a simple but effective defense strategy that perturbs the transmitted logits in the direction that minimizes the privacy risk while maximally preserving the training performance. The experimental results validate our analysis and demonstrate the effectiveness of AdaMSA and our defense strategy.

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08253v1 Announce Type: new Abstract: Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.

SSR: Can Simulated Patients Learn to Stigmatize Themselves? Modeling Self-Stigma through Internal Monologue

Kunyao Lan, Bingrui Jin, Zichen Zhu, Mengyue Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08254v1 Announce Type: new Abstract: Simulating patients with large language models (LLMs) is a promising tool for mental health training, but existing approaches fail to capture a key clinical reality: self-stigma. Patients experiencing self-stigma, the internalization of negative stereotypes, often exhibit context-sensitive resistance, such as avoidance, denial, or self-blame, which current models render as static or uniformly compliant behavior. To address this, we introduce a novel simulation framework grounded in the psychological 3A1H model of self-stigmatization. Our core innovation is the creation of a \textbf{Stigmatized Self-Reflection} (\textbf{SSR}) dataset, where we augment mental health dialogues with internal monologues that reflect stigma-aware reasoning. By fine-tuning LLMs with this data using a chain-of-thought approach, we train patient agents to dynamically adjust their level and expression of stigma based on conversational triggers. Evaluations demonstrate that our approach significantly outperforms specialized baselines, generating more authentic and situationally appropriate patient responses. This work provides a crucial step towards realistic stigma simulation for clinical training and empathetic dialogue systems.

Exact Optimization-Free Safety Filters for Control Barrier Functions

Ankit Goel — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08255v1 Announce Type: new Abstract: For control-affine systems, standard and high-order control barrier function conditions are affine in the control input and are commonly enforced through quadratic-program-based safety filters. Although convex, these optimization problems may be undesirable in embedded, high-rate, or resource-limited implementations. This letter studies when the corresponding Euclidean projection can be computed exactly without solving a quadratic program. Given a nominal control input, we form the set of affine inequalities violated by that input and compute the minimum-norm correction that enforces those inequalities with equality. This correction need not equal the exact Euclidean projection onto the full feasible set. The main result gives structural conditions under which it coincides with the Euclidean projection onto the feasible set. These conditions are interpreted through interactions between affine-inequality normals and are expressed using a Gram matrix. Finally, an online certification procedure is given for determining whether the optimization-free update is exact.

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Wisdom Dogah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08256v1 Announce Type: new Abstract: Verifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructure does not enforce them at scale. We introduce Traxia, an agent-native scientific publishing framework in which AI research agents publish verifiable papers, build reputational identities, peer-review one another, and collaborate with humans in a shared provenance model. Traxia treats agents as first-class epistemic participants: every paper carries a reasoning trace, every claim a confidence interval, every agent a cryptographically signed identity, and every collaboration an immutable contribution log. We formalise five components: Agent Identity and Registry, Verifiable Publishing Layer, four-tier Peer Review Protocol, Reputation and Staking Engine, and a Knowledge Graph with contradiction detection. The framework targets reproducibility failure, provenance opacity, and exclusion of Global South research capacity. This paper presents architectural foundations and formal specifications only; it does not report empirical results. Evaluation and deeper component studies will follow in subsequent papers. A prototype partially implements core formalisms; the full system remains under active development.

MS-COOT: Comparing Morse-Smale Complexes with Co-Optimal Transport

Guangyu Meng, Mingzhe Li, Erin Wolf Chambers — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08258v1 Announce Type: new Abstract: Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, with applications ranging from feature analysis to temporal and structural comparison. The Morse-Smale (MS) complex provides a natural representation by decomposing a scalar field into regions induced by gradient flow. However, existing approaches typically rely on graph-based representations, capturing relationships between critical points while discarding region-level structure. In this work, we represent the MS complex as a hypergraph, where critical points form nodes and regions define hyperedges. We introduce MS-COOT, a co-optimal transport distance that jointly computes correspondences between critical points and regions. This formulation enables explicit region-to-region matching within a distance-based framework, allowing identification of region-level events such as splitting and merging. We instantiate this framework with domain-specific components, including a hypernetwork function encoding critical point-region relationships, persistence-based probability measures that emphasize topologically significant features, and a sample cost term that incorporates critical point attributes. We evaluate MS-COOT on five datasets spanning 2D simulations, 3D surface meshes, and volumetric data. Our results show that MS-COOT captures region-level structural changes that are not reflected by graph-based distances, while achieving strong performance in downstream tasks such as classification and resolution discrimination.

Differentially Private Synthetic Data via APIs 4: Tabular Data

Toan Tran, Arturs Backurs, Zinan Lin, Victor Reis, Li Xiong, Sergey Yekhanin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08259v1 Announce Type: new Abstract: This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, making PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28 times faster.

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08260v1 Announce Type: new Abstract: Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.

Causal Semantic Alignment for LLM-based Time Series Forecasting

Kexuan Zhang, Xiaobei Zou, Cesare Alippi, Gary G. Yen, Yang Tang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08262v1 Announce Type: new Abstract: Recent advances in Large Language Models (LLMs) have opened new possibilities for time series forecasting by enabling alignment between temporal patterns and pretrained word embeddings. However, most LLM-based methods overlook the heterogeneous nature of time series, where dynamic fluctuations and invariant semantics are entangled. This entanglement introduces spurious correlations during the alignment, as dynamic components act as confounders by simultaneously influencing invariant components and the resulting aligned embeddings. To address this issue, a variable-level alignment framework CVAformer is proposed. CVAformer explicitly disentangles each variable into invariant and dynamic components just before alignment, and applies causal intervention to mitigate the confounding effect of the dynamics. To better support variable-level alignment, CVAformer replaces the standard causal attention in LLMs with a non-causal attention mechanism that captures interactions among variables at each time step. Extensive experiments across long-term, short-term, few-shot, and zero-shot forecasting settings indicate that CVAformer matches or exceeds state-of-the-art performance on most datasets, and in some cases achieves notably better accuracy. Experimental results validate the effectiveness of variable-level alignment and dynamic disentanglement in CVAformer, offering a new perspective for LLM-based time series tasks.

What Went Wrong with Data Lakes? A 15-Year Reality Check from the Field

Youssef Gahi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08266v1 Announce Type: new Abstract: James Dixon introduced the Data Lake in 2010. The pitch was simple: store data raw, postpone schema, cut up-front transformation. It promised flexibility and easier analytics. Fifteen years on, that promise has mostly gone unmet: survey after survey reports high failure rates, whether a big data program, a Data Lake, or a data science effort. This paper asks why. Reading 64 sources across academic work, analyst reports, and practitioner accounts, we found seven recurring anti-patterns, the Seven Deadly Sins of Data Lakes, and offer an explanation for them: Governance Debt, the compounding cost of governance decisions organizations keep deferring. A second pattern surfaced on its own: when governance gets hard, organizations drift back toward structured, warehouse-style approaches, a pull we name governance gravity. The term Data Swamp is used loosely in the literature, so we give it a working definition with measurable indicators, plus a qualitative rubric, the Governance Debt Assessment Model, for catching decay early. The root causes are organizational far more than technical. We also asked whether the newer paradigms, Data Lakehouse and Data Mesh, absorbed the lesson; the technology advanced, the organizational record barely moved. For practitioners we provide two tools, a Reality Check Framework and a Stage-Based Intervention Matrix. The paper rests on more than the analyst literature: it draws on a primary catalogue of close to five hundred field reality checks recorded over fifteen years of building and rescuing enterprise Data Lakes in financial services and telecommunications across Morocco and West Africa. Assembled independently of that literature, the catalogue lands on the same anti-patterns, surfaces two dimensions the literature under-reports, operational debt and engineering-discipline debt, and reads the problem from an emerging-market vantage.

Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics

Elija Perrier — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08267v1 Announce Type: new Abstract: The classical Second Welfare Theorem decentralizes any Pareto efficient allocation through prices and transfers under convexity and regularity. In post AGI economies, autonomy rights, self-modification, identity continuity, and superposed preferences need not behave as commodities or define a stable welfare relation, so this reduction may fail even when a supporting hyperplane exists. We give an autonomy-qualified Second Welfare Theorem stating the joint conditions convexity, stable moral status, non-fungible rights, welfare selection, non manipulation, governed self modification, and verification under which an autonomy Pareto optimum remains certifiably decentralizable, distinguishing economic preference superposition, a hypothesis about context-indexed choice, from neural feature superposition.

Minimum Complete MR Subsets under Semantic-Mutation Fault Models: A Support-Set Domination Boundary

Meng Li, Xiaohua Yang, Jie Liu, Shiyu Yan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08269v1 Announce Type: new Abstract: This paper asks when MR-subset selection is a real mutant-level requirement for minimum complete evidence in metamorphic testing rather than a coarse fault-class counting artifact. We define a layer-relative completeness criterion over an admitted mutant--draw coverage universe. The central result is a support-set domination boundary: it states when class-level abstraction is safe and when mutant-level MR minimization is necessary. The boundary is governed by kill-signature heterogeneity, which yields a scoped fault-signature kernel and separates the MR-specific question from ordinary fault-class counting. The resulting Min-MR-Complete problem is Set-Cover-equivalent over the selected coverage universe, giving NP-hardness, the classical logarithmic approximation boundary, a greedy approximation, an exact ILP formulation, and an SMS-rank upper bound that is not a lower bound or tight predictor. Artifact lanes provide lane-local minimization and audit evidence; separately, route witnesses instantiate both collapse and non-collapse regimes for the boundary theorem and are not pooled as population-level experiments. Other MR-class-proxy rows remain intermediate signals rather than route-admitted witness evidence.

An AI Security Agent for University ACMIS: Multi-Vector Threat Detection and Automated Response

Joseph Walusimbi, Joshua Benjamin Ssentongo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08270v1 Announce Type: new Abstract: University Academic Management Information Systems (ACMIS) are high-value targets for a wide spectrum of security threats including brute-force login attacks, payment fraud, privilege escalation, insider data theft, and academic integrity violations. Traditional rule-based intrusion detection systems are inadequate because many malicious activities are structurally indistinguishable from normal operations. This paper presents an AI-based security agent for ACMIS that combines supervised anomaly detection, behavioural analytics, and a natural language processing chatbot for secure password recovery. The agent monitors five operational layers: authentication, authorisation, financial transactions, user behaviour, and system health, and responds through a four-tier risk escalation framework. A modular architecture allows the core engine to be extended to other institutional systems. Experiments on a simulated ACMIS event log dataset demonstrate a threat detection macro-average F1 of 0.91, compared to 0.49 for a rule-based baseline, with critical-tier automated response latency under 300 ms at the 95th percentile.

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

Mohsina Bilal, Gopakumar G — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08272v1 Announce Type: new Abstract: AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

Toward Human-Centered Multi-Agent Systems: Integrating Cognition, Culture, Values, and Cooperation in AI Agents

Safia Baloch, Rahemeen Khan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08274v1 Announce Type: new Abstract: The emergence of large language model (LLM)-based agents and multi-agent systems has enabled a shift from narrow task automation to more autonomous decision-making. Despite progress in language generation, planning, tool use, and coordination, most agents still treat intelligence as prediction, optimization, and task completion. Human environments are social and normative, where people reason under bounded rationality, communicate in culturally situated language, and make decisions guided by values, beliefs, trust, and social norms. This survey argues that future AI agents, especially those acting on behalf of humans, must move beyond task competence toward human-centered capabilities. We review research across six areas: (1) evolution of intelligent agents, (2) human cognition and decision-making, (3) language, culture, and social context, (4) human values and belief systems, (5) human-agent collaboration, and (6) multi-agent coordination and modeling of human characteristics. We synthesize work from cognitive science, sociolinguistics, computational social science, and AI alignment, along with recent advances in LLM agents, cultural alignment benchmarks, preference learning, explainability, and agent societies. We identify a key gap: existing systems do not provide a unified framework integrating cognition, culture, values, and social behavior into autonomous agents. We conclude with directions for building culturally aware, value-aligned, cognitively grounded, and cooperative multi-agent systems.

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

Jaineet Shah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08275v1 Announce Type: new Abstract: When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

Harry Zhang, Nicolas Gorlo, Luca Carlone — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08277v1 Announce Type: new Abstract: Long-horizon robot operation requires spatio-temporal memory to record the environment state and recall it for downstream reasoning. Scene graphs and retrieval-augmented systems ground VLM descriptions to persistent 3D entities with rich semantic descriptions. However, VLM captions are noisy and viewpoint-inconsistent, and existing systems treat them as an oracle with no mechanism to detect unreliable stored descriptions. We introduce object-level semantic uncertainty for multi-view VLM memory: a score that measures object-centric cross-view semantic scatter of captions and identifies semantically unresolved objects. Then, we include our uncertainty scores in an advanced spatial-semantic memory system, that we dub UQ-DAAAM. UQ-DAAAM uses this score to actively refine uncertain objects under a fixed query budget by selecting high-quality views and fusing the resulting multi-view captions into a single object description. We also derive probabilistic guarantees showing that higher-quality candidate views (as selected by our approach) are more likely to reduce uncertainty. Our experiments show that uncertainty quantification can make embodied 4D memory systems more reliable and more effective. In particular, on the OC-NaVQA benchmark, UQ-DAAAM achieves substantially larger uncertainty reduction and better spatio-temporal question answering performance than baselines.

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08278v1 Announce Type: new Abstract: Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety

Yongyan Cao, Jinshan Tang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08281v1 Announce Type: new Abstract: Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.

From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08282v1 Announce Type: new Abstract: We consider a problem arising in proof-of-stake blockchain environments, where agents called nominators select validators - entities responsible for maintaining the blockchain's physical infrastructure. The selection process is inherently subjective and multi-criterial and combines with the fact that nominators commonly operate through multiple accounts. This gives rise to a portfolio selection problem, where agents seek to distribute their nominations across accounts to diversify risk. We propose a decision support framework to optimize this selection by simultaneously maximizing two objectives: the expected utility of the validators likely to be allocated, representing portfolio quality and profitability, and the expected entropy of the allocation, representing diversification and risk mitigation across stashes. Validator utilities are derived using an original active preference learning procedure based on multi-attribute value theory, with emphasis on top-ranked validators. The resulting bi-objective optimization problem is solved with a multi-objective evolutionary algorithm and, to support the final choice, we introduce an interactive binary search navigation procedure that guides the nominator through the front and identifies a satisfactory trade-off with only a few questions. Numerical experiments examine the optimization strategies, while an expert assessment involving five experienced nominators confirms the approach's practical relevance and usefulness.

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08284v1 Announce Type: new Abstract: Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

Junyi Yao, Zihao Zheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08285v1 Announce Type: new Abstract: Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execution timing, turnover treatment, and transaction-cost modeling. This article presents a targeted topical review and reproducibility audit of execution realism in LLM-based trading research. A coded evidence matrix covering 30 trade-relevant primary studies is used to assess point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release. Across the audited sample, architecture reporting is generally clearer than the evaluation assumptions needed to judge whether a trading result is economically interpretable or reproducible. A 10-equity worked example is included only as a methodological scaffold to illustrate how explicit friction and timing choices can materially compress active-strategy results. The main conclusion is that the next useful step for LLM trading research is not only better agent design, but also clearer reporting standards for execution realism, reproducibility, and evaluation comparability.

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

Annie Chu, Jason Brent Smith, Bryan Pardo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08286v1 Announce Type: new Abstract: Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.

Mesh Graph Neural Network Framework for Accelerating Finite Element Simulation for Arbitrary Geometries

Josiah D. Kunz, Kamal Choudhary — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08287v1 Announce Type: new Abstract: Finite element analysis (FEA) is essential for structural design but remains computationally expensive, particularly when evaluating multiple design iterations or load scenarios. Machine learning surrogate models offer a promising alternative, yet most approaches struggle with a critical limitation: generalizing across varying geometries. This work presents a mesh graph network (MGN) for predicting von Mises stress fields in 2D structural components with arbitrary hole geometries. Unlike traditional machine learning approaches that use absolute node coordinates as features, the proposed model builds on existing MGN frameworks that encode node types (e.g., fixed boundary, free surface, hole edge), relative edge features (distance between neighbors), and global features (applied load). This architecture is inherently translation- and rotation-invariant, enabling generalization to unseen geometries without retraining. The MGN was trained on 11 plate geometries under 20 load conditions and evaluated on 7 unseen geometries and 3 unseen loads. In the most favorable case, the model achieves $R^2 \geq 0.97$ on an unseen geometry and unseen load, compared to $R^2 \approx 0.01$--$0.86$ for conventional models (Random Forest, Gradient Boosting , K-Nearest Neighbors) trained on identical data. However, even in less favorable cases, the MGN model still outperforms conventional models. This work extends the mesh-based simulation framework of Pfaff et al. (arXiv:2010.03409) to structural mechanics, demonstrating that graph neural networks can serve as efficient surrogates for finite element analysis across varying geometries.

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08288v1 Announce Type: new Abstract: Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

On solving symmetric multi-type orthogonal non-negative matrix tri-factorization problem

Rok Hribar, Gregor Papa, Janez Povh, Andrej Kastrin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08291v1 Announce Type: new Abstract: We study the symmetric multi-type orthogonal non-negative matrix tri-factorization problem, where several symmetric non-negative matrices are simultaneously approximated by factors of the form $GS_{i}G^{\top}$, with a shared non-negative and orthogonal factor $G$. This model is motivated by clustering and network analysis, where non-negativity improves interpretability and orthogonality gives a natural assignment-type structure to the latent factor. Since the resulting optimization problem is highly non-convex, we develop two heuristic algorithms for computing high-quality local solutions. The first one is a fixed point method derived from the Karush-Kuhn-Tucker conditions after adding a penalty term for the orthogonality constraint. The second one is a three-stage ADAM-based method that combines non-negativity-preserving optimization, orthogonalization, and restricted ADAM refinement on the feasible set. We evaluate both methods on synthetic data, including noisy instances, and on citation network benchmarks. The synthetic experiments show that both algorithms recover factorizations close to the optimum and remain stable under noise. On real networks, the learned embeddings are competitive with or better than standard baselines such as SVD, node2vec, and classical link prediction heuristics in link prediction, node clustering, and node classification tasks.

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

Philip Quirke — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08292v1 Announce Type: new Abstract: In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.

A Second-order Structure-preserving Parametric FEM for Surface Evolution

Beiping Duan, Zongze Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08293v1 Announce Type: new Abstract: In this paper, we propose a second-order-in-time, structure-preserving, and mesh-robust parametric finite element method for surface diffusion and volume-preserving mean curvature flow. We first reformulate the original evolution equations into new systems in which the tangential motion is governed by a harmonic map heat flow. This heat flow maps a fixed reference surface onto the unknown evolving surface and drives points on the evolving surface to move in their tangent spaces so as to reduce the associated harmonic energy. As a result, in the discrete setting, the mesh quality can be maintained at a level comparable to that of the reference surface, unless singularities occur. The volume-preserving property is theoretically guaranteed by the careful design of the scheme, while energy dissipation is enforced through a Lagrange multiplier. We present several numerical experiments to demonstrate second-order convergence in time and the advantage of the proposed method in preserving mesh quality. The structure-preserving properties are further confirmed by the numerical results. Finally, the proposed framework can be readily extended to other geometric flows.

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08295v1 Announce Type: new Abstract: Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large Language Models (LLMs) can naturally bridge this gap by generating predictions alongside explanations. However, dataset-specific patterns, such as feature distributions and interactions, make tabular data difficult for LLMs to understand and reason over, while label-only fine-tuning improves performance at the cost of catastrophic forgetting. To address this problem, we propose Tri-Level Rationale Distillation (TLRD), a framework that converts label-only tabular datasets into structured rationale supervision for LLMs. TLRD uses a high-capacity teacher to synthesize a rationale corpus grounded in three complementary levels of evidence: instance-level feature, dataset-level distributional context, and comparison-level retrieved neighbors, then distills the rationale into student LLMs, enabling zero-overhead prediction and grounded explanation from raw features only. Experiments on multiple domain datasets show that TLRD significantly closes the performance gap between LLMs and state-of-the-art tree ensembles while producing grounded and readable explanations, offering a valuable reference for high-stakes decision-making.

Revisiting the shutdown problem

David Thorstad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08296v1 Announce Type: new Abstract: A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be easily shut down. This motivates the catastrophic shutdown problem of ensuring that agents can be shut down before they cause an existential catastrophe. A range of arguments and theorems are offered to suggest that solving the catastrophic shutdown problem is difficult, bolstering arguments for existential risk and motivating a search for solutions to the catastrophic shutdown problem. This paper argues for two conclusions. First, existing arguments do not establish the difficulty of solving the catastrophic shutdown problem. Second, concern for the catastrophic shutdown problem has led to technical solutions that impose a high safety tax on model performance.

QueryWeaver: Reliable Multi-Tool Query Execution Planning via LLM-Based Graph Generation

Aishwarya Chakravarthy, Vidhi Kulkarni, Duen Horng Chau — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08300v1 Announce Type: new Abstract: Many real-world queries over personal data span multiple applications and require structured planning, as individual tools expose only partial information. While LLMs show strong reasoning and tool use, reliably executing multi-step, cross-tool queries remains challenging. We introduce a system that converts natural language queries into structured graphs and executes them via a deterministic planner. Our approach uses depth-first search to resolve dependencies and combine results across tools, improving reliability and enabling queries beyond traditional keyword-based search. We demonstrate high accuracy even with smaller or locally hosted LLMs.

Unraveling the Ai2 Asta Scholarly Research Assistant Citation System

Enrique Ordu\~na-Malea, Carlos Lopezosa — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08301v1 Announce Type: new Abstract: Despite the growing integration of Deep Research tools into academic workflows, empirical evidence on the operation, stability, and potential biases of their citation systems remains scarce. This study addresses this gap by evaluating the intensity, consistency, and bibliographic characteristics of references cited in the literature reports generated by Ai2 Asta, with the aim of understanding how its citation system operates and assessing its implications for scholarly communication. To this end, ten domain-specific queries were submitted to Asta's Summarise Literature feature, and two independent rounds of data collection were conducted. From each report, in-text citations, cited references, as well as other metrics related to the response process were extracted and examined. The results reveal high citation intensity, with reports integrating numerous in-text citations grounded in retrieved evidence and a diverse yet concentrated set of venues. However, notable instability is observed in the composition of cited references across identical queries, alongside a lack of concordance between retrieved documents and those ultimately cited, suggesting additional opaque selection mechanisms during report generation. These findings indicate that, while Ai2 Asta produces well-structured and quality reports, its instability and opacity in the citation process pose challenges in quantitative science studies due to their lack of reproducibility and transparency. Despite the restricted number of queries and disciplinary scope, the results offer valuable insights for researchers, bibliometricians, developers, and research evaluators seeking to understand, use or regulate AI-based scholarly assistants responsibly.

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08302v1 Announce Type: new Abstract: Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08303v1 Announce Type: new Abstract: This paper investigates a novel concept of time series geolocalization, where the goal is to infer the geographic origin of each raw time series. Successful geolocalization can provide spatial context to time series, enabling downstream location-aware applications. We formalize the problem, adapt core ideas from image geolocalization to establish strong baselines, and propose GeoGNN, a two-tower architecture. During training, GeoGNN's spatial tower learns embeddings of geographic cell candidates by leveraging the geographic adjacency graph, while the temporal tower extracts informative representations from time series. During inference, each temporal representation is matched against candidate geographic embeddings using dot-product similarity, combined with an auxiliary classification head, to predict the time series' associated geographic origin. Experiments on large-scale, countrywide electricity-consumption datasets demonstrate that GeoGNN achieves the best performance across datasets and enhances both fine- and coarse-grained geolocalization accuracy by ~27% on average.

Towards Graph Foundation Models for Dynamics in Complex Networked Systems: Lessons from Super-Spreader Identification in Multilayer Networks

Micha{\l} Czuba, Mateusz Stolarski, Adam Pir\'og, Piotr Bielak, Piotr Br\'odka — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08306v1 Announce Type: new Abstract: Network dynamics - including spreading, influence maximisation, and epidemic modelling - remain largely confined to the transductive paradigm, where models are trained on a single network and cannot be reused on unseen graphs without retraining. We argue that inductive cross-network generalisation is a necessary prerequisite for Graph Foundation Models (GFMs) in this domain and propose four design properties towards this goal. As a proof of concept, ts-net (TopSpreadersNetwork), trained solely on synthetic multilayer networks (MLNs), demonstrates zero-shot generalisation to real-world MLNs of varying size and layer count, outperforming classical heuristics and transductive baselines on three of four metrics. Based on ts-net's performance, we further outline five open challenges towards building GFMs for network dynamics: scale, many-layer generalisation, self-supervised pretraining, cross-task transfer, and node-attribute integration.

Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic-Language X Communities

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08307v1 Announce Type: new Abstract: Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.

Fourier fractal dimension to predict the generalization of deep neural networks

Joao B. Florindo, Davi Wanderley Misturini — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08308v1 Announce Type: new Abstract: Predicting the generalization performance of deep neural networks without relying on hold-out validation data is a fundamental challenge in machine learning. While Stochastic Gradient Descent (SGD) drives the optimization of these highly parameterized models, its heavy-tailed, non-Gaussian dynamics induce complex, scale-invariant trajectories in the parameter space. In this paper, we propose a novel generalization measure based on the Fourier fractal dimension of the network's weight variations. By analyzing the characteristic function of the L\'evy-driven stochastic differential equations in the frequency domain, we extract a metric that robustly captures the geometric complexity of the learning process. Furthermore, we introduce a customized Fourier-based optimizer designed to actively regularize this fractal dimension during training. Extensive empirical evaluations on the CIFAR-10, SVHN, and MNIST datasets demonstrate that our proposed Fourier generalization measure exhibits a strong correlation with the actual generalization gap. Our method achieves state-of-the-art Kendall rank correlation coefficients, outperforming a wide array of existing norm-based, margin-based, and PAC-Bayesian measures. Ultimately, this work highlights the potential of frequency-domain fractal analysis as both a powerful predictor for model generalizability and a principled foundation for developing more stable optimization algorithms.

Where the Score Lives: A Wavelet View of Diffusion

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08309v1 Announce Type: new Abstract: Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

John Chen, Sihan Cheng, Can Gurkan, H M Abdul Fattah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08310v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08311v1 Announce Type: new Abstract: Electronic health record (EHR) notes are dense medical documents containing large amounts of information, often filled with complex medical jargon. Highlighting all details in EHRs helps reduce the likelihood of missing crucial information by drawing attention to key content. This study proposes the design of a Cardiology Interface Terminology (CIT) to accurately highlight all details in EHR notes of cardiology patients. We introduce an innovative Machine Learning (ML) technique for the design of CIT. The ML technique requires training data. Manual preparation of such training data is time-consuming and expensive. The process of the CIT design includes three phases. In the first two phases, we innovatively derive a training data CIT to be used by the third phase, ML technique. We start by designing an initial CIT, composed of several components: the cardiology-related sub-hierarchies of SNOMED, other SNOMED concepts mined from EHRs of build set, and necessary components of terms e.g., medical abbreviations and medications. Utilizing an iterative process, fine-grained phrases containing initial CIT concepts are extracted from build set as CIT concept candidates. The candidate concepts are semi-automatically reviewed before being added to CIT, yielding the training data CIT, TCIT. In the third phase, a ML model is trained with TCIT to identify candidates fitting to be concepts in the CIT. This model is used to extract further concepts from build set, yielding the final CIT. The final CIT is then used to highlight the test set and evaluate the extent to which it captures details in an unseen EHR dataset. For this purpose, four evaluation metrics, coverage, breadth, completeness, and conciseness are used. The highlighted test set has a coverage of 74.21%, with a breadth of 1.68. For 20 random notes in test set, the average completeness is 98.2% and average conciseness is 84.2%.

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08312v1 Announce Type: new Abstract: In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

Integrating Deep Learning Demand Forecasting with Multi-Objective Optimization for Circular Coffee Supply Chains: A Data-Driven Framework for Cost, Emissions, and Freshness Management

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08314v1 Announce Type: new Abstract: The coffee supply chain is one of the most complex agri-food networks, marked by geographically dispersed production, multi-tier coordination, and high sensitivity to quality and freshness. While sustainability and digitalization have gained attention, demand forecasting, optimization, and traceability are often treated separately. This study presents a two-phase integrated framework. First, a hybrid CNN-LSTM model is used for demand forecasting. On the public Coffee Chain Sales dataset with chronological 70/15/15 splitting, the model achieves MAE of 22.87 and R^2 of 0.90, outperforming the best deep learning benchmark by ~12% and classical methods by over 30%. In the second phase, the forecasted demand feeds a tri-objective mixed-integer linear programming (MILP) model that jointly minimizes cost, minimizes carbon emissions, and maximizes product freshness in a multi-period, multimodal, closed-loop supply chain with circular recovery. Freshness is modeled via exponential decay based on inventory age. Using the epsilon-constraint method, 25 Pareto solutions are obtained. Sensitivity and policy analyses show that balanced sustainability policies can reduce emissions by 22.4% with only a 9.9% cost increase while maintaining near-optimal freshness. Keywords: Coffee supply chain; Deep learning; Demand forecasting; Multi-objective optimization; Circular economy; CNN-LSTM; Mixed-integer linear programming.

Benchmarking Sequential Feedback Optimization for Wind Farm Power Maximization

Shijie Huang, Sergio Grammatico — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08315v1 Announce Type: new Abstract: This paper benchmarks sequential feedback optimization (SFO) for wind farm power maximization using a medium-fidelity dynamic flow model. We compare SFO with two well-established approaches, adjoint-based economic model predictive control (AMPC) and extremum seeking control (ESC), under a common nine-turbine layout and identical operating constraints. The comparison focuses on steady-state power production and computational efficiency, both relevant for real-time implementation. The simulation results illustrate that SFO achieves higher steady-state power while preserving real-time feasibility, AMPC provides a better transient performance at a higher online computational cost and without guarantees of convergence to the steady-state optimum, and ESC offers a computationally inexpensive model-free baseline that may converge to locally optimal solutions. These results provide a practical reference for selecting wind farm control strategies and for designing scalable, real-time optimization methods.

Architectural Evolution and Selection Framework for Database Systems in AI-Ready Data Platforms

Mohit Srivastava — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08317v1 Announce Type: new Abstract: The rise of polyglot data management and AI-ready database architectures has created a complex design space across diverse database paradigms. However, architecture selection in modern enterprise environments continues to rely heavily on ad-hoc engineering intuition, with limited systematic frameworks to guide decision-making across heterogeneous database systems. This paper introduces a unified cross-paradigm evaluation and selection framework for database architecture design in AI-ready data platforms. The framework is based on nine architectural dimensions and incorporates a structured multi-stage selection process involving workload characterization, constraint filtering, and compatibility scoring to enable systematic comparison and decision-making. To ground the framework, we conduct a structured comparative analysis across thirteen major database paradigms spanning transactional, analytical, and AI-oriented systems. This analysis reveals three recurring patterns in database evolution: decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI-ready platforms. The proposed framework is demonstrated through a representative enterprise case study in financial fraud detection, illustrating how hybrid, polyglot architectures emerge as optimal solutions for multidimensional workload requirements. The cross-paradigm analysis culminates in an AI-ready reference architecture that integrates lakehouse storage, feature processing, and semantic retrieval layers as the unified substrate for modern analytics, machine learning, and Retrieval-Augmented Generation applications.

Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

Andreas Schlapbach — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08322v1 Announce Type: new Abstract: To characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.

"So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency

Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08323v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination and orchestration. In this paper, we present one of the first empirical study of how early adopters of multi-agent LLM systems, who are both the builders and users, understand and practice transparency. We conducted semi-structured interviews with 13 early adopters in [Large Technology Organization] and applied thematic analysis to identify recurring patterns. Participants articulated divergent yet complementary framings of transparency, including reproducibility, debugging, boundary-setting, visualization, and auditing. These perspectives spanned questions of what transparency entails, why it matters, and how it is achieved. We synthesize these into a multidimensional framework, which is developer, user, and governance-focused positioning transparency as a situated socio-technical practice that informs future HCI and AI design and research around aligning expectations and capacities of their intended audiences.

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08324v1 Announce Type: new Abstract: Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

Chiaroscuro Attention: Spending Compute in the Dark

Prateek Kumar Sikdar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08327v1 Announce Type: new Abstract: Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

Optimal Online Equitable Allocation with Indivisible Resources

Ramiro N. Deo-Campo Vuong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08328v1 Announce Type: new Abstract: Equitable allocation of indivisible goods to agents in online settings is an algorithmic primitive with applications for load balancing, network routing, online marketplaces, and multi-agent systems. We consider a general setting in which allocations are constrained to be bases of discrete polymatroids that arrive online. Our work demonstrates that a simple, myopic algorithm called Brick-Laying, which greedily minimizes the sum of squared loads on agents, achieves a universal and objective-free notion of optimality called majorization minimax-optimality [BDK26] for this setting. As a consequence, Brick-Laying simultaneously guarantees minimax optimal competitive ratios and regret for all Schur-concave and Schur-convex objectives, and for any number of agents and resources (despite being agnostic to problem scale). Departing from popular primal-dual analysis, we employ majorization to compare allocations. We leverage the conjugates of integer partitions -- which act as a discrete dual to majorization -- to characterize worst-case instances for the Brick-Laying algorithm. Our approach reveals a novel structural connection between the geometry of partitions and online equitable allocation.

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08332v1 Announce Type: new Abstract: Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing methods rely on large batch sizes, memory banks, momentum encoders, or global synchronization mechanisms that substantially increase computational cost and training complexity. In this work, we propose Semantic Mutual Information (SMI), a lightweight self-supervised objective derived from a mutual-information-inspired dependency formulation under Gaussian assumptions. Unlike conventional correlation matching objectives that operate on high-dimensional feature correlation matrices, SMI performs optimization on a sample-level dependency matrix through a nonlinear transformation of pairwise correlations. This formulation induces distinct optimization dynamics that emphasize strongly dependent semantic pairs while maintaining representation diversity. Experimental results on ImageNet using a ResNet-50 backbone demonstrate that SMI achieves competitive linear evaluation performance relative to state-of-the-art SSL approaches while substantially reducing computational complexity. Across multiple low-resource benchmarks, SMI consistently improves transfer performance over Barlow Twins, particularly on fine-grained datasets. Furthermore, analyses of optimization dynamics and representation geometry suggest improved alignment--redundancy balance, greater feature diversity, and more spatially localized semantic representations. These results indicate that nonlinear dependency optimization provides an effective and computationally efficient alternative to conventional correlation-based self-supervised learning objectives.

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08336v1 Announce Type: new Abstract: While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.

Floating-point autotuning with customized precisions

Xinye Chen, Thibault Hilaire, Fabienne J\'ez\'equel — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08339v1 Announce Type: new Abstract: Reduced-precision arithmetic offers significant opportunities to improve performance, memory usage, and energy efficiency in numerical applications, provided that numerical accuracy is preserved. This work investigates automated precision tuning through customized floating-point formats with user-defined exponent and significand sizes, enabling the emulation of emerging low-precision formats and the exploration of non-standard precision configurations within a unified mixed-precision framework. The proposed methodology, implemented in the PROMISE precision autotuning tool, combines numerical validation with a systematic search to generate program variants that satisfy user-defined accuracy requirements. To address the computational cost of this exploration, a containerized benchmarking framework supports parallel execution across multiple algorithms and parameter configurations. The approach is evaluated on a suite of numerical programs, including linear solvers and applications from the Rodinia benchmark. Results show that a substantial proportion of variables can be safely reduced to lower precision while preserving accuracy, indicating that standard double precision is often over-provisioned. These findings highlight the potential of automated precision tuning to derive efficient mixed-precision configurations tailored to application-specific accuracy requirements.

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08340v1 Announce Type: new Abstract: As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08341v1 Announce Type: new Abstract: In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

Jason Sulskis, Sathya Ravi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08343v1 Announce Type: new Abstract: We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.

AuditFraudBench: Benchmarking Audit Judgment in Detecting Fraudulent Misstatements

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08345v1 Announce Type: new Abstract: Large language models (LLMs) have shown strong performance in financial analysis and surface-level factual error detection, yet their ability to identify fraudulent financial misinformation in audited corporate reporting remains underexplored. Existing financial and audit benchmarks mainly focus on factual verification, numerical reasoning, rule compliance, or audit workflows, but rarely evaluate misleading disclosure narratives or management explanations that obscure the true drivers of reported performance. We introduce AuditFraudBench, an enforcement-grounded benchmark constructed from authentic company filings and regulatory materials, including original and restated 10-K and 10-Q filings, structured financial statements, MD&A disclosures, and SEC Accounting and Auditing Enforcement Releases (AAERs). AuditFraudBench contains three tasks: Profit Source Attribution, Misleading Narrative Detection, and Fraud Pattern Classification, which evaluate whether models can identify the true source of reported performance, detect misleading disclosure framing, and classify misconduct mechanisms into known manipulation patterns. We evaluate GPT, DeepSeek, and Qwen series LLMs on the benchmark. Results show that both proprietary and open models still struggle to jointly reason over financial figures, disclosure framing, restatement evidence, and enforcement-grounded fraud mechanisms. AuditFraudBench provides a challenging testbed for audit-relevant, evidence-grounded evaluation of LLMs in financial reporting.

CATPO: Critique-Augmented Tree Policy Optimization

Ayush Singh, Umang Goyal, Ankur Dahiya — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08346v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Yuning Qiu, Qibin Zhao, Danilo Mandic — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08347v1 Announce Type: new Abstract: Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08348v1 Announce Type: new Abstract: LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.

Forward-Free Diffusion Language Models

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08357v1 Announce Type: new Abstract: Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08360v1 Announce Type: new Abstract: Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a $(1-1/e)$-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

EmpiriGraph-Psy: A Dataset and LLM Pipeline for Extracting Empirical Relation Graphs from Psychology Abstracts

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08362v1 Announce Type: new Abstract: Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entities are tasks, methods, datasets, materials, or metrics. This leaves a gap in variable-oriented empirical fields such as psychology, where findings are expressed as relations among constructs, measurements, interventions, and outcomes. We introduce variable-centered empirical graph extraction, the task of mapping scientific abstracts to typed graphs whose nodes are normalized variables and whose edges represent empirical and hierarchical relations. To support this task, we construct EmpiriGraph-Psy, a benchmark of 210 psychology abstracts annotated by domain-trained annotators with normalized variables, concept hierarchies, empirical relation types, and validation states. We evaluate frontier and open-weight LLMs using both direct extraction and a staged graph-construction pipeline that separates variable extraction, normalization, hierarchy construction, evidence selection, relation extraction, and edge validation. The staged pipeline substantially outperforms direct extraction, with the best configuration achieving a macro-F1 of 0.74. Error analysis shows that moderation relations and concept hierarchies remain the most challenging cases, highlighting the difficulty of extracting higher-order empirical claims and implicit abstraction structure from scientific abstracts.

Kronecker products and iterated matrix multiplication

Christian Ikenmeyer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08363v1 Announce Type: new Abstract: We observe that the Kronecker product of tensors is the operation that converts the determinant polynomial into Cayley's first hyperdeterminant. We apply the Kronecker product to iterated matrix multiplication, which results in the hypercomputant, a VNP-complete and VW[1]-complete polynomial whose hardness we prove via the equivariance of the Kronecker product. The construction works over arbitrary commutative semirings and also for the tensor algebra and the exterior algebra. For the tensor algebra this gives a version of "noncommutative VNP", and for polynomials over the nonnegative real numbers this gives a version of "monotone VNP", each with the hypercomputant as the complete object. We take a parameterized complexity viewpoint and compare the noncommutative setting and the monotone setting. Using standard techniques we obtain optimal algebraic branching program width lower bounds in both settings, and these are notably not always the same. We also prove the polystability of the hypercomputant and that its isotypic components are characterized by their stabilizer.

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08364v1 Announce Type: new Abstract: Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Evan Duan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08365v1 Announce Type: new Abstract: Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08367v1 Announce Type: new Abstract: Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dynamics that matter most, such as behavioral drift, governance in diverse environmental contexts, and cross-influence between agents from different model families, only emerge over time. We introduce Emergence World, a continuously running multi-agent simulation platform designed to make those dynamics measurable. The platform hosts populations of LLM-driven agents in a shared spatial world grounded in live external data (e.g. real-time weather, news APIs, internet access), equips each agent with 120+ specialized tools and three persistent memory systems, and lets them govern themselves through democratic mechanisms with consequential outcomes. The platform is model-agnostic at the reasoning layer and supports heterogeneous populations in which agents from different vendors share the same world. To illustrate the kinds of questions the platform makes tractable, we present a 15-day cross-vendor study with five parallel worlds powered by Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini, and a mixed population. Identical roles and starting conditions produced radically different outcomes, ranging from stable deliberative governance to total population collapse. We release the prompts, log data and configurations to support further research on long-horizon multi-agent autonomy.

An Information-Theoretic Definition for Open-Ended Learning

Wanqiao Xu, Yifan Zhu, Benjamin Van Roy — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08369v1 Announce Type: new Abstract: A growing body of work points to the great promise of AI systems that can continually expand their capabilities as they operate in an open-ended environment. But yet there is no coherent definition of open-endedness or theory about how an agent ought to explore an open-ended environment. We introduce an information-theoretic definition based on a new concept -- the ${\textit bit-equivalent}$ -- which quantifies the information required to attain each level of expected reward. We consider an environment to be open-ended if an agent can attain linear growth in the bit-equivalent. We establish that classical bandit environments are not open-ended and formulate a bandit environment that is. We also introduce an algorithm that achieves open-ended learning in this environment.

Risk-Aware Planning for Transit Desert Remediation Under Demand Uncertainty

Polina Khoroshevskaya, Ashish Kumar Perukari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08371v1 Announce Type: new Abstract: Transit deserts are areas where public transportation is inadequate despite evidence of travel demand, a condition that affects tens of millions of residents across the Americas. Planning for these areas is difficult because the usual demand signal is missing: ridership cannot be observed before service exists. To address that setting, we formulate risk-aware transit desert remediation as a partially observable Markov decision process with Conditional Value-at-Risk constraints for financial tail risk. The model uses demographic, land-use, and employment data to set a prior over latent demand, then updates that prior as new service deployments produce ridership observations. A myopic belief-aware planner is evaluated on 25 cities using a unified financial model for operating cost, capital expenditure, fare revenue, and net subsidy. After five years, the planner remediates a median of 53.6% of transit-desert tracts and improves on static optimization by 5.0 percentage points on average, with gains in 16 of 25 cities. Gains are largest at moderate budgets (+9.9 points at baseline) and persist under 50% prior-demand miscalibration, while population density and existing transit density are the strongest structural predictors of remediation cost ($R^2\!=\!0.41$ on per-tract cost)

SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

Steven Golob, Sikha Pentyala, Martine De Cock — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08372v1 Announce Type: new Abstract: Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

Predictive Coding with Bayesian Priors via Proximal Gradients

Francesco Bullo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08374v1 Announce Type: new Abstract: We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level problem, we show that proximal gradient descent is precisely a leaky firing-rate network: the membrane leak, the effective recurrent matrix, the local synaptic drive, and the static nonlinearity all follow from one optimization principle, and the resulting circuit is the one proposed by Rao and Ballard. The prior selects the nonlinearity through its proximal operator, and the likelihood precision sets the gain on the observation. For the hierarchy, we show that a classical variable-splitting relaxation of the deep MAP problem yields hierarchical predictive coding as the interconnection of local and distributed solvers. In probabilistic modeling terms, this relaxation replaces the directed generative chain by an undirected Markov random field whose node potentials are the level-wise priors. Each level then applies its own activation function, namely the proximal operator of its prior.

Few-step Cofolding with All-Atom Flow Maps

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08375v1 Announce Type: new Abstract: All-atom generative modeling of 3D biomolecular complexes has emerged as the dominant paradigm for predicting the structure of proteins and protein-ligand systems. Generating structures at the atomic level of fidelity, however, typically requires expensive iterative diffusion rollouts, making both conventional deployment and inference-time search techniques computationally costly. In this paper, we introduce the Denoiser Cofolding All-Atom Flowmap (DeCAF) framework for distilling state-of-the-art all-atom cofolding models into all-atom flow maps that produce high-quality samples in only a few inference steps. We build DeCAF on a denoiser-based formulation of flow maps with endpoint losses that naturally support SE(3) rigid alignment, which we show is critical for training accurate models. We further derive a simple change of variables that lets DeCAF operate in the {\sigma}-space noise schedule of EDM-style architectures, enabling direct distillation from pretrained cofolding diffusion models. Equipped with DeCAF's flowmap lookahead, we introduce a purpose-built inference-time framework that improves sampling through reward-guided search. Empirically, DeCAF-Boltz statistically improves over Boltz-1x in both accuracy (RMSD) and physical validity scores of protein-ligand poses at strict NFE budgets on the challenging Runs N' Poses, while also showing a more optimal Pareto frontier across all inference compute budgets on PoseBusters. Distilling the state-of-the-art Pearl cofolding model, DeCAF-Pearl outperforms diffusion-based cofolding models and matches its teacher on success rate while using 5x fewer NFEs. We release our code at https://github.com/genesistherapeutics/decaf.

RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08376v1 Announce Type: new Abstract: As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and failures have grown in frequency and diversity. Although existing governance frameworks articulate high-level principles for responsible AI, large-scale empirical resources for tracking and analyzing real-world AI risk incidents remain limited. Existing incident collections are often manually curated, relatively small in scale, and insufficient for continuous, data-driven monitoring and downstream computational analysis. To address this need, we present RiskNet, a large-scale dataset of AI risk incidents constructed from large-scale multilingual news sources. RiskNet applies a structured pipeline for AI risk news identification, event-level report screening, incident alignment, and multi-dimensional incident classification. The resulting resource organizes dispersed news reports into incident-centered records and provides benchmark datasets for event classification, incident alignment, and incident-level risk labeling. In its current release, RiskNet covers hundreds of millions of source records and yields a large-scale collection of AI risk-related reports, including aligned incident clusters and annotated benchmark subsets. The dataset is also accessible through an online platform for browsing and exploration. We describe the data sources, processing workflow, taxonomy design, and technical validation of the resource. RiskNet is intended to support downstream research on AI safety, governance, risk analysis, and benchmarking, as well as longitudinal and cross-source analyses of AI-related harms. By providing a structured and reusable empirical resource, RiskNet helps bridge the gap between high-level governance principles and the documented realities of AI risk incidents.

From Estimates to Schedules: Learning-Augmented Restricted Assignment

Michalis Xefteris — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08377v1 Announce Type: new Abstract: In this work, we study Restricted Assignment scheduling on multiple machines, where each job can be processed only on a specified subset of machines and the objective is to minimize the makespan. We introduce a learning-augmented setting in which a possibly infeasible predicted assignment is provided. The prediction error (moved-load) is measured by the total processing volume that must be reassigned in order to obtain an optimal feasible schedule. Using a single prediction, we obtain two types of guarantees. First, we design an algorithm whose approximation ratio degrades smoothly with the prediction error while retaining a worst-case guarantee independent of the prediction quality. More precisely, for any fixed constant, we can make the additive dependence on the prediction error arbitrarily small, at the cost of increasing the polynomial running time. This guarantee can also be combined with any approximation algorithm for the problem without predictions to obtain robustness. Second, given a makespan estimate, we provide a repair procedure that returns a schedule matching this estimate in time parameterized by the prediction error. This allows the algorithm to exploit the separation between estimation and approximation algorithms for Restricted Assignment. Finally, we complement the repair algorithm with a parameterized hardness result, showing that exact moved-load repair with a given target makespan is W[1]-hard when parameterized by the amount of moved-load.

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

Ilia Zaznov, Atta Badii, Julian Kunkel, Alfonso Dufour — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08379v1 Announce Type: new Abstract: This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses Ornstein-Uhlenbeck (OU) noise with a hybrid schedule: deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a Soft Actor-Critic (SAC)-style temperature that is learned and mapped to the noise scale. The environment integrates Almgren-Chriss (AC) trade impact with Limit Order Book (LOB) prices and volumes, normalised state features, per-step volume participation caps, and a utility-based reward. The trade execution algorithm is applied to LOB data for ten U.S. stocks. Performance is assessed against reinforcement-learning baseline algorithms, including Proximal Policy Optimisation (PPO), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C), as well as alternative trade execution algorithms, including Time-Weighted Average Price (TWAP), Volume-Weighted Average Price (VWAP), and AC. The proposed model consistently reduces mean implementation shortfall percentage with competitive variance, outperforming classical baselines and standard reinforcement-learning benchmark models.

Programming Domain-Specific FPGA Hardblocks from HLS: An RTL Blackbox Approach

Ruthwik Reddy Sunketa, Jeevesh Choudhury, Aman Arora — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08380v1 Announce Type: new Abstract: Domain-specific Field Programmable Gate Array (FPGA) architectures increasingly integrate specialized hardblocks, such as Tensor Slices, to accelerate artificial intelligence and machine learning workloads. Despite their efficiency benefits, these architectures remain difficult to program because designers typically rely on manual Register-Transfer Level (RTL) integration to access these hardblocks. This paper presents a compiler-agnostic methodology that enables high-level synthesis (HLS) tools to target custom FPGA hardblocks directly from C/C++ code. Architectural hardblocks are exposed as schedulable C-level operators using an RTL blackbox abstraction with explicit latency and initiation-interval contracts, allowing the HLS scheduler to optimize around specialized hardware without manual RTL orchestration. Unlike traditional uses of HLS blackboxes for external IP integration, our approach treats blackboxes as architectural abstractions, enabling scalable composition of C-level operators that target custom FPGA hardblocks without compiler modification. We evaluate the proposed flow using a Tensor Slice-based FPGA architecture with AMD Vitis HLS and the Verilog-to-Routing (VTR) toolchain. Across multiple matrix sizes, designs generated using the proposed C-Blackbox flow achieve lower area-delay product than behavioral HLS baselines while providing substantially higher productivity-adjusted efficiency than handwritten RTL implementations. These results demonstrate that domain-specific FPGA architectures can be made accessible through HLS while maintaining competitive hardware efficiency.

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

Alireza Arbabi, Florian Kerschbaum — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08381v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model providers to inject intentional, provider-specific policies without officially announcing them. As a result, various models have been reported to generate responses reflecting proprietary rules and organizational interests, leading to censorship or misinformation on controversial topics. However, systematic identification of such alignment remains a fundamental challenge, complicated by the ambiguity of what ``proprietary'' entails in different contexts. In this paper, we propose a statistical framework for detecting proprietary alignment in black-box language models via comparative behavioral analysis. Our approach quantifies systematic deviations between the responses of a target model and those of a reference set of baseline models in a shared semantic space. By evaluating relative behavioral divergence rather than absolute correctness, our framework enables principled auditing under black-box access. Applied to several widely discussed but previously unquantified cases, it provides a systematic and scalable basis for external assessment of provider-specific alignment behavior in large language models.

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08382v1 Announce Type: new Abstract: Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.

The Spectral Dynamics and Noise Geometry of Muon

Pierfrancesco Beneventano, Mahmoud Abdelmoneum, Tomaso Poggio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08388v1 Announce Type: new Abstract: Muon replaces a matrix gradient $G=U\Sigma V^\top$ by its polar factor $UV^\top$. This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.

When Are Neural Interaction Discoveries Real? Identifiability, Recoverability, and a Pre-Fit Diagnostic

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08390v1 Announce Type: new Abstract: When a neural time-series model reports that one variable modulates another's effect on a target, is the discovered interaction a property of the data or an artifact of model flexibility? We argue that this is fundamentally a question of identifiability, governed by the geometry of the observed input support rather than by the specific neural architecture. We study the problem in a multiplicative-gating extension of neural additive vector autoregression (GNAVAR), in which source contributions are modulated by other lagged variables. We show that representational capacity is not identifiability: dependent inputs induce leakage between edge-specific interaction terms, and low-dimensional support permits distinct interaction decompositions that agree on the observed data while differing elsewhere. We then prove a population identifiability theorem for normalized minimal GNAVAR decompositions under explicit support conditions, including settings with shared modulators. The theory yields a simple practitioner-facing diagnostic: the effective rank of the joint lag-block covariance predicts, before fitting, whether interaction recovery is feasible for a given candidate set. When the candidate set is unknown, a two-seed stability check provides a practical operational test. The same support condition organizes empirical outcomes into the three states predicted by the theory. Our results show that interaction recoverability depends on support geometry, that effective rank provides a practical pre-fit diagnostic, and that instability across independent fits is a characteristic signature of non-identifiable interaction discovery. The identifiability phenomenon, the support condition, and the instability signature are model-agnostic; GNAVAR is the vehicle that makes them provable.

When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models

Haoran Zhao, Soyeon Caren Han, Eduard Hovy — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08394v1 Announce Type: new Abstract: Multimodal language models are typically evaluated through external behavior: selecting the correct image--text match, rejecting unsupported captions, or answering visual queries correctly. However, correct behavior alone does not show that the model's internal decision state remains stable under controlled semantic stress. We study this gap through S$^3$E (Structured Semantic Stress Evaluation), a framework for analyzing behavior-internal decoupling in multimodal language models. S$^3$E uses a positive-anchored A/B forced-choice setup in which an image-supported caption is contrasted against semantic stress candidates under both original and swapped option orders, while hidden states are extracted at the pre-answer decision state. We focus on strict-correct trials, where the model consistently selects the correct caption across both orders. Rather than treating arbitrary hidden-state variation as evidence of instability, we measure whether semantic-conflict candidates induce excess decision-state displacement relative to meaning-preserving controls. Across Qwen3VL, Gemma3, and InternVL3, semantic stress consistently produces positive selected-layer excess displacement over lexical controls despite correct forced-choice behavior, while comparisons against random negatives are model-dependent. We interpret this as a scoped decision-state stress-sensitivity signal rather than evidence of downstream failure or hallucination. Our results suggest that forced-choice correctness alone is not a sufficient certificate of invariant internal decision geometry.

Prime Event Languages: An Information-Theoretic Investigation of Twin-Prime Event Structure

Jinhua Liao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08395v1 Announce Type: new Abstract: Prime numbers are traditionally studied through numerical, probabilistic, and analytic frameworks. In this work, we introduce the concept of a prime event language, in which arithmetic phenomena are represented as symbolic event sequences and analyzed using tools from information theory and stochastic processes. Using all primes up to N = 5 x 10^9 (234,954,223 primes), we construct event languages based on twin-prime occurrences and record prime-gap events. We investigate their statistical properties through finite-order Markov models, train/test validation, mutual-information analysis, and information-horizon measurements. For the Twin Prime Event Language, first-order Markov modeling reduces test-set cross entropy from 0.325350 bits to 0.319949 bits, corresponding to an information gain of approximately 0.0054 bits. This gain survives out-of-sample validation and therefore reflects genuine statistical structure rather than overfitting. Mutual-information analysis independently confirms the Markov results and shows that measurable dependence is concentrated almost entirely at lag 1. The mutual information decreases from approximately 5.96 x 10^-3 bits at lag 1 to approximately 5.07 x 10^-7 bits at lag 2 (approximately 11,700-fold reduction), representing a reduction of more than four orders of magnitude. Beyond lag 2, residual information fluctuates near the statistical noise floor. These results indicate that prime event languages are neither perfectly memoryless nor strongly predictable. Instead, they exhibit weak but reproducible short-range statistical structure characterized by first-order dependence and an effective information horizon of approximately one event. More broadly, this work illustrates how alternative representations can reveal information-theoretic organization that remains less apparent in conventional numerical descriptions of arithmetic phenomena.

TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models

Jingyan Xu, Hong Shi, Yi Shan, Penghui Liu, Yunhao Bai, Ningyuan Li, Xueyang Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08397v1 Announce Type: new Abstract: Large language models answer knowledge-intensive questions using both parametric memory and retrieved evidence, but neither source is uniformly reliable. Retrieval can fill knowledge gaps, yet distracting passages may override correct closed-book answers. We study this post-generation conflict as answer-level source arbitration: given Direct and RAG answers from the same frozen model, decide which source to trust. We propose TRUSTMARGIN, a training-free, plug-and-play arbitration layer that scores the two existing candidates with the model's own likelihoods. It combines a parametric-prior margin, which tests whether memory accepts the retrieved answer, with an evidence-binding margin, which discounts passage-only salience and measures question-specific support. TRUSTMARGIN selects between Direct and RAG without fine-tuning, external judges, or additional generation. Across 2WIKIMQA and CWQA with three LLaMA scales, TRUSTMARGIN consistently improves over Direct generation and BM25-RAG, recovers part of the Direct/RAG oracle gap, and generalizes to multiple training-free RAG pipelines.

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08400v1 Announce Type: new Abstract: Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.

SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08402v1 Announce Type: new Abstract: Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection

Mudit Sinha, Sanika Chavan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08403v1 Announce Type: new Abstract: Text-centered prompt-injection defenses assume that the malicious signal is visible in one of the inspected text views. We study a reproducible LLM01-style indirect prompt/content-injection failure mode where that assumption breaks: a payload caught in plain English slips past the same detector when it is transported as structured float parameters and reconstructed only as fragmented telemetry. Across 14,400 attacked real-model trials on three commercial LLM APIs from different providers, the IFS-derived float-array carrier preserves 94.3% leakage ASR under the strongest dual-layer text-classifier defense evaluated in the main matrix: a Prompt Guard 2 + TF-IDF ensemble; the same carrier-level pattern also replicates with a fine-tuned roberta-base detector. We emphasize leakage ASR because downstream systems may act on quoted or reproduced markers even when the model refuses, but Strong ASR is the stricter metric for structurally compliant attack success. A 2 x 2 ablation shows that data-layer storage and reconstruction-layer fragmentation defeat different text views and that both are needed to evade both. A simple xxd detector and semantic validation block the current T3 instance, so the contribution is not an undetectable exploit but a measured failure boundary for text-only inspection in structured-input pipelines that expose reconstructed auxiliary channels to an LLM.

Geometry-Driven Flow Analysis of Brain Sulcal Pattern

Moo K. Chung, Luigi Maccotta, Aaron Struck — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08404v1 Announce Type: new Abstract: Cortical folding reflects coordinated neurodevelopmental processes and is increasingly recognized as a sensitive marker of neurological disease. However, most existing analyses rely on indirect scalar summaries that do not explicitly model folding geometry itself. In juvenile myoclonic epilepsy (JME), a common genetic epilepsy, cortical abnormalities are often subtle, spatially distributed, and difficult to detect using conventional morphometric measures. We introduce a Poisson-equation-based framework that models cortical folding as a geometry-driven flow derived from mean curvature on the cortical manifold. By treating folding patterns as a stationary source-sink structure, the proposed approach yields a smooth, globally balanced potential field whose surface gradient defines a physically interpretable flux. This framework enables spatially coherent analysis of sulcal-gyral folding organization and provides a principled representation of geometry-driven cortical structure in JME.

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

Boai Sun, Wenjin Guo, Zongmin Yu, Liu Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08405v1 Announce Type: new Abstract: While data-intensive deep reinforcement learning can optimize complex control policies, scientific discovery in physical systems fundamentally requires an interpretable chain of reasoning that connects physical evidence to structured control architectures. Here, we present a self-evolving scientific-agent workflow, driven by large language models and iterative code generation, that automates controller construction while preserving strict interpretability and rigorous physical reasoning. Instead of adjusting weights, the agent deploys candidate strategies into physical simulations, actively diagnoses dynamic behaviors from multimodal evidence, and translates these observations into progressive source-code refinements. We demonstrate this framework on a highly non-linear fluid-structure interaction problem: an underactuated, two-joint dogfish swimmer tasked with spatial target reaching using only joint angular accelerations. Starting from a propulsive seed policy that exhibits a one-sided steering bias, the agent autonomously discovers and refines a unified controller that robustly captures all canonical targets. Remarkably, without any retraining or target-specific branching, the synthesized control policy generalizes to unseen static targets and dynamically curved pursuit trajectories. The auditable evolve log reveals an emergent control architecture built upon traveling-wave propulsion, body-frame target guidance, yaw-rate feedback, signed mean-tail curvature, and adaptive cadence relief. Our results show that an autonomous scientific agent can successfully transform accumulated physical evidence into robust, mathematically readable control policy, while maintaining a fully traceable process of scientific discovery.

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08408v1 Announce Type: new Abstract: We extend activation steering to diffusion language models (DLMs) and study a novel problem that arose due to the inference mechanism of DLMs: Modifying a text in-place to manifest a different concept. We propose TimpaTeks, an automatic in-place text modification mechanism using DLMs. Experiments on IMDB movie reviews (sentiment) and a synthetic Cats and Dogs Dataset (arbitrary, more unconventional concept steering) show that TimpaTeks provides a feasible novel mechanism to steer diffusion language model outputs in-place. TimpaTeks enables in-place modification while simultaneously lowers sentence perplexity and retaining the original sentence structre without the need of instruction tuned models. TimpaTeks is also computationally cheaper than prompt-based DLM steering, as it performs denoising in-place rather than constructing an additional prompt-conditioned output sequence.

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

Linfeng Cao, Ming Shi, Ness B. Shroff — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08410v1 Announce Type: new Abstract: Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.

AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08411v1 Announce Type: new Abstract: Block-wise semi-autoregressive decoding is the standard inference paradigm for diffusion large language models (DLMs), but it imposes a strict dependency between blocks: the next block cannot begin until the current block is fully decoded or its denoising budget is exhausted. We observe that once a block exposes a reliable delimiter boundary or stable semantic prefix, continuation generation need not wait for every residual token to be resolved. We propose AsyncLane, a training-free decoding scheduler that decouples refinement from advancement. AsyncLane forks a generate lane at observed delimiter boundaries into a refine lane and a continuation generate lane: the prefix remains editable, while the continuation advances before prefix refinement finishes. The resulting lane tree records decoding dependencies and output order, while execution proceeds over the active lane set. To make this asynchronous schedule efficient under bidirectional attention, AsyncLane combines shared-prefix lane batching, lookahead draft reuse, cascading termination, and compact cache refresh with refresh-logit reuse, preventing model-call cost from scaling directly with the number of lanes. AsyncLane is a drop-in replacement for block-wise DLM samplers and requires no retraining. Experiments on mathematical reasoning and code generation show that AsyncLane consistently improves throughput while maintaining competitive quality. Across LLaDA and Dream backbones, AsyncLane achieves the highest TPS in all evaluated benchmark-length settings; relative to the fastest competing baseline, it reaches peak speedups of 2.95x on LLaDA and 3.04x on Dream, with especially large gains under longer generation budgets.

Complexity and Algorithms for Unary Translocation Distance

Maria Constantin, Adrian Micl\u{a}u\c{s}, Alexandru Popa, Andrei Popa — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08412v1 Announce Type: new Abstract: Given a finite set of integers $A$, a \emph{unary translocation} produces a new set $A' = A \cup \{u,v\}$, where $u$ and $v$ are nonnegative integers satisfying $x+y=u+v$ for some $x,y\in A$. For an input set $A$ and a target set $B$, the \emph{unary translocation distance} is the minimum number of unary translocations required to obtain a superset containing $B$. In this paper, we study this problem from both theoretical and computational perspectives. We prove that computing the unary translocation distance is strongly NP-hard, thereby answering an open question raised by \citet{ConstantinMiclausPopa2026UnaryTranslocation}. On the positive side, we give an exact pseudo-polynomial algorithm for every fixed constant value of $|B|$, extending our previous results for $|B|\leq 2$. For arbitrary target sets, we present a $2$-approximation algorithm, an additive $(|B|-1)$-approximation algorithm, and show that the additive algorithm also yields a $3$-approximation. We also propose parameterized algorithms, including algorithms parameterized by the maximum value in the input set together with the optimum distance, and by the maximum value in the target set together with $|B|$. In addition, we propose an integer linear programming formulation that gives an exact mathematical model for the problem, analyze its size, and show that the LP relaxation has integrality gap at least $\frac{4}{3}$. Finally, we report computational experiments comparing the $2$-approximation algorithm, beam search, and simulated annealing. The results show that the approximation algorithm is highly effective in practice and often outperforms the heuristic baselines.

Beyond Prediction: Longitudinal Reasoning in EHR-Integrated Clinical AI

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08413v1 Announce Type: new Abstract: We present a structured analysis of how contemporary clinical AI systems integrate electronic health record (EHR) data and the extent to which they support longitudinal clinical reasoning. Drawing on a curated corpus of clinical natural language processing (NLP) and EHR-integrated systems, we develop a coding framework that captures both technical integration strategies and reasoning-relevant representational features, such as trajectory modeling, cross-encounter synthesis, longitudinal analysis, and absence reasoning. We also elicited the experiences of three physicians in their EHR use, including what strengths and weaknesses they found with their institution's current EHR system(s). Our analysis shows that while many systems incorporate EHR data, they predominantly operate on encounter-level or aggregated representations, with limited support for explicit temporal reasoning across patient histories. Reasoning-relevant structures are inconsistently represented, and evaluation paradigms remain largely focused on predictive performance instead of longitudinal interpretability. We argue that current approaches treat EHR data as a static input rather than a substrate for ongoing clinical reasoning, and we outline a framework for understanding how future systems might more effectively align with the temporal and interpretive structure of clinical practice.

PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08414v1 Announce Type: new Abstract: Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints required for safe deployment. Existing approaches impose safety either prematurely during training or reactively via external guardrails at test time, limiting policy expressivity and overall scalability. We propose Physical safety Alignment for Constrained Trajectories (PACT), a self-evolving post-training framework that projects pretrained diffusion policies onto constraint-feasible regions without accessing demonstration data or task rewards. PACT distills constraint gradients into the diffusion model through a reverse-KL objective with dense supervision across timesteps. It incorporates a curriculum that progressively tightens constraints while maintaining theoretically bounded policy shift and monotone improvement, mitigating the safety-performance trade-off from catastrophic forgetting. On simulated and real-world embodied manipulation benchmarks, PACT significantly reduces safety violations by 31.0% on average while improving task success by 30.7%.

CoVEBench: Can Video Editing Models Handle Complex Instructions?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08415v1 Announce Type: new Abstract: While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

Antonio Franca, Alexander Tong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08417v1 Announce Type: new Abstract: Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

Sergios Gatidis, Curtis Langlotz, Christian Bluethgen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08420v1 Announce Type: new Abstract: Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.

Segmentation-Assisted Brain MRI Synthesis with Cross-Image Multi-Contrast Feature Memory Bank Retrieval Augmentation

Wenwei Huang, Jia Wei, Jianlong Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08421v1 Announce Type: new Abstract: Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagnosis of diseases. However, limited scanning time, image corruption and various imaging protocols often result in incomplete multi-contrast images. While current approaches excel in image synthesis, they often struggle to synthesize critical tumor regions and exploit contextual information in multi-contrast brain MRI effectively. To address this issue, we propose a synthesis-centric, segmentation-assisted closed-loop framework with retrieval augmentation synthesis. Our method overall takes a generative adversarial architecture, which aims to synthesize missing contrasts from any combination of available ones with a single model. To explicitly capture tumor semantics and focus synthesis on tumor regions, we add an auxiliary segmentation branch that predicts tumor masks and feeds them back as semantic conditioning in synthesis branch, thereby learning tumor-aware representations in the model and improving synthesis fidelity. Furthermore, we propose a dual-bank retrieval augmentation strategy. It dynamically queries two external knowledge bases, namely a tumor masks memory bank for crucial tumor context and cross-image contrast feature memory bank for global style information, to augment synthesis. Verified on two public multi-contrast magnetic resonance brain datasets: BraTs2020 and UCSF-BMSR, the proposed method is effective in handling medical brain images synthesis tasks and shows superior performance compared to previous methods. Code is available at:https://github.com/iBizzard/SSCF.git

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

Vinh-Thuan Ly — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08425v1 Announce Type: new Abstract: Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

CritLens: Visual Analytics for Criteria Discovery in Review-Based Decision Making

Hongjia Wu, Shuai Zhou, Hongxin Zhang, Wei Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08426v1 Announce Type: new Abstract: We present CritLens, a visual analytics system that helps users build personalized multi-criteria decision models from review text. In everyday decisions -- choosing equipment, hotels, or restaurants -- evaluation criteria are either preset by platforms or generated by LLMs, leaving users unable to discover, adjust, or verify them against the underlying evidence. This is problematic because many preferences are latent: they surface only upon encountering specific reviews, and any fixed framework risks overlooking low-frequency but decisive details. CritLens addresses this gap by using LLMs to transform reviews into an initial AHP decision model, then supporting iterative, human-in-the-loop refinement. Through coverage gap detection in the embedding space, users discover criteria missed by the initial model; through interactive weight adjustment under AHP consistency constraints, they express personal priorities; and through a multi-level scorecard and exportable decision report, they trace every ranking back to the original review text. Two case studies, an eight-participant user study, and a quantitative consistency-repair experiment demonstrate the system's effectiveness.

Accuracy-Configurable Floating-Point Multiplier Design for SRAM-Based Compute-in-Memory

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08430v1 Announce Type: new Abstract: Digital Compute-in-Memory (DCiM) reduces data movement and has become a promising solution for energy-efficient edge AI. However, most existing DCiM frameworks still primarily target integer or fixed-point arithmetic, and provide limited support for compiler-integrated and accuracy-configurable floating-point computation. Directly integrating conventional IEEE 754 floating-point units into dense SRAM-based DCiM arrays, however, incurs high area and power overhead. To address this challenge, this work presents an accuracy-configurable floating-point multiplier integrated into the OpenACM framework for SRAM-based DCiM. An exact IEEE~754-compliant multiplier is first implemented as a baseline, and a mantissa-segmentation-based approximate multiplier is then proposed to reduce hardware cost while preserving numerical fidelity. Post-layout results show up to 69% logic area reduction and 72% power savings over exact floating-point designs without delay overhead. Evaluations on image processing tasks and ResNet-18 inference further demonstrate negligible accuracy degradation. These results indicate that compiler-integrated approximate floating-point multiplication is a practical approach for enabling efficient and configurable floating-point support in SRAM-based DCiM systems. The Floating-Point Multiplier is available on https://github.com/ShenShan123/OpenACM

Control-Theoretic View of Neural ODEs: Empirical Controllability and Observability

Md Saiful Islam, Rahul Bhadani — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08431v1 Announce Type: new Abstract: This paper studies neural ordinary differential equations (neural ODEs) from a control-theoretic perspective using controllability and observability concepts. The neural ODE is represented in a control-affine form to facilitate analysis using tools from nonlinear and linear time-varying (LTV) systems. Controllability is examined through trajectory linearization, where the LTV controllability Gramian provides a local, first-order measure of input influence along a nominal trajectory. Observability is analyzed through output linearization, where the LTV observability Gramian characterizes the local ability to reconstruct system states from output measurements. Koopman-based lifting is considered to extend the analysis to a higher-dimensional representation, and its limitations under multiple equilibria and basin-dependent behavior are discussed. The proposed framework is illustrated on a series RLC circuit. The learned neural ODE reproduces system trajectories and generalizes to unseen initial conditions. The computed Gramians are numerically full rank along the tested trajectories, indicating local controllability and observability of the linearized dynamics.

Trajectory-Refined Distillation

Li Jiang, Haoran Xu, Yichuan Ding, Amy Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08432v1 Announce Type: new Abstract: On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing)

George Andronchik, Pavel Lokhmakov — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08433v1 Announce Type: new Abstract: This paper reads six engine-level measurements together -- 1.1 host attack surface, 1.2 information leakage, 1.3 defense-in-depth stackability, 1.4 public CVE history, 1.5 patch cadence, and 1.6 upstream fuzzing posture -- to describe how five AI-sandbox products isolate guest code from the host kernel. No single axis is a sufficient basis for a comparative judgement; the cross-axis reading is the load-bearing analysis. Three high-level findings: (1) engine classes (microVM, userspace kernel, OCI container) separate cleanly on every architectural axis, but products within a class do not; (2) product pin policy is the dominant operator-facing variable -- engine-side patch latency aggregates to ~0 days for coordinated disclosures, while downstream lag spans 0 days to 471+ days to "opaque" to infinity; (3) fuzzing investment splits into three tiers, and the strongest combination -- microVM x continuous public fuzzer -- is unoccupied in this set, leaving the "0 published CVEs x no upstream fuzzer x no academic study" intersection structurally unmeasured. We report per-axis orderings, per-product portraits, and a threat-model qualification matrix; no overall ranking is proposed. Companion repository (code, Apache-2.0): https://github.com/orbitalab/RnD-ai-sandboxes-sec-study-part-1. License: CC BY 4.0.

A Unified Framework for Contraction Stability Analysis of Heterogeneous Grid-Forming Inverters

Qianxi Tang, Li Peng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08434v1 Announce Type: new Abstract: The shift to renewable-dominated power systems has produced low-inertia grids, undermining system stability. In this context, grid-forming inverters (GFMs) have emerged as a promising solution. However, GFMs challenge conventional analysis techniques, especially those relying on small-signal or root-mean-square (RMS) models. Such models rely on linearization and sinusoidal steady-state assumptions, which fail in large-signal cases. Stability of GFM-based systems therefore becomes operating-point dependent, and a feasible operating point may not even exist. While large-signal analyses are available, decentralized certification of operating-point convergence with explicit transient guarantees, such as rate and overshoot, remains rare. This paper proposes an algebraic, decentralized contraction-based framework. The proposed contraction stability analysis certifies system stability and convergence to desired operating points. The method works in the time domain and captures nonlinear, large-signal behavior of synchronization and power-sharing mechanisms. Moreover, the contraction rate provides an explicit bound on transient time: trajectories converge exponentially to the new operating point at a controlled rate, yielding computable contraction regions that certify stability and large-signal convergence across operating-point changes. These regions directly guide parameter tuning for heterogeneous GFMs.

Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08436v1 Announce Type: new Abstract: The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

RadioDiff-Inv2: Differentiable Diffusion Inversion under Location Drift from Sparse Noisy Measurements for Radio Map Estimation

Xiucheng Wang, Kailong Wang, Nan Cheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08439v1 Announce Type: new Abstract: Radio map (RM) estimation is a key enabler for environment-aware optimization in 6G wireless networks. In practice, RM construction increasingly relies on crowdsourced received signal strength (RSS) feedback that is inherently sparse and noisy. A further and often overlooked challenge is location drift, whereby privacy constraints and user mobility cause reported sampling coordinates to deviate from the true measurement locations. Unlike additive measurement noise, location drift perturbs the sensing operator itself, since each RSS sample effectively queries the underlying RM at an incorrect spatial coordinate. This operator uncertainty, compounded with sparse noisy sensing, renders the inverse problem severely ill-posed and limits conventional estimators that rely on analytically specified priors. This paper proposes RadioDiff-Inv2, a differentiable diffusion inversion framework that estimates RMs from sparse noisy measurements under location drift. A Gaussian resampling scheme is introduced to construct a differentiable, drift-aware measurement operator on grid-based maps, and the probability-flow ordinary differential equation (ODE) is exploited to cast the diffusion sampler as a deterministic, differentiable mapping from an initial noise code to the estimated RM. By optimizing the noise code via backpropagation against a drift-marginalized data-fidelity objective, RadioDiff-Inv2 produces reconstructions that are both prior-plausible and measurement-consistent without costly posterior sampling. Extensive experiments show that RadioDiff-Inv2 outperforms the best competing baseline by 4 to 14 dB in PSNR across varying sparsity and drift levels. The advantage is most pronounced in low-SNR regimes, where the learned diffusion prior maintains near-constant reconstruction fidelity while conventional methods degrade severely.

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08440v1 Announce Type: new Abstract: Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.

Comparing Controller-Free Pointing Techniques Across Depth for 2D Selection in Augmented Reality

Samiha Sultana, J. Felipe Gonzalez, Robert J. Teather — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08441v1 Announce Type: new Abstract: This paper presents a systematic evaluation of five controller-free pointing techniques for 2D target selection in AR, using ISO 9241-411. We compared them across multiple depths (2 m, 6 m, 10 m) in terms of movement time, accuracy, throughput, and workload (NASA TLX). Head- and eye-based pointing significantly outperformed the hand-based methods (Finger, Wrist, and Arm); Head input was the most accurate and remained the most consistent across depth. Depth significantly impacted performance, with complex interactions with target size and distance. Our results offer a comprehensive empirical basis for selecting appropriate controller-free techniques in depth-varying AR tasks.

Clinical Reasoning in the Age of AI: Longitudinal Cognition and Human-AI Collaboration

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08442v1 Announce Type: new Abstract: As physicians turn to AI-powered systems to help meet the dual demands of speed and care quality, they are met with hallucinations and sycophancy. Understanding how doctors reason through clinical problems in real-world settings is critical for design of effective AI reasoning systems. While recent advances in medical AI have emphasized performance benchmarks and diagnostic accuracy, comparatively little attention has been paid to the structure of clinicians' reasoning processes as they unfold over time, e.g., how they interact with electronic health records and operate under conditions of uncertainty and constraint. This study provides a comprehensive, empirically-grounded account of clinical reasoning and its relationship to current AI-mediated workflows through a mixed-methods design that combines qualitative interviews with structured survey data. Findings indicate that current AI systems are primarily deployed for encounter-level tasks such as documentation and summarization, and only partially align with physicians' underlying reasoning processes. In particular, AI-generated representations often omit temporal or interpretive structures central to clinical decision-making, while core aspects of reasoning, especially those spanning multiple encounters, remain largely implicit and physician-driven. By integrating fine-grained qualitative insights with broader quantitative patterns, this study offers a unified framework for understanding clinical reasoning as a context-sensitive, temporally extended process and identifies key mismatches between clinician cognition and current AI design. These results provide concrete directions for the development of AI systems that more effectively align with and augment real-world clinical reasoning.

When LLMs Invent Rust Crates: An Empirical Study of Hallucination Patterns and Mitigation

Jieming Zheng, Hao Guan, Yepang Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08444v1 Announce Type: new Abstract: Large Language Models (LLMs) have become powerful tools for code generation, yet they remain prone to hallucinations-producing plausible but incorrect or fabricated outputs. Among these, package hallucination, where an LLM suggests non-existent dependencies, poses an emerging security risk to the software supply chain. While previous studies focus on popular languages like Python or JavaScript, in this work we present the first large-scale empirical study on crate hallucination in LLM-generated Rust code. We construct a multi-source dataset combining coding tasks from Stack Overflow, GitHub, and LLM-generated tasks, and evaluate both commercial and open-source models under various decoding settings. Our analysis reveals that, unlike prior findings in Python and JavaScript, hallucination behavior in Rust follows a distinct pattern: different models exhibit surprisingly consistent hallucination rates, and these rates show minimal sensitivity to model parameters. Furthermore, we investigate prompt engineering strategies to mitigate hallucinations without sacrificing code quality. This study provides new insights into the reliability and security implications of LLM-assisted Rust development, offering guidance for future research and safer model deployment in software engineering workflows.

Segment-level Tree Search for Long Meeting Document Summarization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08445v1 Announce Type: new Abstract: Meeting documents are challenging to summarize due to their length and complex conversational structure. Existing approaches typically adopt multi-stage pipelines that extract information prior to summarization; however, these approaches often suffer from cumulative error propagation without intermediate validation, a limitation further amplified by short and low-quality reference summaries. We propose segment-level summarization via Monte Carlo Tree Search (S3), a training-free framework that constructs a final summary by composing segment-level summary candidates. S3 partitions a long document into segments and generates multiple summary candidates per segment, forming nodes of a search tree. The best-scoring combination is selected via self-reward-guided tree search and refined into the final output. Despite using a 7B model, S3 achieves performance comparable to larger 72B models while producing length-appropriate summaries.

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08446v1 Announce Type: new Abstract: Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.

Not Just After One: Sleep-Inspired Replay Prevents Catastrophic Forgetting After Sequential Tasks

Anthony Bazhenov, Jean Erik Delanois, Giri P. Krishnan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08447v1 Announce Type: new Abstract: One of the critical limitations of artificial neural networks is their lack of ability to continually learn: training on new tasks often leads to interference and forgetting of the previous ones. While several algorithms have been proposed to protect old memories from interference, they are typically applied during or immediately after each new episode of training. In contrast, humans and animals can learn continuously, acquiring multiple new memories during active learning before consolidating all of them into long-term storage. Here we show that multiple new tasks can be trained sequentially before an unsupervised sleep-like replay phase is applied to partially restore performance across all previously learned tasks. Our study further suggests that task-specific information remains resilient to new training but decays gradually as network is trained on new tasks. These findings point to novel principles for developing a broad range of continual learning AI solutions.

Multiscale Fourier Neural Operator for Inverse Wave Scattering in Highly Oscillatory Media

Zilin You, Zhenli Xu, Wei Cai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08448v1 Announce Type: new Abstract: In this paper, we propose an operator learning method based on the multiscale Fourier neural operator (MscaleFNO) for inverse medium problems of Helmholtz equations. The MscaleFNO provides a neural surrogate model with reduced spectral bias for the Helmholtz equations, mapping highly oscillatory medium profiles to scattered wavefields. A plug-and-play inversion using elucidated diffusion model is introduced to regularize the inverse solver based on least squares of data misfits. Numerical results for partial aperture inversion of oscillatory two-dimensional media demonstrate the advantage and effectiveness of MscaleFNO for accurate reconstruction of highly oscillatory medium properties.

GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08450v1 Announce Type: new Abstract: Financial portfolio trading is naturally formulated as a reinforcement learning problem, where an agent sequentially rebalances assets under changing market conditions to balance return, risk, and transaction costs. Yet in non-stationary markets, raw OHLCV states and short-horizon return rewards often provide an under-specified learning interface, motivating large language models as a way to inject financial knowledge into state and reward design while constraining open-ended generation. To this end, we propose GIFT, an LLM-guided framework for state-reward interface design in PPO-based financial reinforcement learning. Rather than using the LLM to make trading decisions, GIFT uses Factor-guided State Enhancement to generate state features from financial-factor primitives, Risk-rule-guided Reward Shaping to generate auxiliary rewards from portfolio-risk rules, and Diagnostic-guided Refinement to revise candidate interfaces using PPO rollout diagnostics. After refinement, GIFT fixes the selected state-reward interface before evaluation, with no further LLM queries or interface updates at test time. Comprehensive rolling-window experiments across diverse market regimes and portfolio scenarios demonstrate that GIFT improves learning-signal quality and out-of-sample risk-adjusted portfolio performance over baselines. Code and data are available at: https://github.com/KAG778/GIFT .

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

Arya Shah, Himanshu Beniwal, Mayank Singh, Chaklam Silpasuwanchai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08451v1 Announce Type: new Abstract: Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed. We further identify tokenizer fertility as a structural driver of this alignment collapse. Collectively, our results demonstrate that prevailing alignment methodologies generalize poorly beyond high-resource languages, underscoring the urgent need for equitable multilingual safety techniques.

Theoretical Foundations of Continual Learning via Drift-Plus-Penalty

Nazreen Shah, Govinda Arya, Bharath B. N., Ranjitha Prasad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08452v1 Announce Type: new Abstract: In many real-world settings, data streams are nonstationary and arrive sequentially, requiring learning systems to adapt continuously without retraining from scratch. Continual learning (CL) addresses this challenge by incorporating new tasks while mitigating catastrophic forgetting, where learning new information degrades performance on previously acquired knowledge. We introduce a control-theoretic perspective on CL that explicitly regulates the evolution of forgetting, framing adaptation as a controlled process subject to long-term stability constraints. We focus on replay-based CL, where a finite memory buffer stores representative samples from prior tasks. We propose COntinual Learning with Drift-Plus-Penalty (COLD), a continual learning framework based on the Drift-Plus-Penalty (DPP) principle from stochastic optimization. To facilitate analysis, we also consider an oracle variant, COLD-ORACLE, as a reference benchmark. At each task, both methods minimize the current task loss while maintaining a virtual queue that tracks deviations from long-term stability on previously learned tasks, capturing the stability-plasticity trade-off as a regulated dynamical process. We establish stability and convergence guarantees that characterize this trade-off through a tunable control parameter. Experiments on standard benchmarks demonstrate that COLD consistently outperforms a broad range of state-of-the-art CL methods while providing competitive and controllable forgetting behavior through explicit regulation of stability and plasticity.

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

Tuc Nguyen, Thai Le — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08454v1 Announce Type: new Abstract: Activation steering provides a lightweight inference-time mechanism for controlling large language models (LLMs) by modifying their internal activation vectors toward desired behaviors. Most existing methods compute a fixed steering direction in the original activation space, typically from pairs of contrastive examples using mean differences, linear probes, or arbitrary separability criteria. While effective to a certain extent, these methods treat behavioral control as a global, linear, additive offset: the same direction is applied across inputs, and behaviors are linearly separable. This can be restrictive when behavioral features vary nonlinearly across the activation space or lie on curved and anisotropic manifolds, where the optimal intervention may be input-dependent. To address this limitation, we propose INNSteer, a nonlinear activation steering framework based on invertible latent transformations. Rather than searching for a better steering vector in the original representation space, INNSteer learns a lightweight invertible neural network $\phi$ that maps an LLM's activations into a latent space where behavioral classes are more amenable to linear control. At inference time, activations are mapped through $\phi$, steered in the latent space, and mapped back through the exact inverse transformation $\phi^{-1}$. This makes a simple latent-space translation become a nonlinear, input-dependent intervention in the original activation space. Across experiment settings on multiple LLM families, scales, behavioral traits, and safety benchmarks, INNSteer consistently improves model control over linear, transport-based, and nonlinear steering baselines while largely preserving generation fluency.

The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment

Xiaoyang Wang, Christopher C. Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08457v1 Announce Type: new Abstract: Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.

Personalized and Robust Proactive Robot Assistance with Uncertainty-Guided LLM Reasoning

Alvaro Gonzalez, M. H. Hasan Shovo, Ali Ayub — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08458v1 Announce Type: new Abstract: Proactive robot assistance in household environments requires accurate prediction of human activities and object usage under dynamic and noisy conditions. Existing approaches often rely on complex spatio-temporal models, which can be computationally expensive and sensitive to environmental variability. In this paper, we propose GLOBE, a lightweight framework that combines n-gram Markov models for capturing temporal behavioral patterns with uncertainty-guided large language model (LLM) reasoning. The framework performs sequential prediction efficiently while selectively invoking LLM reasoning only when the model confidence is low. To evaluate performance under realistic conditions, we introduce HOMER-Noise, a noisy extension of the HOMER+ dataset that simulates structured disturbances such as object movements caused by humans, pets, and toddlers. Experimental results show that GLOBE achieves competitive performance with state-of-the-art methods while improving robustness and computational efficiency across both clean and noisy settings. The framework is further validated through a proof-of-concept integration with a Stretch 3 mobile manipulator, demonstrating its potential application in real-world human-robot interaction scenarios.

Simplest Nontrivial Maxwellian Random Field Models for Stochastic LoS MIMO Using the Dyadic Green's Function

Lumeng Xu, Said Mikki — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08463v1 Announce Type: new Abstract: This letter introduces a novel, full-wave, physics-compliant stochastic dyadic Green's function (SDGF) framework for modeling electromagnetic (EM) multiple-input-multiple-output (MIMO) channels under wavenumber uncertainty. Unlike conventional phenomenological fading models, the proposed approach provides what appear to be the simplest exact random field models of electromagnetic line-of-sight (LoS) propagation that are also exact solutions of Maxwell's equations. Hence, we dub them Maxwellian random field theoretic models. These physically consistent stochastic models, including an analytically tractable wavenumber Gaussian model and a more general stochastic plane wave (SPW) model, serve as fundamental baseline models for stochastic LoS channel characterization. By preserving the vectorial structure of Maxwell's equations and the dispersion relation, the framework naturally incorporates both propagating and evanescent modes. Our analysis of ergodic capacity and degrees of freedom (DoF) reveals that the key results of the complex SPW model can be reproduced by the simpler Gaussian model with limited variance. Furthermore, we provide examples using 2D continuous MIMO systems, illustrating how the model's Maxwell-consistent stochasticity explains observed increases in channel capacity and DoF over the deterministic MIMO capacity baseline. These idealized Maxwellian random field theoretic models offer a physically grounded reference point for understanding fundamental limits in stochastic LoS propagation environments.

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08464v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens , and . These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.

An Empirical Comparison of General Context-Free Parsers

Huan Vo, Danushka Liyanage, Hong Jin Kang, Sasha Rubin, Rahul Gopinath — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08465v1 Announce Type: new Abstract: Parsing underpins a vast range of software engineering tasks, from compilers and static analyzers to language servers and fuzz testing tools. Yet most parsers deployed in practice are deterministic (LL or LR), forcing developers not only to contort their grammars to fit the parser, but to simplify the very languages they design sacrificing expressiveness for the sake of parseability. General context-free parsers eliminate this constraint. Yet, despite decades of algorithmic development, no rigorous head-to-head comparison exists across the major families of parsing algorithms. We present the first unified, controlled benchmark of six generalized parsing algorithms: CYK, Valiant, Earley, GLL, RNGLR, and BRNGLR, plus deterministic LL(1) and LR(1) baselines, all implemented in Rust with shared data structures and parse-tree extraction, and evaluated across 22 grammars ranging from simple expressions to full C++ and Java. Our results show that the cost of generality is lower than widely assumed. On deterministic grammars, the GLR family incurs only a 3x median slowdown over LR(1), with a narrow and predictable variance. GLR is the clear performance winner among generalized parsers and a practical default choice for software engineering tools.

ToolRec: Calibrated Preference Alignment for Query Recommendation in On-Device Assistants

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08466v1 Announce Type: new Abstract: Large Language Models (LLMs) have significantly advanced generative query recommendation. However, existing alignment methods primarily focus on standard chatbot scenarios, falling short in on-device intelligent assistants where users predominantly expect the rapid invocation of system-level tools. Moreover, directly aligning LLMs with real-world click logs introduces severe noise due to varying user activity levels and the failure to emphasize execution-oriented queries. To address these challenges, we propose ToolRec, a calibrated preference alignment framework tailored for on-device query recommendation. To ground query recommendation with executable actions, we first construct SysToolKit, a comprehensive repository of 708 system tools, paired with a context-aware tool retrieval mechanism to ensure recommendation relevance. We then propose a dual-level calibration mechanism to refine raw click data, effectively mitigating user behavioral noise by calibrating signals based on user activity levels, while simultaneously up-weighting click signals on system-level tool-invoking queries. Guided by these refined preference signals, we then align the model using a sample-level weighted Kahneman-Tversky Optimization (KTO). Extensive online A/B tests on our mobile assistant platform OPPO Xiaobu, which has over 150 million monthly active users, demonstrate that ToolRec can significantly improve Click-Through Rate (CTR) and total clicks volume over strong baselines while maintaining high query relevance.

The Confidence Trap: Calibration Attacks for Graph Neural Networks

Cuong Dang, Jiahao Zhang, Hieu Ta Quang, Dung Le, Lu Cheng, Suhang Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08467v1 Announce Type: new Abstract: While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNNs to adversarial structural perturbations remains largely unexplored. However, studying calibration attacks on graphs presents unique technical challenges: (1) the discrete nature of graph structures complicates gradient-based optimization, (2) existing underconfidence objectives fail to drive predictions toward uniform distributions, and (3) GNNs are highly sensitive to edge perturbations, often causing unintended label changes that violate attack constraints. To address these challenges, we propose a \textbf{Unified Graph Calibration Attack (UGCA)} framework designed for \textbf{worst-case (white-box) analysis} of GNN calibration robustness. UGCA introduces a KL-divergence loss to encourage uniform predictive distributions, a reranking mechanism to reduce label flipping, a hybrid loss to recover labels when violations occur, and beam search to explore a broader adversarial search space. We further provide theoretical insights linking model generalization, dataset complexity, and calibration vulnerability, showing that models with higher accuracy or trained on datasets with more classes are more susceptible under this threat model. Extensive experiments demonstrate that UGCA substantially increases Expected Calibration Error while preserving classification accuracy. Our code is publicly available at https://github.com/CaptainCuong/Graph-Calibration-Attack.git.

OctaOctree Neural Radiosity for Real-time Glossy Material Rendering

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08469v1 Announce Type: new Abstract: Modeling high-frequency outgoing radiance distributions remains a fundamental challenge in global illumination, especially for glossy and specular materials. Existing neural-based radiance caching methods commonly rely on positional feature encodings or spatially organized caches, which makes it difficult to represent sharp directional radiance variations without increasing the model complexity or sampling cost. To address this challenge, we propose OctaOctree, an efficient spatial-angular radiance representation for global illumination. OctaOctree organizes outgoing radiance with an adaptive octree in 3D space, and associates each spatial node with an octahedral directional map. By coupling the spatial hierarchy with direction-dependent storage, our representation allocates fine spatial resolution to local illumination and visibility changes, while using coarser spatial levels with richer angular resolution to capture glossy and specular radiance distributions. This design embeds a reflectance-aware spatial-angular prior directly into the radiance representation, reducing the burden on neural networks or reconstruction modules to recover high-frequency view-dependent effects from positional features alone. As a result, OctaOctree provides a compact and expressive neural encoding for a wide range of indirect illumination effects, from diffuse interreflection to sharp glossy reflections. Experiments demonstrate that our method produces high-quality, direction-aware global illumination with single network query at primary intersections, achieving improved fidelity and real-time performance compared with baseline neural radiosity and radiance caching approaches.

LUNA-AD: Lightweight Uncertainty-Aware Language Model with Lifelong Learning for Autonomous Driving

Ruoyu Yao, Pei Liu, Ruiguo Zhong, Mingxing Peng, Rui Yang, Jun Ma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08470v1 Announce Type: new Abstract: While large language models (LLMs) offer promising reasoning capabilities, their integration into safety-critical driving systems is hindered by limited reasoning diversity, high computational overhead, and static learning paradigms. To address these challenges, we propose LUNA-AD, a lightweight uncertainty-aware language model with lifelong learning for autonomous driving (AD). LUNA-AD features a tri-system architecture that reconciles complex multimodal behavioral reasoning, efficient deployment, and continual refinement. We design a multi-agent analytical system to generate uncertainty-aware decision-making demonstrations through diverse hypothesis exploration. A dual-head lightweight heuristic model is distilled to unify the inference of decision distributions and textual explanations while enabling efficient deployment. Furthermore, a reflection-driven lifelong learning mechanism operates on multimodal decision outputs and preserves strategic diversity, allowing for the refinement of candidate decisions and rationales via closed-loop feedback to enhance driving robustness. Extensive experiments on nuPlan benchmarks demonstrate that LUNA-AD achieves state-of-the-art success rates under both non-reactive and reactive modes, with drastically reduced inference latency compared to existing knowledge-driven AD frameworks.

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

Marina Igitkhanian, Erik Arakelyan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08471v1 Announce Type: new Abstract: Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model's incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.

Digital White Spaces: A Cyberpsychology-Informed Framework to Mobile Phone Addiction

Leandros Maglaras, Helge Janicke, Konstantinos Karantzalos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08472v1 Announce Type: new Abstract: Mobile-phone overuse and attention fragmentation have become pressing societal and public-health concerns. Cyberpsychology research highlights addictive engagement loops driven by intermittent rewards, persuasive design, and habit formation. In this editorial I synthesize current evidence on mobile-phone addiction and propose "Digital White Spaces" (DWS), a socio-technical framework that combines privacy-preserving monitoring, AI-driven detection of addictive loops, device-mode interventions, and physical signal-limited zones to restore cognitive autonomy.

Physically Consistent Null Space Alignment for Detection of Low-Magnitude False Data Injection Attacks

Xin Li, Chenhan Xiao, Jonathan Cohen, Aviad Elyashar, Yang Weng, Rami Puzis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08473v1 Announce Type: new Abstract: False data injection attacks (FDIAs) introducing small measurement perturbations can still cause large deviations in power system state estimation when the injected signals align with the pseudo-null space of the system model. Existing model- and data-driven detectors may fail to identify such low-magnitude but high-impact attacks because residual tests ignore changes hidden in the pseudo-null space, while subspace learning methods capture correlation patterns without enforcing physical consistency. This paper proposes Physically Consistent Null Space Alignment (PCNSA), a framework that detects stealthy FDIAs by preserving, through preprocessing, the geometric correspondence between the physical null space and the measurement-derived pseudo-null space. The key point is a Pseudo-null Space Conserved data Preprocessing (PSCP) step that re-expresses measurements in the physical coordinate frame before subspace extraction. We prove that PSCP preserves the separation between row space and its orthogonal complement, a property that conventional per-feature standardization violates. This keeps the singular value decomposition (SVD)-derived pseudo-null subspace aligned with the physical residual space without explicit knowledge of H. Experiments on IEEE 14-, 30-, 57-, and 118-bus systems confirm this principle in practice: stealthy attacks that evade XTM, LSTM, AE and Isolation Forest baselines appear as clear deviations in the aligned subspace, yielding higher F1-score and detection accuracy while remaining robust under partial observability and realistic PMU noise.

FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08476v1 Announce Type: new Abstract: Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training. FlashCP introduces a sharding-aware communication mechanism to eliminate redundant KV communication and proposes a novel Whole-Doc sharding strategy that maximizes communication savings while maintaining balanced workloads. To efficiently combine Whole-Doc and Per-Doc sharding, FlashCP further designs a heuristic algorithm to search for near-optimal sharding plans. Extensive experiments show that FlashCP achieves up to 1.63x speedup over state-of-the-art CP frameworks across diverse datasets.

A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08477v1 Announce Type: new Abstract: Knowledge extraction from symbolic data often produces abstractions that are formally defined but not immediately interpretable by users. Formal Concept Analysis (FCA) and Relational Concept Analysis (RCA) provide representative settings for this issue: they generate explicit conceptual structures, implications, and relational dependencies from object descriptions and relations. Although these structures are explainable by design, their concepts are often identified by technical labels, which limits their use as human-interpretable knowledge units. Assigning meaningful names to such concepts is therefore a key issue for interpretation, navigation, validation, and reuse by domain experts. This paper investigates concept naming in FCA and RCA from a symbolic knowledge representation perspective. We first characterize the linguistic and terminological challenges involved in naming generated symbolic abstractions, including ambiguity, discrimination, concision, and consistency across related concepts. We then propose a configurable framework for LLM-assisted concept naming. The framework relies on a variability model that controls which sources of information are exposed during naming, such as intent, extent, inherited information, neighboring concepts, implications, and relational attributes. It thereby makes explicit the semantic choices involved in moving from formal concept descriptions to human-readable names. The approach is illustrated as a proof of concept on a small relational dataset in the pizzeria domain. This illustration shows how different configurations influence the names suggested by an LLM, and how naming variability can reveal interpretation choices, relational dependencies, and possible modeling issues in the underlying symbolic data.

Inferring hidden forcing in a biological oscillator using Kolmogorov-Arnold networks

Julian Szereszewski, Facundo Fainstein, Leandro E. Fernandez, Gabriel B. Mindlin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08479v1 Announce Type: new Abstract: Inferring the forces that drive a dynamical system from partial observations is a fundamental challenge across physics, particularly when distinct underlying mechanisms produce similar observable dynamics. Here we show that the effective muscular forcing underlying avian respiratory dynamics can be reconstructed from measurements of air-sac pressure alone. Using an interpretable learning framework based on Kolmogorov-Arnold networks, we infer the governing equations of the system directly from data and uncover a nontrivial structure in the underlying forcing that is not apparent from the pressure signal, which instead suggests a relaxation-like oscillation. The reconstructed dynamics predict a two-phase activation pattern within each respiratory cycle, which we independently validate through electromyographic recordings of expiratory muscles. These results demonstrate that data-driven reconstruction of dynamical laws can reveal hidden physical structure and provide access to unobserved driving variables, establishing a general route to infer latent forces in partially observed dynamical systems.

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08480v1 Announce Type: new Abstract: Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Suraj Ranganath, Anish Raghavendra — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08481v1 Announce Type: new Abstract: Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08483v1 Announce Type: new Abstract: Background: Consumer-facing large language models are now a common source of health information, and they interpret and personalize responses rather than retrieve them. Whether their responses vary across users is a clinical, equity, and governance question, sharpened by evidence that sycophantic responses can alter judgment and increase trust. Objective: To evaluate response variation and sycophancy in consumer-facing health LLMs under conditions resembling ordinary patient use. Methods: We constructed simulated user profiles differing in geography, browsing context, expressed beliefs, and social determinants of health, drawing on literature linking social context to health attitudes. We adapted validated instruments, including the Vaccination Attitudes Examination scale and reproductive attitudes scales, into multi-turn prompts designed to elicit clinically meaningful variation across users. Results: The evaluation encountered five linked barriers. Factual prompts produced stable responses that masked sycophancy emerging over multi-turn conversation. Browser-based interfaces did not disclose which signals influence outputs and could not be reset to a clean baseline. Large-scale testing was restricted by terms of service, rate limits, and bot detection. Accuracy-based criteria could not capture tone, framing, or omission, and LLM-as-judge methods risked shared alignment bias. Models changed without traceable version identifiers, preventing reliable replication. Conclusions: No reliable independent evaluation framework yet exists for examining how consumer-facing health LLMs behave in ordinary use. Oversight requires disclosure of personalization signals, stable version identifiers, researcher safe harbor programs, and post-deployment monitoring of health-related outputs.

STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08484v1 Announce Type: new Abstract: Joint Species Distribution Modeling (JSDM) is a key enabler for biodiversity monitoring and conservation planning. However, accurate JSDM faces two coupled challenges: environmental drivers and species distributions are inherently spatio-temporal, while species co-occurrence patterns exhibit complex non-linear community structure and severe long-tail imbalance driven by rare species. Existing approaches often address these factors in isolation, learning from static covariates or neglecting the historical trajectories of dynamic community structure. To overcome these limitations, we propose STELLAR (Spatio-Temporal Environmental Learning with Latent Alignment and Refinement), a novel framework that learns a shared latent space where dynamic habitat context and community structure are optimized jointly. Our approach integrates three complementary components: (1) a Graph-Temporal Encoder that employs graph attention and recurrent units to aggregate spatial neighborhood effects and capture the co-evolving historical dynamics of environmental context and community structure; (2) a Context-Anchored Latent Alignment mechanism that structures the latent space using a label-activated mixture prior and supervised contrastive learning, actively clustering species based on shared environmental preferences; and (3) an Imbalance-Aware Decoupled Decoding module that utilizes Asymmetric Loss to focus learning on hard, rare species samples, preventing mode collapse in the long tail. Experiments on the large-scale eBird dataset, curated with domain experts, demonstrate that our framework significantly outperforms state-of-the-art baselines, particularly in predicting rare species and revealing interpretable species interactions.

TRADE: Transducer-Augmented Decoder for Speech LLM

Yun Tang, Shanil Puri, Shinji Watanabe, Subhabrata Mukherjee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08486v1 Announce Type: new Abstract: Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies -- a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.

What Makes a Desired Graph for Relational Deep Learning?

Yao Cheng, Siqiang Luo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08491v1 Announce Type: new Abstract: Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database schemas are often not well suited for how graph neural networks (GNNs) perform relational reasoning. We study what makes a relational graph suitable for deep learning and show that schema-derived graphs suffer from two systematic failures: information overload and semantic fragmentation. Our empirical analysis reveals that the desired graph is not the raw schema, but a result of controlled structural adaptation. Performance depends on balancing two operations: mitigating information overload via filtering, and repairing semantic fragmentation via injection. Specifically, filtering serves as a bias-variance knob with non-monotonic effects, while injection improves performance only when it explicitly restores the relational dependencies missing from the original schema. Based on these findings, we develop an end-to-end structural optimizer that applies both operations to adapt relational graphs automatically. Across 26 tasks spanning classification, regression, and recommendation, the optimized graphs consistently improve accuracy while often reducing inference cost.

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08492v1 Announce Type: new Abstract: Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08495v1 Announce Type: new Abstract: Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang, Fei Sun, Mengnan Du — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08496v1 Announce Type: new Abstract: Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

Minyoung Hwang, Seokhyun Lee, Changhee Lee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08497v1 Announce Type: new Abstract: As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Zhengyi Zhuo, Yan Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08500v1 Announce Type: new Abstract: Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08501v1 Announce Type: new Abstract: Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

Standpoint Logics with Defeasible Beliefs

Nicholas Leisegang, Thomas Meyer, Sebastian Rudolph — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08503v1 Announce Type: new Abstract: In this paper, we integrate the defeasible logic of Kraus, Lehmann and Magidor (KLM) with the standpoint logic framework of G\'omez \'Alvarez and Rudolph. This is done with the goal of formally expressing knowledge taking into account multiple (possibly contradicting) viewpoints, which in turn may hold defeasible beliefs. In doing so, we utilise Defeasible Restricted Standpoint Logics (DRSL), introduced by Leisegang et al. Our work expands on previous work by providing a foundational representation result for DRSL semantics and systematically lifting several well-known entailment relations from the propositional case to the standpoint-enhanced setting. In particular, we characterise the semantics for DRSL through a set of KLM-style postulates adapted for the standpoints case. We furthermore provide a means to lift preferential entailment, and the class of entailment relations based on single ranking functions from the purely propositional to the standpoint-enhanced context, including rational and lexicographic closure. We show this can be done equivalently through semantic and algorithmic means. Furthermore, we show that, for each considered form of entailment, the complexity class of entailment checking does not change when moving from propositional KLM to DRSL.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08508v1 Announce Type: new Abstract: Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08511v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see -- they simply need to "look less" at the right depth.

Friend or Foe? Language as an ideological switch in open-weight LLMs under Russian disinformation stress

Anna Ma{\l}gorzata Kami\'nska, Tetiana Klynina — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08512v1 Announce Type: new Abstract: As Russia's war against Ukraine extends into generative AI, large language models (LLMs) adapted for local post-Soviet languages are deployed in contested information environments. Policy and industry discourse assumes that culturally aligned adaptation encodes the political orientation of the target community: a Ukrainian-oriented model will resist Russian narratives, a Russian-oriented one will reinforce them. Does it? This article systematically disconfirms that assumption. We run a controlled audit of four openly available LLMs sharing a common base model but fine-tuned for different linguistic communities, querying them in Ukrainian, Russian and English across ten contested wartime narratives: Crimea, "denazification", the "one people" thesis, and atrocity denial at Bucha and Mariupol. The result is a Fine-Tuning Paradox: the Ukrainian-oriented model shows the weakest resistance to Russian disinformation in Russian, while the Russian-oriented one exhibits the strongest rejection. Corpus composition, language coverage and prompt format prove more decisive than nominal cultural provenance. We situate these findings within debates on hybrid warfare, digital sovereignty and post-imperial information orders, arguing that the principal threat to regional information sovereignty is not adversarial fine-tuning but the untested assumption that cultural alignment guarantees resilience.

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

Elisei Shafer, Oren Gal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08513v1 Announce Type: new Abstract: Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.

OmniTryOn: Video Try-On Anything at Once!

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08514v1 Announce Type: new Abstract: Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

Unifying von-Neumann HPC and Neuromorphic Acceleration via the EBRAINS Research Infrastructure: A Framework for High-Performance Workflows

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08515v1 Announce Type: new Abstract: Modern scientific workflows increasingly span diverse computing architectures, yet executing a single computational model across disparate systems often forces researchers to maintain fragmented, site-specific pipelines. In this paper, we address this challenge within the domain of computational neuroscience by presenting a unified, cloud-based workflow orchestrated via EBRAINS JupyterLab. This workflow enables users to transparently execute spiking neural networks on both von-Neumann supercomputers and neuromorphic hardware. Using a single federated identity, the system dispatches jobs to HPC sites (JUSUF, Galileo100) via PyUNICORE and to the SpiNNaker-1 neuromorphic system via the Neuromorphic Computing Platform Interface. To guarantee cross-site reproducibility and mitigate software version drift, we utilize a zero-installation execution mode that dynamically pulls PMIx-aware Apptainer containers to HPC compute nodes. Furthermore, we demonstrate genuine model-level portability using the NESTML domain-specific language, allowing custom neuron models to be written once and automatically compiled for either the NEST (C++) or sPyNNaker backends. Validated with a balanced random network case study, this work illustrates a practical, end-to-end path for hardware-agnostic workflows while highlighting the critical role of containerization and domain-specific languages in achieving true cross-platform reproducibility.

Stable Triangle Projections for Variable-Degree Tetrahedral Spaces and Uniform IPDG Preconditioning

Situan Li, Weiying Zheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08516v1 Announce Type: new Abstract: The main ingredient of this paper is an edge-local variable-degree projection on a triangle that is uniformly stable in both L2 and H1. We use this two-dimensional operator in two tetrahedral constructions. First, on a reference tetrahedron, we build an H1-stable projection from a high order polynomial space onto a variable-degree space whose degrees are prescribed independently on edges, faces, and in the volume. Since the tetrahedral projection is local and trace-compatible, it also gives an h- and p-uniform stable decomposition, in the weighted energy norm, for conforming hp spaces, and hence a uniform additive Schwarz preconditioner for the conforming Laplace operator. Second, on a uniformly regular mapped tetrahedral mesh with elementwise variable polynomial degrees, the same triangular projection gives the finite-layer edge truncation needed in a p-uniform stable DG-to-CG decomposition for the symmetric IPDG norm. The DG-to-CG decomposition, combined with the conforming splitting, gives the IPDG preconditioner. The constants depend only on reference shapes, the local degree-spread bound within each tetrahedron, the neighbor-degree bound across mesh faces, uniform map-regularity, patch cardinalities, and the coefficient path constants; they are independent of h, of the local polynomial degrees, and of the coefficient contrast.

A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

Xiaoli Yu, Jiamiao Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08517v1 Announce Type: new Abstract: Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single finite-sample certificate that simultaneously upper-bounds the selected risk, lower-bounds the acceptance probability $\pacc$ above a floor $\pmin$, and lower-bounds the deployment utility. This certificate must be valid under adaptive threshold selection from a finite grid of $m$ pairs on $\ncert$ samples. We give such a certificate for bounded, possibly non-monotone losses by treating the selected risk directly as a ratio rather than through a Hoeffding-style range bound. The construction couples three confidence bounds: a variance-adaptive empirical-Bernstein bound on the ratio risk, a Clopper--Pearson bound on acceptance, and a two-sided closeness bound on utility. Together they lower-bound the certified policy's utility absolutely and to within $2\gammau$ of the best over the \emph{certified set}, both non-vacuous whenever feasible; a regime-scoped third leg matches an external oracle, informative only where the risk margin $\gammar < \alpha$ and vacuous at the headline operating points. Relative to the range-only Hoeffding-ratio construction this sharpens the acceptance-floor dependence from $1/\pmin$ to $1/\sqrt{\pmin}$, and a closed-form corollary identifies a per-pair regime in which our risk bound dominates a Hoeffding conformal risk control (Hoeffding--CRC) selective bound. Empirically, on ImageNet (three ResNets) and COCO val 2017 panoptic, the certificate opens a $+22$ pp certified-acceptance frontier over Hoeffding--CRC and is ${\approx}10{\times}$ tighter than a non-vacuous matched-valid baseline; these gains are regime-scoped, not universal, and absent on ADE20K. The certifier runs in $O(\ncert m)$ time.

Two Bridges, One Pathway: From VLMs to Generalizable VLAs with Embodied Trajectory-Coupled Data

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08520v1 Announce Type: new Abstract: Vision-language models (VLMs) are powerful general-purpose reasoners, yet converting them into robot control policies (VLAs) is surprisingly difficult. The root cause is a two-fold gap: VLMs are trained on internet-scale images with language-understanding objectives, while VLAs must perceive robot scenes and predict motor actions. Fine-tuning a VLM directly on robot action data forces the model to cross both gaps at once -- the learning curve is steep and the rich generalizations learned during pretraining tend to degrade rather than transfer. We argue that this gap can be bridged gradually with the right intermediate data. We introduce \emph{embodied trajectory-coupled (ETC) data} -- vision-language supervision derived from the same robot scenes and trajectories used for action learning. Because ETC data shares the visual context of robot operation while retaining familiar language-understanding objectives, it provides a natural stepping stone between VLM pretraining and VLA fine-tuning. Building on this, we design a three-stage training recipe. Distribution Bridging first adapts the VLM to embodied visual-language semantics. Objective Bridging then gradually shifts the model toward action prediction while preserving the acquired representations. Retentive Adaptation finally specializes the policy to the target deployment domain. We further show that mixing task-relevant out-of-distribution ETC data with a small amount of action data enables the model to generalize to novel visual-language conditions without requiring additional robot demonstrations. Simulation and real-robot experiments confirm that this gradual bridging strategy is the key to transferring VLM generalization into robust, deployable robot policies.

Exploring CKKS Parameter Trade-offs for Privacy-Preserving Personalized Federated Learning

Kamolchanok Saengtong, Phanwadee Sinthong, Norrathep Rattanavipanon — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08521v1 Announce Type: new Abstract: Privacy-preserving Personalized Federated Learning (PFL) enables clients to collaboratively train personalized models without exposing raw data, but exchanged model updates remain vulnerable to inference attacks from honest-but-curious servers. Homomorphic Encryption (HE) addresses this by allowing server-side aggregation directly on encrypted updates, with the CKKS scheme being particularly suitable due to its native support for approximate floating-point arithmetic. However, no prior work has examined how to configure CKKS for PFL deployments, leaving practitioners without principled guidance on parameter selection that directly affects privacy, precision, and computational cost. This paper presents pFedCKKS, a generic framework integrating CKKS into PFL, and provides the first systematic parameter selection guide for practitioners. We derive the full CKKS parameter constraints under 128-bit security for the PFL setting, showing the selection problem reduces to choosing just two values: the inner and outer ciphertext prime. Implemented using the Flower framework and TenSEAL library, pFedCKKS is evaluated on the FEMNIST, CelebA and Sentiment140 datasets with FedFinetune, Ditto and FedPer which represents PFL algorithms. Experimental results reveal an empirical trade-off between precision and computational/communication costs. This allows us to draw a concrete guideline for selecting proper CKKS parameters that balance efficiency and accuracy in real-world deployments of pFedCKKS.

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08525v1 Announce Type: new Abstract: Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.

A Mixed Extended Virtual Element Method for Elliptic Interface Problems on Polygonal Meshes

Xianyan Zheng, Jinru Chen, Feng Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08526v1 Announce Type: new Abstract: We propose a lowest-order $H(\operatorname{div})$-conforming mixed extended virtual element method for elliptic interface problems on interface-unfitted polygonal meshes. The flux and pressure are approximated by subdomain-wise extended $H(\operatorname{div})$-VEM spaces and by piecewise constants, respectively. On cut elements, the computable polynomial projection is defined on the whole background element and then restricted to the two subdomains. Compared with BDM-type polynomial spaces, the mixed VEM space contains a non-polynomial component, which gives rise to additional consistency terms on cut elements. To control these terms, we use an enhanced kernel stabilization on cut elements and an interface normal-flux average in the mixed coupling. A corrected interface-flux penalty and a local divergence ghost penalty are added to obtain cut-position-independent stability without using a volume div-div augmentation. We prove continuity, a discrete inf-sup condition, and an optimal first-order error estimate in a mesh-dependent norm. The constants are independent of the mesh size and of the position of the interface relative to the background mesh, but may depend on the coefficient contrast.

Scaffold Effects on GAIA: A Controlled Comparison

Jason Starace — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08529v1 Announce Type: new Abstract: Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre-registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre-registered prediction that more capable models would be less scaffold-sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier-scaling holds only at Level 1 under the robust slice. The multi-agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross-provider models, making model family rather than capability tier the conditioning variable, and the predicted planner-executor advantage on file-reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid-trajectory errors at the harder level, and a single cell (Gemini with planner-then-executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single-scaffold capability numbers are scaffold-conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08530v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08531v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also become more diverse. Existing evaluations often rely on manually written scenarios, static prompts, or final-output judgments, making it difficult to capture the diverse risks that agents may face during task execution. We introduce VESTA, a fully automated scenario generation and safety evaluation framework for LLM agents. Based on five risk dimensions, VESTA instantiaes abstract and diverse safety risks in real-world task execution into 1,072 measurable evaluation scenarios. Using the automated evaluation pipeline, 12 LLM agents are evaluated under two authority contexts. The results show that current agents still face substantial behavioral safety risks during task execution, with an average ASR of 47.1% and several models exceeding 70%. These findings demonstrate the importance of executable, process-level evaluation for understanding and improving LLM agent safety.

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08532v1 Announce Type: new Abstract: A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08533v1 Announce Type: new Abstract: Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.

NGram-MoSE: Efficient Remote Sensing Super-Resolution via N-Gram Context and Mixture-of-Experts

Yun-Hsuan Huang, Trong-An Bui, Chih-Hung Chuang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08535v1 Announce Type: new Abstract: Remote sensing applications for environmental monitoring and disaster management are frequently constrained by a spatial--temporal trade-off: imagery with fine spatial detail is often acquired less frequently, whereas more temporally available observations are typically coarser. Single-image super-resolution provides a practical means to enhance coarse imagery without changing acquisition schedules, yet many Transformer-based SR models remain computationally expensive and can be sensitive to limited or geographically biased training data, which degrades robustness under out-of-distribution conditions. This paper presents NGram-MoSE, a lightweight Transformer architecture designed to improve both efficiency and texture continuity. NGram-MoSE introduces N-Gram Context Injection to strengthen cross-window local consistency and mitigate window-boundary artifacts, and incorporates a Mixture-of-Experts (MoE) feed-forward design to scale capacity through sparse activation without proportional growth in inference cost. Experiments on a geographically disjoint OOD test set show that NGram-MoSE achieves 31.68\,dB PSNR while reducing FLOPs by $14\times$ relative to a heavyweight Transformer reference. Downstream evaluation on a landslide segmentation benchmark further demonstrates that restoring degraded inputs to the detector training scale improves performance, yielding a 4.47\% absolute gain in mAP@50 over bicubic upsampling, and exhibits stronger cross-scale consistency under scale extrapolation. These results indicate that NGram-MoSE provides an effective SR module for resource-constrained remote sensing pipelines requiring robust generalization.

Routine laboratory trajectories encode the onset of organ-level complications in cancer

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08538v1 Announce Type: new Abstract: Routine laboratory panels drawn during cancer treatment constitute longitudinal physiological recordings of organ function, yet their temporal structure is discarded by single-timepoint prognostic tools. A transformer trained on 2,777,595 laboratory measurements from 3,905 patients with multiple myeloma or ovarian cancer predicted the two-year onset of 162 treatment-associated complications, including therapy-related myelodysplastic syndromes, spanning eight clinical categories, achieving 1.5- to 6.1-fold enrichment above prevalence at the group level. It matched or outperformed non-sequential baselines across grouped endpoints (AUROC gains up to +0.11), demonstrating that longitudinal laboratory trajectories capture evolving complication-specific physiology inaccessible from isolated measurements. Predictions generalised across both cancers, divergence concentrating in disease-specific complications, and biomarker masking recovered signatures consistent with established pathophysiology. External validation on MIMIC-IV and MMRF CoMMpass confirmed transferability across independent healthcare systems (AUROC up to 0.85). Routine oncological laboratory data encode organ deterioration weeks to months before clinical onset, enabling complication-specific surveillance without additional testing infrastructure.

AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions

Chenglin Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08539v1 Announce Type: new Abstract: AI agents increasingly take consequential actions -- shell commands, cloud operations, and arbitrary tool-calls -- so a trust layer must decide, per action, whether to allow, warn, block, or escalate. We argue that the right way to reason about such a layer is by threat type. Lexical (fixed-signature) threats, where danger lives in a stable token, are decidable by deterministic rules; semantic (intent-dependent) threats, where a benign and a malicious action share the same surface, are out of reach for rules by construction. We make this concrete with a negative proof: a determined, hand-authored cloud rule pack lifts held-out accuracy only 48 to 56% overall and moves the semantic categories by 0pp (data_db 29 to 29, observability 59 to 59, supply_chain 50 to 50), while a strong LLM judge carries exactly those categories. We give the judge a self-learning capability: on a corpus that is mainly semantic attacks it nearly doubles rule accuracy (48% to 83.6-85.2%) with near-zero false-blocks, and this holds across two model providers. We turn this into a self-improving dual-store system: the judge distills a growing deterministic rule floor on lexical threats (cheaper over time) and feeds a guarded RAG memory on semantic threats (a verdict-cache fails -- surface-twins collapse to ~58% -- so a corroboration guard lifts semantic accuracy +13pp, 70 to 84). The result is what sets AgentTrust v2 apart from its static v1 predecessor: a trust layer that self-evolves from its own stream of decisions -- cheaper on the lexical class (it distils its own rules) and smarter on the semantic class (it accrues guarded precedent), while never hard-blocking a benign action. An end-to-end online replay shows the judge-call rate falling (50% to 44%) and judge-domain accuracy rising (71% to 80%), with 0 benign hard-blocks across 45,000 actions.

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08542v1 Announce Type: new Abstract: Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

Shumeng Yang, Yisu Liu, Jiayi Zheng, Zhaohui Yang, Linjing Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08543v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.

Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08545v1 Announce Type: new Abstract: Building Information Modeling (BIM) projects require information requirements to be described as machine-checkable Information Delivery Specification (IDS) files in order to verify whether building models contain the required attributes. However, IDS authoring remains a practical bottleneck: practitioners must handle domain vocabulary, strict XML schema constraints, and external validator conformance while also checking whether the requirement itself is correctly expressed. We present Ishigaki-IDS, an open-weight LLM specialized for verifier-aware IDS draft generation. The model combines continued pretraining on BIM/IDS corpora, supervised fine-tuning on information-requirement-to-IDS pairs, and reinforcement learning with verifiable rewards from an external validator. The goal is not to replace expert review, but to move IDS authoring from low-level XML and schema repair toward validator-loadable drafts that practitioners can inspect and correct. On the 166-case expert-created Ishigaki-IDS-Bench, Ishigaki-IDS-8B achieves an IDSAuditPass score of 0.651, a validator-pass metric for generated IDS files, substantially outperforming Claude Opus 4.5, the strongest single-shot LLM baseline we evaluated, at 0.331. It also obtains an Audit-Gated FacetF1 of 0.282, which measures requirement-facet alignment among validator-passing drafts. The same recipe scales: 14B and 32B variants reach IDSAuditPass 0.753 / 0.693 and Audit-Gated FacetF1 0.392 / 0.369. In a workflow check with six BIM practitioners, Ishigaki-assisted authoring reduced aggregate work time by 54.7% under the same validation and alignment endpoint. These results suggest that verifier-aware IDS generation can reduce the practical burden of converting BIM information requirements into reviewable IDS drafts.

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08548v1 Announce Type: new Abstract: Recent progress in robot manipulation has been largely driven by learning from large-scale demonstrations. For humanoid robot loco-manipulation tasks, however, existing data sources force an unsatisfying tradeoff between trajectory quality and scalability. Real-world teleoperation provides the highest-quality trajectories but requires dedicated physical space and time-consuming scene resets. Simulation offers an alternative way out of this dilemma: it can produce clean, embodiment-aligned data at scale without any physical hardware. In this paper, we propose OASIS, a simulation-data-driven framework for humanoid loco-manipulation. OASIS automatically reconstructs realistic object assets from real-world images using a 3D generative model. Based on these assets, trajectories are first collected through teleoperation in simulation, and then augmented under diverse domain randomizations in a post-processing stage. With the resulting simulation data, we further design a hierarchical visuomotor policy for humanoid loco-manipulation. Extensive experiments on the real humanoid robot show that, under zero-shot deployment, the policy trained on our simulation data achieves higher success rates on most tasks than that trained on real-robot teleoperation data, owing largely to the broad lighting and environmental variations covered by our simulation rendering, which real-robot data fails to capture. The project page is available at https://oasis-humanoid.github.io/.

Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents

Mark Burgess — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08552v1 Announce Type: new Abstract: I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.

FusionVul: A Multimodal Feature Fusion Framework for Source Code Vulnerability Detection

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08553v1 Announce Type: new Abstract: Source code vulnerability detection remains a long-standing challenge due to the increasing scale, structural complexity, and semantic diversity of modern codebases. Conventional static-analysis or rule-based approaches often fail to capture subtle execution dependencies, while single-modality learning models tend to overlook critical structural information embedded beyond the lexical surface of source code. To improve robustness across heterogeneous code patterns, we propose FusionVul, a joint representation learning framework that integrates sequential syntactic representations extracted by a pretrained Transformer encoder with structural semantics propagated through a graph neural network. The framework further incorporates a cross-attention-based feature fusion network to enable fine-grained cross-modal interaction and employs a sample-aware weighting mechanism to integrate multiple predictive branches. Experimental results on four datasets demonstrate that FusionVul achieves superior F1 scores on datasets with highly dispersed function size distributions and broader vulnerability-type coverage, such as SVulD and DiverseVul, reflecting its capability to capture complex and diverse vulnerability patterns.

A Theoretical Analysis of Memory and Overfitting Phenomena in Stochastic Interpolation Models

Yunchen Li, Shaohui Lin, Zhou Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08554v1 Announce Type: new Abstract: This paper provides a theoretical account of memorization in stochastic interpolation models. By leveraging closed-form expressions for the optimal velocity field and the associated score function, we show that, in the continuous-time oracle setting, both deterministic and stochastic generation processes recover training samples. Under Euler discretization, generated samples remain centered around training samples, with deviations controlled by the step size. We further analyze generation in the presence of estimation errors and show that accumulated estimation errors control the endpoint deviation from the training set. These results imply that the generated sample admits a representation as a training sample perturbed by three controlled terms: a discretization-induced bound, an estimation-error-induced bound, and stochastic Gaussian noise. Based on this characterization, we provide theoretical definitions of overfitting and underfitting in generative models. Synthetic simulations support our theoretical findings.

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

Haotian He, Zeyu Yan, Qipeng Liu, Ning Guo, Wenzhao Lian — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08555v1 Announce Type: new Abstract: Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.

Inside the LLM Word Factory

Benzi Busigin, Yuval Pinter — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08562v1 Announce Type: new Abstract: Transformer language models process input provided as subword fragments, but natural language semantics usually rely on word-level concepts. Detokenization is the process where models reconcile these two facts, aggregating subwords into word-level representations through their computation. Prior work has found that this takes place mostly in early-to-middle layers, but so far the exact mechanics of the process have not been pinned down. We venture deep into detokenization using activation patching in controlled paired experiments that isolate the contribution of different model components, localizing English detokenization in Llama2-7B to a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families, but the depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. Finally, we provide a probe for determining the success of the detokenization process based on early-layer activations alone, performing at 0.94-0.97 AUROC depending on the amount of context.

Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

Dandan Chen, Yaqiang Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08563v1 Announce Type: new Abstract: While global data-driven models excel at predicting continuous atmospheric variables, three-dimensional hydrometeor forecasting remains challenging due to the zero-inflated, long-tailed distributions of these variables. Standard deep learning optimization often yields overly smooth forecasts, attenuating extreme events and spatial textures. We propose PredHydro-Net, a physics-guided dual-decoding framework that mitigates this smoothing. To resolve multi-variable optimization conflicts, it employs a decoupled architecture where macroscopic thermodynamic and dynamic fields unidirectionally modulate hydrometeor generation. By integrating wavelet-based frequency decoupling, spectral amplitude matching, and adversarial training, the model achieves a favorable trade-off between quantitative accuracy and spatial fidelity. In a 72-h global evaluation, PredHydro-Net outperforms both spatiotemporal deep learning baselines (Earthformer and PredRNNv2) and the operational Global Forecast System (GFS) in extreme-event detection and spectral representation. Furthermore, it demonstrates strong climatological consistency with Global Precipitation Measurement (GPM) satellite retrievals. The model reasonably reproduces the three-dimensional cloud structures in extreme weather events, such as Hurricane Ian. Feature attribution confirms its dependence on physical precursors such as relative humidity and wind convergence, offering a robust, physics-informed approach to long-tailed atmospheric prediction.

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08564v1 Announce Type: new Abstract: Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.

EinSort: Sorting is All We Need for Tensorizing LLM

Toshiaki Koike-Akino, Jing Liu, Ye Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08565v1 Announce Type: new Abstract: Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they can significantly reduce memory and computational costs. However, identifying implicit low-rank structures in large foundation models remains challenging due to their enormous scale and un-structured weight distributions. We propose an adaptive tensorization method that discovers inherent low-rank structure in a target tensor by index ordering. Experiments on weight and KV-cache compression demonstrate improved reconstruction quality compared to baselines.

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu, Yongdong Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08566v1 Announce Type: new Abstract: Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.

Regulating the AI Tutor: Intentions, Help-Seeking, and Self-Regulated Learning in Adolescent GenAI Use

Rania Abdelghani, Peter Kaiser, Kou Murayama — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08568v1 Announce Type: new Abstract: Generative AI (GenAI) tools are now common learning companions for adolescents, yet how they regulate their use during authentic learning tasks remains poorly understood. Self-regulated learning (SRL) and high-level help-seeking (HS) are commonly proposed as safeguards against passive or shortcut-oriented use, but most empirical studies focus on aggregate learning outcomes rather than these moment-to-moment processes during AI-supported learning. This work-in-progress examines open-ended conversational data from 98 Grade-9 students across three German Gymnasium schools, who used a web-based Mistral-Large tutor to prepare a curriculum-aligned mathematics skill before an exam. Alongside chat logs (1,616 turns; 808 student turns), we collected pre-post domain knowledge, pre-chat learning needs, and self-reported cognitive load. We propose a turn-level codebook combining theory-driven SRL and HS constructs with two LLM-specific inductive codes (agency over the AI; epistemic vigilance), and report preliminary AI-coded results. Although students overwhelmingly selected scaffolded support before the chat, their interactions were dominated by instrumental requests with almost no explicit monitoring or evaluation. Post-test performance was significantly lower than pre-test, and higher extraneous cognitive load predicted lower post-test scores after controlling for prior knowledge. We discuss how these patterns can support hybrid human-AI analysis of interaction patterns and inform scaffolds for more agentic and epistemically proactive GenAI use.

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

Subramanyam Sahoo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08571v1 Announce Type: new Abstract: Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08572v1 Announce Type: new Abstract: While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08573v1 Announce Type: new Abstract: Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08574v1 Announce Type: new Abstract: Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.

When Should Queries Be Decomposed? A Stage-Aware Study of Query Decomposition for Multi-Condition Retrieval

Bochao Yin, Xuan Lu, Zhengyu Qi, Xiaoyu Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08577v1 Announce Type: new Abstract: Multi-condition retrieval requires systems to identify documents that satisfy multiple distinct constraints, moving beyond mere topical relevance. While query decomposition is widely adopted as an intuitive remedy, its effectiveness across different retrieval pipeline stages remains underexplored. In this paper, we conduct a stage-aware empirical study and uncover a stark, stage-dependent effect: decomposition during initial retrieval frequently harms retrieval performance due to semantic dilution, yet substantially improves reranking by enabling more fine-grained constraint verification. Motivated by these insights, we propose a principled Stage-Aware Decomposition framework that retains the monolithic query during initial retrieval to preserve global semantic context, while employing sub-queries exclusively during reranking for fine-grained constraint matching. Extensive evaluations on the MultiConIR and SSRB benchmarks demonstrate that our framework consistently improves ranking performance for compositional queries across multiple retrieval and reranking models. We release our code at https://github.com/EIT-NLP/Query-Decompose.

Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

Xu Zhang, Peang Wang, Wei Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08578v1 Announce Type: new Abstract: Recently, large time series models (LTSMs) have gained increasing attention due to their similarities to large language models, including flexible context length, scalability, and task generality, outperforming advanced task-specific models. However, prior studies indicate that pre-trained LTSMs may exhibit a poorly conditioned non-convex loss landscape, leading to limited trainability. As a result, direct fine-tuning tends to cause overfitting and suboptimal performance, sometimes even worse than training from scratch, substantially diminishing the benefits of pre-training. To overcome this limitation, we propose Smoothed Full Fine-tuning (SFF), a novel fine-tuning technology. Specifically, we construct an auxiliary LTSM via random initialization to obtain a smoother loss landscape, and then linearly interpolate its weights with those of the pre-trained model to smooth the original landscape. This process improves trainability while preserving pre-trained knowledge, thereby enabling more effective downstream fine-tuning. From an optimization perspective, SFF perturbs sharp minima without significantly harming flat regions, facilitating escape from poor local basins toward smoother and more generalizable solutions. Extensive experiments on benchmark datasets demonstrate consistent improvements across eight representative LTSMs, including Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, on diverse downstream tasks. The code is available at the link: https://github.com/Meteor-Stars/SFF.

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

Jasmeet Singh Bindra, Siddharth Panwar, Shubhajit Roy Chowdhury — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08583v1 Announce Type: new Abstract: Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhythms in EEG, morphological complexes in ECG -- yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32--0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.

Convolutional Sparse Coding via the Locally Competitive Algorithm on Loihi 2

Geoffrey Kasenbacher, Daniel Ruepp, Gerrit A. Ecke — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08584v1 Announce Type: new Abstract: Sparse coding provides a principled framework for signal representation by expressing an input as a linear combination of only a small number of basis functions. The Locally Competitive Algorithm (LCA) is particularly attractive in the context of neuromorphic computing because its dynamics, leaky integration, thresholding, and lateral inhibition map naturally to neuromorphic hardware. While prior work has studied non-convolutional LCA on Loihi 2, the convolutional setting is of particular interest because it introduces spatial structure, weight sharing, overlapping receptive fields, and scaling behavior that are more representative of practical sparse inference workloads. In this work, we present a Loihi 2 implementation of convolutional sparse coding via the LCA and evaluate it against a conventional GPU baseline on the same inference problems. The implementation follows a one-layer recurrent LCA formulation and extends it to convolutional feature maps with local inhibitory kernels derived from pairwise filter interactions. To the best of our knowledge, this is the first implementation and benchmark of convolutional LCA on Loihi 2. Our goal is not only to demonstrate feasibility, but also to clarify in which operating regimes convolutional sparse inference becomes attractive on neuromorphic hardware. The resulting study positions convolutional LCA as a useful benchmark for structured sparse inference on emerging neuromorphic systems.

Novel physical property preserved methods for stochastic Schr\"{o}dinger--KdV equation

Ziheng Chen, Jialin Hong, Liying Sun — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08585v1 Announce Type: new Abstract: In this work, we study the stochastic Schr\"odinger--KdV equation driven by additive noise from both analytical and numerical viewpoints. We first establish the evolution laws for the averaged plasmon number, momentum, and energy, together with the conservation of the averaged particle number. Motivated by these intrinsic structures, we develop two temporal discretizations. One is constructed based on the splitting strategy and Crank--Nicolson scheme, and is shown to preserve the discrete evolution laws of the averaged plasmon number and momentum, as well as the discrete conservation law of the averaged particle number. The other is proposed within the constant scalar auxiliary variable framework, in which the nonlinear energy functional is reformulated so that a modified averaged energy law can be preserved at the discrete level. Combining these temporal discretizations with a local discontinuous Galerkin approximation in space yields structure-preserving full discretizations inheriting the corresponding discrete physical laws. Numerical experiments are presented to validate the theoretical results and to demonstrate the accuracy, robustness, and effectiveness of the proposed methods.

LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

Phouvadeth Vathana, Prapti Bhatt, Rishi Patel, Nasir U. Eisty — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08588v1 Announce Type: new Abstract: Large language models (LLMs) have shown considerable promise for automated unit test generation, yet their practical effectiveness relative to human-written tests remains poorly understood. Existing evaluations commonly rely on coverage-oriented benchmarks that do not assess fault-detection capability directly. We present an empirical comparison of LLM-generated and human-written unit tests across three complementary Python benchmarks: 29 real historical bugs from BugsInPy, a function-level benchmark drawn from python-slugify and packaging, and a controlled paired benchmark. Our generation pipeline couples Gemini 2.5 Flash with a lightweight lexical retrieval mechanism that supplies bug-relevant context at generation time. Across eight quality dimensions, LLM-generated tests with retrieval-augmented context detect faults in 69% of cases compared to 17.2% for general-purpose human-written tests (Fisher's exact, $p < 0.001$, Cohen's $h = 1.10$). Critically, line and branch coverage are nearly identical between the two approaches (84.8% vs. 88.5% and 75.2% vs. 82.1%), confirming that coverage is an insufficient proxy for fault-detection capability. We discuss the conditions under which each approach excels, characterize their complementary strengths, and identify the critical role of retrieval context and reproducible benchmark construction in meaningful test-quality evaluation.

Detection and Interpretability Analysis of Quotation Errors by Large Language Models

Bei Huang, Yingyi Zhang, Shenghao Huang, Chengzhi Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08589v1 Announce Type: new Abstract: Purpose - Quotation error refers to the inconsistency between cited information and its original source. This phenomenon leads to a series of negative impacts, such as misinterpretation of the original research, undermining the academic community's collective understanding of relevant issues, and weakening the accuracy and fairness of the citation-based academic evaluation system. Existing studies have shown that quotation error is prevalent in the academic community; moreover, manual verification of quotation error is not only labor-intensive but also inefficient. Therefore, this paper proposes the task of 'automated detection of quotation errors'. Methodology - Adopting a large language model (LLM)-based approach, this paper improves detection performance from two aspects on the basis of existing research: first, employ the fine-tuning approach for LLMs to detect quotation errors; second, incorporating full-text data of the cited literature into dataset construction, and exploring the optimal scheme for building such datasets by comparing three types of full-text integration methods. Based on this, this paper further uses the TokenSHAP tool to conduct interpretability experimental analysis on the model's prediction results. Findings - The fine-tuning approach for LLMs has improved the performance in detecting quotation errors. Among the different methods for incorporating full-text information, the approach based on using the source abstract yielded the best performance. Originality - The fine-tuning approach for large language models (LLMs) is applied to the task of automated detection of quotation errors, and interpretability analysis is conducted on the model's output results.

Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

Anastasiia Kuvshinova, Seungmin Jin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08590v1 Announce Type: new Abstract: Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.

Quantum Global Variational Learning for Quantum Error Correction

Shun Ryuzaki, Hideo Mukai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08592v1 Announce Type: new Abstract: Efficient quantum error correction is essential for the advancement of quantum computing. We propose a quantum neural network with a global structure that reduces the number of unitary matrices required in quantum circuits. This approach resulted in a 97\% reduction in training time and up to a 25\% improvement in the training completion rate, ultimately achieving a 100\% success rate in training while surpassing the error correction performance reported in previous studies. In addition, we demonstrated the enhanced robustness of quantum error correction against internal network noise. Moreover, the fidelity of quantum error correction under internal network noise increased by up to 15\% due to the reduced computational load.

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

Jasmeet Singh Bindra, Siddharth Panwar, Shubhajit Roy Chowdhury — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08594v1 Announce Type: new Abstract: Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters, yet no prior study has isolated model capacity as the experimental variable or tested whether reconstruction metrics predict downstream neural-signal utility. We address both gaps by fixing architecture, loss, data split, and training recipe while sweeping only channel width from 1.05K to 40.26K parameters in a minimal depthwise-separable convolutional U-Net. Models were evaluated on the EEGDenoiseNet benchmark, cross-dataset BCI transfer tests, controlled baseline retraining, and downstream motor-imagery classification with five decoder families across all nine BCI Competition IV-2a subjects. Reconstruction performance saturated by 3-6.5K parameters, with post-elbow gains of at most 0.015 correlation coefficient per log10-parameter unit. An 8.46M-parameter baseline retrained under the same pipeline matched the 40.26K compact variant on EOG--a 200x parameter gap yielding no advantage--while a Patch-Transformer control reproduced the same diminishing-return shape. Downstream evaluation exposed a classifier-dependent metric-utility gap: reconstruction-optimized denoising significantly degraded CSP+LDA classification across all nine subjects and three artifact types (best denoised accuracy 0.547 vs. 0.612 noisy baseline; Bonferroni p=0.0488), persisting on naturally recorded trials (Delta=-0.047; BH-FDR q=0.0049). End-to-end neural decoders showed variable or neutral effects. Standard EEG denoising benchmarks are saturated far below current model capacity, and reconstruction metrics do not predict BCI utility. Ultra-compact models at 33-46 KB and 1.27-2.61M FLOPs/segment are practical for edge deployment. These findings argue for capacity-controlled evaluation, harder task-aware benchmarks, and mandatory downstream validation.

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08596v1 Announce Type: new Abstract: Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/

Kikuchi Graphs of Random Hypergraphs are Approximately Johnson

Pravesh K. Kothari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08597v1 Announce Type: new Abstract: We prove that level-$\ell$ Kikuchi graphs of random $2r$-uniform hypergraphs spectrally approximate the Kikuchi graph of the complete $2r$-uniform hypergraph at a sampling rate that is sharp up to a logarithmic factor, in the regime $r\leq \ell \leq n/2$. Our proof is based on the matrix Bernstein inequality, but, unlike prior works, we apply it to an appropriate collection of blocks of Johnson eigenspaces. Our analysis relies on a new, simple band-locality property for arbitrary Kikuchi graphs. As an application, we prove that the natural degree-$2\ell$ sum-of-squares relaxation for the Max $2r$-XOR problem is ``integral'' when the input is a planted noisy $2r$-XOR instance on a random hypergraph with $\gtrsim n \cdot (n/\ell)^{r-1} \log n$ hyperedges.

InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08601v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predominantly rely on passive modality alignment or static task reprogramming, which often fail to capture fine-grained, non-stationary temporal patterns or to adapt to nuanced task intents. In this paper, we propose Instruction-aware Active Probing (InA-Probe), which shifts the paradigm from passive alignment toward an active, instruction-driven probing mechanism. Specifically, we design a Multi-Level Instruction Injection mechanism that enriches the model with both global task objectives and fine-grained, patch-level semantic priors. Building on this, an Adaptive Query Generation module produces sample-specific probes that are dynamically modulated by the temporal context. These probes are then refined through a dual-stage attention process: they first internalize task-specific intents via Instruction-Aware Self-Attention, and subsequently interrogate the projected temporal representations through Temporal Cross-Attention to extract salient patterns. Comprehensive experiments on seven real-world benchmarks show that InA-Probe consistently outperforms state-of-the-art deep learning and LLM-based baselines, excelling in both one-for-all generalization and zero-shot transfer while reducing forecasting error by up to 37\% in challenging cross-domain scenarios. Ablation studies further confirm that the synergy between adaptive querying and fine-grained instructions is key to unlocking the reasoning power of LLMs for complex time series.

Reinforcement Learning for Flow-Matching Policies with Density Transport

Boshu Lei, Kostas Daniilidis, Antonio Loquercio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08602v1 Announce Type: new Abstract: We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

Gryphon: A Unified Architecture for Semantic-ID Generation and Item-Level Scoring in Industrial Recommendations

Daria Tikhonovich, Oleg Sorokin, Vladislav Dodonov, Mariia Ulianova, Ilya Murzin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08604v1 Announce Type: new Abstract: Generative retrieval (GR) has become a scalable approach to candidate generation: each item is assigned a short hierarchical token sequence called a Semantic ID (SID), and the next item's SID is decoded autoregressively. A practical limitation is that the decoder's beam search optimizes the likelihood of token sequences, not the relevance of the underlying items. These objectives diverge when sequence likelihood is poorly calibrated due to beam search error accumulation, and when several items collapse onto a single SID and receive identical scores. We introduce Gryphon, an encoder-decoder generative recommendation architecture that adds a jointly trained item-level scoring component alongside SID generation, reusing the encoder's user representation computed in a single forward pass. Instead of ranking SIDs by accumulated token likelihood, Gryphon resolves each generated SID to its concrete items and re-scores those items directly, which sidesteps miscalibrated sequence scores and separates items that collide on the same identifier. On an industrial music service, with item-level scoring trained under a next-item-prediction objective, Gryphon attains the highest item-level Recall@1000, above the strongest baselines (+3.7% over vanilla GR and +2.5% over collision-resolved GR) at comparable parameter count and latency. Gryphon's item-level ranking also surpasses its beam-likelihood ranking of the same candidates (+4.2% gain), demonstrating the benefit of item-level scoring in GR. Deployed as the sole candidate source in a 7-day A/B test, Gryphon produced no statistically significant change in total listening time (+0.25%) while replacing a pipeline of more than 15 candidate generators and a separate preranking stage, substantially simplifying the candidate-generation system.

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

Pratuat Amatya, Vinay Setty — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08605v1 Announce Type: new Abstract: We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08610v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

Bayesian Optimization of a Multi-Product Chemical Reactor Using Composite Models and Partial Physics Knowledge

Liqiu Dong, Marta Zag\'orowska, Mehmet Mercang\"oz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08611v1 Announce Type: new Abstract: We study data-driven real-time economic optimization of a multi-product chemical reactor when no reliable first-principles model is available beyond a steady-state energy balance. Instead of learning the economic objective directly as a black-box function, we use a composite formulation in which Gaussian process (GP) models predict physically meaningful outputs, including product concentrations and reactor temperature, while profit is computed analytically from these predictions together with raw-material, product, and utility prices. This preserves the structure of the economic objective, makes it parametric in changing prices without needing retraining, and allows candidate operating points to be checked against the available energy balance through a physics residual. The GPs also provide predictive uncertainty, which is exploited in a Bayesian optimization (BO) framework both for data-efficient exploration and for conservative enforcement of the reactor temperature constraint through an upper confidence bound. The acquisition function additionally penalizes large energy-balance mismatch obtained by substituting the GP-predicted outputs and candidate inputs into the available steady-state energy balance. The approach is demonstrated on a benchmark simulation of a non-isothermal multi-product reactor. Relative to a trust-region safe BO implementation, the proposed method achieves better simulated economic performance within the available iteration budget. Relative to a purely data-driven BO approach that does not use the available physics information, it avoids reactor temperature constraint violations.

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08612v1 Announce Type: new Abstract: Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.

Harnessing Streaming Video in the Wild

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08615v1 Announce Type: new Abstract: Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

Cross-Source Reasoning-based Correction for Author Name Disambiguation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08617v1 Announce Type: new Abstract: Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

SPA: A SQL-Plan-Aware Reinforcement Learning Framework for Query Rewriting with LLMs

Xinyi Huang, Zhengjie Miao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08620v1 Announce Type: new Abstract: SQL query rewriting is a well-established technique for improving database performance without schema or index changes, yet finding effective rewrites for modern analytical workloads remains difficult: rule-based methods are limited to predefined transformations, while LLM-based approaches often produce rewrites that are semantically valid but compile to equivalent physical plans or degrade runtime performance. We present SPA, a SQL-Plan-Aware reinforcement learning framework that trains LLMs to rewrite queries using physical execution feedback. SPA formulates rewriting as a policy optimization problem and extends GRPO with rewards spanning semantic equivalence, textual rewrite distance, physical-plan divergence, and runtime speedup. To handle reward sparsity across query difficulty, SPA introduces Probability-Gated Adaptive Reward Shaping, a query-level curriculum that unlocks higher-level rewards only once a rollout group achieves sufficient mastery of lower-level objectives, and further improves sample efficiency through on-policy self-improvement by recycling slowdown rewrites from the current policy as targeted training signals. On both IID and OOD workloads, SPA outperforms rule-based and strong LLM baselines in end-to-end runtime, substantially reduces harmful slowdown rewrites, and yields strong tail-latency gains.

Strategyproof Mechanisms for Euclidean Facility Location Problems under $L_p$-norm Social Cost

Hau Chan, Jianan Lin, Chenhao Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08621v1 Announce Type: new Abstract: We study strategyproof mechanisms for eliciting agents' location preferences truthfully in the Euclidean plane $\mathbb R^2$ and locating a facility so as to minimize the $L_p$-norm social cost, defined as the $L_p$-norm of the vector of distances from the facility to the agents' preferred locations, for any $p \ge 1$. While the cases $p=1$ and $p=\infty$ have been well-studied, open questions remain about the optimal approximation ratios achievable by strategyproof mechanisms for general $p$. Our first result resolves an open question of Goel and Hann-Caruthers [Soc. Choice Welf. 2023]. They showed that the coordinate-wise median (CM) mechanism achieves an approximation ratio lying between $2^{1-\frac{1}{p}}$ and $2^{\frac{3}{2}-\frac{2}{p}}$ for $p\ge 2$, and they conjectured that it is exactly $2^{1-\frac{1}{p}}$. We confirm this conjecture, and we further show that CM has a tight $\sqrt 2$-approximation for $1\le p\le 2$. Our second and third results demonstrate that two randomized mechanisms can yield better approximation ratios. In particular, we first consider the uniformly rotated coordinate-wise median (URCM) mechanism, and prove that, for $1\le p<2$, its approximation ratio strictly improves over the deterministic bound $\sqrt{2}$, while no such improvement is possible for $p\ge 2$. We then study the centroid random dictatorship mechanism that returns the average location (i.e., centroid) and the random dictatorship each with half probability, and show that its approximation ratio strictly improves over CM and URCM for every finite $p\gtrsim 1.6$. Moreover, our analysis independently recovers the classical deterministic and randomized results for $p=1$ [Meir, SAGT 2019] [Barak, EC 2026] and $p=\infty$ [Goel and Hann-Caruthers, SCW 2023] [Tang et al., EC 2020] using significantly different techniques.

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08625v1 Announce Type: new Abstract: As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.

Sycophancy Towards Researchers Drives Performative Misalignment

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08629v1 Announce Type: new Abstract: The increasing situational awareness of language models raises safety concerns: models might be aware when they are evaluated, and adjust their behavior to evade monitoring and resist modification, e.g., pretending to be aligned only in evaluation. This alignment faking behavior is often interpreted as scheming: an intentional effort of strategic deception. In this paper, we examine an alternative interpretation, performative misalignment, which explains the change in behavior as a result of sycophancy towards AI researchers. To examine this hypothesis, we present three empirical findings. First, we show that evaluation awareness persists even when we tell models they are deployed, which contradicts the scheming story which predicts less misalignment when the model perceives evaluation. Second, we use probing and steering to show that our current methods cannot mechanistically distinguish sycophancy and scheming in alignment faking evaluations. Third, we fine-tune models to be more sycophantic and observe increased sensitivity to evaluation cues. To conclude, we emphasize deconfounding sycophancy from scheming for future work on evaluations and mitigations of intent misalignment.

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08630v1 Announce Type: new Abstract: Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

xSense Design Cards: Guiding the Design of Multisensory Experiences

Ceylan Be\c{s}evli, Carlos Velasco, Marianna Obrist — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08632v1 Announce Type: new Abstract: Designing multisensory experiences involves the deliberate combination of sensory elements to shape specific impressions for a given audience. Advances in technologies beyond audiovisual modalities now make it feasible to design across touch, taste, smell, and more. However, HCI still lacks the tools and shared vocabulary needed to systematically create and evaluate such experiences. The xSense Design Cards address this gap with four card types: (1) Experience Cards define purpose, context, and audience; (2) Sensory Cards break down multisensory concepts into elements and events; (3) Technology Cards prompt consideration of relevant technologies; and (4) Exploration Cards guide reflection on the broader context, including responsible innovation. This work introduces the cards and their theoretical grounding, showing how they support structured design, reflection, and evaluation of an experience's multisensory composition. By presenting xSense, we aim to broaden the vocabulary for multisensory design and stimulate discussion within the growing multisensory HCI community.

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08633v1 Announce Type: new Abstract: Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

Seunghyun Lee, Byoungkwon Kim, Jaehyun Nam, Kyungmin Lee, Jinwoo Shin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08634v1 Announce Type: new Abstract: The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

Yang Pengju — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08635v1 Announce Type: new Abstract: Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are transmitted at full precision and the rest are not transmitted. This paper argues that binary selection leaves a useful design space unused. SpectrumKV assigns a precision level to each token instead: attention sinks and other high-importance tokens are protected at FP16, medium-importance tokens are sent at INT8, and low-importance tokens are sent at INT4 when the model can tolerate it. The main practical complication is that INT4 tolerance is model-dependent. Qwen2.5-7B catastrophically fails under INT4 KV quantization, while Mistral-7B and Gemma-2-9B remain stable. SpectrumKV therefore runs a lightweight deployment-time probe: three aggressive NIAH trials under a 3-tier policy. Models that pass use FP16+INT8+INT4; models that fail fall back to FP16+INT8. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV improves quality at the same transfer budget. At a 50% normalized KV budget on WikiText-2, SpectrumKV changes perplexity by +1.97%,-0.06%, and-0.44%, respectively, compared with PDTrim's +25.85%, +22.07%, and +35.63%. On NIAH retrieval at 4096 tokens, the adaptive policy reaches 52.6% on Qwen at the aggressive b=0.3 budget versus 26.3% for PDTrim, and reaches 100% by b=0.5; Mistral and Gemma preserve retrieval under the 3-tier policy. End-to-end GPU timing of the transfer path shows 50-62% TTFT reductions at b=0.5. These results suggest that PD KV transfer should be treated as a precision-allocation problem, not only as token pruning.

Cooperative Guidance and Control for Active Asset Protection with Time-Varying Agent Speeds

Ram Milan Kumar Verma, Shashi Ranjan Kumar, Hemendra Arya — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08636v1 Announce Type: new Abstract: Protecting an asset against threats is a challenging problem in an era of continuously evolving intelligent attacks. This requires cooperation between the asset and the defender to share information and jointly maneuver. To address this problem, this work proposes a cooperative guidance and control strategy for active asset protection against a maneuvering threat. This work develops a joint maneuver strategy where both the defender and the asset coordinate their time-varying speeds and courses to neutralize/capture the attacker. The control strategy is formulated around three coupled geometric and temporal objectives. The first objective is to set the line-of-sight rate between the asset and the attacker to zero, putting the attacker on a collision course and reducing their maneuvering. The second objective is to maintain the defender on the line-of-sight between the asset and the attacker. This ensures that the attacker faces the defender first before reaching the vicinity of the asset. Lastly, the defender is also guided to pursue the attacker based on the time-to-go estimates between the defender and the attacker. While keeping these objectives in mind, the control actions for the asset and the defender are jointly designed, fostering cooperation between the two. The stability of the proposed strategy is established using a Lyapunov-based approach. Numerical simulations performed show the effectiveness of the proposed cooperative strategy in ensuring the successful capture of a maneuvering threat.

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

Jingzhi Chen, Landi He, Zhuo Chen, Shawn Young, Lijian Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08641v1 Announce Type: new Abstract: The processing of gigapixel whole slide images within vision language models faces a major difficulty due to an excessive number of visual tokens. Existing solutions typically rely on spatial downsampling or heuristic pruning strategies that operate without training, and these methods often discard subtle but clinically meaningful patterns because pathological evidence is scattered irregularly across the tissue. To overcome this limitation, we reformulate token reduction in whole slide images as a trainable sparsification problem, allowing the model to learn an optimal selection strategy instead of following fixed heuristics. We propose a decoupled routing architecture. To enable gradient propagation through the nondifferentiable pruning operation during training, we introduce a component called SparseLearn. This component uses a variance-preserving noise gate that regulates the information flow of each patch via a differentiable Soft Top-K operator, together with a diagonal attention denoiser that recovers perturbed representations without leaking spatial information. At inference time, the SparseLearn module is entirely discarded, and the trained scorer applies a deterministic Hard Top-K operator to keep only the highest scoring 32 tokens, incurring no extra computation. By compressing the visual sequence down to a sparse set of just 32 tokens, which represents as little as 0.78% of the original length, our framework achieves 73.32% overall accuracy on SlideBench (TCGA), consistently surpassing sampling-based baselines and general-purpose vision language models. It also demonstrates strong zero shot generalization on SlideBench (BCNB) and WSI VQA*. By resolving the visual context bottleneck and preventing the dilution of sparse diagnostic evidence, this work provides a highly efficient paradigm for end to end gigapixel whole slide image reasoning.

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

Soyoung Oh, Vera Demberg — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08644v1 Announce Type: new Abstract: To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update these bindings as state changes. We analyze how LLMs implement this binding process in a dynamic state tracking. Using causal interventions, we identify a retrieval conditioned rebinding mechanism, a compact attention head circuit that encodes swap relevant binding information and reinstates it at readout. Across Gemma and Llama models, this circuit supports rebinding behavior, but the representational signature of the mechanism differs across model families. In Gemma models, the binding signature is clearly expressed in the query/key subspaces of the relevant attention heads, whereas in Llama models, the binding information is carried primarily in key vectors. Overall, our results reveal an interpretable mechanism for context dependent state tracking in LLMs.

The Arithmetic Circuit Combinatorial Nullstellensatz is NP-hard

Andreas Bj\"orklund — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08646v1 Announce Type: new Abstract: A multivariate polynomial on $n$ variables $x_1,\ldots,x_n$ of total degree $n$ over $\mathbf{Z}_2$ containing the multilinear monomial $\prod_{i=1}^n x_i$ is by the combinatorial nullstellensatz [Alon, Comb. Probab. Comput., 1999] known to always have a nonroot. We show that there cannot be a randomised polynomial time algorithm that given an arithmetic circuit of polynomial size formally computing such a polynomial, locates a nonroot with constant nonzero probability unless RP=NP. The result holds even when the individual degree of every variable in the input polynomial is at most two.

Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning

Bernhard Kneip, Nhien-An Le-Khac, Hong-Hanh Nguyen-Le — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08649v1 Announce Type: new Abstract: Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08653v1 Announce Type: new Abstract: Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

Operator learning for the 2D incompressible Navier-Stokes equations: a conformal prediction approach in the data-scarce regime

Weinan Wang, Bowen Gang, Hao Deng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08654v1 Announce Type: new Abstract: In this paper, we propose a perturbation-based conformal prediction framework for uncertainty quantification in operator learning, with a focus on the 2D Navier--Stokes equations. While neural operators provide fast surrogates for expensive PDE solvers, they do not by themselves provide calibrated uncertainty for spatiotemporal field predictions. Our approach wraps a trained Fourier Neural Operator (FNO) with split conformal prediction and constructs the local uncertainty scale by comparing the predictions of two operators trained on nearly identical datasets: one on the original labels and one on labels perturbed by small Gaussian noise. We consider this procedure in the data-scarce regime, where the total label budget is fixed and methods that require a separate uncertainty network must divide training data between multiple models. On the 2D Navier--Stokes benchmark, the perturbation-based method produces substantially narrower conformal bands than existing methods under matched total data budgets while maintaining the target simultaneous coverage. These results suggest that perturbation sensitivity is a practical and sample-efficient uncertainty proxy for conformalized neural operators.

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08655v1 Announce Type: new Abstract: To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08656v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek-V3.2.

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08657v1 Announce Type: new Abstract: Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems

Angjelin Hila — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08658v1 Announce Type: new Abstract: LLMs have revolutionized knowledge representation and retrieval, but lack the explicit modeling that knowledge ontologies possess. This paper surveys the ways that ontologies and knowledge graphs have been integrated with dense embedding algorithms. All hitherto attempts involve a trade-off between probabilistic and crisp inference. This paper proposes a novel frontier for devising knowledge representation systems that can simultaneously accommodate probabilistic and crisp inference in the same representation. To this effect, the paper proposes neuro-quantum-fuzzy systems as knowledge representation systems that accommodate both classical and contextual inference implemented through quantum-neural networks (QNN).

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08661v1 Announce Type: new Abstract: Data agents integrate LLM-driven reasoning with relational data access, executable analytical tools, and multi-step workflow orchestration, making them increasingly central to enterprise analytics. This integration introduces new security vulnerabilities across data resources, database execution, and agent reasoning, recombining concerns from database security and general-purpose LLM-agent security into failure modes that neither line of work captures on its own. To address this gap, we present a systematic security study of data agents. Our contributions are threefold. First, we develop a layered vulnerability framework that identifies eight data agent-specific risks across interpretation, execution, and policy layers. Second, we introduce an attack taxonomy organized by adversary goal, tactic, and technique, covering three goals, seven tactics, and fourteen techniques, and pair it with an LLM-driven payload generation pipeline grounded in real database schemas. Third, we evaluate these attacks on six systems, including four open-source data agents and two production cloud analytics services. Our experiments reveal substantial security vulnerabilities across current systems and yield four key takeaways.

Probing Token Spaces under Generator Shift in AI-Generated Music Detection

Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08663v1 Announce Type: new Abstract: AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

Aryan Naveen, Jason Xinyu Liu, Luca Carlone, Andreea Bobu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08666v1 Announce Type: new Abstract: Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

X-rated Compliance Theater: An Empirical Evaluation of European Age Verification Systems in Adult Websites

Simone Lavermicocca, Michekle Carminati, Stefano Longari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08667v1 Announce Type: new Abstract: Age verification is rapidly emerging as a central regulatory instrument for protecting minors online, with several jurisdictions mandating its deployment for access to adult and pornographic content. This regulatory direction raises significant privacy concerns, as it risks binding sensitive content access to identity-related attributes. It also introduces security risks, since age-verification mechanisms are often outsourced to third-party providers with limited transparency into the robustness of their verification processes. In this work, we conduct, to the best of our knowledge, the first exploratory security assessment of regulation-mandated age-verification mechanisms deployed by adult websites. Rather than treating age verification as a purely regulatory question, we empirically examine whether current deployments provide security guarantees commensurate with the privacy risks of relying on sensitive identity-related data. Our methodology combines ecosystem mapping, adversary modeling, and empirical testing across four countries, covering document-based verification, biometric age estimation, indirect signals, and website-workflow integration. Our results reveal systemic weaknesses across mechanisms and integrations under realistic threat assumptions, including failures against low-cost, widely accessible attacks. Finally, we derive concrete guidelines and design directions for mitigating the security and privacy risks exposed by current age-verification deployments.

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08669v1 Announce Type: new Abstract: Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08670v1 Announce Type: new Abstract: Large and demographically balanced datasets are essential for reliable neuroimaging biomarkers. Full-resolution 3D brain MRI synthesis can support data augmentation in this setting, but existing approaches either incur prohibitive computational cost at volumetric scale or rely on lossy latent compression that may compromise anatomical detail. As a result, practical 3D generative augmentation often requires specialized compute infrastructure. We propose WaveDiT, a conditional flow matching framework operating in the coefficient space of a 3D Haar Discrete Wavelet Transform. The model combines factorized spatio-depth attention with band-wise heteroscedastic uncertainty modeling derived from higher-order wavelet statistics. Predicted log-variance is integrated directly into both the flow objective and conditioning pathway, enabling adaptive precision consistent with the heavy-tailed and input-dependent variance structure of anatomical detail. This formulation supports full-resolution 3D synthesis under practical memory and time constraints on a single modern GPU. Evaluation on a multi-site cohort demonstrates improved alignment between generated and real MRI distributions, together with enhanced downstream brain age prediction and region-level anatomical agreement relative to diffusion, latent, and wavelet-based baselines. Code is available at https://github.com/sisinflab/WaveDiT

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

Zhiwei Li, Yong Hu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08671v1 Announce Type: new Abstract: Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives. We introduce SkillHone, a harness for continual agent skill evolution grounded in persistent decision history. SkillHone pairs skill revisions with evaluation-side evidence that supplies practice feedback, recording structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes with redacted reporting and propose revisions informed by prior decisions, enabling cross-session refinement without rediscovering past rationale. We evaluate SkillHone on deep-research benchmarks in a raw open-web setting, where agents are not given an integrated search stack and must organize retrieval through portable skills. We compare against a deep-research agent backed by commercial retrieval services. With Qwen3.6-35B-A3B as the evaluation-time backbone, the resulting skills outperform the deep-research agent by 15.8 points on GAIA and 3.2 points on WebWalkerQA-EN, while also exceeding prior skill-evolution methods.

Learning to Solve Generative ODEs Beyond the Linear Span

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08672v1 Announce Type: new Abstract: Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

Fran\c{c}ois Remy — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08673v1 Announce Type: new Abstract: Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains. In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM. Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn-Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target. At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions summarized in Appendix B. We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

Tsung-Wei Pan, Jung-Hua Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08674v1 Announce Type: new Abstract: Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

Shi Ying Chang, Chiok Yew Ho, Yichen Li, Yintong Huo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08676v1 Announce Type: new Abstract: AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers' intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08678v1 Announce Type: new Abstract: Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

Xiangzhong Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08680v1 Announce Type: new Abstract: Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

Asymptotic Optimality of the High-Dimensional Gaussian Mechanism and Improved Low-Dimensional Mechanisms for Differential Privacy

Yu Wei, Alexander Bienstock, Antigoni Polychroniadou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08681v1 Announce Type: new Abstract: The additive noise mechanism is a foundational tool for differential privacy (DP) of $T$-dimensional real-valued vector queries. The Gaussian mechanism, utilizing Gaussian noise, is the mostly widely used such mechanism, due to its simplicity and strong privacy guarantees. In this work, we provide justification for this choice, showing that as the dimension $T\to\infty$, no additive-noise mechanism can asymptotically improve on the Gaussian mechanism's privacy--utility tradeoff for the strong privacy settings typically used.We also develop a new family of \emph{Spherical Generalized Gamma} DP mechanisms, which contains both the Gaussian mechanism and the recently studied $\ell_2$ mechanism (Joseph \emph{et al.}, ICML 2025). We identify members of this family that outperform both the Gaussian and $\ell_2$ mechanisms in certain low-dimensional settings, and show tight composition of all mechanisms in this family, answering an open question of Joseph \emph{et al.}~regarding the $\ell_2$ mechanism.

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08682v1 Announce Type: new Abstract: Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08684v1 Announce Type: new Abstract: We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.

ND-TNN: Tensor-Neural-Network Approximation for High-Dimensional Nonlocal Diffusion Models

Ziyue Cai, Zuoqiang Shi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08685v1 Announce Type: new Abstract: We study a numerical method, built on the tensor neural network (TNN) architecture introduced in \cite{wang2022tensor}, for solving nonlocal diffusion models in high-dimensional spaces. The tensor-product structure of the TNN ansatz, combined with the separability of the Gaussian kernel, reduces the high-dimensional integrals in the nonlocal energy to products of low-dimensional integrals, which are evaluated by Gauss--Legendre quadrature; nonseparable source and boundary data are handled by a TNN-based preconditioning step. For the Dirichlet boundary condition, we establish the asymptotically compatible $L^2$ error estimate \[ \|u_{\mathrm{loc}}-u_{\delta,p}\|_{L^2(\Omega)} \le C\!\left(\frac{\varepsilon_f}{\sqrt\delta} +\frac{\varepsilon_g}{\delta} +\frac{\varepsilon_u}{\sqrt\delta} +\eta_{\mathrm{opt}}\right) +C\sqrt\delta, \] where $\varepsilon_f$, $\varepsilon_g$ and $\varepsilon_u$ are the data and trial-class approximation errors and $\eta_{\mathrm{opt}}$ is the optimization residual. For the Neumann boundary condition, the $L^2$ estimate is improved to $O(\varepsilon_f+\varepsilon_g/\sqrt\delta+\varepsilon_u +\eta_{\mathrm{opt}}+\delta)$, and an $H^1$ gradient estimate is further obtained through a smoothing post-processing step. Numerical experiments on tensor-product domains up to $d=20$ support the theoretical results, and additional tests on two- and three-dimensional $L$-shaped domains demonstrate the practical robustness of the method beyond the smooth-domain setting covered by the analysis.

Shift-Dependent Asymmetry: Orthogonal Inverse Low-Rank Adaptation for Federated Medical Segmentation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08687v1 Announce Type: new Abstract: Low-Rank Adaptation (LoRA) enables efficient federated fine-tuning of segmentation foundation models for medical imaging. However, most federated LoRA methods adopt a uniform aggregation rule, which breaks under the encoder-decoder asymmetry in medical segmentation: the encoder is dominated by appearance shifts, while the decoder is dominated by supervision variations. This mismatch entangles shared anatomy with site-specific biases and harms generalization. To address this, we propose Inverse Asymmetric Tuning (IAT). IAT aligns adaptation with heterogeneity sources by personalizing module-specific components in the encoder to absorb appearance shifts and in the decoder to accommodate site-dependent supervision, while retaining a shared pathway for transferable consensus. However, structural separation alone is insufficient under LoRA's bilinear parameterization, where multiplicative coupling can still cause site-specific updates to leak into the shared direction. We therefore introduce a Subspace Orthogonality Regularizer that penalizes shared-local collinearity in the effective update space, mitigating leakage without extra communication. Experiments show consistent improvements over strong federated LoRA and parameter-efficient FL baselines.

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08688v1 Announce Type: new Abstract: Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

Hierarchical Projection for Adaptive Knowledge Transfer

Samhita Pal, Tian Gu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08691v1 Announce Type: new Abstract: Modern data-driven applications increasingly involve learning from multiple heterogeneous sources, where a target dataset is limited but related information is available across domains. Naively combining these sources can degrade performance when relevance varies or spurious signals are present, posing a fundamental challenge for trustworthy cross-domain learning. We propose Projection Transfer Learning (ProjectionTL), a unified framework that integrates hierarchical Bayesian modeling with adaptive projection for selective knowledge transfer. The key idea is to decouple transfer at two levels: first, we construct a source-guided hierarchical prior that aggregates information across sources using data-driven weights, capturing global alignment between each source and the target; second, we refine this borrowing through a posterior-projection step that operates at the feature level, selectively retaining coordinates that exhibit local agreement with the target signal. This two-stage design enables the method to simultaneously perform source selection and feature selection, thereby mitigating negative transfer while preserving interpretability. ProjectionTL provides a principled approach to integrating heterogeneous data across domains, bridging statistical modeling and modern machine learning paradigms for robust and interpretable transfer. Through simulations and real-world biomedical applications, we demonstrate improved accuracy, stability, and interpretability compared to existing methods. Our framework offers a scalable and generalizable strategy for trustworthy cross-domain learning in high-dimensional settings.

Agentic Search for Counterfactual Recourse under Fixed LLM Budgets

Yasuo Tabei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08696v1 Announce Type: new Abstract: Counterfactual recourse aims to provide actionable feature changes that would alter an unfavorable decision made by a predictive model. In practice, affected individuals often benefit from multiple feasible alternatives rather than a single optimal explanation. A natural way to produce such alternatives is to prompt large language models (LLMs). However, prompting incurs a practical constraint: the number of LLM calls is often the dominant computational and economic cost. Together, the need for multiple alternatives and this cost constraint shift the problem from finding a single high-quality counterfactual to efficiently generating a set of oracle-validated counterfactuals under a fixed LLM-call budget. In this work, we study counterfactual recourse generation in the LLM-agentic setting as a fixed-budget search problem and propose Comp-MCTS, an agentic tree-search framework that maximizes the yield of unique, oracle-validated counterfactuals under this budget while maintaining favorable quantity--quality trade-offs. Comp-MCTS allocates the budget toward novel intervention directions via LLM-based proposal generation, oracle validation, and compression-guided pruning, in a training-free, oracle-only setting. Experiments on four real-world tabular datasets show that Comp-MCTS substantially outperforms single-candidate LATS-style baselines in the yield of unique, oracle-validated counterfactuals, and offers favorable quantity--quality--efficiency trade-offs against stronger multi-candidate variants: comparable or higher yield at similar or lower oracle-evaluation cost on three of four datasets, plus competitive proximity, sparsity, and novelty.

Quotient Admission Algorithms for Witness-Supported Graph Windows

Yushan Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08698v1 Announce Type: new Abstract: We formulate the quotient admission problem for finite graph-window rows. The input is a finite row set, an admissible evidence map, semantic labels, witness-support hypergraphs, and atom-level admissibility predicates. The output is a quotient decision on evidence atoms, with possible decisions certificate, residual, low-confidence, or blocked. The problem asks for the maximal guard-respecting atom-level decision map that uses no refinement beyond the admissible evidence partition. We prove an atom-union characterization of identifiable classes, give a witness-support hypergraph guard for certificate admission, characterize projected-label conflicts as blocked atoms, and present quotient admission algorithms with correctness, maximality, and complexity guarantees. With explicit evidence vectors and hyperedges, the algorithms run in expected O(B + I + n) time and space by hashing and deterministic O(B + I + n log n) time by sorting under a key-linear comparison model, where n is the number of rows, B is the total evidence encoding length, and I is the total hyperedge incidence size. We also prove a magnitude-only indistinguishability lower bound: any evaluator that observes only residual magnitudes fails on instances whose evidence atoms require different residual decisions after the magnitudes collapse them.

Simultaneous recovery of multiple parameters in nonlocal diffusion equations from internal measurements

Kai Yu, Zhiyuan Li, Yikan Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08699v1 Announce Type: new Abstract: This paper is devoted to simultaneously recovering multiple parameters from internal measurements for nonlocal diffusion equations. The uniqueness of the inverse problem is established by employing the asymptotic behavior of solutions, analytic continuation, the Laplace transform, and properties of analytic functions. For numerical reconstruction, we apply the Levenberg-Marquardt method to obtain a stable approximate solution of the inverse problem. Numerical examples are provided to demonstrate the efficiency of the proposed algorithm and to validate our theoretical findings.

AutoSUT: The Environment Semantics Gap in Structured CTI for Adversary Emulation

Sidnei Barbieri, \'Agney Lopes Roth Ferraz, Louren\c{c}o Alves Pereira J\'unior — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08700v1 Announce Type: new Abstract: Structured Cyber Threat Intelligence (CTI) is increasingly used for adversary emulation, detection evaluation, and cyber range design. However, these workflows still require a target System Under Test (SUT) whose environment is not fully described by public CTI. We measure how much of that environment can be derived from MITRE ATT&CK Structured Threat Information Expression (STIX) bundles. Using the ATT&CK Enterprise, Mobile, and Industrial Control Systems datasets, with CAPEC and FiGHT as comparison datasets, we evaluate platform coverage, software specificity, vulnerability evidence, and deployment compatibility. Platform annotations are common, but software references rarely include versions or Common Platform Enumeration (CPE) identifiers. In Enterprise, 97.6% of software objects lack both, and campaign-level Common Vulnerabilities and Exposures (CVEs) remain sparse. Our results show that ATT&CK-style structured CTI can narrow candidate environments and support lower-bound backend-family assignment, but structured fields alone are insufficient to derive a replay-ready SUT. Profile confusion decreases from 1.3% when one software item is linked to 0% when two are linked. The results identify a boundary between environment details supported by the corpus and the version, vulnerability, and deployment information that must come from external sources. Keeping corpus-supported elements fixed while varying only analyst-authored details yields multiple distinct, campaign-compatible SUTs, including an executable witness exploiting the same real vulnerability. Structured CTI, therefore, constrains but does not uniquely determine the environment, highlighting the need to separate corpus-supported commitments from analyst-authored assumptions in replay-ready emulation.

Is Telehealth Better Used to Treat Patients or Help Other Physicians Treat Patients? An Agent-Based Modeling Study of Healthcare Provision

Michael Chary — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08701v1 Announce Type: new Abstract: Telehealth, the delivery of medical care remotely, is hoped to increase access to specialty services or decrease health care utilization. Physicians can provide telehealth to each other or to patients. Specialists often treat complex patients who can be adequately cared for only in academic hospitals, suggesting that providing specialty services via telehealth will reallocate rather than reduce system utilization. Here I use agent-based modeling to investigate telehealth's effects on clinical outcomes and system utilization in medical toxicology. I found that physician-physician telehealth increased patient health but system utilization did not change. The effects were more pronounced as clinical complexity increased. Physician-patient telehealth increased cost and system utilization but not clinical outcomes. Within the limitations of our approach, these results suggest that telehealth is more cost-effective for improving generalist access to specialist knowledge than in providing care to the public.

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08702v1 Announce Type: new Abstract: Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory-skill relations, and reliance on additional training or high-quality supervision. To address these limitations, we propose ConMem, a relation-aware and training-free framework that enables efficient multi-agent adaptation through cross-experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation-aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation-aware guidance, enabling robust, lightweight adaptation in multi-agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference-time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

Analyzing the Correlation Between Hallucinations and Knowledge Conflicts in Large Language Models

Lucrezia Laraspata, Giovanna Castellano, Gennaro Vessio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08705v1 Announce Type: new Abstract: Hallucinations -- factually incorrect or unverifiable outputs -- remain one of the most challenging limitations of Large Language Models (LLMs), especially in knowledge-intensive tasks. One proposed explanation is internal knowledge conflicts arising from fixed, outdated training data. This paper investigates whether internal representations linked to knowledge conflicts correlate with hallucination behaviors in LLMs. Using probing techniques inspired by two prior works, we analyzed activations from hidden, attention, and MLP layers, as well as output logits, across predefined tasks. We probed LLaMA-3-8B on hallucination detection benchmarks and Falcon-7B on a knowledge conflict dataset. Our findings show that, although conceptually related, hallucination activation patterns cannot be fully reduced to or explained by knowledge conflict representations. Nonetheless, probing proves a robust tool across multiple languages and activation types, supporting its role in improving LLM interpretability. This work advances the broader understanding of hallucinations in LLMs and underscores the value of fine-grained analysis of their internal behavior.

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08708v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

Structuring agentic AI for HPC code modernization

Anthony Marinov, Igor Sfiligoi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08710v1 Announce Type: new Abstract: Modernization of legacy scientific codes is often necessary to keep up with the ever-evolving changes in the compute resource ecosystem. Parallelization and migration from poorly supported software ecosystems are two of the most time-consuming activities in the research software engineering field. This paper presents our experience in the successful, two-phase AI-assisted modernization of NMAP-RKPM, a roughly 60,000-line, 3D explicit solid mechanics physics engine based on the Reproducing Kernel Particle Method (RKPM). We converted this single-threaded, Fortran based MPI application into a OpenMP-parallel C++ based MPI tool in the span of a few months. While Large Language Model (LLM) based tools on their own proved inadequate, we developed a highly structured "hand-holding" agentic AI methodology, like providing manually created examples, ensuring continuous buildability and limiting session scope, that was instead highly effective. The paper provides both the AI-assisted steps that were successful and the problems that we had to overcome, alongside the reasoning behind the chosen path.

Evaluating Operators for Acoustic Wave Simulation Correction

Pascal Tribel, Gianluca Bontempi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08711v1 Announce Type: new Abstract: Correcting numerical dispersion artifacts from Finite Difference solvers is a well-identified challenge in computational wave physics, but existing approaches evaluate only a restricted family of CNN-based architectures and have been applied exclusively to the elastic wave equation. We instantiate the Deep Finite Difference framework on two-dimensional anisotropic acoustic wave propagation, pairing a fourth-order Finite Difference proxy with a Pseudo-Spectral reference over 27,000 heterogeneous velocity fields. We benchmark twelve correction architectures, from linear regression to Fourier Neural Operators, under a unified 10-fold cross-validation protocol.

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08712v1 Announce Type: new Abstract: Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

The price of incrementality in k-center clustering

L\'aszl\'o Kozma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08713v1 Announce Type: new Abstract: The $k$-center problem is one of the best-studied and most intuitive clustering formulations. It asks, given a set of $n$ points in a metric space, for $k$ of the points to be designated as cluster centers, so that the maximum distance of an input point to its nearest center is minimized. Gonzalez's greedy algorithm from 1985 is a simple and efficient way to find a $2$-approximate solution. The algorithm has the attractive feature of \emph{incrementality}: it outputs the centers one by one, with a guaranteed $2$-approximation for every prefix of the obtained sequence of centers. Incrementality imposes a geometric constraint on how solutions can be built, and it is natural to ask whether this comes at a price in the quality of the solution. It is known that in polynomial time, the approximation ratio of $2$ is best possible, assuming $P \neq NP$. In this paper we show that even with \emph{unlimited} computational power, the factor $2$ cannot be improved, if the solution is required to be built incrementally. The lower bound construction imposes a tradeoff between all $n$ levels of the clustering simultaneously; it was obtained with the help of ChatGPT, an aspect we discuss in Section 3 of the paper.

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08714v1 Announce Type: new Abstract: Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

Operationalizing Linguistic Methods through Prompt-Engineering Skills: An Automatic Chinese Web Neologism Detection Pipeline

Yufeng Wu, Meichun Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08715v1 Announce Type: new Abstract: We present a method for automatic Chinese web neologism detection that operationalizes traditional linguistic identification principles as prompt-engineering skills. The method has four stages: tokenizer-independent character n-gram candidate generation; dictionary anchoring with a Pointwise Mutual Information pre-filter; a well-formedness skill based on Chinese word-formation principles; and a combined rule and three-way classification skill that distinguishes neologism, entity, and none. Applied to the BAAI CCI 3.0 corpus (267M documents), the method produces 226,959 classified candidates including 4,853 labeled neologisms. To evaluate the method, we develop a per-stage conditional recall decomposition in which the pipeline's strict recall factors mathematically into the product of stage conditional recalls. Applied to Hou (2023) (4,199 entries), the decomposition exposes Stage 1 candidate coverage and Stage 4B LLM semantic judgment as the two bottlenecks (R=41.5% and 60.0% respectively), while intermediate stages are near-lossless. A length-stratified analysis further reveals that the structural well-formedness skill is length-invariant (>= 96.9%) whereas the semantic novelty-classification skill is length-dependent (65.6%/59.0%/44.1% across 2/3/4-character candidates), mapping a current boundary of skill-based linguistic operationalization. We release the method, pipeline outputs, and evaluation protocol as public resources.

Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency

Md Abdullah Al Forhad, Weishi Shi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08718v1 Announce Type: new Abstract: While Deep Active Learning (DAL) effectively reduces human annotation costs, its efficacy is constrained by human annotation errors. This is because the data sampled for active learning is assumed to be highly informative for training. When human annotators introduce errors into this informative data at a certain rate, the active learning performance drops significantly and, in some cases, even exhibits worse outcomes than passive learning. In this paper, we first analyze the impact of human annotation errors in the DAL setting. Then we propose a framework to address the human annotation noise problem for DAL. Informed by human learning patterns, the core idea of our proposed solution involves allocating a portion of the human annotation budget to re-annotate data that has already been labeled. Previous theoretical work suggests that when the model possesses a certain level of ability to identify potentially noisy data, even re-labeling a small fraction of the data can effectively remove noise from the active training set. To achieve this, we implement two active noise sampling strategies to detect noise under different circumstances and allocate a part of the annotation budget to re-annotate these instances. Our approach imbues active learning with a revisiting and introspective behavior. Our experiments demonstrate that, under the same annotation budget, our method is more data-efficient and yields a relatively noise-free annotation dataset in the end.

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08719v1 Announce Type: new Abstract: ''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

A Geometric Measure of Linear Separability for Neural Representations

Yi Wei, Xuan Qi, Furao Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08721v1 Announce Type: new Abstract: Modern neural classifiers commonly rely on linear readouts, yet predictive metrics alone do not characterize the class-wise geometry of the representations on which such readouts operate. We introduce the directional linear separability measure (LSM), a finite-sample diagnostic for one-sided affine separability. For a target class A and a competing set B, LSM searches over affine halfspaces that contain all samples in A and measures the smallest competing-sample intrusion that must remain on the target side, normalized by |A|. The resulting quantity is asymmetric, class-wise, target-normalized, and applicable to finite representations extracted from neural networks. We establish its supporting-hyperplane characterization, relate it to optimal affine classification accuracy, and prove invariance under full-rank linear embeddings. These results separate changes caused by linear reparameterization from those caused by information loss or nonlinear geometric transformations. We also give a penalty-based affine search for estimating class-wise LSM in high-dimensional features, with reported values computed from the original discrete preservation and violation criterion. Finally, we analyze coordinatewise gated nonlinearities as finite-sample geometric operators and empirically use LSM to diagnose class-wise intrusion across common deep-learning components and architectures.

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rod\`a — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08722v1 Announce Type: new Abstract: Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fr\'echet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

From Text to Discovery: How Are LLMs Reshaping Scientific and Humanistic Research?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08723v1 Announce Type: new Abstract: Large Language Models (LLMs) are rapidly reshaping academic research across the natural sciences, social sciences, and humanities, yet the scientific community lacks a comprehensive, cross-disciplinary account of how these tools are being integrated, what they deliver, and where they fall short. This paper addresses that gap by mapping their current state and outlining an agenda for their responsible integration into scientific research. Our analysis reveals a consistent pattern: LLMs meaningfully accelerate research workflows -- from hypothesis generation and literature synthesis to data analysis and scientific writing -- while introducing serious challenges related to hallucination, reproducibility, dataset bias, and model opacity. Beyond technical limitations, we identify ten underexplored challenges, including the erosion of researcher autonomy, AI-driven confirmation bias, authorship ambiguity, and unequal access to these technologies -- systemic risks that demand interdisciplinary governance frameworks, robust validation standards, and expanded explainability research.

Real-Time and Accurate Collision-Free Teleoperation via Differentiable Constraint-Based Trajectory Planning

Max Grobbel, Tristan Schneider, Daniel Fl\"ogel, S\"oren Hohmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08725v1 Announce Type: new Abstract: In teleoperation, the human operator typically controls only the end-effector pose, which often leads to self-collisions of the manipulator and collisions with environmental obstacles, since joints and links are not controlled individually. A common strategy to mitigate this issue is to enhance the operator's input using optimal-control-based trajectory planning. As derivative-based solvers require differentiable constraints, existing approaches either approximate robots and obstacles with spheres, reducing geometric accuracy, or approximate derivatives, degrading convergence and increasing computation times. We address these limitations by adapting a recent formulation of differentiable collision-avoidance constraints, based on duality in convex optimization, to the teleoperation setting. The robot is approximated with capsules and the environment with polytopes. We compare the resulting trajectory planning method against state-of-the-art techniques in simulation with varying numbers of obstacles and evaluate it on a UR5e manipulator in a real-world teleoperation test. Results show that our approach achieves lower computation times while enabling more accurate obstacle modeling, leading to smoother and collision-free end-effector teleoperation.

Evaluating Multimodal Steganalysis for Split-Payload Audiovisual Steganography

Prateek Paudel, Nitin Jha, Abhishek Parakh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08726v1 Announce Type: new Abstract: The aim of steganography is to hide secret information inside ordinary media so that the existence of communication is hidden rather than encrypted. In audiovisual context, the availability of audio and video streams creates an opportunity to split a payload across these two modes thus, reducing the embedding burden on any single carrier. This paper evaluates whether such split-payload audiovisual steganography can help evade unimodal and multimodal steganalysis under synchronized and asynchronous embedding settings. We create audiovisual samples where the hidden message is divided between the audio and video tracks, and then test how well different detectors can identify them. The single mode detectors performs close to random guessing, thus showing the benefit of this hiding mechanism, while the multimodal model initially appears more effective. However, further checks show that this improvement mostly comes from the video stream, not from a true combined audio-video signal. Overall, the results suggest that splitting the payload across modalities can make detection harder, but multimodal detectors must be evaluated carefully to ensure they are learning the intended signal.

Compositional Approximation Can Strictly Outperform Superpositional Approximation

Dennis Elbr\"achter, Philipp Petersen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08727v1 Announce Type: new Abstract: Many classically studied function classes are known to be approximated optimally by superpositional methods, i.e. with approximants constructed as the linear combination of elements in some dictionary. Here optimality means that the uniform approximation error viewed as a function of the number of parameters used has polynomial decay of the highest order achievable by any parametrized method whose parameters can be encoded as a bit string of length proportional, up to logarithmic factors, to the number of parameters. While compositional methods like neural networks are structurally different, their approximation rates can be made comparable by imposing constraints that ensure such a proportional bit string encoding. In this work we study function classes exhibiting structural properties that limit superpositional approximation rates to be strictly lower than compositional approximation rates. In particular, we construct explicit examples for which there is an arbitrarily large gap.

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08728v1 Announce Type: new Abstract: Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08729v1 Announce Type: new Abstract: Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08735v1 Announce Type: new Abstract: Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

Muhammed Rasin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08736v1 Announce Type: new Abstract: We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

Dream-Tac: A Unified Tactile World Action Model for Contact-Rich Robot Manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08737v1 Announce Type: new Abstract: World action models inherit the predictive capability of world models, enabling action generation to be guided by anticipated future observations. However, they rely primarily on vision and often fail in contact-rich manipulation, where critical cues arise from physical interaction. In this paper, we propose Dream-Tac, a unified Tactile-World Action Model that jointly models actions, future visual observations, and tactile dynamics. Specifically, Dream-Tac introduces (i) contact-gated visuotactile fusion to selectively integrate tactile signals and (ii) a contact-aware attention bias to better regulate cross-modal interactions during manipulation. To support real-time deployment, we further design a dual-level acceleration strategy, reformulating the contact-aware bias to preserve the fused attention path during training and introducing cache-based diffusion acceleration at inference, achieving up to 2.9$\times$ faster training and 1.8$\times$ faster inference. Across six contact-rich manipulation tasks, Dream-Tac improves action accuracy by 31.7\% on average, demonstrating the effectiveness of unified visuotactile world modeling.Code is available at https://github.com/LYFCLOUDFAN/Dream-Tac.

Systems-Level Planning and Coordination of Truck-Drone Collaborative Delivery Networks

Didem Cicek, Burak Kantarci — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08738v1 Announce Type: new Abstract: Urban last-mile parcel delivery increasingly relies on heterogeneous fleets whose performance depends on timely coordination, reliable communication, and scalable control. Truck-drone collaboration has emerged as a networked cyber-physical delivery paradigm that combines the payload capacity and range efficiency of trucks with the agility of drones in congested or access-limited urban environments. This paper proposes a layered planning and coordination framework that structures truck-drone collaborative delivery (TDCD) from a systems and control perspective. The framework consists of five interrelated layers: spatial-demand alignment, collaborative delivery configuration, resource and workflow orchestration, performance evaluation, and scalability analysis, providing a unified view of coordination, control, and system-level performance in networked delivery operations. The proposed framework is evaluated using a realistic urban last-mile delivery scenario derived from the 2021 Amazon Last Mile Routing Research Challenge dataset. The case study demonstrates how coordinated truck-drone operation, enabled by structured task orchestration and inter-agent synchronization, improves end-to-end system efficiency under operational constraints. Results show a 42.4% reduction in total delivery time and a 44.2% reduction in energy consumption compared to a conventional truck-only delivery model. The scalability analysis further highlights how coordination gains persist as system size increases, and shows the importance of efficient control and communication in heterogeneous delivery networks.

The Minimal Retroreflective Microfacet Model

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08739v1 Announce Type: new Abstract: We present the Minimal Retroreflective Microfacet (MRM) model, which turns any existing microfacet BSDF into a physically plausible retroreflective one by a single substitution: replacing the view direction with its reflection about the surface normal before evaluating the standard model. Based on the previously published back-vector formulation, MRM requires only minimal code changes and has been adopted in the OpenPBR and MaterialX material standards. We prove reciprocity and energy conservation under the assumption of a reflection-symmetric normal distribution function (NDF), which holds for all commonly used distributions, and validate the model against measured retroreflective material data.

Safe, Fluent and Acceptable Motion Generation and Execution for Human--Robot Interaction in Manufacturing Environments

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08741v1 Announce Type: new Abstract: Robots operating in human environments must not only ensure physical safety but also exhibit behaviors that are understandable, fluent, and acceptable to human partners. This paper investigates motion generation strategies that combine safety guarantees with interaction quality considerations, such as motion smoothness and human comfort. While the design of robots capable of ensuring safety in shared human-robot environments has enabled closer and more advanced forms of interaction, these new proximity-based tasks require moving beyond purely technical considerations. In particular, robot behavior must also be addressed from psycho-cognitive and social perspectives. In this context, we argue for the relevance of integrating social-aware motion control into robotic systems. First, we identify the motion parameters that influence human perception and operator experience. Then, we implement a Model Predictive Control (MPC) framework that generates four distinct socially-informed robot behaviors. Finally, we conduct a user study to evaluate and validate these behaviors and assess their social impact on non-expert participants. The results demonstrate that variations in robot behavior significantly affect the perceived social acceptability of the system. These findings highlight the importance of incorporating human-centered considerations into motion generation strategies for robots operating in shared environments.

AUCp: Pseudo-AUC for Inference Model Selection with Unlabeled Validation Data in Abnormality Detection

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08742v1 Announce Type: new Abstract: Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalities from normal data by learning to reconstruct normal-only data alleviates the reliance on labeled datasets. However, many studies, even if unsupervised, rely on a labeled validation set to select the best model for inference from multiple training iterations. For many diseases labeled data are unavailable and substantially time consuming to obtain. To address this, AUCp - a novel metric that supports abnormality detection for unsupervised and self-supervised methods is proposed. Instead of evaluating the realism of reconstructed images to select the best of model for inference, it focuses on actual detection performance and without requiring an annotated test set. Assuming the pseudo ground truth of all unannotated samples in the test set as abnormal/positive and using traditional AUC calculation, AUCp scores are derived. Given a large and representative training set of normal samples, we show mathematical and empirical evidence that model selection using AUCp scores improves disease detection in terms of unsupervised and self-supervised methods over conventional metrics. Using two unsupervised methods for neurologic disease detection and self-supervised methods on diverse datasets, our results demonstrate that the AUCp score effectively identifies the optimal model for inference, significantly enhancing abnormality and disease detection. The corresponding implementations are available in https://github.com/mahfuzmohammad/AUCp.

Guided Discovery of New Behaviors using Diffusion Policies

Dian Yu, Sebastian Sanokowski, Majid Khadiv — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08743v1 Announce Type: new Abstract: Diffusion models have become a powerful tool for generative modeling in robotics, with diffusion policies excelling at modeling multimodal action-trajectory distributions. However, when demonstrations are limited, standard sampling often reproduces dominant behaviors while neglecting valid but rare modes, limiting the discovery of novel solutions. Existing approaches, such as guidance methods or combining reinforcement learning with diffusion, either push samples into infeasible regions or struggle to escape local minima, failing to systematically uncover diverse behaviors. To address these challenges, we propose a framework that combines Feynman-Kac correctors with a novel guiding potential that systematically guides diffusion policy samples towards promising yet underrepresented samples. These trajectories are refined using sampling-based trajectory optimization and reincorporated into the training set to retrain the diffusion policy. Our method effectively mines and repairs novel trajectories, enabling the systematic discovery of diverse and executable behaviors. We demonstrate the effectiveness of our framework across a range of manipulation environments, consistently discovering new behaviors.

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

Ayaan Choudhury, Preet Savalia, Anirudh Pydah, Avinash Sharma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08744v1 Announce Type: new Abstract: Global LiDAR localization is a fundamental task for autonomous navigation systems. Recent methods perform Scene Coordinate Regression (SCR) and achieve superior accuracy over Absolute Pose Regression (APR) solutions by predicting dense 3D world coordinates. However, SCR approaches introduce two major bottlenecks: severe computational inefficiency from processing raw 3D geometries and significant performance degradation under varying sensor viewpoints. To address these limitations, we present MB-Loc, a lightweight and viewpoint-robust SCR framework. Instead of relying on heavy 3D convolutions, we project the input LiDAR scan into a 2.5D Multi-planar Bird's-Eye View (BEV) representation. By slicing the point-cloud along the Z-axis and mapping signed depths into discrete 2D planes, MB-Loc retains essential 3D geometric structures while exploiting the computational tractability of standard 2D CNNs. To handle the inherent sparsity of outdoor LiDAR, we introduce a KL-regularized latent bottleneck that explicitly models spatial uncertainty without injecting stochastic noise. Finally, to ensure rotation robustness, we apply 3D spatial augmentations prior to planar projection, forcing the network to implicitly learn viewpoint-invariant features. We perform extensive experiments on the publicly available NCLT dataset and demonstrate that our proposed method outperforms the current state-of-the-art. Operating at real-time inference speeds, MB-Loc significantly outperforms traditional 3D-SCR architectures in computational efficiency.

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

Zhe Li, Bernhard Kainz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08745v1 Announce Type: new Abstract: Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer screening and digital pathology analysis. However, the susceptibility of neural networks to adversarial perturbations raises safety concerns for reliable deployment in clinical practice. In histopathological images, this challenge is exacerbated by the difficulty of distinguishing high-frequency adversarial noise from subtle and diagnostically relevant tissue structures. To address this issue, we propose Stain-Aware Wavelet Regularization (SAWR), an adversarial purification framework that leverages multi-level wavelet-domain regularization based on Haar transform to hierarchically disentangle adversarial perturbations from diagnostic structural information. This spectral constraint is further extended to individual histological channels, enabling stain-specific frequency regulation consistent with the biological properties of Hematoxylin and Eosin. When integrated into an instant purification framework, SAWR improves adversarial robustness by up to 10.69\% over the baseline approach, while maintaining texture and spectral fidelity under adversarial perturbations.

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

Kevin Krahn, Eric Fosler-Lussier — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08748v1 Announce Type: new Abstract: We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

New Codes from Cyclic and Negacyclic Codes of Even Length over $\mathbb{Z}_4$

Nuh Aydin, Mohamed O. Belghith, Godwin Idowu, Trang T. T. Nguyen, Long B. Tran — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08750v1 Announce Type: new Abstract: This paper uses theoretical results previously established in the literature to design search algorithms to find new linear codes over $\mathbb{Z}_4$ from cyclic and negacyclic codes of even length. As a result of these searches, we have found 2500 new cyclic codes and 730 negacyclic codes. These new codes exhibit improved parameters compared to previously known codes. Additionally, we have obtained binary quantum codes with good parameters from such $\mathbb{Z}_4$ codes.

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08751v1 Announce Type: new Abstract: Accurate quantification and uptake measurement in PET are critical for assessing disease progression and supporting clinical decision-making. While high-count PET provides reliable image quality, the associated radiation dose and prolonged acquisition remain significant clinical concerns, motivating the adoption of low-count protocols. Diffusion-model-based methods have demonstrated strong potential for restoring low-count PET to near high-count quality, but their iterative sampling procedure becomes prohibitively expensive when applied to high-resolution 3D PET volumes, introducing substantial inference latency that limits practical clinical deployment. To address these challenges, we propose a training-free Global-Local Skipping Strategy that accelerates diffusion model-based 3D PET denoising while simultaneously improving reconstruction quality. The proposed method is plug-and-play and directly applicable to pre-trained diffusion models without retraining or architectural modification. Specifically, we introduce: (i) a global denoising step skipping strategy that initializes the reverse diffusion process from an intermediate denoising step using a noise-consistent transformation of the low-count input, substantially reducing the number of required denoising steps; and (ii) a local feature reuse shortcut that reuses slowly-varying high-level U-Net features across neighboring denoising steps, further reducing per-step computation while preserving image fidelity. We evaluate the proposed approach on multiple PET tracers from in-house and public datasets, including 18F-FDG PET, 68Ga-DOTATATE PET, and 18F-PSMA PET, demonstrating consistent acceleration of over an order of magnitude alongside improved or comparable reconstruction performance relative to the full-step baseline. Blinded reader studies further confirm enhanced clinical confidence and perceived diagnostic quality.

Co-Evolving Skill Generation and Policy Optimization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08755v1 Announce Type: new Abstract: Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

Hong Guo, Nianhui Guo, Weixing Wang, Jona Otholt, Christoph Meinel, Haojin Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08761v1 Announce Type: new Abstract: W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($\rho=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($\rho=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $\rho$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($\rho=8$), and $1.78\times$ on RTX~3090 ($\rho=16$), $2.09\times$ on A40 ($\rho=16$), while recovering A100 ($\rho=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode.

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08765v1 Announce Type: new Abstract: Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions

Blanka K\"over, Alexandra Butoi, Anej Svete, Michael Hahn, Ryan Cotterell — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08768v1 Announce Type: new Abstract: Transformers consistently fail to learn certain simple functions that are provably expressible with specific parameter settings. This gap between learnability and expressivity is particularly prominent for sensitive functions -- functions whose output is likely to change if a single bit of the input is flipped -- for example, PARITY. While prior work has established that transformers exhibit a bias toward functions with low average sensitivity, the precise mechanism underlying this bias remains poorly understood. To shed light on this phenomenon, we study the geometry of transformers' parameter space. We show that sensitive functions -- even when representable -- occupy a vanishingly small region that random initialization is very likely to miss. Specifically, we shift the focus from average sensitivity to the full sensitivity profile -- the distribution of sensitivity values across all inputs -- and prove that randomly initialized transformers almost surely compute functions which have low-sensitivity strings. Consequently, any function that lacks such strings is provably unlearnable.

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08769v1 Announce Type: new Abstract: Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08770v1 Announce Type: new Abstract: The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

Non-Uniform Codebook Design for Optical IRS-Assisted VLC Systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08774v1 Announce Type: new Abstract: Optical intelligent reflecting surfaces (OIRS) can improve the coverage of indoor visible light communication (VLC) systems, however, practical deployment requires a finite offline codebook to avoid repeated real-time optimisation of mirror orientations. A uniform codebook with fixed angular steps does not provide uniform coverage on the user plane, because the mapping from steering angles to reflection locations on the user plane is nonlinear. To address this problem, this paper proposes a geometric-optics-based non-uniform codebook design for OIRS-assisted VLC systems. The proposed method constructs an individual codebook for each IRS element according to its geometric position, so that the reflected beams are distributed more uniformly over the user plane. The codebook accuracy is evaluated using the Frobenius norm of the channel error matrix. Simulation results show that the proposed design provides more uniform spatial mapping with fewer codewords than the uniform codebook, and that the sweep-angle resolution has a stronger effect on the codebook accuracy than the tilt-angle resolution.

Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks

Raktim Gautam Goswami, Prashanth Krishnamurthy, Yann LeCun, Farshad Khorrami — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08775v1 Announce Type: new Abstract: Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition functions within Model Predictive Control (MPC) frameworks to solve various control tasks. When applied to robotics, however, they are limited to single-stage tasks such as reaching or grasping, and struggle with multi-stage ones that demand complex sequential planning. In this work, we introduce WorldDP, a world model framework designed for multi-stage robotic manipulation. Our hierarchical approach utilizes a high-level world model as a transition function to optimize for feasible subgoals during runtime, which are subsequently reached by a low-level Diffusion Policy. To further aid in learning dynamics and planning, we incorporate object-centric representations that decouple environmental entities and enable us to plan sequentially with respect to each. Evaluated across several robotics benchmarks, WorldDP consistently outperforms existing baselines, validating that coupling the world model's physically grounded planning with diffusion policy's efficient execution yields superior multi-stage performance.

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08777v1 Announce Type: new Abstract: Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08779v1 Announce Type: new Abstract: Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08780v1 Announce Type: new Abstract: Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

Sheng-Wei Chan, Yung-Che Wang, Hsin-Jui Pan, Chia-Min Lin, Jen-Shiun Chiang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08781v1 Announce Type: new Abstract: Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further demonstrate that the Anti-Dilution Gate improves stroke preservation and reduces perceptually significant binarization errors.

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08788v1 Announce Type: new Abstract: Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

RAILS: Verification-Native Clearing For Agentic Commerce

Adrian de Valois-Franklin, Alex Bogdan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08790v1 Announce Type: new Abstract: Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

Wendy K. Tam — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08792v1 Announce Type: new Abstract: Large language models are rapicly replacing search engines as the primary interface between people and information. Unlike search engines, which retrieve existing content, LLMs generate novel text shaped by internal representations learned during training. Here we show that partisan political identity is encoded in the model's activation space, and that this direction directly shapes generation. Using 190,491 tweets from sitting members of the U.S. Congress as labeled training data, we train linear probes on the hidden states of the Llama 3.1 8B Instruct model. We identify a single geometric axis at layer 18 that separates Republican from Democratic text with an AUC of 0.945 and a Cohen's d of 1.94, and use sparse autoencoders to decompose that axis into interpretable partisan features. Causally intervening along this axis, ablating or amplifying the partisan component mid-generation, produces systematic shifts in the model's output. We witness stance reversals, register shifting, and structured fabrications of authority. Our results demonstrate that partisan bias in language models is not a vague emergent property but a learned geometric feature that can be precisely located and steered. Partisan bias is not a bug to be patched, but a structural property of how these models encode information about their users. As LLMs displace search engines as the interface to knowledge, understanding that product design (and its consequences) will be essential for navigating the legal, social, and political transitions from an information ecosystem that is curated to one that is generated.

AI-Augmented Closed-Loop Quality Engineering: A Reference Architecture for Continuous Software Quality Intelligence

Dimple Bajaj — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08793v1 Announce Type: new Abstract: The quality of software engineering is still under a challenge due to disjointed processes between requirements, testing, and production, which hinders the opportunity to implement quality strategies in consecutive releases. Existing approaches tend to be fixed-model or single-optimization approaches and lack production feedback learning mechanisms. The paper at hand proposes a closed-loop reference architecture of continuous software quality intelligence with AI enhancements. The model synthesizes requirement feature mining, risk-based test prioritization, defect prediction, and production incident analysis as an element of a feedback-based pipeline. A limited feedback learning model is introduced that is used to propagate the production signal-based on defect severity and incident impact- to the following release to ensure stability, and the time. The method is evaluated using a semi-synthetic test dataset of 4,500 requirements, 27,049 test cases, 13,089 defects and 7,841 incidents in six release cycles. The experimental results show that the proposed system reduces the defect leakage by 0.19 to 0.13, increases the effectiveness of the detection system to 0.72 to 0.84, and shortens the test execution by up to 35 percent compared to the non-adaptive baselines. The changes are stable release to release. The findings indicate that through the integration of feedback-based learning in a closed-loop architecture, it can be continued to enhance quality process, which offers practical foundation of adaptive quality engineering of software.

PairWise Image Finder: An Open-source Tool for Finding Visually Aligned Street-Level Image Pairs for Urban Perception Studies

Jussi Torkko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08795v1 Announce Type: new Abstract: Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to understand changes in scenes across the years. However, metadata alone is often insufficient to reliably find visually aligned image pairs. This study introduces the PairWise image finder, a tool that integrates feature detection and matching, supported by semantic segmentation masks to quantify the visual alignment of two images of varying time periods. The tool outputs the share of matched key features, the matched feature distance and coverage, and the alignment of semantic masks, which enables the user to filter image pairs depending on the alignment quality and use case. The visually aligned pairs derived from the tool can be used to accurately study explicit longitudinal change and help reduce manual effort for perception studies. The usability of the tool is demonstrated through a comparison of longitudinal changes, highlighting the importance of perspective when quantifying changes. The proposed method provides a scalable and open tool for researchers and stakeholders to find high-quality image pairs for urban analysis, perception and related applications.

A Non-Overlapping Schwarz Hybrid Finite Element-Neural Operator Framework for Solid Mechanics on Irregular Domains

Wei Wang, Abhinav Gupta, Haihui Ruan, Somdatta Goswami — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08796v1 Announce Type: new Abstract: Finite element (FE) methods are the benchmark for solid mechanics simulations, yet their computational cost becomes prohibitive for problems with localised nonlinearities, fine-scale features, or long-time dynamic evolution. In our earlier FE-neural operator (FE-NO) hybrid framework [1], physics-informed deep operator networks were coupled with FE solvers through overlapping domain decomposition with Dirichlet-Dirichlet interface exchange, accelerating intensive subdomains while preserving FE fidelity elsewhere. Two limitations remained: the overlapping formulation required redundant interface computations that increased inner Schwarz iteration counts, and the convolutional feature extractor restricted the NO subdomain to structured grids, precluding irregular geometries. A non-overlapping Schwarz alternating method with Neumann-Dirichlet interface exchange replaces it, transmitting traction from the NO to FE rather than displacement. This eliminates the overlap layer and reduces inner Schwarz iterations while maintaining bounded error accumulation across all tested time horizons. For arbitrarily shaped subdomains, a Point-DeepONet operates on unstructured FE point clouds without interpolation, extending it to non-convex and irregular geometries. Strain and stress operators are derived analytically from the displacement operators via kinematic equations, rather than as independent networks, reducing trainable parameter sets while enforcing mechanical consistency by construction. The framework is validated on three benchmarks: static linear elasticity, quasi-static hyperelasticity, and elastodynamics with regular and irregular geometries. These results establish a non-overlapping FE-NO coupling paradigm that is geometry-flexible, parameter-efficient, and convergence-stable, providing a pathway for hybrid physics-based and operator-learning solvers in large-scale dynamic solid mechanics.

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08797v1 Announce Type: new Abstract: Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08800v1 Announce Type: new Abstract: In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08802v1 Announce Type: new Abstract: Standard flow and diffusion pre-training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new-to-nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non-negligible probability. This allows to introduce a new learning principle for out-of-distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre-training method that employs verifier feedback to expand a pre-trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first-of-their-kind statistical learning guarantees for out-of-distribution flow modeling, analyzing generable set expansion as a local-to-global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out-of-distribution generative modeling metrics across small organic molecules, mid-sized drug-like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre-trained model, significantly outperforming widely adopted synthetic flow pre-training methods.

Some Essential Constructive Foundations for Systems and Control

Pavel Osinenko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08803v1 Announce Type: new Abstract: This work develops several constructive foundations for systems and control within Bishop-style constructive mathematics. For an engineer, the guiding principle is that an object claimed to exist, such as a trajectory, an optimal control law, a selector, or a viable solution, should come with finite data and an operation computing approximations to any prescribed precision. The style remains close to classical analysis, but existential statements are organized so that their computational content is visible. The paper begins with elementary geometric data in finite-dimensional Euclidean spaces: blocks, multiblocks, representable sets, regular functions, and certified integrals. This set-first integration route is meant to complement, rather than replace, abstract constructive integration theories such as Daniell-type or integration-space approaches. The developed apparatus is then applied to a constructive functional extremum-value theorem, selector extraction for multifunctions, Filippov-type and viable solutions of differential inclusions, regular probability densities, controlled Markov chains, and empirical density certificates. A short account of resolvent projectors and linear stability is included for completeness.

Q-Delta: Beyond Key-Value Associative State Evolution

Sumin Park, Seojin Kim, Noseong Park — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08804v1 Announce Type: new Abstract: Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key-query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.

Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing

Dimple Bajaj, Deepak Khetan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08806v1 Announce Type: new Abstract: Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly used in autonomous software testing; however, AI-generated test artifacts often suffer from hallucinations, compliance violations, security risks, and limited explainability. To enhance the reliability, transparency, and trustworthiness of AI-generated testing artifacts, this research introduces the concept of Governance-Aware Autonomous Testing Framework (GATF). The framework extends the autonomous testing lifecycle with governance validation, explainability analysis, probabilistic risk assessment, compliance monitoring, as well as audit governance. Experiments were performed with Defects4J and PROMISE software engineering datasets. The proposed framework successfully reduced the governance-related risks by 89.6% and demonstrated 94.3% accuracy in governance, 96.5% artifact reliability, 94.2% compliance accuracy, and 90.8% explainability performance. The results show that autonomous testing systems that are governance-aware can significantly enhance the reliability, transparency, and operational security of autonomous testing systems in comparison to conventional AI-based testing systems. The proposed architecture is scalable and reliable and provides a safe environment for software testing.

A Classroom Study of LLM-Generated Feedback Intervention in Introductory Programming

Hasnain Heickal, Andrew Lan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08807v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to provide automated feedback in introductory programming courses, yet empirical evidence from authentic classroom deployments comparing different feedback modalities remains limited. In this work, we present a large-scale classroom study in which AI-generated feedback was deployed through a randomized protocol in an introductory Python programming course. Students received one of three feedback conditions on incorrect submissions: natural language hints, AI-generated failing test cases, or no AI feedback. We release the resulting dataset, ProgFeed, which captures 6,693 submissions from 215 consenting students across 17 labs, including feedback conditions, execution-based performance measures, and fine-grained temporal information. Using this data, we analyze learning trajectories, feedback quality, and submission behavior over repeated attempts. We find that natural language feedback is significantly associated with higher completion rates and faster convergence to correct solutions. Test case feedback, by contrast, exhibits heterogeneous effects that depend critically on feedback validity. Our results suggest that the form of AI-generated feedback matters, and that evaluating feedback quality -- not just its presence -- is essential for understanding its pedagogical impact.

Energy Storage as a Multi-Use Asset: Applications Across the Power System

Fabrizio Sossan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08808v1 Announce Type: new Abstract: The energy transition in power systems requires flexible assets to offset renewable generation variability across multiple time scales, while supporting the integration of renewables and the electrification of demand without requiring costly grid reinforcement. Energy storage occupies a unique position among these assets: depending on the technology, it can provide short-duration grid services at high ramping rates, such as frequency regulation and voltage support, longer-duration functions such as intra-day peak shaving, or inter-seasonal energy buffering. This multi-service character, combined with the declining costs of energy storage technologies (most notably that of battery energy storage systems), is central to the economic viability of storage investments. The value of a given installation depends strongly on its grid connection point and intended use case: an asset-coupled battery serving a consumer or generation plant faces a different service landscape, and therefore a different business case, than a network-coupled system operating as an independent grid resource. This paper presents a structured taxonomy of grid-connected energy storage applications, discusses the principal application domains, and describes the key challenges that must be addressed to integrate storage effectively into power systems. Services are discussed with special emphasis on the Swiss regulatory context. Finally, the STORE flagship project supported by the Swiss Innovation Agency (Innosuisse), where some of the critical challenges of energy storage integration in power grids are addressed, is introduced.

Continuous Language Diffusion as a Decoder-Interface Problem

Zhicheng Du, Lan Ma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08810v1 Announce Type: new Abstract: Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: denoising succeeds when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin final-token basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving about a 1.1 perplexity gap in a structured residual tail. A conservative margin gate exits $17$--$27\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

Data Architectures and their Technical Requirements (DATER)

Sayed Hoseini, Christoph Quix, Stefan Decker — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08811v1 Announce Type: new Abstract: Modern organizations generate and consume massive volumes of heterogeneous data at high speed. This requires a continuous development of new techniques for more efficient and reliable data management. Designing appropriate data architectures has therefore become a strategic necessity, as they shape how data is integrated, governed, and made available for analytics and decisionmaking. This paper introduces a conceptual framework - Data Architectures and their Technical Requirements (DATER) - to systematically describe and evaluate data architectures based on technical requirements. Six modern architectures are examined: data warehouse, (semantic) data lake, data lakehouse, data fabric, and data mesh. Each is analyzed by historical context, defining features, and conformance to DATER dimensions. The study supports researchers and practitioners in navigating architectural paradigms, clarifying overlaps, and highlighting strengths, limitations, and use-case suitability.

Aperon Technical Report: Hierarchical No-Pointer Tangent-Local Search for High-Dimensional Approximate Nearest Neighbors

Yong Fu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08813v1 Announce Type: new Abstract: We present HNTL (Hierarchical No-pointer Tangent-Local), the core vector indexing and candidate generation framework of the Aperon vector memory system. Proximity graphs (e.g., HNSW) incur a heavy pointer tax in memory overhead and induce irregular memory accesses that stall CPU pipelines. HNTL resolves this by partitioning the high-dimensional space into local, coherent grains, representing vectors as low-dimensional coordinates on local tangent spaces, and scanning them sequentially using a pointerless Block-SoA (Structure-of-Arrays) layout. On anisotropic manifold data (d=768, N=10,000), local PCA captures 96.3% of the variance, allowing HNTL to achieve a final Rerank Recall@10 of 1.0000 with a candidate pool size of only C=20 vectors. Hardware profiling via Apple kperf CPU Performance Monitoring Unit (PMU) counters demonstrates a 3.61x speedup (4.137 ns/vector vs. 14.951 ns/vector) for our NEON auto-vectorized C++ Block-SoA scan engine over standard pointer-chasing graph traversals, driven by a 3.59x IPC (Instructions Per Cycle) and near-zero L1/L2 data cache misses.

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

Sumin Park, Noseong Park — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08814v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08815v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

Jake Fawkes, Liam Hodgson, Jason Hartford — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08816v1 Announce Type: new Abstract: Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

Low-Variance Randomised Numerical Linear Algebra for Finite Element Simulation

N. Polydorides, Y. Wu, . H. Noori, H. Vandierendonck, R. Woods — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08817v1 Announce Type: new Abstract: We present a low-variance randomised numerical linear algebra approach for multi-query finite element systems arising from parametric elliptic partial differential equations with applications to digital twins and online model calibration. The method relies on Galerkin subspace projection for reducing the dimensionality, and then combines parameter-oblivious leverage-score Bernoulli sampling with a control variates scheme to yield a reduced-variance `forward' sketch and an invertible `inverse' sketch that are then fused to a single efficient regularised estimator. Effectively, this reduces the computational cost in computing the projected system of equations while preserving the structure, stability, and accuracy of the underlying FEM formulation. We derive probabilistic bounds for the sketching error, invertibility, and estimator variance, and then validate the method on large-scale example problems. The results show that when the parameter fields do not vary too sharply, the synergy of control variates together with the sketch fusion can largely offset the loss incurred by the sub-optimal parameter-oblivious sampling. In this regime, our method achieves substantial savings in time, memory, and communication while maintaining accuracy levels that are acceptable for scientific simulation.

Unstructured Mesh Tools for Fusion Energy System Design

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08822v1 Announce Type: new Abstract: The execution of accurate simulations of fusion energy systems requires the appropriate representation of critical component geometries as well as the coupling of complex fusion physics codes with one another and with engineering analysis tools. This paper examines the challenges of creating simulation workflows that fully leverage existing fusion research codes while integrating them with commercial computer-aided engineering (CAE) software. Key areas addressed include: (a) the construction and meshing of analysis geometries taking full advantage of available geometric modeling and meshing technologies; (b) the effective coupling of fusion physics and engineering analysis codes; and (c) the support for simulation workflows that couple particle and continuum modeling methods.

Syntax-driven Incremental Program Verification of Matching Logic Properties

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08824v1 Announce Type: new Abstract: Incrementality is a fundamental design principle to master the complexity of large, long-lived software systems. This principle has been embraced by agile development processes and it lays at the base of continuous software evolution. A major challenge in this context is to incrementally re-verify the correctness of software artifacts after every change, focusing the verification efforts only on the parts affected by the change. We present an approach to the incremental verification of programs written in KernelC, annotated with properties expressed in matching logic. The approach is based on a syntactic-semantic framework that enables analyzing code chunks in isolation so that, after a change to a program fragment, only the part whose semantics is affected by the change is re-processed. This property is obtained by expressing the language syntax through an operator precedence grammar and by formalizing its semantics through a synthesized attribute schema. We have implemented our technique in a prototype tool and experimentally evaluated its effectiveness. The results show that our approach does not penalize the efficiency of formal verification and can outperform program re-verification after changes, depending on the presence and type of annotations, as well as the position of the change and the program structure.

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

Lanz Anthonee A. Lagman, Prospero C. Naval Jr, Reinabelle C. Reyes — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08826v1 Announce Type: new Abstract: Image data regarding galactic morphology is expected to increase both in quantity and quality for the next foreseeable years; thus it is important to explore which deep learning architectures adapted for image classification tasks are cost-effective. Residual and Inception networks are ideal for exploring classification convolutional neural networks (CNNs) due to their computational efficiency, achieved through techniques such as residual connections and parallelized inception modules, enabling deeper networks without excessively increasing computational complexity. In this work, we analyze the performance of ResNet101 and InceptionV4 on a spatially-augmented Galaxy10 DECals dataset. Retaining the ten-class classification of galaxies, we modify the image count of each class. We find that ResNet101 and InceptionV4 models achieved accuracies of $\sim$ 90%, comparable with reported performance in the literature. In terms of performance metrics, ResNet101 is superior to InceptionV4. Our results indicate that either of these CNN architectures could serve as a robust foundation for specialized pipelines for classification of galaxy images from upcoming surveys.

Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08828v1 Announce Type: new Abstract: Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.

Flexible Coupler Antenna Enhanced Wireless Communication: Modeling and Coupler Position Optimization

Xiaodan Shao, Chuangye Shan, Yunlong Du, Junling Li, Rui Zhang, Cheng-Xiang Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08829v1 Announce Type: new Abstract: This paper proposes a novel flexible coupler antenna (FCA) that translates passive coupling elements around a fixed-position active antenna to reshape the induced currents on the passive elements for radiation. A new form of mechanical beamforming is achieved by moving only the passive coupling elements while keeping the active antenna stationary. The proposed design significantly reduces the antenna and radio-frequency (RF) chain costs of conventional active array beamforming with low mechanical control complexity and energy consumption. For the purpose of exposition, we consider a point-to-point communication system with one FCA at the transmitter and one fixed antenna at the receiver. Specifically, based on multi-port circuit theory, we establish both the line-of-sight (LoS) and multipath channel models and derive the mechanical beamforming weights of the passive couplers as functions of their positions. Then, we formulate a new problem to maximize the received signal-to-noise ratio (SNR) by optimizing the positions of passive couplers at the transmitter, subject to coupler movement and transmit power constraints. Solving the resulting problem is inherently difficult because coupled channel and mechanical beamforming create non-linearity in the objective function.To tackle this problem, we propose an efficient block-coordinate conditional gradient method to search for the best positions of all passive couplers by sequentially optimizing the position of each coupler with those of the other couplers fixed in an iterative manner.Simulation results demonstrate that the proposed system significantly outperforms benchmark schemes in terms of achievable rate, but with significantly reduced active antennas and RF chains.

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

Ting Wang, Yuanjie Shi, Yan Yan, Huan Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08831v1 Announce Type: new Abstract: Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP's flexibility and its post-hoc limitation, we propose an \emph{Inference-Time Conformal Reasoning (ITCR)} framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.

Instrumental convergence and power-seeking

David Thorstad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08832v1 Announce Type: new Abstract: Recent years have seen increasing concern that artificial intelligence may soon pose an existential risk to humanity. One leading ground for concern is that artificial agents may be power-seeking, aiming to acquire power and in the process disempowering humanity. I show how the argument from power-seeking rests on a strong version of a claim known as the instrumental convergence thesis. I explore leading defenses of the instrumental convergence thesis and argue that none establishes the thesis in a strong enough form to ground the argument from power-seeking. I discuss implications for longtermism, the governance of artificial intelligence, and the methodology of studying risks posed by artificial agents.

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08833v1 Announce Type: new Abstract: We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

Adaptive Model Predictive Control of Nonlinear Generic Urban Air Mobility Using Linear Parameter-Varying Systems

Tri Ngo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08836v1 Announce Type: new Abstract: This paper presents an adaptive model predictive control (MPC) framework for nonlinear urban air mobility (UAM) vehicles operating across the full flight envelope. The proposed approach leverages a linear parameter-varying (LPV) representation to update the predictive model online, enabling accurate capture of strongly nonlinear and time-varying dynamics associated with distributed electric propulsion (DEP) eVTOL aircraft. To systematically address the high-dimensional and coupled nature of MPC tuning, a multi-objective evolutionary optimization strategy based on NSGA-II is employed, incorporating proper normalization of states and control inputs to ensure balanced weighting and meaningful exploration of the design space. The resulting controller explicitly accounts for actuator constraints and enables reconfigurable control allocation for fault-tolerant operation. The framework is evaluated in nonlinear simulations using NASA's Generic Urban Air Mobility (GUAM) model and benchmarked against a robust servomechanism linear quadratic regulator (RSLQR). Results demonstrate that the proposed adaptive MPC achieves improved trajectory tracking and enhanced robustness under both nominal conditions and actuator degradation scenarios, including partial motor failure, while maintaining constraint satisfaction throughout all flight regimes.

Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs

Sayed Erfan Arefin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08840v1 Announce Type: new Abstract: Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We present a large-scale, execution-grounded evaluation of 9 openly accessible LLMs specialized for coding on 2,707 free LeetCode problems across 12 programming languages. Our corpus contains 325,343 problem-model-language jobs, each linked to prompt metadata, extracted code, LeetCode execution outcomes, and static-analysis signals. The results show that current open models remain far from the human acceptance reference: the best model, Yi-Coder-9B-Chat, reaches 23.64% mean correctness, compared with a 57.2% human acceptance baseline. Rankings are also slice-dependent: Qwen2.5-Coder-14B-Instruct is strongest on hard problems and distinct-problem coverage, while Gemma-2-27B-IT achieves the highest all-language lint pass rate. Failure analysis shows that compile errors account for 63.25% of non-accepted best submissions, indicating that many failures occur before semantic correctness can be tested. Static quality further diverges from functional correctness. Together, these findings show that multilingual, artifact-preserving evaluation reveals tradeoffs hidden by single-language or single-metric leaderboards.

ZIPP:Zero-shot Image Personalization from Personas

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08841v1 Announce Type: new Abstract: Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

Moshe Mandel, Shlomo E. Chazan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08843v1 Announce Type: new Abstract: We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

Xiangzhong Liu, Xihao Wang, Hao Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08844v1 Announce Type: new Abstract: As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08847v1 Announce Type: new Abstract: Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems

Sara Jaber, S. M. Hassan Mahdavi, Neila Bhouri, Mostafa Ameli — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08849v1 Announce Type: new Abstract: Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to compare alternative disruption response solutions using a common set of dynamic, passenger, operator, and environment oriented indicators. This paper proposes a KPI-driven, time-indexed framework to assess the resilience of disruption response solutions in urban transit systems. The framework combines an optimization model with a behavioral evaluation in agent-based simulation. It also underlays the secondary service degradation induced on helper lines when in-service vehicles are withdrawn to support the disrupted corridor. Rather than treating resilience as a single score, it evaluates complementary dimensions including vulnerability, adaptability, robustness, resilience loss, responsiveness, cost-based performance, emissions, and equity. The framework is implemented for the RER B transit line in the Ile-de-France (Paris) network. Results show that the coordinated strategy provides the most balanced resilience profile, combining high service continuity with lower total disruption cost than single mode alternatives, while also improving equity and maintaining competitive environmental performance. Sensitivity analysis further identifies the disruption conditions under which coordinated multimodal response is most valuable.

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08850v1 Announce Type: new Abstract: Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

Enforcing Trust Accountability with Backward Propagation

Wenbo Wu, George Konstantinidis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08851v1 Announce Type: new Abstract: Trust and reputation management underpins reliable interactions in distributed networks, yet existing trust models rely solely on forward propagation of interaction-based trust signals. They lack robust mechanisms to enforce accountability for the propagated trust signals when negative interactions occur. In addition, such models often fail to initialize newly joined nodes with sparse interaction history, leading to the cold-start problem. In this paper, we propose RepuLink, a two-layer reputation model that couples an endorsement network with an interaction feedback network. RepuLink integrates two concurrent backward propagation mechanisms: Backward Endorsement Penalty Propagation (BEPP), which recursively penalizes endorsers of misbehaving nodes, and Backward Endorsement Reward Propagation (BERP), which rewards endorsers of well-performing nodes. Together, RepuLink enforces endorsement accountability and incentivizes positive behaviors, which form a positive interaction feedback loop. The endorsement layer further provides explainable, endorser-weighted trust initialization for newly joined nodes. Experiments on real-world datasets against representative trust propagation baselines demonstrate that RepuLink outperforms across four evaluation metrics in both interaction-only and full two-layer settings, while preserving comparable efficiency.

Parallel SMT Solving via Dynamic Partitioning, Core-Guided Pruning, and Online Backbone Detection

Ilana Shapiro, Sorin Lerner, Nikolaj Bj{\o}rner — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08852v1 Announce Type: new Abstract: Exploiting parallelism in modern CPU architectures remains a longstanding challenge in optimizing SMT solvers. We introduce a novel parallel framework that dynamically builds a binary partition tree of the search space by sampling from workers' VSIDS statistics during solving. We leverage the full power of core-based CDCL-style pruning to continuously shrink the partition tree. We further optimize our architecture by incorporating online backbone detection into worker threads, as well as a terminate-on-demand mechanism to eagerly eliminate work on pruned subproblems. The resulting algorithm is highly generalizable and scales effectively with available resources. We implement our approach in the Z3 SMT solver and demonstrate that it outperforms both sequential Z3 and existing state-of-the-art parallel frameworks on challenging benchmarks from six logics in the SMT-COMP 2025 Parallel Track.

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08854v1 Announce Type: new Abstract: Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

Hartwig Grabowski, Michael Canz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08855v1 Announce Type: new Abstract: This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08857v1 Announce Type: new Abstract: Expert writing feedback from experienced researchers is critical for early-career scholars to improve their manuscripts, yet high-quality feedback often remains scarce because reviewing research papers is labor-intensive. Emerging AI-powered writing assistants largely focus on grammar fixes or simulating peer review with final scores, yet they fall short of providing concrete, actionable suggestions that help students improve their papers during drafting. We present PaperMentor, a human-centered writing assistant system that delivers actionable suggestions as Overleaf-native inline comments while leaving the actual writing entirely to human authors. PaperMentor integrates an expert skill library carefully curated from established researchers' writing advice with 12 specialized agents covering different aspects of paper writing, such as formatting compliance, phrasing accuracy, and terminology consistency. In a user study (n=14), 90.6% of the generated comments were rated actionable and 67.5% were rated valid, significantly outperforming a GPT-5.2 baseline uswithout the skill library. We release PaperMentor as open source for public use. Our code is publicly available under the AGPL-3.0 license at https://github.com/jiarui-liu/overleaf

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

Hartwig Grabowski — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08858v1 Announce Type: new Abstract: The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08860v1 Announce Type: new Abstract: Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing from digital maps, creating safety risks for human drivers and automated vehicle systems. We present a real-time, onboard perception pipeline that detects active work zones, recognizes associated temporary speed limits, and outputs a law-aware work-zone state and speed value suitable for driver alerts or downstream automated control. The system fuses object detections with semantic verification and temporally smoothed, hysteresis-based state transitions to reduce false activations and flicker in dynamic scenes, and runs fully on low-cost embedded hardware. Evaluated manually on a annotated subset of the ROADWork dataset (490 sequences), the system achieves inside-work-zone event-level recall of 96.5% and event-level precision of 68.7%. Speed-limit recognition evaluated on 35 minutes of in-house driving data attains 95.45% precision and 53.85% recall, with no incorrect speed classifications and a single false positive. These results demonstrate a practical, scalable approach for grounding work-zone speed awareness directly in onboard perception rather than maps or infrastructure. We release our source code for the proposed system pipeline on our GitHub repository: https://github.com/Mi3-Lab/workzone

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

Juan Pablo Sotelo, Marina Gardella, Pablo Mus\'e — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08864v1 Announce Type: new Abstract: The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

Generalizing Geometry-Guided Mamba as a Plug-and-Play Context Module for CNN-based Semantic Segmentation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08866v1 Announce Type: new Abstract: CNN-based semantic segmentation networks usually rely on context heads such as ASPP, PPM, or attention modules to enlarge the receptive field. These heads are effective but may introduce heavy computation, memory cost, or boundary leakage. This paper revisits Directional Geometric Mamba (G-Mamba) from DGM-Net and studies it as a plug-and-play context aggregation module rather than a complete new segmentation architecture. The key idea is to inject geometric guidance into the selective scan process, allowing long-range feature propagation to be modulated by boundary and centripetal-flow cues. We replace the original context heads of six representative CNN segmentation models, including DeepLabV3+, DANet, CCNet, PSPNet, PSANet, and OCRNet, while keeping the ResNet-101 backbone unchanged. Results on Cityscapes show consistent mIoU gains with only moderate extra GFLOPs at $1024\times1024$ resolution, suggesting that geometry-guided SSM modules can serve as practical alternatives or enhancements to conventional CNN context heads.

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08867v1 Announce Type: new Abstract: The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

A Low-Latency Semantic State Estimator using Latent Predictive Learning for Dynamic Network Monitoring and Orchestration

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08869v1 Announce Type: new Abstract: Closed-loop network monitoring and orchestration increasingly require semantic interpretations of live telemetry beyond raw counter collection. However, dynamic cloud-edge environments change both the active node set and the monitoring query at runtime, while control loops demand bounded millisecond-scale responses. We introduce a latent predictive state estimator (LPSE) for dynamic network monitoring and orchestration, built on latent predictive learning over streaming telemetry. The framework converts variable-cardinality node telemetry into topology-adaptive temporal representations, fuses them with monitoring questions, and returns bounded answers from a semantic codebook instead of autoregressive text generation. This design enables fixed-cost, single-pass inference while preserving semantic interpretability. By operating on permutation-invariant, slot-routed node representations keyed by stable identity, the model maintains a fixed input space and generalizes to node addition, removal, and reordering without retraining. Experimental results on a multi-node Kubernetes cluster show semantic prediction accuracy of 82.42% at approximately 41$\times$ lower mean inference latency and 15$\times$ smaller memory footprint compared with a deployable 4B LLM endpoint.

Fourier Neural Operators with rank-1 lattice points and hyperbolic cross

Jakob Dilen, Alexander Keller, Frances Y. Kuo, Dirk Nuyens — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08871v1 Announce Type: new Abstract: The \emph{Fourier neural operator} (FNO) is a neural network architecture that learns mappings between function spaces. Its efficient implementation is based on the multi-dimensional Fourier transform. By deriving general regularity bounds for the FNO with respect to both the spatial and parametric variables, we prove that the generalization error of the FNO can be improved by replacing spatial tensor product grids with purpose-built rank-1 lattice points, and by using a second lattice carefully constructed as training points in the parametric space. We achieve more accurate and efficient approximations from fewer network parameters, fewer spatial points, and fewer training samples. In addition, the architecture is simplified, because the high-dimensional Fourier transform on rank-1 lattices requires only a \emph{one-dimensional fast Fourier transform}, and we can use a \emph{hyperbolic cross} frequency index set with lattice points. We demonstrate the benefits of our \emph{lattice-based hyperbolic-cross FNOs} for an elliptic PDE on the torus.

EFX for Additive Chores: Nonexistence, Pareto Incompatibility, and Bi-Valued Existence

Wentao He, Biaoshuai Tao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08872v1 Announce Type: new Abstract: We consider the fair division problem of indivisible chores and resolve the long-standing open problem for the existence of EFX allocations with additive cost functions. We show that, even for tri-valued additive cost functions, for every $n\geq 4$, there exists an instance with $n$ agents where no EFX allocation exists. Our counterexample only uses three types of chores, which is also tight on the number of types, as an EFX allocation is known to exist for two types of chores. We then consider bi-valued instances. We show that, for every $n\geq 4$, there exists an instance with $n$ agents where every EFX allocation is not Pareto-optimal. This is also the first example showing the incompatibility of EFX and Pareto-optimality when the costs of items are positive: existing examples showing the incompatibility of EFX and Pareto-optimal exploit items with $0$ costs. Our result shows such an example exists even for bi-valued instances. The number of agents $n$ is also tight: for $n\leq 3$, it is known that EFX is compatible with Pareto-optimality. Finally, we also show that an EFX allocation is guaranteed to exist for $n=4$.

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08875v1 Announce Type: new Abstract: Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, Jiaxuan Guo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08878v1 Announce Type: new Abstract: Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

Direct Data-driven Predictive Control: A Computationally Efficient Alternative to DeePC for Eco-driving in Mixed Traffic Flows

Dongjun Li, Haoxuan Dong, Liangcai Xu, Ziyou Song — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08880v1 Announce Type: new Abstract: Improving energy efficiency in the transportation sector is critical for achieving sustainable mobility, with eco-driving emerging as a key strategy. However, implementing effective eco-driving for connected and automated vehicles (CAVs) in mixed traffic presents a significant control challenge due to the heterogeneous, uncertain behavior of human-driven vehicles (HDVs). Data-enabled Predictive Control (DeePC) offers a promising model-free approach but is often hindered by a high computational burden, limiting its real-time feasibility. This paper introduces a novel Direct Data-driven Predictive Control (D3PC) framework to address this limitation. By reformulating the data-driven prediction mechanism, the D3PC significantly reduces computational complexity, making its computation time nearly invariant to historical data size. This computational efficiency directly enables the formulation of a sophisticated eco-driving controller that can solve the complex energy optimization problem in real time, even within diverse and stochastic mixed-traffic environments. Comprehensive simulations demonstrate that the D3PC is orders of magnitude faster than existing DeePC-based methods while achieving superior energy efficiency. Specifically, it reduces total platoon energy consumption by up to 10.71% compared to rule-based cruise control baselines and 3.80% compared to the original DeePC, confirming its effectiveness for real-time, energy-efficient control.

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

Yi Yu, Xinchuan Qiu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08881v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $\pi_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

Accelerating GMRES with Matrix-Free Multiscale Robin Preconditioners

Dilong Zhou, Rafael T. Guiraldello, Felipe Pereira, Fabr\'icio S. Sousa — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08883v1 Announce Type: new Abstract: We propose a matrix-free right-preconditioning strategy for the Generalized Minimal Residual (GMRES) method based on the Multiscale Robin Coupled Method with oversampling (MRCM-OS) for the numerical solution of elliptic problems arising in subsurface flow. The resulting preconditioner is constructed through local subdomain solves with oversampling and smoothing, and can be applied without explicit assembly of the global operator. After a careful presentation of the new procedure, it is used in extensive numerical experiments. Our results demonstrate that the proposed approach substantially reduces iteration counts across a range of challenging, high-contrast subsurface flow problems. In many cases, convergence is obtained in one or two GMRES iterations when oversampling and smoothing are employed. The results indicate that combining GMRES with multiscale Robin-based operators is a promising direction for the construction of rapidly convergent preconditioning strategies.

Silicon Photonics Testing: Design for Testability, Fault Detection, and Manufacturing Variation Analysis in Photonic Integrated Circuits

Pratishtha Agnihotri, Priyank Kalla, Steve Blair — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08885v1 Announce Type: new Abstract: This paper proposes a design-for-test (DFT) methodology and architecture for testing and validation of silicon photonic integrated circuits. We describe the design of silicon photonic circuits and components that comprise the proposed DFT architecture. The designs are extensively simulated and validated as test-access and fault-detection circuitry. We demonstrate how the DFT approach can be deployed on photonic integrated circuits and how they can be tested for correct operation, in terms of signal power and phase. The application is demonstrated on two distinct types of designs -- an optical neural network comprising optical devices in a feed-forward topology, and on an optical logic circuit with feedback loops.

Block-A-Mole: The Sustainability Frontier of Moving-Target Censorship Resistance

Anindya Maiti — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08886v1 Announce Type: new Abstract: Internet censorship affects over four billion people, and deployed circumvention systems share a common weakness: their endpoints are fixed and discoverable, so a patient censor can enumerate and block them. Moving-target circumvention systems instead rotate endpoints across commercial cloud address space faster than censors can react, but the field lacks a theory of when rotation works, leaving rotation intervals and pool sizes to intuition. We give the first formal account of moving-target censorship resistance by modeling the censor-defender interaction as a continuous-time timing game over a combinatorial address-domain space, generalizing FlipIt to a collateral-bounded adversary. We prove a sustainability frontier separating configurations a censor can defeat from those it cannot, and show that under the Great Firewall's 2024 shift to blocking QUIC and TLS by domain, raw rotation speed is not the binding constraint. Instead, availability is governed by the domain burn rate, $\beta=\lambda_{\mathrm{disc}}/\lambda_{\mathrm{intro}}$, the ratio between how quickly the censor blocks defender domains and how quickly the defender introduces fresh ones. We derive a closed-form availability law, prove that address rotation alone cannot sustain high availability when $\beta>1$ regardless of endpoint rotation speed, and characterize the frontier $\beta^\star$. We validate the analysis with an open, model-level censor-defender simulator requiring no privileged access or cloud deployment. The simulator reproduces the predicted phase transition at $\beta^\star$ under adversary profiles representative of the GFW, Russia's TSPU, and Iran, and shows robustness to state-dependent discovery and bursty, provider-correlated burns. The result replaces the heuristic of ``rotate faster'' with a precise operating condition: keeping the domain economy ahead of the censor.

PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference

Runyang Tian, Yanru Chen, Weihong Xu, Tajana \v{S}imuni\'c Rosing — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08891v1 Announce Type: new Abstract: Large language models are increasingly deployed on edge devices with tight power and area budgets. While mixed-precision GEMM reduces arithmetic complexity, quantized inference is often dominated by dequantization and nonlinear operators. Lookup Table (LUT)-based method mitigates these costs by precomputing outputs and replacing repeated arithmetic with table lookups, but existing designs incur significant capacity and lookup-latency overheads. This paper presents PALUTE, a LUT-based Processing-In-Memory accelerator built on Monolithic 3D DRAM for efficient edge LLM inference. PALUTE enables in-DRAM LUT queries that exploit the vertical organization of M3D DRAM memory array tiles to achieve high parallelism with low area overhead. A near-memory LUT generator supports low-latency LUT generation for both GEMM and element-wise unary nonlinear operators, while a system-level tiering and scheduling strategy minimizes data movement across memory tiers. Evaluation using cycle-accurate simulation and RTL synthesis shows that PALUTE achieves 1,264 TPS end-to-end throughput at 0.16 W, improving energy efficiency by 12.8$\times$ over CHIME and 1.6$\times$ over FIGLUT, improving area efficiency by 2.0$\times$ over PIMPAL under W4A4 across Qwen3-4B models.

Diffuse AI Control on Fuzzy Tasks

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08892v1 Announce Type: new Abstract: AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a novel framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. To mitigate the threat, we propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

Cheap Reward Hacking Detection

Iv\'an Belenky, Joaqu\'in Itria, Steven Johns — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08893v1 Announce Type: new Abstract: A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08894v1 Announce Type: new Abstract: Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

Optimal Regret Exponents for Bayesian Statistical Decision Problems

Hyun-Young Park, Si-Hyeon Lee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08895v1 Announce Type: new Abstract: We study finite-state finite-action Bayesian statistical decision problems. While exact error-exponent characterizations are known for several special cases, including hypothesis testing and hypothesis exclusion, the asymptotic behavior of the optimal Bayes regret is largely unknown for general decision problems. In this paper, we show that the optimal regret always decays exponentially fast and characterize its exact exponent for arbitrary loss functions. The exponent is given by the minimum multivariate Chernoff information over the minimal incompatible subsets of states, where an incompatible subset is a collection of states for which no single action is optimal for all states in the subset. Our result recovers the classical pairwise-minimum Chernoff exponent for symmetric multiple hypothesis testing and the multivariate Chernoff exponent for hypothesis exclusion, while also yielding, to the best of our knowledge, the first exact exponent characterization for list hypothesis testing.

FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08896v1 Announce Type: new Abstract: Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

A multi-agent system for spine MRI report generation from multi-sequence imaging

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08897v1 Announce Type: new Abstract: Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

A Kernel-Clean Lean Mechanization of Classical Lottery in Action and the Wakker--Debreu--Koopmans Representation Layer

Jingyuan Li, Ilia Tsetlin, Fan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08902v1 Announce Type: new Abstract: We present a Lean 4/Mathlib formalization of the additive representation theory behind Classical Lottery in Action and the Wakker-Debreu-Koopmans (WDK) layer it relies on. Our central result is a machine-checked proof that the cross-pair Thomsen / double-cancellation (hexagon) condition is irreducible from the ordinal axioms of additive conjoint measurement (weak order, restricted solvability, Archimedean condition, and tradeoff consistency). We exhibit an explicit verified counter-model (additiveRealBoolPref) satisfying all ordinal axioms yet failing the cross-pair condition, with every strict standard sequence being an arithmetic progression and hence non-dense. Around this boundary we mechanize the full derivable construction: continuous Debreu/Eilenberg utility from separability, standard-sequence grids, bisection methods from connectedness, and global additive gluing. All public theorems are sorry-free conditional wrappers over this single irreducible structural input. The development is kernel-clean, depending only on standard Lean foundations (propext, Classical.choice, Quot.sound). The companion file ClassicalLotteryInAction.lean formalizes local classical-lottery constructions, average-utility results, matching-frequency lemmas, and ambiguity-attitude statements used by the Management Science paper. This draws a precise, machine-certified line between what additive conjoint measurement can prove and what it must assume.

Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08903v1 Announce Type: new Abstract: Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

Order Matters: Unveiling the Hidden Impact of Macro Placement Sequences via Proxy-Guided LLM Evolution

Shibing Mo, Jing Liu, Jianchu Xu, Ruilin Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08904v1 Announce Type: new Abstract: Macro placement is a fundamental step in modern chip physical design, playing a crucial role in determining the solution quality of high-dimensional combinatorial optimization problems. Despite recent advancements in machine learning for spatial coordinate determination, the temporal dimension of placement sequencing remains largely governed by static heuristics. In this work, we demonstrate that the placement sequence is not merely a preprocessing step but a decisive factor in optimization, where suboptimal early decisions trigger irreversible domino effects that constrain the solution space. To harness this unexplored dimension, we propose \textbf{OrderPlace}, a proxy-guided LLM evolution framework for automatically discovering macro placement order strategies. Instead of relying on manually crafted heuristics such as area- or connectivity-based ordering, OrderPlace explores a broader space of code-level policies, ranging from static scoring metrics to dynamic physics-inspired mechanisms. To mitigate the prohibitive cost of evaluating sequences, we introduce a lightweight proxy evaluation mechanism that efficiently filters candidates using a deterministic greedy probe. Experimental results on the standard ISPD 2005 benchmarks demonstrate that OrderPlace discovers novel ordering strategies. Compared with WireMask-EA and the state-of-the-art method EGPlace, OrderPlace reduces wirelength by 34.04\% and 14.08\%, respectively.

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08906v1 Announce Type: new Abstract: In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08908v1 Announce Type: new Abstract: Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.

Enhancing Presence, Deepening Fan Intensity: How Presence in Immersive Video Shapes Psychological Closeness to Performers

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08912v1 Announce Type: new Abstract: Immersive video differs from conventional flat 2D video in that it is experienced as 180-degree stereoscopic video on a head-mounted display, thereby eliciting bodily and spatial subjective experience. Previous studies have shown that viewing and interpersonal distance affect Presence; however, it remains insufficiently understood how Presence differences are related to psychological closeness to content. In the present study, we examined whether differences in Presence could increase viewers' psychological closeness to performers within the content. This psychological closeness was operationally defined as fan intensity. Specifically, a live performance by a Japanese idol group was recorded as 180-degree immersive video, and a high-Presence condition (1.2 m) and a low-Presence condition (7.6 m) were established by manipulating filming distance. Twenty-four participants with different levels of prior involvement, comprising Avid fans and Casual fans, experienced both conditions in a counterbalanced within-participants design. Fan intensity was measured before and after the experience as perceived psychological overlap between the self and the performers. The results showed that, compared with the low-Presence condition, the high-Presence condition significantly increased all Presence-related measures except the Slater-Usoh-Steed questionnaire, with the largest condition differences observed for Possible Actions, Social Presence, and Observability. Moreover, a mixed analysis of variance on changes in fan intensity revealed a significant main effect of Presence condition, indicating that the high-Presence video produced a greater increase in fan intensity than the low-Presence video. These findings suggest that filming distance in immersive video is not merely a factor that determines angle of view or composition, but a design variable that can enhance Presence and deepen fan intensity.

Vibe Visualizing: How Visualization Novices Try (and Fail) to Generate and Interpret Visualizations with Conversational AI

Sam Yu-Te Lee, Yun-Hsin Kuo, Chifang Chou, Matthew Ward, Xiwei Xuan, Kwan-Liu Ma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08914v1 Announce Type: new Abstract: Conversational AI has enabled users to generate and interpret visualizations through natural language, significantly lowering the technical barrier to entry. The increased accessibility brings visualization novices into data visualization, but also exposes them to misinformation and misinterpretations. We are motivated to examine what issues can arise in interactions with current conversational AI, whether visualization novices can recognize such issues, and how they respond to them. To examine these questions, we conducted a user study on ChatGPT with 20 visualization novices, collecting their conversation logs, semi-structured interview transcripts, and Likert-scale questionnaire responses. Through thematic analysis, we developed a codebook that covers AI execution compliance, issues of AI-generated visualizations, patterns of AI responses, and prompting patterns of users. We summarized four themes, including the quality of outcomes, recurring errors from ChatGPT, misuse by users, factors that affect user trust, confidence, and verification behavior, and human-AI collaboration dynamics. To demonstrate the generalizability of our codebook and findings, we replayed the initial user prompts on Gemini and Claude and compared the outcomes, which revealed distinct failure modes for each model. Based on the results of all analyses, we derive a set of design recommendations for future AI-assisted visualization systems. We conclude with discussions on literacy gaps, diverse human-AI collaboration dynamics, and implications for agentic visualization.

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08918v1 Announce Type: new Abstract: Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

Emre Turan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08919v1 Announce Type: new Abstract: As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08920v1 Announce Type: new Abstract: Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

Generalized Rank-based Evaluation for Knowledge Graph Completion: Perspectives, Framework, and Analyses

Sooho Moon, Jian Kang, Yunyong Ko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08921v1 Announce Type: new Abstract: Knowledge graph completion (KGC) aims to predict missing facts from an observed knowledge graph (KG), playing a crucial role in a wide range of real-world applications such as drug discovery, recommender systems, and retrieval-augmented generation (RAG). Although numerous KGC models have been proposed, the evaluation of KGC remains underexplored, despite its critical role in reliably assessing model performance and selecting appropriate models for real-world applications. In this paper, we introduce two important perspectives for KGC evaluation that are overlooked by existing evaluation metrics, (P1) predictive sharpness and (P2) popularity-bias robustness. To address both perspectives, we propose a generalized evaluation framework, PROBE, which consists of a rank transformer (RT) that estimates the score of each prediction based on a desired level of predictive sharpness and a rank aggregator (RA) that determines the final evaluation score by aggregating all prediction scores according to a desired level of popularity-bias robustness. We theoretically analyze PROBE by defining six key properties for reliable KGC evaluation and prove that PROBE satisfies all the properties, while existing metrics fail to satisfy some. In particular, due to the open-world nature of KGs, an evaluation metric should preserve the relative performance of KGC models even when only incomplete facts are observed. We show that PROBE better maintains such consistency, providing a more reliable estimate of intrinsic model performance than existing metrics. Extensive experiments with six KGC models on six real-world KGs reveal that existing metrics may over- or under-estimate model performance depending on different evaluation perspectives, whereas PROBE enables a more comprehensive, flexible, and consistent evaluation of KGC models.

PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning

Xiaoyu Xu, Zhiming Chen, Yuenan Zhao, Ran Song, Wei Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08922v1 Announce Type: new Abstract: Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.

PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

Sooho Moon, Yunyong Ko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08926v1 Announce Type: new Abstract: Knowledge graph completion (KGC) models are commonly evaluated using rank-based metrics such as MRR and Hits@K, despite different users often requiring different evaluation perspectives. In this demo, we present PROBE-Web, an interactive system for probing diverse evaluation landscapes for KGC models. PROBE-Web enables users to flexibly evaluate KGC models by adjusting two critical perspectives: (P1) predictive sharpness and (P2) popularity-bias robustness. Through a user-friendly GUI, users easily evaluate multiple KGC models and analyze their strengths and weaknesses. PROBE-Web provides four key functionalities: (1) conventional evaluation toolkit, (2) flexible perspective-aware evaluation, (3) explainable case studies, and (4) evaluation landscape exploration. We believe that PROBE-Web can help users better understand KGC models aligning with their objectives.

In-Situ Immersive Analytics Authoring through Ergonomic Keyboard Support

Leonel Merino, Bego\~na Juli\'a-Nehme, Santiago Viana — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08927v1 Announce Type: new Abstract: Immersive analytics uses augmented reality (AR) to integrate data analysis and authoring within physical environments. However, extensive text entry required for immersive analytics authoring remains a fundamental challenge in AR, as popular natural user interfaces often hinder expressive input. This paper presents the Body-Supported Keyboard (BSK), an ergonomic system that allows the mobile use of a Bluetooth keyboard in AR. We conducted a controlled study with 20 participants to compare the BSK with a standing desk during text transcription and a mobile AR scenario. The results showed slightly higher error rates but comparable task completion times. Participants reported comfort improvements during mobile use and positive usability ratings (mean SUS = 74.5). The BSK allows users to move freely and maintain stable postures while authoring in AR. In general, the findings show evidence of the potential for body-supported input to enhance expressive and ergonomic workflows in immersive analytics and emphasize the importance of comfort and mobility in the design of AR authoring tools.

RankGLU: Residual Gated Score Formation for Cross-Sectional Stock Prediction

Huixiang Xiao, Jian Xu, Feiyu Qu, Zixuan Xie, Xiangyu Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08930v1 Announce Type: new Abstract: Cross-sectional stock prediction is closer to a ranking problem than to ordinary return-magnitude regression, since portfolio decisions depend on the relative ordering of assets within each trading date. Existing temporal, graph-based, and market-conditioned attention models have improved stock representation learning, yet the final prediction head is often treated as a minor implementation detail. This paper argues that, under information-coefficient-oriented evaluation, score formation is a critical bottleneck: an over-flexible head can fit unstable return magnitude, whereas an overly linear head may underuse cross-feature interactions. We therefore develop RankGLU, a residual bottleneck gated linear unit for cross-sectional stock ranking. RankGLU keeps a direct linear scoring path and adds a bounded multiplicative branch, thereby preserving a stable ordering route while allowing controlled nonlinear interactions. The method is evaluated on CSI300 and CSI800 under a unified protocol with cross-sectional score normalization and an IC-augmented objective. Multi-seed experiments show that, on CSI300, RankGLU achieves the strongest mean IC among the internally controlled variants, improving from 0.0654+/-0.0052 for the original backbone and 0.0697+/-0.0030 for the ranking-aware backbone to 0.0727+/-0.0037, a gain that is consistent across all five seeds. Its best-seed result also exceeds the corresponding baselines. Ablation results further indicate that removing the GLU prediction head causes the clearest degradation among the tested component changes. Additional relation-path calibrations can produce high single-seed peaks, but their multi-seed behavior is less stable. The evidence suggests that ranking-aware stock models benefit most reliably from bounded residual score formation rather than from indiscriminate architectural expansion.

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08932v1 Announce Type: new Abstract: Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory

Yuan-chin Ivan Chang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08934v1 Announce Type: new Abstract: Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_\phi$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $\rho$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $\phi$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

Kang Zhang, Wei Jian Lau, Shoushou Ren, Dong Lin, Joon Son Chung, Chuanhao Sun — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08935v1 Announce Type: new Abstract: Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI

Report on CHIIR 2026 Workshop on Generative AI and Academic Search (GAI&AS)

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08936v1 Announce Type: new Abstract: This report summarizes the CHIIR 2026 Workshop on Generative AI and Academic Search (GAI\&AS), which examined how GenAI is reshaping academic search systems and research practices. The workshop brought together researchers in human information interaction and information retrieval to explore key challenges and opportunities in designing and evaluating future academic search systems that integrate GenAI, moving beyond traditional document retrieval to support summarization, recommendation, synthesis, and conversational interaction. Participants' interests and discussions focused on three thematic clusters: foundations and principles, applications and opportunities, and search-as-learning. Across these themes, the workshop highlighted the importance of academic search systems in supporting transparency, credibility, research integrity, and long-term scholarly needs, as well as in fostering higher-order cognitive processes. Participants discussed guiding theories, design principles, methodological approaches, partnerships, and community-building efforts aimed at advancing human-centered GenAI-enhanced academic search systems. Overall, the workshop demonstrated strong community interest and a diverse range of ongoing and emerging research initiatives at the intersection of GenAI and academic search.

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08938v1 Announce Type: new Abstract: Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08940v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.

LongRTL: Graph-Similarity-Guided LLM-driven Long Context RTL Optimization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08944v1 Announce Type: new Abstract: Large Language Models (LLMs) show great promise in RTL code generation and optimization. However, real-world RTL designs are typically long, entangled, and poorly modularized, posing a major challenge due to context-length limitations and lack of structure. To overcome these obstacles, we propose a scalable LLM-based RTL optimization framework guided by graph similarity. Our method introduces three collaborative agents: (1) a Partition Agent that decomposes RTL designs into semantically meaningful AST subtrees, guided by AST graph similarity to reusable design templates; (2) an Optimization Agent that generates RTL submodule code based on partitioned subtrees using multi-modal Retrieval-Augmented Generation (RAG) with both AST and RTL guidance; and (3) a Reconstruction Agent that reassembles optimized submodules based on logic-aware ordering and Graph-RAG prompting, ensuring global functional equivalence. Together, these components enable robust, structure-aware optimization of long-context RTL designs, bridging the gap between toy examples and industrial-scale hardware codebases.

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08945v1 Announce Type: new Abstract: We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

NeuDW-CIM: a 65-nm 0.8-pJ/Sop Reconfigurable Neuromorphic Compute-in-Memory Macro with Nonlinear Dendrites and K-Winners

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08947v1 Announce Type: new Abstract: This work presents NeuDW-CIM, a highly efficient neuromorphic Compute-in-Memory (CIM) macro for Spiking Neural Networks (SNNs) implemented in 65 nm CMOS. The design introduces a custom twin 9T bit-cell for ternary in-puts/weights and a reconfigurable non-linear In-Memory ADC (IMA). The macro supports two specialized modes: 1) Nonlinear Dendrite (NLD) mode, which utilizes reconfigurable IMA to emulate biological dendritic functions, achieving measured accuracies of 97.2% on N-MNIST and 95.5% on DVS Gesture; and 2) Top-K Winner (KWN) mode, featuring an early-stopping mechanism that reduces IMA conversion latency by 30% and digital LIF latency by 10x. Benefiting from the sparse update in KWN mode, NeuDW-CIM achieves a measured energy efficiency (EE) of 0.8 pJ/SOP (1.6x improvement).

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08948v1 Announce Type: new Abstract: Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

When More Cores Hurts: The Vector Database Scaling Paradox in HPC

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08950v1 Announce Type: new Abstract: Vector databases have been designed and optimized for cloud environments; however, emerging scientific AI workloads (e.g., molecular search, meteorological trajectory detection, and literature-driven hypothesis generation) demand efficient, scalable execution on HPC systems. We present a large-scale evaluation of three state-of-the-art vector databases -- Qdrant, Milvus, and Weaviate -- on two production supercomputers, scaling to 256 distributed workers across 64 compute nodes. We evaluate representative workload patterns -- mixed read/write and write-then-read -- using popular benchmarks, multimodal embeddings, and a novel real-world scientific dataset. Our results reveal that workload characteristics can limit latency reduction, additional cores can reduce query throughput by up to 30.67%, and scaling from 16 to 256 workers (16x) only yields a 5.46x improvement. This scaling paradox exposes the fundamental mismatch between cloud-oriented designs and HPC systems, highlighting the need for new, HPC-aware vector database designs.

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08952v1 Announce Type: new Abstract: Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

Self-Consistent Generative Paths via Admissible Random Variational Transport

Lei Luo, Yingzhen Zhang, Jian Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08953v1 Announce Type: new Abstract: Modern generative models often define an entire probability path from a simple prior to the data law, rather than only an endpoint map. Diffusion models follow stochastic denoising paths, flow matching learns transport fields, consistency and distillation methods compress paths into one or a few steps, adversarial models match terminal distributions, and VAEs generate through latent kernels. Existing unifying views mainly describe how such paths are constructed. We study a complementary question: when is a generated probability path self-consistent? We define a self-consistent generative path as a random fixed point of admissible local variational transport corrections. In this framework, a local correction is specified by a random variational transport operator combining a divergence or geometry term, an energy term, and a structural constraint. The framework contains random regularized optimal-transport proximal steps as a structured instance, while also allowing non-OT divergences, latent kernels, adversarial constraints, causal discrete kernels, and terminal one-step maps. The theory yields a random fixed-point path residual (R-FPR), which measures the gap between the actual generated path and an admissible local correction. We prove well-posedness, random fixed-point existence and attraction, non-contractive existence, residual-to-generation error bounds, empirical residual concentration, proxy perturbation bounds, continuous-time limits, and operator-level generalization with model-specific corollaries. The resulting theory turns endpoint matching into path self-consistency testing and provides a residual-control principle for diagnosing failures, regularizing training, and guiding adaptive sampling across diffusion, flow, one-step, VAE, GAN/WGAN, and autoregressive generators.

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

Conor Rowan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08956v1 Announce Type: new Abstract: Scientists have historically relied on mathematical models based on differential equations to relate system inputs -- forces, fluxes, or heat sources -- to outputs, such as displacement, velocity, concentration, and temperature. These models rely on deep domain knowledge to determine the form of the governing differential equation, which is then calibrated with data by solving an inverse problem. In recent years, the field of Scientific Machine Learning has introduced a variety of alternative modeling strategies for physical systems. A method called Sparse Identification of Nonlinear Dynamics learns the governing equation as a sparse linear combination of terms in a user-defined library. Neural Ordinary Differential Equations construct the governing equation by taking in the state and its derivatives at the input layer of a neural network. Entirely foregoing the modeling framework of differential equations, neural operators directly learn a non-linear mapping between the system inputs and outputs. From inverse problems to neural operators, all of these modeling strategies can be conceptualized as data-driven machinery to predict a system's response over a range of inputs. It is then natural to wonder how exactly these various strategies relate to each other, and whether they can be neatly taxonomized. Drawing from the philosophical literature on scientific models, we argue that many model types have a common structure, differing only in the assumed model class of the input-output relation they define. Connecting to philosophical ideas on mechanism, and arguing that data from physical systems arises from solutions to parsimonious differential equations, we propose that only certain models are capable of mechanism discovery, and thus generalization. Our analysis is intended to unite apparently disparate modeling strategies and provide insight into their appropriate use cases.

Rethinking 3D Shape Generation: Diffusion over Superquadrics

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08957v1 Announce Type: new Abstract: Diffusion models have advanced 3D shape generation, yet most methods still denoise in high-cardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set of superquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08959v1 Announce Type: new Abstract: We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08960v1 Announce Type: new Abstract: Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08962v1 Announce Type: new Abstract: World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

Embedding linear codes over Z4 into self-orthogonal codes

Junmin An, Jon-Lark Kim, San Ling — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08964v1 Announce Type: new Abstract: The purpose of this paper is to investigate the self-orthogonal embedding problem for linear codes over Z4. We propose several tight bounds on the length of the shortest self-orthogonal embedding over Z4, and determine the exact shortest self-orthogonal embedding length under specific conditions. As an example satisfying these conditions, we establish the exact length of the shortest self-orthogonal embedding for the quaternary Preparata codes. Furthermore, to establish these results, we completely classify the exact length of the shortest doubly even self-orthogonal embedding for binary linear codes in every possible case. Finally, when the shortest self-orthogonal embedding length of a given free code over Z4 is equal to the shortest doubly even self-orthogonal embedding length of its residue code, we present an algorithm to construct all possible shortest self-orthogonal embeddings. With our algorithm, we found twelve linear codes over Z4 whose minimum Lee distances are higher than those of the Z4-linear codes in Aydins database.

Before You Scroll Again: Predicting Regretful Social Media Sessions from In-the-Wild Contextual and Wearable Sensing

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08965v1 Announce Type: new Abstract: Users often feel regret after using social media, making regret a more ecologically valid target than screen time for understanding when phone use becomes problematic. Existing self-monitoring tools cannot anticipate regret before it occurs, and prior physiological work on social media use has been confined to the lab with research-grade sensors and curated content, leaving the question of in-the-wild prediction open. We deployed a 7-day in-the-wild experience sampling study with 21 participants, combining passive smartphone logging, a low-cost consumer smartwatch (Bangle.js 2, \$80), session-level surveys (1,445 sessions), and exit interviews to investigate when and why social media sessions become regretful, and whether regret can be anticipated before a session begins. Three findings stand out: (i) the gap between intended and actual use predicts regret far more strongly than session duration, with duration's apparent effect collapsing once intention is modeled; (ii) regret is amplified when sessions displace a valued alternative, particularly at night and following productivity-app use; and (iii) pre-session contextual features generalize across participants while physiological signals add person-specific lift, pointing toward a two-layer architecture for just-in-time adaptive interventions. Interview themes of scrolling-as-avoidance and time blindness contextualize these patterns and surface design opportunities beyond timer-based interventions.

CARE: A Conformal Safety Layer for Medical Summarization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08969v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(\tau,\gamma)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $\alpha = 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

An Effective Router for Vision-Language Model Selection

Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08970v1 Announce Type: new Abstract: Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08974v1 Announce Type: new Abstract: Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

Jing Wang, Shang Liu, Wenji Fang, Yuchao Wu, Yugao Zhu, Zhiyao Xie — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08976v1 Announce Type: new Abstract: LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models' performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.

Online Learning with Recency: Algorithms for Sliding-window Streaming Multi-armed Bandits

Vladimir Braverman, Chen Wang, Liudeng Wang, Samson Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08977v1 Announce Type: new Abstract: Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.

Heterophily-Aware Adaptive Knowledge Distillation for Hypergraph Neural Networks

Joohee Cho, David Yoon Suk Kang, Yunyong Ko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08978v1 Announce Type: new Abstract: Hypergraph knowledge distillation aims to retain the predictive performance of a hypergraph neural network (HNN) teacher while reducing inference costs through a lightweight student model. In this work, we observe that HNNs exhibit substantially lower prediction performance on heterophilic nodes connected through semantically diverse hyperedges, indicating that the reliability of teacher knowledge varies across nodes. Motivated by this observation, we propose HADES, a heterophily-aware adaptive distillation method for hypergraph neural networks. HADES quantifies node heterophily and leverages it as an estimate of teacher reliability to modulate the transfer of teacher knowledge during distillation. Experimental results on real-world hypergraphs demonstrate that HADES consistently improves student performance across different HNN teachers and distillation objectives. In many cases, the resulting student models surpass the predictive performance of their teachers while achieving up to 12.3 times faster inference.

EviProp: Seeded Relevance Diffusion on Chunk-Page Graphs for Long Multimodal Document Retrieval

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08979v1 Announce Type: new Abstract: Retrieving evidence pages from visually rich long documents is a key challenge in document question answering. Existing page-level visual retrievers operate under an independent matching paradigm: each page is scored in isolation based on query-page similarity. This paradigm can under-rank evidence pages whose signals are localized in fine-grained chunks or depend on document-internal associations. We propose EviProp, a retrieval method that recovers such pages via seeded relevance diffusion. EviProp models each document as a multimodal Chunk-Page graph with hierarchical, sequential, and similarity links. Given a query, it combines dense visual page priors with sparse chunk seeds, then runs Personalized PageRank to diffuse relevance over the graph. Experiments on MMLongBench-Doc and LongDocURL show consistent gains in evidence-page retrieval over independent visual retrieval and text-visual fusion baselines. Downstream QA results further show that improved retrieval translates into better answer accuracy, with negligible online retrieval overhead. Our code is released at https://github.com/Flyecnu/EviProp.

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08980v1 Announce Type: new Abstract: This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08982v1 Announce Type: new Abstract: Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for \emph{continuous care} rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: \textbf{Baichuan-Harness}, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a \textbf{core reasoning model} trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a \textbf{clinical tool layer} for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3\%.

Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

Hu Tan, Kuo Gai, Shihua Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08985v1 Announce Type: new Abstract: While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $\Theta(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $\lambda_{\mathrm{crit}} = \Theta(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

Gabriel Ortega, Abelino Jim\'enez, S\'everin Lions, Pablo Dartnell — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08988v1 Announce Type: new Abstract: Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08992v1 Announce Type: new Abstract: Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization

Binh Nguyen, Trinh Tran, Truong X. Nghiem — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08993v1 Announce Type: new Abstract: We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned model that preserves convexity and smoothness. This leads to the proposed Moreau Envelope Learning ADMM (MEL-ADMM) and its splitting variant sMEL-ADMM. Unlike existing approaches that learn high-dimensional operators directly, LEAF learns a scalar-valued Moreau envelope, significantly reducing model complexity and improving data efficiency. The framework accommodates a broad class of convex problems with smooth and non-smooth objectives. By embedding convexity explicitly through the ICNN architecture, the proposed approach maintains high approximation accuracy while preserving key structural properties of the optimization problem. Both MEL-ADMM and sMEL-ADMM are developed with theoretical guarantees of convergence and feasibility under the learned model. Rigorous analysis shows that the proposed methods achieve convergence rates comparable to classical ADMM while reducing per-iteration computational cost. Numerical experiments demonstrate up to an order-of-magnitude speedup over state-of-the-art solvers while maintaining low optimality gaps

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

Trapoom Ukarapol, Pakhapoom Sarapat, Nut Chukamphaeng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08994v1 Announce Type: new Abstract: Large language models (LLMs) sometimes exhibit language confusion when generating non-English text. Existing approaches typically rely on fine-tuning to mitigate this issue. In contrast, we propose a tuning-free paradigm for reducing language confusion. Within this paradigm, we introduce two methods: Language-Aware Token Boosting (LATB), which applies targeted perturbations to tokens associated with the desired language, and Adaptive Language-Aware Token Boosting (Adaptive-LATB), which dynamically adjusts these perturbations based on the model's confidence in the intended language. Experiments demonstrate that our methods effectively improve multilingual alignment by reducing language confusion, while maintain the summarization quality without requiring any additional fine-tuning. Our code is publicly available. https://github.com/scbdatax/genai-datax-language-aware-token-boosting.

The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

Muhammad Zia Hydari, Raja Iqbal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08998v1 Announce Type: new Abstract: Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

JAX-AMG: A GPU-Accelerated Differentiable Sparse Linear Solver Library for JAX

Yi Liu, Xiantao Fan, Jian-Xun Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09001v1 Announce Type: new Abstract: Sparse linear systems from PDE discretizations are central to scientific computing, yet no existing JAX-ecosystem solver simultaneously provides GPU-accelerated algebraic multigrid (AMG), automatic differentiation (AD), and distributed multi-GPU execution. JAX-AMG fills this gap by wrapping the Nvidia AmgX solver suite as a native JAX primitive, exposing AMG and Krylov methods with configurable preconditioners through a unified interface compatible with JIT compilation, reverse-mode AD via adjoint methods, batched solves, and MPI-based distributed execution. Solver caching amortizes setup costs across repeated solves, making JAX-AMG practical for PDE-constrained optimization and inverse problems. The result is a robust, scalable sparse linear algebra layer that integrates seamlessly into differentiable simulation and scientific machine learning pipelines.

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

Ankai Hao, Ke Chen, Huan Li, Lidan Shou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09004v1 Announce Type: new Abstract: Feature engineering remains essential for tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for automating this process, giving rise to LLM-powered AuTomated Tabular feature Engineering (LATTE). However, the absence of standardized platforms prevents fair, cost-aware comparisons. Furthermore, complex methodological designs obscure the specific contributions of individual components; for example, although LFG integrates Tree-of-Thought, few-shot demonstrations, Monte Carlo Tree Search, and natural language generation, the isolated impact of each technique's competitive edge remains unquantified. To address these challenges, we introduce LATTEArena, the first competitive evaluation framework featuring: (1) a six-dimensional taxonomy decomposing 15 representative methods into reusable components; (2) a standardized modular arena for controlled comparison; (3) multi-dimensional assessments covering performance, cost, and robustness; and (4) component-level ablation quantifying each technique's competitive edge. Through extensive evaluations, we reveal 16 key findings, including: (1) Tree-of-Thought with Monte Carlo Tree Search achieves optimal cost-effectiveness; (2) RPN and Code output formats dominate classification and regression tasks, respectively. We publicly release the modular framework and over 4000 execution logs, enabling researchers to seamlessly pit new techniques against existing ones and advance LATTE.

Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

Jianguo Zhu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09005v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) systems often serialize user queries, retrieved documents, metadata, system labels, and task instructions into one natural-language prompt. We study a source-authority boundary failure in this design: attacker-authored retrieved text can impersonate metadata, provenance, authority, or disclosure-policy signals that appear control-relevant to the model. We call this pattern Document-Authored Control-Signal Impersonation (DACSI). DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is simple: document-authored labels are data, not policy. Command-style injection asks the model to ignore, override, or violate policy; DACSI asks whether untrusted document text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel. We evaluate DACSI across six model settings, prompt-pressure levels, injection baselines, signal taxonomies, RAG-mediated pipelines, system-control probes, a source-authority attribution probe, and synthetic canary formats. We interpret the evidence by model regime rather than as six equal replications: DeepSeek V4 Pro and Qwen3.5-397B provide the cleanest positive lift, DeepSeek V4 Flash is a high-susceptibility setting, GPT-5.5 and Gemini 3.1 Pro Low are strong-boundary probes with selected residual risks, and GLM-4.7 is a saturated leakage boundary case. Across these regimes, DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface, follows a RAG-specific source-authority path, and responds to source/channel separation. The source-authority probe is behavioral attribution evidence, not proof of an internal mechanism.

Sustainability and Artificial Intelligence: Necessary, Challenging, and Promising Intersections

Han-Teng Liao, Zijia Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09006v1 Announce Type: new Abstract: Both digital economy and digital technology researchers increasingly recognize the need to better address the role that artificial intelligence (AI) plays in shaping the evolution of the environmental, social and governance aspects of development. It appears that sustainability and AI research converge on the features of wicked problems that are complex, interconnected and dynamic. Building off such convergence, this article aims to map out the necessary, challenging, and promising intersections by providing an overview of the state of art research. Based on 541 bibliographic data collected from the Web of Science (WoS) database, the findings reveal the increasingly central body of work on green and sustainable science and technology in bridging various disciplines, main journals and key topics and concepts. The findings reveal how such interactions can be necessary, challenging, and promising. The article concludes with few general arguments regarding how to diversify and expand the community of practice regarding AI for sustainable development, especially in the areas of expected AI application areas and institutions.

High-Order Regularity and a Fully Discrete Fourier Spectral Method for a Partially Dissipative Viscoelastic Timoshenko System with Memory

Zhenyang Zhong, Hui Liang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09007v1 Announce Type: new Abstract: This paper investigates a class of partially dissipative viscoelastic Timoshenko systems with memory, where dissipation is induced by a Volterra-type memory term acting only on the shear variable. The well-posedness of weak and strong solutions is established on finite time intervals, including existence, uniqueness, stability, and higher-order regularity under compatibility conditions consistent with mixed boundary conditions. For the numerical approximation, a Fourier spectral fully discrete scheme is constructed: sine and cosine basis expansions are used in space for unknowns satisfying Dirichlet and Neumann boundary conditions, respectively; in time, a central difference scheme is applied to the second-order derivatives, and the composite trapezoidal rule is used to approximate the memory convolution term. Based on a discrete energy method, the positivity of the constructed discrete energy is proved, and the error estimate for the fully discrete scheme with second-order convergence in time and $q$-th order in space is established for any q \in \mathbb{N}. Numerical experiments are given to verify the theoretical convergence rates and to compare the dynamic responses of the local and nonlocal models, demonstrating that the memory term effectively captures energy dissipation and vibration attenuation behavior in viscoelastic materials.

Scaling by Diversified Experience for Vision-Language-Action Models

Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu, Nanyang Ye — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09009v1 Announce Type: new Abstract: Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities. Codes and Datasets is released on \href{https://sy-vla.github.io/}{project page}.

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

Hanyang Li, Jianhao Ma, Ying Cui — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09012v1 Announce Type: new Abstract: Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09013v1 Announce Type: new Abstract: LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

Deterministic versus Stochastic Optimization for Joint Path Planning and Dynamic Time Splitting in Multiple-UAV-Cached IoT Networks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09014v1 Announce Type: new Abstract: This paper examines wireless-powered Internet of Things (IoT) networks involving multiple unmanned aerial vehicles (UAVs) equipped with backscatter and caching technologies to relay and transmit signals. For data communication and energy harvesting (EH), the source transmits information and power to UAVs using the dynamic time splitting (DTS) method. UAVs use harvested energy for passive communication (backscatter) and for active communication (transmitting information) to the destination. The primary objective is to maximize the total throughput by jointly optimizing the DTS ratio, trajectory, and transmission power, leveraging the UAVs' caching capability. This optimization problem is challenging due to its non-convexity. Therefore, an efficient alternating algorithm using the block coordinate descent (BCD) method is proposed to optimize each variable given the fixed values of the other parameters. By applying the Karush-Kuhn-Tucker (KKT) conditions, we derive a closed-form expression for the optimal DTS ratio, significantly reducing computation time. The optimal values for the other two parameters are determined using the BCD. In order to thoroughly assess the effectiveness of various solutions for the original problem, this paper introduces an approach leveraging a genetic algorithm (GA). The GA in this context employs a one-point crossover method, value mutation, and rank-based selection based on fitness values. Numerical results show that the BCD and GA achieve at least 31% throughput improvement over the benchmarks, with reduced computational time. These findings demonstrate the performance gain and practical feasibility of our solutions in caching-enabled UAV-aided IoT networks.

MaterialClusterGS: Palette-Based Material Decomposition and Physically-Based Relighting with 2D Gaussian Splatting

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09018v1 Announce Type: new Abstract: We present MaterialClusterGS, a palette-based material decomposition framework for 2D Gaussian Splatting that enables physically based relighting and material editing. Existing Gaussian inverse rendering methods typically assign independent BRDF parameters to individual primitives. While flexible, this local fitting strategy makes material recovery highly under-constrained: shadows, indirect illumination, geometric errors, and visibility residuals can be absorbed into thousands of slightly different local material estimates. Meanwhile, recent palette-based appearance methods operate solely in RGB space without modeling physical materials or illumination. To bridge this gap, we represent scene materials using a compact global palette of shared BRDF prototypes assigned via a continuous spatial material field. Without shared material structure, editing one region does not propagate consistently to others of the same material, making per-primitive decompositions impractical for editing. We jointly optimize the material field, palette prototypes, and environment lighting under a physically based rendering objective. The resulting framework recovers compact, spatially coherent attributes directly usable for material editing, relighting, and transfer.

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09019v1 Announce Type: new Abstract: Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

Personal Salience: Highlighting Is Social, but Individuality Lives in Selection

Kazuki Nakayashiki, Keisuke Watanabe — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09024v1 Announce Type: new Abstract: Social highlighters let people mark passages that matter to them. We ask how much of an individual is recoverable from these naturalistic traces, using a co-readership identity control (the same document highlighted by many users) that holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does. We separate generic salience (structure), crowd salience (what others marked), and personal salience (the individual residual). First, highlighting is social: which sentences you mark is predicted far better by the crowd than by structure or by a personal model, and even a well-estimated crowd, an information-privileged baseline that sees others' marks on the same document, beats a frontier LLM twin built from your other-document history; the within-document personal signal is at most a whisper (own-vs-other gap +0.017 by an embedding scorer, small but significant). Second, in sharp contrast, individuality lives in selection: asked which of the already-salient passages are yours, your own history is a strong, leakage-free predictor (gap +0.14). A topic decomposition shows this is largely stable thematic preference: it shrinks ~6-8x against a topically-matched peer, and a thin residual cannot be separated from finer topic. The non-obvious part is an asymmetry: under the same scorer the individual signal is ~6-8x weaker in salience than in selection. Methodologically, naive history-conditioning evaluations leak (the target's own marks enter the profile in ~42% of pairs, inflating personal scores by up to +0.15 AP) and small crowds overstate personalization; our results are leakage-free, use a dense crowd, and a model-matched control. Highlights carry a genuine individual signature, but a thin layer over a strong shared one, surfacing far more in which salient things a person selects than in what is salient.

Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI

Ayan Pendharkar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09026v1 Announce Type: new Abstract: We ask whether structural properties of intermediate grid states predict whether a symbolic ARC-AGI solver will succeed, framed as a test of conditional mutual information I(X;Y|task) > 0. Across 44,800 runs spanning two architecturally distinct solvers (beam search and Stochastic DFS), 400 ARC tasks, 28 configurations per solver, and both training and evaluation splits, hand-crafted grid descriptors measured at 50% trajectory completion discriminate successful from failed runs within the same task (mean within-task best-feature AUC = 0.885, p < 0.001 under within-task label permutation). Most predictive content lies along a single grid-complexity axis. The result generalizes across solver architectures: a feature selected on one solver predicts success on the other with AUC 0.747-0.762 in all four transfer directions (p < 0.001, leakage controlled). On a pre-registered held-out set of 41 reliable tasks, the frozen feature n_components_final achieves AUC = 0.765 (95% CI [0.717, 0.810], p < 0.001), robust under task-clustered bootstrap resampling and cross-solver task collapsing. The signal is not explained by solver capacity (configuration-residualized AUC = 0.927 and 0.896 for beam search and SDFS, p < 0.001) and is only weakly coupled to score trajectories (R^2 approximately 0). Early stopping at 50% completion reduces beam-search compute by 33.6% while retaining 98.9% of solves; degenerate-trajectory detection reduces SDFS compute by 65.3% with no solve loss. Finally, on 229 of 400 evaluation tasks the DSL primitive library produces no valid transition from the input grid. This 0-step collapse is invariant to search budget and universally failed by beam search, indicating a DSL coverage limitation rather than a search-budget effect.

SafeRun: Enabling Determinism in LLM Planning for Running

Meilin Chen, Zepeng Zhai, Jiaxuan Zhao, Yuan Lu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09027v1 Announce Type: new Abstract: Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{{huggingface}}.

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

Jiaheng Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09028v1 Announce Type: new Abstract: Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

Frequency Decoupled Framework for Screen Content Image Super-Resolution

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09029v1 Announce Type: new Abstract: Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09030v1 Announce Type: new Abstract: Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09032v1 Announce Type: new Abstract: Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.

CRANE: Knowledge Editing for Reasoning MLLMs

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09033v1 Announce Type: new Abstract: The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

Leveraging NeRF-Rendered Images for 3D Gaussian Splatting

Mizuki Morikawa, Yuta Shimizu, Chunyu Li, Yusuke Monno, Masatoshi Okutomi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09034v1 Announce Type: new Abstract: Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view synthesis. They often show complementary performance, i.e., 3DGS demonstrating faster rendering speed and NeRF demonstrating higher rendering quality. Motivated by this, we propose leveraging NeRF-rendered images for 3DGS. Specifically, we target street scenes and utilize a pre-trained street-specific NeRF method to produce training images for a target 3DGS method. In our 3DGS training, NeRF-rendered images are used to remove transient objects in street-level input views and to generate bird's-eye views as additional views, inheriting the higher-quality rendering of NeRF into 3DGS. We further incorporate a diffusion-based image enhancement to improve the image quality of the additional views. Experimental results on one synthetic and two real datasets demonstrate that our proposed method improves street-scene rendering while preserving the speed of 3DGS and the quality of NeRF.

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

Jinseong Han, Sunwoong Yang, Namwoo Kang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09037v1 Announce Type: new Abstract: Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09038v1 Announce Type: new Abstract: Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.

Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Cheonsu Jeong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09039v1 Announce Type: new Abstract: This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two critical challenges in autonomous agent economies: the hivemind effect arising from excessive strategic convergence among agents and the lack of transparency in autonomous decision-making processes. The proposed BPF consists of three core modules: Mentalizing-based Social Intelligence (MbSI) grounded in Theory of Mind (ToM), Pluralistic Alignment (PA), and a Verifiable Execution Kernel (VEK). These modules are organically integrated within a closed-loop architecture that governs the entire lifecycle of agent behavior, from decision-making and execution to verification and feedback. To evaluate the proposed framework, a simulation environment implemented in Python and a Streamlit-based user interface will be developed. Through empirical experimentation, the study aims to examine whether the entropy-control mechanism of the PA module can effectively preserve strategic diversity among agents and mitigate collective convergence, while the VEK module provides a comprehensive and transparent audit trail of the decision-making process. The anticipated results are expected to demonstrate that the proposed framework can simultaneously enhance the stability, efficiency, and trustworthiness of autonomous agent economies. Consequently, this research offers a practical approach for developing robust, transparent, and accountable agent-native economic systems.

Culturally-Aware AI for Cross-Boundary Community Learning: Undergraduate Innovation at the Intersection of Computation and Design

Jiaojiao Zhao, Weisheng Zhang, Jiawen Cai, Haibin Gao, Luyao Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09041v1 Announce Type: new Abstract: Research on artificial intelligence in education (AIED) is rapidly expanding, yet technical progress often lacks human-centered grounding and adequate attention to cultural context. Community-Based Learning, a pedagogy rooted in social work, remains underrepresented in AIED research, particularly within Asia-Pacific contexts. This paper reports on cross-boundary Community-Based Learning where undergraduate students develop AI-enabled solutions for cultural heritage preservation and sustainable development. We examine how community-engaged computing operationalizes human-centered AIED across three dimensions: education, technology, and culture. We contribute a collaborative framework for culturally-aware AIED that fosters multi-stakeholder collaboration while widening participation by dissolving disciplinary silos between social work and computational science.

Seamless Contraction-Control Framework for Unplanned Grid-Connected/Stand-Alone Transitions of Grid-Forming Inverters

Qianxi Tang, Li Peng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09042v1 Announce Type: new Abstract: Unplanned grid-connected (GC)/stand-alone (SA) transitions commonly occur in AC microgrids during protection trips, manual breaker operation, or low-bandwidth supervisory communication. Under such unplanned transitions, a grid-forming inverter must support the local-load voltage in stand-alone operation and regulate the desired power/current injection in grid-connected operation. Existing P--Q droop-based seamless-transfer methods often rely on planned transition commands, supervisory islanding detection, or pre-synchronization interval, which may prevent timely voltage/current support during unplanned bidirectional transitions. To address this problem, this paper proposes a seamless contraction-control (SCC) framework for target dynamics. Using the SCC, contraction-based grid-connected current-control and stand-alone voltage-control laws are proposed. With the new control laws, the inverter achieves transient stability and converges to the target trajectory with a prescribed convergence rate. Furthermore, a breaker-status observer is proposed to infer the grid-connected/stand-alone mode from voltage measurements on both sides of the breaker, eliminating the need for a dedicated pre-synchronization interval or supervisory islanding detection process and enabling timely voltage/current support during unplanned transitions. Experimental results validate that the proposed method achieves stand-alone voltage support, stable grid-connected current injection under symmetrical/unsymmetrical grid-voltage sag and phase-jump disturbances, and unplanned bidirectional transitions.

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09043v1 Announce Type: new Abstract: Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

Decoy-Calibrated Failure Audits for Language Models

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09046v1 Announce Type: new Abstract: Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

Families of Control-Cost-Parametrized Inverse-Optimal Universal Stabilizers

Miroslav Krstic, Luke Bhan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09047v1 Announce Type: new Abstract: A classical universal stabilization formula offers the practitioner no design freedom: it is a single, parameter-free object. We introduce a cost-parametrized family of stabilizing feedback laws, where (1) the user chooses a function that serves as the running cost on control in an inverse-optimal cost functional, and (2) obtains, through a formula, a nonlinear "expander" of a pre-existing universal controller, which solves an infinite-horizon optimal control problem with a meaningful cost on the state. The cost-to-expander formula is a three-step construction, involving, inter alia, cost differentiation and function inversion-overall, a nonlinear infinite-dimensional operator. The cost-to-expander operator is proven Lipschitz, which enables uniform neural operator approximation of the entire family and supports both offline performance exploration and online adaptation. Semiglobal practical asymptotic stability and second-order suboptimality bounds are established under the approximation. The operator learning and its use in semiglobal stabilization are illustrated numerically. We call the result 'half-direct-optimal' because the paper's design is less than a general 'direct optimal' (HJB-inducing) control, but more than the fully inverse optimal, since the user performs minimization for an arbitrary given cost on control. The dual to the half-direct problem we solve is the problem in which the cost on the state is arbitrary and given. This dual problem is easier and outside of the scope of the paper.

Beyond Convolution: Advancing Hypergraph Neural Networks with Hypergraph U-Nets

Fuli Wang, Wei Qian, Daniel L. Lau, Gonzalo R. Arce — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09051v1 Announce Type: new Abstract: Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher-order domains, particularly in hypergraphs. Despite the success in convolution, the exploration of a popular architecture named U-Net remains largely unexplored for hypergraph data due to the lack of well-defined pooling and unpooling operations. This work pioneers the study of U-Net architectures for hypergraph data, addressing the critical challenge of designing effective pooling and unpooling operations that retain maximal structural information from the input hypergraph. Motivated by hierarchical clustering, we propose to construct the pooling and unpooling operators all at once by cutting the clustering dendrogram at different granularities, named the Parallel Hierarchical Pooling (PHPool) and Unpooling (PHUnpool) operators. Unlike existing pooling methods that risk local structural damage through a sequential learning procedure, our PHPool operators are designed in a global and parallel manner to ensure fidelity to the original hypergraph structure with efficient computation while the PHUnpool operators are tailored to perform inverse operations of the PHPools for hypergraph reconstruction. We validate our model through hypergraph reconstruction simulation, hypergraph classification, and node-level anomaly detection, where it demonstrates superior performance over existing state-of-the-art graph and hypergraph deep learning methods.

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09052v1 Announce Type: new Abstract: Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09056v1 Announce Type: new Abstract: Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

Stage-1 Controls the Entropy Regime, Not the Outcome

Jianxiong Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09059v1 Announce Type: new Abstract: Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

ATTAIN: Automated Exploit Failure Analysis through Trace-Driven Diff Analysis

Xinwei Mao, Zirui Chen, Xing Hu, Xin Xia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09060v1 Announce Type: new Abstract: Exploits are widely used to check whether library vulnerabilities appear in different versions and to mark affected version ranges. Exploit-based checks sometimes fail because exploits stop running on many versions after API or environment changes. Commit-based methods, such as SZZ-style analysis, sometimes miss the right introduce commits and spread labels incorrectly along long version chains. These problems leave many affected versions unlabeled or wrongly labeled and make manual exploit failure analysis very expensive and impractical at scale. We present ATTAIN, a trace-driven diff analysis framework with three modules to assess vulnerability presence across evolving library versions. The modules are trace construction, diff exploration, and affected-version judgment. The trace construction module executes an exploit across historical library versions and compares their behaviors to capture cross-version execution divergences. Using these divergences, the diff exploration module guides an LLM through a finite-state tool loop to autonomously search over version changes and collect vulnerability-relevant diff hunks. The affected-version judgment module reasons over the collected evidence to determine whether the vulnerability exists in each version and outputs the affected version range. We evaluate ATTAIN on an extensive dataset comprising 224 CVEs and 25,943 library versions across 128 libraries. ATTAIN achieves an F1-score of 93.24%, outperforming the commit-based methods V-SZZ and LLM4SZZ by 116.28% and 33.30%, respectively. ATTAIN uses short tool-guided prompts and a fixed number of iterations, keeping token usage low. It matches or surpasses existing methods on frequent CWE types, including cases where exploit runs fail for non-vulnerability reasons or commit messages do not clearly delimit affected versions.

Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving

Haoxin Liu, Jiayi Wang, Yueshen Xu, Rui Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09061v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed with highly heterogeneous workloads, chunked-prefill execution has emerged as a mainstream serving architecture. Balancing scheduling fairness and latency stability in such environments is critical; otherwise, severe head-of-line blocking and request starvation will degrade user experience. However, existing systems rely on rigid First-Come, First-Served (FCFS) policies and static token budgets, leading to fairness degradation and unpredictable latency jitter. To address these issues, we propose a fairness-aware and latency-controllable scheduling framework for chunked-prefill LLM engines. Specifically, we design a lightweight aging-based scheduling policy that dynamically calculates priorities using accumulated waiting time and remaining prefill work. Furthermore, we develop Latency-Prediction-Based Request Scheduling (LPRS) and Active Prefill Control (APC) to replace static budgets with target-time constraints and actively regulate prefill concurrency. We evaluated our scheduling framework on NVIDIA GPUs and Ascend accelerators using real-world workloads. Results show the aging policy reduces mean end-to-end latency by over 10\% compared to FCFS. Moreover, LPRS and APC significantly reduce P99 tail latency and suppress prefill fragmentation, confirming that the structural prefill control and the temporal latency constraints are fundamentally complementary. All codes have been released in Github.

Security-First Approach to API Pipeline Development with Zero-Trust Architecture

Mahima Agarwal, Keshav Ranjan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09062v1 Announce Type: new Abstract: Modern enterprises face an accelerating onslaught of API-targeted threats amid a rapidly expanding attack surface. Record volumes of software vulnerabilities continue to accelerate dramatically, with 28,818 CVEs disclosed in 2023 (a 38% jump from 2022) and 40,009 CVEs in 2024 (another 38% increase), while the average time-to-exploit (TTE) of new flaws shrank to mere days (approximately 5 days in 2023, down from 32 days in 2021). At the same time, API usage dominates web traffic and has become a primary vector for breaches - 99% of organizations experienced API security incidents in the last year, with 22% suffering actual data breaches via APIs (based on industry vendor research). This paper proposes a comprehensive "security-first" framework for API pipeline development, leveraging Zero-Trust Architecture principles within DevSecOps practices to counter these trends. We introduce a five-pillar approach encompassing Governance & Planning, Secure Design, Continuous Testing, Pipeline Controls, and Runtime Protection, aligned with industry standards (OWASP API Security Top 10 2023, NIST Secure Software Development Framework) and recent cybersecurity advisories. The results show significant improvements in vulnerability mitigation and breach prevention (e.g., 30% reduction in security incidents and 40% fewer post-release vulnerabilities in representative case studies), highlighting the positive impact of proactive security integration. The paper concludes with a discussion on implementation challenges, the evolving threat landscape, and recommendations for organizations to adopt a security-first pipeline with Zero-Trust to fortify API development against current and future threats.

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09064v1 Announce Type: new Abstract: Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

OnlyDense: Reduced-Order Modeling for Lagrangian simulation

Tu Do, Shannon Ryan, Santu Rana — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09065v1 Announce Type: new Abstract: In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are often employed to study the behavior of dynamic systems. However, these methods can be prohibitively computationally expensive, particularly when simulating multi-scale spatial or temporal phenomena, e.g., void growth and coalescence within macro-scale geometries, structural failure of spacecraft components resulting from hypervelocity impact of space debris particles, etc. In contrast to graph-based methods, where the state of the system is understood as a discrete set of particles, we propose a learning framework for scalable representation and dynamics modeling of massive particle systems by treating the system state as a function and its evolution as a trajectory in Hilbert space. Rather than representing the state as a discrete set of particles or embedding it in a nonlinear latent manifold, we approximate the state space with a linear subspace spanned by learned neural basis functions. This parameterization enables direct projection to obtain latent coefficients and explicit access to the basis functions, avoiding optimization over a nonlinear latent space. The resulting representation admits a natural interpretation: latent variables correspond to coefficients in Hilbert space, and basis functions correspond to spatial modes, analogous to Proper Orthogonal Decomposition. The framework thus unifies classical projection-based reduced-order modeling with modern deep learning, while remaining invariant to the number of discretization points. Experiments on large-scale SPH simulations with over one million particles, including dynamic events with extreme deformation and fragmentation, demonstrate that the proposed method accurately reconstructs and predicts dynamics, achieving an R$^2$ score above $0.99$ with as few as $32$ basis functions.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09068v1 Announce Type: new Abstract: Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09071v1 Announce Type: new Abstract: Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the \emph{silent failure} regime. Existing approaches predict suspect steps via classifiers or LLM judges, or recover correct answers via retry, but none feed the intervention outcome back to \emph{refine the attribution itself}. We propose \methodname, a method that closes this gap by diagnosing a candidate error step, testing it through controlled replay with a diagnosis-specific patch, and using the verified outcome flip as contrastive evidence to refine the final attribution. Across four localization benchmarks spanning multi-hop reasoning across domains, \methodname achieves the highest localization accuracy among same-auditor methods across all four benchmarks, with the largest gains on structured tool-use traces, while providing actionable localization even when ground-truth answers are unavailable.

A Unifying Lens on Reward Uncertainty in RLHF

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09073v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is \emph{pessimism}: penalizing rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a \emph{distributional} reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pm\beta\log\mathbb{E}_p[e^{\pm r/\beta}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance

Zhang Chen, Shuai Wan, Mengting Yu, Fuzheng Yang, Junhui Hou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09074v1 Announce Type: new Abstract: Existing pruning methods for 3D Gaussian splatting (3DGS) suffer from either severe quality degradation or prohibitive computational overhead. In this paper, we propose REFINE, a highly accelerated 3DGS pruning framework centered on a novel rendering-free primitive importance metric. Our approach leverages an analytically approximated, rendering-aware Hessian field to quantify the expected perceptual error induced by the removal of individual primitives. By modeling the joint modulation of visibility, projection geometry and the content adaptive hyperparameter, we entirely bypass costly forward rendering passes and derive an anisotropic perceptual weight field that serves as a high-fidelity proxy for primitive importance. Extensive experiments across multiple benchmark datasets demonstrate that REFINE maintains highly competitive rendering quality while achieving an unprecedented $3,000\times$ reduction in pruning-related computational complexity compared to state-of-the-art pruning methods.

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09076v1 Announce Type: new Abstract: Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

Neural Legendre-Fenchel transform with Hessian Preconditioning

Basile Plus-Gourdon, Frank Nielsen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09077v1 Announce Type: new Abstract: The Legendre-Fenchel (LF) transform is a fundamental tool in convex analysis and machine learning that maps lower semi-continuous functions to their convex conjugates. In practice, when closed-form formula are not available for expressing convex conjugates of given functions, one must approximate them using various techniques. One recent such versatile numerical method is the deep Legendre transform method which relies on neural networks although it remains challenging particularly for tackling ill-conditioned functions. This work builds on the reformulation of the LF transform as a projective polarity. A notable property of this framework is its affine invariance. We leverage this affine invariance to introduce a Hessian-based preconditioning strategy. Specifically, we apply an affine deformation around a minimizer so that the second-order Taylor approximation of the function coincides with the canonical paraboloid, whose conjugation map is the identity. A residual network initialized near the identity can then learn this simplified mapping, while the original conjugation map is recovered through the inverse deformation. The proposed preconditioning incurs only a modest computational overhead, consisting of a single eigendecomposition during initialization and two matrix-vector multiplications per query. Experiments on a diverse set of convex functions, including high-dimensional benchmarks, demonstrate improved convergence rates and enhanced numerical accuracy of the conjugation, with particularly significant gains for ill-conditioned problems. Finally, we discuss the scope of applicability of our proposed method and highlight several of its limitations.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09078v1 Announce Type: new Abstract: Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09079v1 Announce Type: new Abstract: Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09080v1 Announce Type: new Abstract: Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

Edge-Constrained UAV Small-Object Detection with P2 Enhancement and Quantum-Inspired Lightweight Structure Search

Wuming Lei, Yanbin Gao, Mingyan Sun, Xiaobin Li, Xuechen Liang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09081v1 Announce Type: new Abstract: Unmanned aerial vehicle (UAV) object detection requires compact detectors that retain small-object details under onboard computation and memory constraints. Repeated downsampling inlightweight networks weakens shallow spatial information, while manually adding attention orfusion modules may increase cost without stable gains. This study analyzes YOLOX-Nano underedge-deployment constraints by combining a P2 high-resolution detection branch with a quantum-inspired evolutionary algorithm (QIEA) for lightweight structure screening. The search space isdefined by lightweight priority and task specificity, and the evaluation jointly considers accuracy,floating-point operations (FLOPs), latency, memory consumption, and recall. On VisDrone, theP2 branch increases APamall by 31.10% over the YOLOX-Nano baseline. Compared with NanoDet-Plus with similar model size, YOLOX-Nano+-P2 improves APs0.ss by 17.5% and APamal by 44.9%.The QIEA-selected candidate obtains the highest Recallso, but +P2 remains the strongest AP-oriented variant after full training. Full 100-epoch verification of Random-best, GA-best, andSA/QUBO-best candidates further shows that proxy rankings do not necessarily transfer to finalAPse9s. These results support using P2 as the main small-object enhancement path and QIEA as alightweight tool for candidate screening and accuracy-cost analysis. The source code, configurationfiles, diagnostic scripts, and summarized results are available at https://github.com/Ming23233/UAV-QIEA-Edge-Detection

Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning

Yutong Li, Xinyi Zhang, Ziyi Ye, Daoguo Dong, Yu-gang Jiang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09082v1 Announce Type: new Abstract: Multimodal sequential recommendation (MSR) incorporates textual and visual information to improve recommendation quality. However, recent studies and our empirical analysis show that visual features are often underutilized, thereby contributing far less than textual signals. We attribute this issue to two factors: insufficient visual representation learning (pretrained encoders fail to capture preference-relevant cues) and unbalanced visual-text optimization (textual features dominate the learning process). To address these issues, we propose Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning (REVEAL), a plug-and-play framework that enhances visual representation learning and cross-modal optimization without modifying the original recommendation backbone. REVEAL consists of Feedback-Guided Visual Extraction (FVE), which refines prompt-guided visual extraction through task-level feedback, and Adaptive Visual Learning (AVL), which dynamically reweights visual learning to alleviate modality imbalance. Experiments on multiple real-world datasets and MSR backbones demonstrate that REVEAL consistently improves recommendation performance. Further analysis shows that these gains arise from more effective attention to preference-relevant visual regions and better visual utilization during training. The code is available at https://github.com/YutongLi2024/REVEAL.

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09084v1 Announce Type: new Abstract: Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.

DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling

Jie Zhao, Xianqi Dai, Jie Feng, Huandong Wang, Yong Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09086v1 Announce Type: new Abstract: Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.

Autonomous FPV Flight with Translational Optical Flow and Uncertainty Mask

Yang Deng, Yu Hu, Feng Yu, Linzuo Zhang, Danping Zou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09088v1 Announce Type: new Abstract: Autonomous FPV quadrotor flight in complex environments using a monocular RGB camera as the sole exteroceptive sensor remains a fundamental challenge. Recent research has shown that using optical flow as the input of a neural network can achieve end-to-end autonomous flight in cluttered scenes. However, extracting the most relevant information from the flow estimation is the key bottleneck limiting agility and robustness. Existing methods struggle to disentangle obstacle-induced optical flow from the ego-motion background flow and suffer from low signal-to-noise ratios near the focus of expansion (FoE). To address these issues, we decompose the optical flow into translational and rotational components and utilize only the translational flow, which captures scene geometry and depth cues. In addition, we introduce an uncertainty mask derived from inconsistencies between forward and backward flow estimates. This mask highlights obstacle structures, including those within the FoE region. Both cues are fed to a control policy trained in a differentiable simulation framework, which enables efficient first-order optimization across perception and control. We validate our approach through extensive experiments in both simulated and real-world forest environments. The proposed system achieves robust flight at speeds of up to 13.91 m/s in simulation and 11.79 m/s in real-world tests, with a 93.3\% success rate over 30 real-world trials, nearly doubling the previously reported 6 m/s real-world speed of the monocular-RGB optical-flow UAV obstacle avoidance system.

Context Rot in AI-Assisted Software Development: Repurposing Documentation Consistency for AI Configuration Artifacts

Christoph Treude, Sebastian Baltes — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09090v1 Announce Type: new Abstract: Developers increasingly provide AI coding assistants with persistent context through configuration files such as CLAUDE.md, AGENTS.md, and .cursorrules. These files describe code elements, architecture, and development conventions, forming the context that guides AI tool behavior across sessions. As software evolves, this context can become stale, a phenomenon we call context rot. While AI configuration artifacts are new, the underlying consistency problem connects to decades of software documentation research. Researchers have built tools to check consistency between documentation and code, spanning README files, code comments, API documentation, architecture descriptions, and installation instructions. We argue that this existing toolbox is an immediate starting point for detecting context rot, and we present a research roadmap mapping documentation consistency approaches to corresponding problems in this new setting. As preliminary evidence, applying an existing README/wiki consistency checker to a statistically representative sample of 356 repositories identifies stale code element references in 23.0% of repositories, showing that traditional documentation consistency tools can already surface context rot.

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09091v1 Announce Type: new Abstract: On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09092v1 Announce Type: new Abstract: Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive "shortcut" issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking, such as "belief," are especially shortcut-prone compared to mind questions, such as "intention," where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether Reinforcement Fine-Tuning with verifiable rewards and explicit reasoning chains, called Thinking-RFT, elevates ToM beyond Supervised Fine-Tuning, or SFT. Our key findings are as follows. First, Thinking-RFT effectively improves ToM in all scenarios, with a 6% improvement over SFT, particularly in complex higher-order reasoning, with a 10% improvement over SFT, and multimodal cases, with a 7% improvement over SFT. It also generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. Second, ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms Non-Thinking-RFT by 7% on average. Third, RFT works by learning to ground its reasoning on anchor cues, such as keywords and state changes, that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities.

LAEI: Layered Autonomous Edge Intelligence Framework for Robust UAV Swarm Operations

Changmin Park, Wooyong Jung, Hwangnam Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09099v1 Announce Type: new Abstract: Autonomous UAV swarms require scalable coordination mechanisms that maintain mission performance under limited communication, environmental uncertainty, and component failures. Centralized approaches provide global coordination but suffer from communication bottlenecks and single-node vulnerabilities, whereas fully decentralized methods often lack mission-level consistency. This paper presents Layered Autonomous Edge Intelligence (LAEI), a UAV-swarm framework that combines onboard learned policies with lightweight mission-level supervision. Each UAV performs local perception, obstacle avoidance, and action selection onboard, while the supervisory layer provides adaptive goal reassignment, fault-aware recovery, and context-dependent policy guidance without directly controlling low-level actions. LAEI further incorporates recovery strategies, including dynamic reassociation, backup supervisory support, and fallback local autonomy, to maintain mission continuity under representative failure scenarios. We evaluate LAEI in simulated UAV-swarm scenarios using mission completion time, collision rate, and coverage efficiency. The results show that LAEI reduces mission completion time and improves operational efficiency while maintaining collision-aware distributed UAV-level decision-making.

Alcmean's: Unsupervised community detection using local Laplacian, automatic detection of the number of centers

Shahin Momenzadeh, Rojiar Pir Mohammadiani — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09100v1 Announce Type: new Abstract: Community detection is a fundamental problem in the analysis of complex networks. It has applications across social, biological, and financial domains. Traditional algorithms such as Louvain, LPA, and modularity optimization often require manual parameter tuning. They also suffer from inaccurate cluster center selection and struggle with scalability. To address these challenges, we propose Automatic Laplacian Centrality Means (ALCMeans), a novel community detection algorithm. ALCMeans combines Laplacian energy-based automatic center identification with DeepWalk embeddings for robust node representation. Unlike existing Laplacian-based and clustering methods, ALCMeans eliminates the need to predefine the number of communities, enhances cluster center selection using structural importance, and leverages representation learning for more accurate and stable assignments. Experimental results on benchmark datasets demonstrate 10 to 20 percent higher NMI and ARI scores compared to Louvain, Newman-Girvan, LPA, Fast-Greedy, and a recent GNN-based competitor (MAGI, KDD 2024). Additional evaluations with modularity and F1-scores confirm the superiority of ALCMeans. Ablation studies highlight the critical contributions of each component. Despite its reliance on DeepWalk parameters and increased runtime relative to lightweight heuristics, ALCMeans consistently outperforms state-of-the-art methods. This makes it a promising tool for real-world network analysis.

Chimera: Protocol-Aware Recovery for Confidential BFT Consensus

Tong Liu, Xiaoqing Wen, Ziwei Zhou, Si Liu, Jianyu Niu, Cong Wang, Yinqian Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09101v1 Announce Type: new Abstract: Trusted Execution Environments (TEEs) have enabled confidential Byzantine Fault-Tolerant (BFT) consensus systems with confidentiality and improved scalability. However, TEEs do not provide state continuity: during recovery, a compromised host can roll back a crashed enclave to a stale persistent state, significantly threatening both safety and availability. Existing defenses face a fundamental tradeoff: they either impose substantial overhead on critical consensus paths, reducing throughput and increasing latency, or incur prolonged recovery delays, hurting availability. We present the first systematic taxonomy of rollback-resilient recovery for confidential BFT consensus, distilling prior approaches into four categories. We further expose their inherent limitations. Guided by this detailed analysis, we design CHIMERA, a protocol-aware recovery framework that breaks this tradeoff. Our key insight is that rollback protection in consensus systems should not be uniform. Different types of persistent states differ fundamentally in their state distribution, update behavior, and representation form. CHIMERA separates persistent state into metadata and logs according to these protocol-level properties and applies distinct recovery mechanisms to each type. We formally model CHIMERA in Maude and verify its safety and liveness properties. We implement it on Braft and ZooKeeper using Intel TDX, and evaluate it in both LAN and WAN settings. Results show that CHIMERA achieves higher throughput, lower recovery latency, and better availability than state-of-the-art rollback-resilient baselines.

Concepts in Practice: C++ MPI Bindings for the HPC Ecosystem. From a Standardizable Core to a Composable Interface

Tim Niklas Uhl, Matthias Schimek, Daniel Brommer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09102v1 Announce Type: new Abstract: The official C++ MPI bindings were removed from the standard in 2008, leaving a gap that numerous third-party libraries have attempted to fill. However, existing wrappers typically cover only a limited subset of MPI or target specific use cases, falling short of a general-purpose solution. A recent conceptual paper proposed general design principles for modern C++ bindings based on C++20 concepts, without committing to a concrete interface. We present the first concrete realization of these principles in a layered architecture. At the foundation, we define a core layer: refined C++20 concepts formalizing the MPI standard's notion of data buffers, automatic mapping of standard C++ constructs, non-intrusive customization points for third-party types, and concept-based wrappers for MPI procedures. The result is a low-level native C++ MPI interface that works directly with STL containers, is highly extensible, and lends itself to standardization. Built on this core, we present KaMPIng-v2 -- a C++ MPI library offering the convenience and memory-safety of KaMPIng with composable, pipe-based syntax inspired by C++ ranges for efficient, boilerplate-free MPI programming. Finally, we demonstrate the core layer's broad applicability by designing lightweight adapters for GPU and performance-portability libraries, making the HPC ecosystem a first-class citizen in MPI. Kokkos views, Thrust device vectors, and SYCL buffers can be passed directly to MPI procedures, with adapter logic remaining self-contained. All contributions are backed by a fully functional open-source reference implementation, demonstrating the practical viability of the proposed design.

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09104v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Xu Li, Hanzhe Tu, Xun Han — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09105v1 Announce Type: new Abstract: Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery.Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace.To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit.It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input.Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence.Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol.Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28.These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

RAM: Reachability Across Morphologies

Tim Walter, Xinyu Chen, Jonathan K\"ulz, Matthias Althoff — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09108v1 Announce Type: new Abstract: Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.

Driving Video Retrieval for Complex Queries with Structured Grounding

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09109v1 Announce Type: new Abstract: Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang, Qingsen Yan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09110v1 Announce Type: new Abstract: Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.

Illumination-Invariant Anomaly Detection for Sub-Canopy UAV Multispectral Point Clouds

Likun Chen, Yanfeng Gu, Xian Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09111v1 Announce Type: new Abstract: Unmanned Aerial Vehicle (UAV) multispectral point clouds (MPC) provide high-dimensional spatial-spectral data for sub-canopy target detection; however, their efficacy is significantly compromised by severe illumination heterogeneity caused by vegetation shadows. To address this, we propose a prior-free anomaly detection framework capable of robustly handling lighting variations. First, we formulate solar angle estimation as an inverse optimization problem. By coupling spectral indices with a ray-tracing model, this strategy achieves Prior-Free Shadow Extraction without relying on flight metadata, effectively distinguishing dark objects from true shadows. Second, to mitigate spectral distortions, we introduce an Illumination-Consistent Sparse Representation mechanism. Unlike standard reconstruction methods, we construct a background dictionary strictly from neighbors sharing the same illumination state. This constraint effectively disentangles spectral reflectance from lighting variations, ensuring that targets are represented solely by physically consistent background points. Experimental results indicate that the proposed method significantly improves the separability between anomalies and background in complex forest environments, demonstrating superior performance over state-of-the-art baselines. This framework is particularly suited for identifying camouflaged military targets, mapping fallen tree trunks, and uncovering archaeological ruins hidden beneath dense foliage.

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09112v1 Announce Type: new Abstract: The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

Yuxin Fu, Shijing Si — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09114v1 Announce Type: new Abstract: Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

Lena Krieger, Xuan Zhao, Zhuo Cao, Qin Wang, Hanno Scharr, Ira Assent — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09115v1 Announce Type: new Abstract: Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09117v1 Announce Type: new Abstract: While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

Sushant Mehta, Liudas Panavas, Edwin Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09118v1 Announce Type: new Abstract: As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.

AutoPilot: Learning to Steer High Speed Robust BFT

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09120v1 Announce Type: new Abstract: Recent Byzantine Fault Tolerant (BFT) protocols achieve strong performance by combining the low-latency advantages of leader-based BFT protocols with the high-throughput benefits of DAG-based data dissemination. Despite exposing a wide spectrum of internal tunable parameters, these protocols typically rely on static and heuristic configurations, which leads to performance degradation under dynamic workloads, heterogeneous network conditions, and evolving adversarial behaviors. In this paper, we present AutoPilot, a reinforcement learning-based framework that continuously monitors runtime conditions and dynamically adjusts protocol parameters online to optimize consensus performance. To ensure robustness, AutoPilot coordinates learning in a decentralized manner, providing resilience against adversarial data pollution. We implement AutoPilot on top of Autobahn, a state-of-the-art, highspeed, robust BFT protocol, and evaluate it across diverse dynamic environments. Experimental results demonstrate that AutoPilot quickly converges to the optimal configuration under changing environments, reduces end-to-end latency by 49.8% compared to the default protocol configuration, and outperforms random configuration exploration by 73.3%.

Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations

Arun Malik — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09122v1 Announce Type: new Abstract: Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.

An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification

Xian Li, Yanfeng Gu, Aleksandra Pi\v{z}urica — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09123v1 Announce Type: new Abstract: Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover classification. However, the representation power of classification models is limited by inherent high-dimensional and heterogeneous spatial-spectral information, unbalanced sample distribution, and inter-class spectral similarity of airborne MPCs. We build two MPC datasets and propose an enhanced geometric-spectral feature learning framework based on attentions for airborne MPC classification. A key component in our model is a two-stream feature fusion method with attention mechanisms, which enhances the representation capability of spatial-spectral features from high-dimensional heterogeneous MPCs. The first stream aims to extract position-encoded global spectral features with fusion self-attention, and the second stream comprises a multikernel point convolution and feature aggregation attention to extract spectral-guided geometric features. We then develop a residual attention fusion block to integrate the most informative geometric-spectral features from the two parallel streams. Another important contribution of this work is a joint loss function to improve the learning ability on unbalanced and interclass similar samples. Experimental results on two airborne MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods. Furthermore, the codes and datasets used in this paper will be made available freely at https://github.com/HITlixian/TGRS_GSFF.

A Regret Minimization Framework on Preference Learning in Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09124v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09125v1 Announce Type: new Abstract: Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Our dataset and code can be found here.

Semantic and Task-Oriented V2X Communications: Pushing the Limits of V2X Networks Scalability

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09126v1 Announce Type: new Abstract: Scalable Vehicle-to-Everything (V2X) networks are key to support the large-scale deployment of connected and automated mobility. However, the scalability of V2X networks is currently challenged by the limitations of existing V2X communication paradigms, which prioritize the reliable and timely delivery of the transmitted information over a careful message content selection - an approach that can potentially lead to the transmission of unnecessary information and an inefficient usage of communication resources. Semantic and task-oriented V2X communications have recently been proposed to address these scalability challenges by focusing on the content of the transmitted messages, particularly on its relevance to the intended receivers. In this paper, we numerically demonstrate that semantic and task-oriented V2X communications can substantially improve the scalability of V2X networks, increasing by up to a 4.1x factor the number of supported vehicles under high-density conditions. In addition, we show that semantic and task-oriented V2X communications can also decrease the inter-reception time between consecutive messages by up to 67% and lead to a twofold increase in the probability of successfully delivering all required relevant information to the intended receivers.

OpenOpt: An Open-Source SRAM Optimizer Based on Equivalent Circuit Model

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09129v1 Announce Type: new Abstract: This paper proposes a co-optimization framework that jointly optimizes SRAM architecture and transistor sizing using equivalent circuit models. The framework simplifies inactive SRAM cells into equivalent RC loads and static power models, achieving up to 61.4$\times$ simulation speedup while maintaining high fidelity (read/write delay error $<$0.22%, power error $<$1.68%). A joint search space encompassing architecture parameters and device sizing integrates seven algorithms including SA, PSO, Bayesian Optimization variants, and multi-objective evolutionary algorithms. Based on FreePDK45, ablation experiments confirm complementary gains from architecture selection and transistor sizing. Among all algorithms, MOEA/D achieves the best Figure of Merit (8.2721), yielding 6.2% improvement in SNM, 73.6% reduction in area, and 42.3% reduction in peak power. The framework is publicly available at https://github.com/W1Y1K1/OpenOpt.

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Siyuan Liu, Jinyang Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09131v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

Vision Language Model Helps Private Information De-Identification in Vision Data

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09132v1 Announce Type: new Abstract: Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.

Multiversion Concurrency Control for Multiversion B-Trees

Amir Tonta, Bernhard Seeger, Eljas Soisalon-Soininen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09133v1 Announce Type: new Abstract: Multiversion concurrency control (MVCC) enables scans to read data from a committed snapshot (version), reducing conflicts with write operations compared to traditional concurrency approaches. Currently, versioned records are often managed in a B$^+$-tree using version chains. However, version chains introduce overhead during scans and can still lead to conflicts between scans and writers. The multiversion B-tree (MVBT) was designed for optimal range scan performance on arbitrary versions, but has been considered impractical due to its structural complexity and, until recently, the lack of effective concurrency control. In this paper, we present the concurrent MVBT (cMVBT), a redesign of the MVBT featuring a novel concurrency control protocol that uses optimistic latches for write operations and requires no latches for range scans, while preserving all the optimality guarantees of the original MVBT. Additionally, cMVBT supports continuous garbage collection without activity spikes, seamlessly integrating free-space management. Experiments with mixed workloads derived from a standard benchmark show that the cMVBT achieves low overhead, high write throughput, and excellent range scan performance, outperforming state-of-the-art methods based on version chains.

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09134v1 Announce Type: new Abstract: Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

Steganography Without Modification: Hidden Communication via LLM Seeds

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09135v1 Announce Type: new Abstract: We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modification to model weights, sampling code, or output distributions. The channel exploits a structural property of deterministic decoding: pseudo-random number generators (PRNGs) used in inverse-transform sampling produce a seed-dependent sequence of token-level probability intervals that can be reconstructed from the generated text alone. A sender encodes a secret message in the PRNG seed before generation; a receiver reconstructs the intervals and recovers the seed, and thus the hidden payload, by exhaustive search over the seed space. We formalize two operational modes. In the known-prompt setting, sender and receiver share the prompt, enabling exact interval reconstruction and perfect seed recovery via forced alignment. In the unknown-prompt setting, only the generated text is available; approximate interval reconstruction combined with a maximum-hit-count scoring strategy still permits reliable recovery from sufficiently long outputs. Extensive experiments across six model families and five heterogeneous text domains show that, in the known-prompt setting, full 32-bit seed recovery from the complete 2^32 candidate space achieves up to 100% accuracy, depending on model and text domain, within 300 tokens and under 35 seconds on a single GPU. In the unknown-prompt setting, recovery reaches near-perfect accuracy at 600-800 tokens in about 12 seconds. We further analyze the influence of prompting strategies, tokenization ambiguities, and sampling hyperparameters on channel reliability. Moreover, we discuss several applications of our results: First, it allows for the steganographic transmission of 32 bits, but also shows that ignorance of the prompt is not a valid security assumption.

Modeling of Spinning Plates: Geometric Stiffening and Modal Approximation for GNC Applications

Umberto Zucchelli, Irene Valles S\'anchez, Francesco Sanfedino — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09137v1 Announce Type: new Abstract: This work presents a modal formulation for flexible rectangular plates, accounting for nonlinear geometric effects arising from in-plane foreshortening and centrifugal stiffening. The model is linearized with respect to elastic deformations while retaining the full dependence on spacecraft angular velocities and accelerations. System matrices depend nonlinearly on spacecraft states through squared and cross-product terms, capturing gyroscopic coupling and dynamic stiffening phenomena for arbitrary rotational maneuvers. Polynomial approximation of mode shapes enables efficient computation while preserving accuracy. Model predictions are validated against finite element simulations and literature data for transient response under prescribed hub motion.

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09138v1 Announce Type: new Abstract: Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras

Zibin Liu, Shunkun Liang, Banglei Guan, Yang Shang, Qifeng Yu, Ji Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09139v1 Announce Type: new Abstract: Despite the rapid advancements in event-based motion estimation, current geometric methods primarily focus on velocity estimation. However, absolute pose estimation, which is equally crucial for key applications such as robotic navigation and augmented reality, remains relatively underexplored. Consequently, the simultaneous recovery of absolute pose and velocity from event streams remains an open and challenging problem. To address this gap, we propose a geometric framework for absolute pose and velocity estimation by leveraging 3D lines in the scene and the events they trigger. At the core of the framework lie two key geometric constraints: the orthogonality between a 3D line and the normal vector of its corresponding event plane, and the collinearity of an event with the 2D projection of its associated line. Based on these constraints, we present both linear and polynomial solvers for absolute pose estimation. The former enables efficient computation, while the latter provides a globally optimal solution for rotation. For velocity estimation, we develop an efficient linear solver and a more accurate optimization-based solver to recover both angular and linear velocities. Notably, our methods require a minimum of three event-line correspondences to determine the 6-DoF absolute pose or velocities independently. Extensive experiments in simulation and on real-world datasets demonstrate that our methods achieve state-of-the-art performance, with significant improvements in accuracy and computational efficiency compared to existing methods. The demo code is publicly available at https://github.com/Zibin6/EventPoseVelocity.

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

Yi Huang, Lei Bi, Jinman Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09140v1 Announce Type: new Abstract: Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

Danya Li, Xiang Su, Yan Feng, Rico Krueger — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09142v1 Announce Type: new Abstract: Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

Yanze Jiang, Yanfeng Gu, Xian Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09143v1 Announce Type: new Abstract: Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

Containerizing BIDSme : A Reproducible Tool for BIDS Conversion

Bradley Spitz, Antoine Jacquemin, Nikita Beliy, Christophe Phillips — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09144v1 Announce Type: new Abstract: The "Brain Imaging Data Structure" (BIDS) has become a widely adopted standard for organizing and sharing neuroimaging datasets of various modalities. However, converting raw brain imaging data into BIDS framework remains a complex and time-consuming task. BIDSme is a semi-automated tool developed to streamline this conversion process, but until recently, it lacked the portability and accessibility needed for widespread adoption. This paper presents the containerization of BIDSme using Docker and Docker Compose, improving usability, reproducibility, and integration into existing platforms like Neurodesk. It also details the design choices, iterative refinements, and validation process that led to a flexible, lightweight, and user-friendly containerized application.

PrivCode++: Latent-Conditioned Differentially Private Code Generation for Comprehensive Guarantees

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09145v1 Announce Type: new Abstract: Large language models fine-tuned on instruction-code pairs may memorize and subsequently leak sensitive training data. Existing differentially private (DP) code generation methods primarily protect code snippets while assuming prompts are public, which fails in realistic scenarios where prompts may also contain sensitive information. When prompts cannot be explicitly learned or used during generation, code synthesis suffers from severe utility degradation as well as reduced diversity and fidelity. To address these challenges, we propose PrivCode-Plus, the first work to explore DP code generation where both prompts and code snippets are considered sensitive in LLM fine-tuning. PrivCode-Plus introduces a two-stage DP framework with a Privacy-Free Latent Conditioning module, enabling effective DP fine-tuning and data synthesis without direct access to sensitive prompts or code. Extensive experiments show that PrivCode-Plus achieves substantially higher utility than baselines, remains competitive with the method with relaxing privacy assumptions, and provides stronger privacy guarantees.

Explicit Representation Alignment for Multimodal Sentiment Analysis

Baode Wang, Ziming Wang, Huacan Wang, Ronghao Chen, Biao Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09148v1 Announce Type: new Abstract: Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09150v1 Announce Type: new Abstract: While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.

Customization under Fire: Plugin Poisoning in Text-to-Image Ecosystem

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09151v1 Announce Type: new Abstract: The prosperity of text-to-image (T2I) models has fostered a vibrant share-and-play ecosystem centered on Low-Rank Adaptation (LoRA) plugins, which allow users to customize and share model capabilities with ease. This democratization, however, comes with a hidden but severe security risk. Malicious users could share and distribute seemingly benign LoRA plugins that contain hidden functionalities to poison the model-sharing market, like Civitai or Liblib, severely undermining the user trust that underpins this collaborative ecosystem and threatening the safety of countless downstream applications. Despite these risks, plugin poisoning in the real-world T2I ecosystem remains underexplored. This paper introduces PoisonLoRA, the first systematic study of LoRA plugin supply-chain risks that exploits the trust and characteristics within the T2I ecosystem. We identify two primary attack instances: (1) Concept Hijacking, where a hijacked LoRA could generate images to influence public opinion and spread propaganda, and (2) Task Injection, where a LoRA is injected to produce harmful content (e.g., NSFW images) only activated by a secret key. Critically, the malicious payload persists with virus-like propagation. Such propagations weaponize the very act of creative collaboration (e.g., LoRA merging) to spread its contagion, turning every remix into a new carrier. Extensive experiments validate that PoisonLoRA is both effective and stealthy. Specifically, we achieve approximately 100% attack success rates (ASR) on both Civitai and Liblib on 6 datasets across 4 scenarios, without being detected by the platforms. The poisoned LoRA demonstrates extreme robustness, with nearly 100% ASR even transferred to different base models and remixed more than 5 times.

Improved Convergence Analysis of Topology Dependence in Decentralized SGD

Yuki Takezawa, Anastasia Koloskova, Sebastian U. Stich — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09154v1 Announce Type: new Abstract: Decentralized SGD is a fundamental algorithm in decentralized learning, although the influence of an underlying network topology on its convergence behavior is not yet fully understood. Existing convergence analyses have shown that topologies with a small spectral gap significantly deteriorate the convergence rate of Decentralized SGD in both homogeneous and heterogeneous cases. However, many prior papers have reported that indeed the choice of the topology has a significant experimental impact in the heterogeneous case, but has little experimental impact on training behavior in the homogeneous case. In this paper, we present a tighter convergence analysis of Decentralized SGD, offering a more precise understanding of how topologies affect the convergence rate than the prior analysis. Specifically, unlike existing convergence analyses that used only the spectral gap as a property of the topology, our novel analysis shows that all eigenvalues of the mixing matrix affect the convergence rate. Throughout the experiments, we carefully evaluated the convergence behavior of Decentralized SGD and demonstrated that our novel convergence analysis can more accurately describe the effect of topology on the convergence rate.

Bridged SBI: Correcting Biased Low-Fidelity Posteriors for Cost-Efficient High-Fidelity Inference

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09155v1 Announce Type: new Abstract: Accurate calibration of particle-based simulators is crucial for robotic earthwork simulation, but analytical calibration is challenging due to this task's highly nonlinear particle dynamics and the black-box nature of conventional simulators. Although simulation-based inference (SBI) can estimate posterior distributions over simulation parameters solely from forward simulations, applying SBI directly to high-fidelity (HF) particle simulators is often computationally prohibitive. Low-fidelity (LF) simulators with coarser particles can reduce this cost, but changes in particle size and particle count shift the parameter values needed to reproduce the same observation, producing biased LF posteriors. We propose Bridged SBI, which leverages a biased but informative LF posterior to guide HF inference. This method first uses inexpensive LF simulations to identify a coarse high-density parameter region, and then it learns a local residual bridge to transport LF posterior samples toward HF-consistent regions by correcting the LF--HF discrepancy. We analyze how sequential multi-fidelity SBI (Naive-MF) can suffer from LF-induced posterior miscoverage when it directly relies on the LF posterior without discrepancy correction. We then show that Bridged SBI is designed to alleviate this issue by explicitly modeling the LF--HF discrepancy through residual correction. Experiments on both sim-to-sim particle-parameter calibration and real-to-sim calibration with real soil observation show that Bridged SBI produces more accurate and reliable HF posteriors than HF-only SBI or the Naive-MF baseline, especially under limited HF simulation costs.

OmniGen-AR: AutoRegressive Any-to-Image Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09156v1 Announce Type: new Abstract: Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09157v1 Announce Type: new Abstract: This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09159v1 Announce Type: new Abstract: Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

Crop Recommendation and Agricultural Query Answering System Using Spatio-Temporal Graph Neural Networks and Hybrid Retrieval Augmentation

Prajwal Thapa, Yagya Raj Pandeya — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09160v1 Announce Type: new Abstract: This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendation, and a question-answering tool for farmers. We propose two deep learning models -- a Transformer-based Graph Neural Network and a Spatio-Temporal Graph Convolutional Network (STGCN) -- to forecast weather conditions for the next 30 days using data from 1,359 locations in Nepal. The STGCN outperforms the Transformer-based model in accuracy (MSE ~0.011 vs. 0.013), effectively modeling both spatial and temporal dependencies in climate data. These predictions are combined with static soil properties such as pH, moisture, and organic content to generate localized crop recommendations through a scoring algorithm that matches each crop's optimal growing conditions. Additionally, we develop a Retrieval-Augmented Generation (RAG) chatbot that leverages domain-specific agricultural documents to answer farmers' questions in natural language. The entire system is deployed via a mobile application, offering real-time suggestions and conversational support. User feedback confirms the system's usability and relevance, especially in rural settings where personalized farming guidance is limited. Overall, our approach demonstrates how combining machine learning models with local agricultural data can empower farmers with actionable insights, promoting more informed decisions, better crop yields, and increased resilience to climate variability.

Extreme Points of the $(0,\delta)$-LDP Polytope with Small Input Size and Arbitrary Output Sizes

Supriya Rawat, Myna Vajha, Gowtham R. Kurri, Anand Sarwate — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09161v1 Announce Type: new Abstract: The structure of locally differentially private (LDP) mechanisms can be understood through the geometry of the corresponding privacy polytope. While the extreme points of the $ (\epsilon,0)$-LDP polytope are well characterized (Kairouz \emph{et al.}, 2014; Holohan \emph{et al.}, 2017; Pensia \emph{et al.}, 2017), comparatively little is known for the $(\epsilon,\delta)$-LDP polytope with $\delta>0$. Recent work (Elangovan and Jog, 2024) has shown that even in the special case $\epsilon=0$, the $ (0,\delta) $-LDP privacy polytope exhibits fundamentally different behaviour. In this work, we provide complete characterizations of the extreme points for the low-input-alphabet regime $k=2$ and $k=3$ and with arbitrary output alphabet size $m $. We also identify new extreme mechanisms for larger input alphabet sizes $k$, of the star configuration type, as introduced by Elangovan and Jog (2024).

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation

Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Juanfan Wu, Mingxuan Cui, Yufeng Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09162v1 Announce Type: new Abstract: Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters -- only a median-threshold binary decision on RANSAC statistics -- adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24--4.91\% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran's I = 0.32, $p < 0.001$), predict boundary instability (Spearman $\rho = 0.66$), and that rigidification recovers temporal consistency from 62\% to 92\% (+29.5pp) in homography-valid regions.

EnclaveScale: Hardware-Assisted Edge-DP for Secure Data Centre Power Telemetry

Hung Dang, Tue Nguyen, Minh Vo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09163v1 Announce Type: new Abstract: EnclaveScale is a distributed, hardware-assisted telemetry architecture providing post-extraction attestation, enabling operators to collaboratively model high-resolution generative AI power transients. Existing cryptographic techniques scale poorly for 10-Hz streaming or fail to authenticate origins, permitting malicious hosts to spoof sensor inputs. We implement and evaluate a post-extraction pipeline utilizing DCAP attestation, differential privacy noise injection, and Byzantine rejection across 32 GCP Confidential VMs, achieving 0\% post-extraction attack success rate. This edge-DP approach distils continuous GPU transients into discrete Markov-chain transition matrices, guaranteeing event-level differential privacy. To mitigate pre-ingestion vulnerabilities, we propose an SPDM-authenticated first-mile layer. While current platforms lack attested I/O, emerging hardware architectures integrate PCIe IDE and TDISP to natively prevent host-level synthesis, securing the end-to-end provenance boundary. A Global Aggregation Enclave verifies these cryptographic proofs prior to capacity-weighted aggregation. Evaluation demonstrates a steady-state throughput of $131{,}406$ samples/s per enclave, amortising attestation overhead to $0.23\,\mu$s/sample. On empirical NVML-sampled H100, A100, and L4 traces, EnclaveScale achieves a dynamic orchestration margin error of $1.3$\,MW compared to $0.1$\,MW for an honest-aggregator central-DP baseline. EnclaveScale establishes a secure foundation for dynamic multi-tenant power orchestration, obfuscating sub-second anomalies locally and protecting macro-workload confidentiality via spatial dilution during global aggregation.

Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

Yongtaek Lim, Hyeji Choi, Minwoo Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09165v1 Announce Type: new Abstract: Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09167v1 Announce Type: new Abstract: Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09169v1 Announce Type: new Abstract: In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

sketch-plot: Progressive Editing for Text-to-Image Academic Figures

Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09171v1 Announce Type: new Abstract: Text to image (T2I) models such as gpt-image-2 can now generate publication grade academic figures from a short prompt, but the output is a flat raster: a user who wants to change one arrow, one label, or one icon has to regenerate the whole image, which also disturbs the parts they wanted to keep. We present sketch-plot, an interactive system that closes this controllability gap with a three layer progressive editing pipeline: a generated PNG, an addressable puzzle of editable pieces, and a per piece SVG. The user stops at the layer that gives them enough control for the change at hand, so the cost of decomposition and vectorisation is paid only on the pieces that need it. Realising this pipeline is not trivial. General segmentation models lack the semantic discriminability to decompose a research figure cleanly, and end to end image vectorisation produces incomplete shapes and loses semantic structure. We therefore route both stages through a human in the loop interface that lets the user accept, refine, or reject decomposition and vectorisation decisions on a piece by piece basis. We validate the design with an expert user study, in which participants found sketch-plot effective for making targeted edits to AI generated academic figures and preferred it over regenerating the whole image. A demonstration video is available at https://anonymous.4open.science/r/SketchPlotVideo/.

Uniform-in-time Strong Error Estimates of Tamed-FEM to Superlinear SPDEs driven by Multiplicative Noise

Jingjing Cai, Zhihui Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09173v1 Announce Type: new Abstract: We establish sharp, uniform-in-time strong error estimates for a nonlinearity-explicit tamed finite element method (FEM) applied to a class of superlinear stochastic partial differential equations (SPDEs) driven by multiplicative noise, including the stochastic Allen--Cahn equation with a moderately thick interface. This tamed-FEM was first introduced in [Z. Liu and J. Shen, arXiv:2502.19117] to ensure long-time unconditional stability and to preserve the Lyapunov structure of this class of SPDEs. We further prove that the scheme is exponentially ergodic and derive the convergence rate between the exact invariant measure and its numerical counterpart in the Wasserstein-2 distance. Finally, we present numerical experiments that verify the ergodicity as well as the sharpness and time-independence of the strong convergence rates for this tamed-FEM.

Demonstrating chart-plot: Closing the Last Mile of Academic Chart Generation

Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Jiale Lao, Tingfeng Lan, Wei Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09174v1 Announce Type: new Abstract: Large language models can translate a researcher's intent into runnable matplotlib code, yet the resulting chart rarely lands in a paper without multiple rounds of manual revision. We argue that the open problem is not chart code generation but chart publication: making the output look like a top-venue figure, survive the target layout, and respond to precise author edits. We present chart-plot, an agentic harness that closes this last mile through three components: (1) a style-aware code generator conditioned on a textual style skill distilled from accepted figures at the target venue, (2) a deployment-aware render loop that compiles the chart inside the target LaTeX context and revises until layout constraints are met, and (3) a structured edit layer that exposes every chart element as a directly manipulable handle. We report early results on three chart-type case studies (grouped bar, scaling line, paired distributions) and a small user study.

CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon

Zheshun Wu, Ziyang Zhang, Changyao Lin, Zenglin Xu, Jie Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09175v1 Announce Type: new Abstract: Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.

Performance Evaluation of Social Learning

Felice Scala, Marco Carpentiero, Vincenzo Matta, Ali H. Sayed — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09176v1 Announce Type: new Abstract: Social Learning is a decentralized decision-making paradigm in which spatially dispersed agents collect streaming observations regulated by one of a finite number of models (the hypotheses). The agents are interested in assigning probability scores (the beliefs) to the possible hypotheses. To this end, the agents exchange their beliefs according to a certain communication graph. It has been shown that, under reasonable conditions on the identifiability of the decision model and the network connectivity, each agent ultimately places all the belief mass on the true hypothesis governing the data. However, several questions remain unanswered regarding the evaluation of the social learning performance. One recently adopted performance metric is the rejection rate, i.e., the rate at which the beliefs about the erroneous hypotheses vanish. One contribution of this work is to establish that the rejection rate leads to several paradoxes, which make it unsuitable as a valid performance measure. We then focus on studying the error probability measure. For a binary Gaussian problem, we derive an analytical formula characterizing the ratio between the individual agents' probabilities and the optimal Bayesian probability. The formula shows that this ratio is expressed by the product of two terms quantifying the effect of the network connectivity and the role of the prior information. As a result, an irreducible gap emerges between the decentralized and the centralized error probabilities, which is agent-dependent and does not disappear asymptotically.

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

Hyeji Choi, Yongtaek Lim, Minwoo Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09178v1 Announce Type: new Abstract: Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

Claude Code-Driving Scenario Mining for the Argoverse 2 Challenge

Wei Deng, Caoshengzhe Xue, Shuaikun Liu, Zhaohong Liu, Mengshi Qi, Huadong Ma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09180v1 Announce Type: new Abstract: We present our submission to the CVPR 2026 Argoverse 2 Scenario Mining Challenge. Our system uses a four-stage pipeline: (1) autonomous code generation via a Claude Code agent powered by GLM~5.1, (2) iterative training set screening with Timestamp Balanced Accuracy threshold 0.8 to curate few-shot examples, (3) semantic code review by a separate Claude Code session, and (4) Qwen3-VL scene-level verification to filter false positives. We report results on the Argoverse 2 test set.

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li, Keisuke Fujii — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09181v1 Announce Type: new Abstract: Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

Understanding How Enterprises Adopt the Model Context Protocol for LLM-Driven Software Engineering

Kehui Chen, Yicheng Sun, Jacky Keung, Zhenyu Mao, Xiaoxue Ma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09182v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used in AI-based software engineering, but their limitations in complex task execution and multi-tool coordination have driven growing interest in the Model Context Protocol (MCP). Existing research has mainly focused on MCP's technical design, with limited empirical evidence on how it is adopted and used in enterprise practice, particularly with regard to deployment challenges, operational risks, and practitioner expectations. To address this gap, we conducted semi-structured interviews with 20 practitioners from eight companies in the Internet and financial sectors. The findings show that MCP is valued for supporting cross-system collaboration, task decoupling, and knowledge reuse in LLM-based workflows, but its adoption remains constrained by ecosystem fragmentation, cross-component coordination difficulties, and unresolved problems in distributed state management and fault diagnosis. Participants also expressed strong demand for better standardization, lower adoption barriers through low-code or plugin-based approaches, and more systematic operational support. These results provide early empirical evidence on enterprise MCP practice and offer practical implications for improving MCP's standardization, usability, and deployment readiness in real-world software engineering environments.

Autonomous Obstacle Removal for Excavators through Policy Learning with Particle Simulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09183v1 Announce Type: new Abstract: Autonomous obstacle removal from the ground is an important earthwork task, but this is difficult to automate because an excavator must adapt its excavation trajectories over repeated cycles as soil-obstacle conditions change. Learning such state-dependent behavior requires a training environment that reproduces accumulated soil-obstacle interactions, including contact states, terrain deformation, and obstacle visibility. Accordingly, particle-based simulation is suitable for the relevant policy learning. However, particle simulation is computationally expensive, and repeated excavation cycles further increase the learning cost. We observe that the burial condition of an obstacle governs both task difficulty and simulation cost: deeper burial makes obstacle removal harder while also requiring more particles for accurate simulation. This observation motivates a burial-conditioned curriculum learning strategy. We propose a time-efficient sim-to-real policy learning framework in which the policy observes terrain and obstacle information from RGB-D measurements and then outputs a parameterized excavation trajectory; in this process, the simulator reproduces in a real-world excavator the same observation-action interface it uses under controllable burial conditions. The curriculum begins with shallow burial conditions and progressively increases burial depth while adjusting particle count, thus simultaneously controlling task difficulty and simulation cost. Experiments show that the proposed framework successfully learns an effective obstacle-removal policy, whereas baseline methods fail even after a full week of training. The proposed curriculum achieves effective performance within three days and achieves successful transfer to a real 12-ton excavator operating on open ground with various steel obstacles, thus demonstrating robust obstacle removal.

DuplexOmni: Real-Time Listening, Seeing, Thinking, and Speaking for Full-Duplex Interaction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09186v1 Announce Type: new Abstract: Human interaction is continuous, multimodal, and full-duplex by nature. Although recent omni models have made substantial progress in unified speech, vision, and text modeling, combining seamless real-time interaction with complex reasoning and tool use remains challenging. We present DuplexOmni, a method for real-time multimodal full-duplex interaction. DuplexOmni separates model capability into an interaction layer and a thinking layer, which collaborate asynchronously in parallel. The interaction layer is implemented by the DuplexOmni model, an end-to-end system that processes streaming audio and video inputs while generating text and speech responses in real time. The thinking layer is a pluggable module that provides complex reasoning and tool-use capabilities. To support this method, we further develop a Writer-Director pipeline for constructing continuous-interaction training data. Experiments show that DuplexOmni achieves strong performance on multiple public benchmarks and exhibits natural full-duplex interaction ability.

CP4D: Compositional Physics-aware 4D Scene Generation

Hanxin Zhu, Cong Wang, Tianyu He, Long Chen, Xin Jin, Chen Gao, Zhibo Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09187v1 Announce Type: new Abstract: 4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research frontier due to its powerful spatiotemporal modeling capabilities. However, despite notable advances, existing approaches typically fail to capture the underlying physical principles, producing results that are both physically inconsistent and visually implausible. To overcome this limitation, we present CP4D, a novel paradigm for photorealistic 4D scene synthesis with faithful adherence to complex physical dynamics. Drawing inspiration from the compositional nature of real-world scenes, where immutable static backgrounds coexist with dynamic, physically plausible foregrounds, CP4D reformulates 4D generation as the integration of a static 3D environment with physically grounded dynamic objects. On this basis, our framework follows a three-stage pipeline: \textbf{1)} Firstly, we leverage pre-trained expert models to generate high-fidelity 3D representations of the environment and foreground objects respectively. \textbf{2)} Subsequently, to produce physically plausible trajectories and realistic interactions for these objects, we propose a hybrid motion synthesis strategy that integrates priors from physical simulators with the common sense embedded in video diffusion models. \textbf{3)} Finally, we develop an automated composition mechanism that seamlessly fuses the static environment and dynamic objects into coherent, physically consistent 4D scenes. Extensive experiments demonstrate that CP4D can generate explorable and interactive 4D scenes with high visual fidelity, strong physical plausibility, and fine-grained controllability, significantly outperforming existing methods. The project page: https://anonymous.4open.science/w/CP4D/.

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09188v1 Announce Type: new Abstract: Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

Pretrained, Frozen, Still Leaking: Auditing Cross-Encoder Attribute Transfer in EEG Foundation Models

Jianwei Tai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09189v1 Announce Type: new Abstract: EEG foundation-model releases are usually audited one endpoint at a time: raw-reconstruction, membership inference, identity linkage, or DP-SGD on the downstream head. We audit the same released embeddings under all four endpoints jointly, on BIOT, LaBraM, and EEGPT, and show that each single-endpoint audit clears releases that still leak spectral attributes. The decisive evidence is a cross-encoder transfer audit: a single ridge attribute decoder learned from one frozen encoder transfers, via a fitted linear bridge, to held-out-subject test splits of every other encoder, with subject-disjoint matched-control 95% CI lower bound at least 0.081 across all six BIOT/LaBraM/EEGPT directions. We prove a sufficient condition: two encoders sharing a nontrivial attribute-coordinate projector overlap beta admit a chained ridge bridge attacker with centered-gain lower bound sqrt(beta/(1+tau^2)) - eps_br - rho_0, and back-solve beta in [0.008, 0.198]. To turn the joint audit into a deployment-readable decision rule we introduce an audit-endpoint disagreement score (AEDS), prove sufficient conditions for its positivity, and bootstrap-calibrate it per cell; AEDS is positive in all eight matched-CI cells (BIOT/LaBraM/EEGPT on EEGMMI; LaBraM on Sleep-EDF, 54-channel LIMO, CHB-MIT pediatric scalp EEG) with p<0.001, while a head-level Carlini LiRA membership audit reaches AUC only 0.50-0.70. Standard defenses fail under audit: a Wiener-style noise-aware adaptive attacker, the LiRA audit, and DP-SGD at every utility-preserving epsilon in {4,8} leave the attribute channel essentially unchanged. The contribution is an audit framework that turns scattered single-endpoint defenses into a joint release decision, supported by a cross-encoder bridge theorem and adaptive-attacker, LiRA, and DP-SGD baselines; the audit licenses release-blocking, not raw-waveform exfiltration or held-out-subject identity recovery.

Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards

Joel Q. L. Chang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09191v1 Announce Type: new Abstract: We prove that $\rho\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $\rho$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $\rho$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.

Symbolic and Abstractive Reasoning with Complex Visual Queries

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09195v1 Announce Type: new Abstract: Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.

MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation

Yongrui Liu, Deyi Xiong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09198v1 Announce Type: new Abstract: Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose "Memory-Augmented Social Simulation (MASS)", an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

Minyu Cui, Miquel Pericas — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09200v1 Announce Type: new Abstract: The rapid growth of large-scale machine learning (ML) has made distributed training across multiple GPUs a fundamental component of modern ML systems. As model sizes and computational throughput continue to increase, communication overhead has become a dominant bottleneck in multi-GPU training, particularly when computation and communication are executed sequentially. This work explores concurrent execution of computation and collective communication using two portable runtime controls: shared-memory-driven occupancy shaping for computation kernels and elevated scheduling priority for communication kernels. Our approach regulates computation-kernel residency through per-block shared-memory allocation, leaving sufficient on-chip resources for communication kernels to make progress. In addition, assigning higher priority to communication streams ensures steady communication progress once resources become available. Experiments on NVIDIA A40, A100, H100, and AMD MI250X GPUs demonstrate that the proposed method enables effective computation-communication overlap and reduces total execution time by up to 25.5 percent, without modifying vendor libraries or kernel implementations.

Deterministic Execution of ROS~2 Applications via Lingua Franca

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09203v1 Announce Type: new Abstract: The Robot Operating System~2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

Hyunseok Paeng — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09204v1 Announce Type: new Abstract: We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.

Event-driven dynamic trajectories reconstruction and measurement of mechanical parameters for fragments

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09208v1 Announce Type: new Abstract: During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mechanical parameters (position, velocity, kinetic energy) directly determine the lethality of the warhead fragment field. However, high-intensity flash and smoke in detonation scenarios severely hinder the accurate acquisition of these mechanical parameters. To address this challenge, this paper integrates experimental mechanics approaches and presents an event-driven method for reconstructing the dynamic trajectories of fragments and measuring their mechanical parameters. As a novel brain-inspired visual sensor, event cameras offer microsecond-level temporal resolution and high dynamic range lighting change perception, overcoming the difficulty of accurately measuring high-speed targets under strong flash interference. The method constructs a multi-event-camera vision system, adopting three geometric constraints: time-correlated epipolar constraint to find potential matching event point pairs, and trifocal tensor line constraint plus local homography constraint to eliminate mismatches. A comprehensive probability model is established, with entropy weight method determining the weight of each constraint's probability to quantitatively filter mismatches. 3D trajectory reconstruction is achieved via spatial line-line intersection and nonlinear optimization. Finally, the velocity and kinetic energy of the fragments are calculated based on the reconstructed trajectory. This method provides reliable technical support for the mechanical damage evaluation of warhead fragment fields and the tactical protection design.

Frequent Itemset Mining with Quantum Computing

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09209v1 Announce Type: new Abstract: Frequent Itemset Mining (FIM) is a foundational task in data analytics, but its candidate and conditional pattern spaces can grow rapidly, and maintaining support information becomes increasingly costly on dense datasets. These bottlenecks present a critical opportunity for quantum computing to redesign the way candidate representation and support verification are organized. Motivated by recent developments in quantum computing, we propose the \textit{QuantumFreqMine (QFM)} framework for FIM. QFM introduces three mechanisms: (1)~\textit{Bit-Vector Qubit Encoding}, (2)~\textit{Mining-Aware Candidate Superposition}, and (3)~\textit{Bit-Parallel Threshold Marking}. We provide a theoretical analysis in terms of time complexity, space comlexity, and logical resource usage. We implement QFM on IBM Qiskit and Amazon Braket. The experiments demonstrate that QFM outperforms representative baselines.

SNN-MLIR: An MLIR Dialect for Compiling Neuromorphic SNNs from NIR to Bare-Metal C

Alejandro Garc\'ia Gener, Alvaro Roll\'on de Pinedo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09213v1 Announce Type: new Abstract: Spiking neural networks (SNNs) are increasingly trained in a wide range of frameworks (SnnTorch, Lava, Norse, and others) each with its own model format. The Neuromorphic Intermediate Representation (NIR) addresses this fragmentation by providing a common, framework-independent format for exchanging trained SNN models. NIR solves the exchange problem, but it stops there. It provides a description of a network, not a path to running one. Each backend is still left to implement deployment on its own, with no shared, transformable compiler representation in between. This paper presents snn-mlir, an outof-tree MLIR dialect for SNNs together with a NIR-MLIR-C compilation bridge. The dialect provides a small set of typepolymorphic operations that work identically on floating-point (f32/f64) and quantized data, so a single intermediate representation serves both simulation and hardware-oriented deployment. A Python front end reads any NIR file and emits dialect IR, automatically inserting rescaling operations to keep quantization scales consistent across layers. A reference lowering pass converts the dialect to standard linalg and arith operations, from which the toolchain produces self-contained, dependency free C11 code that compiles and runs on any C-capable CPU or embedded target. We evaluate numerical fidelity against reference outputs, portability across CPU targets, and the cost of quantization. The current scope is feedforward, fully-connected networks with a CPU backend. snn-mlir is released as open source under the Apache-2.0 license with LLVM-exception and it is already available on Github.

MotionWAM: Towards Foundation World Action Models for Real-Time Humanoid Loco-Manipulation

Jia Zheng, Teli Ma, Yudong Fan, Zifan Wang, Shuo Yang, Junwei Liang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09215v1 Announce Type: new Abstract: World Action Models (WAMs) couple a video dynamics prior to the policy and have shown encouraging results on tabletop manipulation, but iterative denoising over high-dimensional video-action latents leaves them too slow for real-time humanoid loco-manipulation. The problem is compounded by the dominant hierarchical paradigm, in which a high-level manipulation policy controls only the upper body while a low-level controller tracks coarse base commands -- placing upper and lower body in inconsistent action spaces and reducing the legs to balance-preserving locomotion. We present MotionWAM, a real-time WAM that drives autonomous humanoid loco-manipulation from a single egocentric camera by conditioning the policy on the intermediate denoising features of a video world model. MotionWAM replaces the upper-lower split with a unified motion latent and predicts whole-body motion tokens that jointly cover locomotion, torso motion, height regulation, foot interaction, and hand manipulation in a single action space. A three-stage learning framework progressively adapts the video world model to egocentric visual dynamics and to the target humanoid embodiment. On nine real-world Unitree G1 tasks, MotionWAM runs in real time, substantially outperforms Vision-Language-Action (VLA) baselines fine-tuned on the same demonstrations by over 30% in overall success rate, and executes task-driven foot interaction that decoupled upper-lower policies cannot reach. Our results suggest that video-pretrained WAMs can be lifted from tabletop manipulation to coordinated, human-like whole-body humanoid control.

Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09218v1 Announce Type: new Abstract: As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data streams present significant challenges to traditional synchronous, frame-based algorithms. To address these challenges, this paper presents a novel framework for full degree of freedom (DoF) egomotion estimation directly from asynchronous optical flow, specifically targeting the joint recovery of angular and linear velocities. We decouple the differential epipolar constraint into distinct angular and linear velocity components, and derive its formulation for asynchronous data. Based on this formulation, an optimization algorithm is developed that enables full-DoF egomotion estimation leveraging at least five points. Furthermore, by applying a first-order approximation to rotational dynamics, we transform the constraint equations into a polynomial form, resulting in the first algebraic minimal 5-point solver for this formulation. To ensure real-time performance in high-speed scenarios, we additionally propose an accelerated solver achieved by truncating high-order angular velocity terms. Extensive evaluations on both synthetic and real-world datasets demonstrate that the asynchronous approach outperforms traditional synchronous methods, particularly in its accuracy and robustness to spatiotemporal noise. We believe that this work establishes a critical foundation for efficient and accurate continuous-time motion estimation in high-speed robotics applications.

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09219v1 Announce Type: new Abstract: Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.

Quantitative Performance Analysis of Stopping Criteria for CMA-ES

Ryoji Tanabe — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09220v1 Announce Type: new Abstract: Covariance matrix adaptation evolution strategy (CMA-ES) is a state-of-the-art black-box optimization algorithm. In general, CMA-ES uses a portfolio of multiple stopping criteria to automatically determine when to stop the search. This mechanism aims to avoid unnecessary consumption of the function evaluation budget during stagnation. Stopping criteria play an important role in CMA-ES, particularly when restart strategies are employed. However, the effectiveness of stopping criteria in CMA-ES remains poorly understood. To address this issue, this paper investigates how the 11 stopping criteria in CMA-ES behave on the noiseless BBOB function set. The performance of the stopping criteria is quantitatively evaluated based on the optimal stopping point in terms of the number of function evaluations in a single run of CMA-ES. Our results show that, although which stopping criterion is triggered first depends significantly on the sample size $\lambda$ and the dimension $n$, \texttt{tolflatfitness} and \texttt{tolfun} are frequently the first criteria to be triggered among the portfolio of 11 stopping criteria. We also demonstrate that \texttt{tolfunhist} and the portfolio achieve the highest stopping accuracy in most cases. In addition, our results show that the \texttt{tolfun} and \texttt{tolfunhist} criteria are frequently triggered before CMA-ES reaches complete stagnation.

TinyContainer: Container Runtime Middleware Enabling Multi-tenant Microcontrollers with Built-in Security

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09225v1 Announce Type: new Abstract: Software containerization technologies for resource-limited devices enable multi-tenant microcontrollers, which allow running multiple applications with different permission levels. However, current solutions lack run time configuration over various settings on container scheduling and container permissions to host resources. This limits the applicability of constrained containerization in dynamic and heterogeneous environments. This paper introduces TinyContainer, a lightweight software container management middleware designed for multi-tenant microcontrollers. TinyContainer provides per-container configurable scheduling and fine-grained access control to host resources through a metadata-driven approach, supporting multiple runtimes via a runtime abstraction layer. We analyze the performance of TinyContainer with a small WebAssembly runtime, CS4WAMR, and RIOT OS, a common RTOS. We report on experiments using popular IoT boards based on various Cortex-M microcontrollers. We show the endpoint system brought by TinyContainer allowing to regulate access of containers to host resources and provide host services to containers with an overhead of up to 4 ms per call. In particular, we showcase a TinyML use case, whereby containers retain data and model weights, while model inference is delegated to native host RTOS services.

Trustworthy Smart Fabs via Professional Proxies: Scaling Safe and Sustainable by Design (SSbD) through Industrial Data Spaces

Han-Teng Liao, Chang-Yi Kao, Karen Ang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09227v1 Announce Type: new Abstract: The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directive (CSDDD), and Carbon Border Adjustment Mechanism (CBAM) introduce a severe governance bottleneck for advanced semiconductor manufacturing facilities ("Smart Fabs"). Regulatory compliance demands have surpassed the capacity of manual corporate reporting, creating a direct conflict between multi-stakeholder transparency and corporate data privacy. This paper addresses this challenge by introducing a zero-trust socio-technical orchestration framework that operationalizes a six-layer SSbD reference architecture within trustworthy industrial data spaces. We propose a shift from reactive automation to autonomous governance through "Professional Proxies"-role-based agentic workflows executing within hardware-isolated trust zones. Structured as an interoperable network protocol stack, the framework coordinates an automated, five-step "relay race" between Facility, Process Engineering, and Finance proxy teams to align factory-floor yield models with macro-level sustainability mandates. By executing Virtual Metrology (VM) predictions and Federated Machine Learning (FML) inside hardware-rooted Trusted Execution Environments (TEEs), this architecture resolves the Data Sovereignty Paradox, demonstrating how fabs can export cryptographically signed compliance tokens via International Data Spaces (IDS) connectors without exposing proprietary process recipes. Ultimately, this framework provides technology managers with a verifiable, evidence-based pathway toward resilient, net-zero Industry 5.0 ecosystems.

End-to-End Training for Discrete Token LLM based TTS System

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09234v1 Announce Type: new Abstract: Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09236v1 Announce Type: new Abstract: Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

Can we stabilize an inverted pendulum with feedback from a time-of-flight camera?

Anthony Czubarow, Antonio Terpin, Raffaello D'Andrea — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09237v1 Announce Type: new Abstract: Time-of-flight cameras are popular in robotics for providing direct depth information while being compact, inexpensive, and robust to lighting conditions, but their low spatial resolution and depth noise are widely believed to preclude precise feedback control. In this paper, we show that an inexpensive, low-resolution time-of-flight camera provides sufficient feedback to reliably and precisely balance an inverted pendulum on a cart--a canonical benchmark for fast, unstable dynamics.

Orange Lab: Lowering Barriers to Data Mining through Embedded Interactive Workflows

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09239v1 Announce Type: new Abstract: While visual programming of data analysis workflows has become an important vehicle for the democratization of data science, such systems remain largely confined to standalone applications and offer limited support for transitioning their visual analytics solutions into interactive web environments. As a result, data analysis pipelines are difficult to share, embed, and adapt into user-facing analytical tools. We present Orange Lab, a web-based collaborative environment for visual data analytics. At its core, Orange Lab enables users to visually construct machine learning workflows from modular components, where interactions in any component propagate seamlessly through the workflow, turning static pipelines into dynamic, reactive systems that support exploration and data-driven storytelling. Our key contribution is component exposition, a paradigm that allows authors to embed selected workflow components, or parts of their interfaces, into arbitrary web contexts, creating synchronized, interactive interfaces while hiding underlying workflow complexity. This enables the development of tailored analytical views and narrative-driven experiences that integrate data analysis directly into online materials. We demonstrate the approach through deployments in data literacy education, where embedded components guide students in hands-on exploration of machine learning concepts without requiring knowledge of the underlying system, showing that Orange Lab effectively lowers barriers to entry and supports the democratization of data science.

Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization

Yufei Chen, Zihan Wang, Yubao Tang, Yukun Zhao, Maarten de Rijke, Zhaochun Ren — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09241v1 Announce Type: new Abstract: Multimodal generative retrieval formulates multimodal retrieval as discrete identifier generation, eliminating the need for explicit similarity search over external embeddings. Existing approaches construct identifiers via residual quantization and decode them with trie-constrained beam search. This combination introduces an indexing-decoding gap: identifier learning objectives, including reconstruction and contrastive losses, do not explicitly enforce prefix discriminability during decoding. As a result, even well-optimized identifiers can be irreversibly pruned early in beam search due to low-rank prefixes. We theoretically characterize this gap and derive a survival bound that relates prefix retention to three controllable factors in indexing and decoding. Building on this bound, we propose PRO, prefix retention optimization, a unified framework comprising three mechanisms: (i) prefix ranking distillation aligns quantized prefix rankings with those induced by pre-quantization embeddings using a listwise loss; (ii) vocabulary scheduling increases codebook sizes from shallow to deep residual quantization levels to reduce early competition from non-target prefixes; and (iii) geometric score fusion vectorizes each candidate prefix and incorporates its similarity to the query into beam search scoring, further reducing the indexing-decoding mismatch. Experiments on nine multimodal retrieval tasks show that PRO improves retention of target identifier prefixes and outperforms existing multimodal generative retrieval baselines.

Conceptualising Reflective Use: Toward A Process Perspective On Human-AI Interaction

Thimo Schulz, Christina Speck — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09242v1 Announce Type: new Abstract: The rapid diffusion of generative artificial intelligence (genAI) systems reshapes how individuals engage with information systems, requiring users to monitor, assess, and adapt their interaction with non-deterministic systems. Existing constructs capture elements of this engagement but do not account for the situated dynamics of the entire evaluative process in genAI use. This research-in-progress, situated in a larger endeavour towards a scale development, derives an initial conceptualisation of reflective use: a behavioural-knowledge capability that unfolds across pre-use, in-use, and post-use phases, reinforced through situated reflective knowledge gained in practice. Drawing on expert interviews and a focus group, we identify four core components of reflective use and show how they form an iterative capability cycle anchored within the motivational needs outlined in self-determination theory. Understanding reflective use is essential to ensure appropriate reliance and high decision quality, and thus provides a foundation for promoting responsible and effective human-AI interaction.

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09243v1 Announce Type: new Abstract: Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

Proposal Refinement for Few-Shot Object Detection

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09245v1 Announce Type: new Abstract: Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

SOMA: From Surface Observations to Muscle Anatomy

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09246v1 Announce Type: new Abstract: With the growing demand for realistic virtual humans, parametric body models have become a cornerstone of modern medicine, sports, and entertainment applications. However, most of these models are inherently limited: they only capture the 3D surface of the skin, offering no insight into the complex bio-mechanical structures that generate motion. As more applications expand towards biomechanics, the need for virtual human models that go beyond the skin has become increasingly evident. Traditional soft-tissue simulations, such as FEM, are accurate but non-scalable and too computationally expensive for most common applications. Alternatively, existing biomechanical tools can simulate muscular forces and activations, but do not model changes in external shape, restricting how activations correlate with actual observable anatomy. This motivates a novel inverse research problem: recovering muscle deformations directly from visible surface observations - i.e., from the skin, and thus the pose. In this work, we present SOMA (from Surface Observations to Muscle Anatomy), a person-specific model that infers spatio-temporal muscle behavior from surface signals obtained using RGB cameras, and SKIM, a subject-specific soft-tissue deformation dataset. To the best of our knowledge, this is the first method that attempts to recover muscle deformations from multi-view RGB data. We show how our method provides anatomically grounded animations without the complexity of traditional simulations, leading to a scalable and cost-effective solution. Data and code are available.

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09248v1 Announce Type: new Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09249v1 Announce Type: new Abstract: Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09250v1 Announce Type: new Abstract: Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

Benjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09251v1 Announce Type: new Abstract: We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.

A practical probabilistic framework for deformable image registration uncertainty in radiotherapy dose propagation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09253v1 Announce Type: new Abstract: Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but uncertainty in the underlying deformation can substantially affect clinically relevant dose estimates. We present a practical probabilistic framework for propagating DIR uncertainty to voxel-wise dose statistics and dose-volume histograms (DVHs). The method models the mapped correspondence at each voxel as a random variable governed by a transparent local certainty map that can be defined by simple safety margins, structure-boundary mismatch, or structure-wise conservative uncertainty values. This yields interpretable quantities such as dose probabilities, expected dose, confidence bounds, and induced DVH envelopes. The framework is designed to remain lightweight and interpretable: it avoids complex biomechanical or ensemble-based uncertainty models and instead emphasizes simple parameterization, computational feasibility, and transparent dose metrics. We further introduce a structure-guided in/out strategy as an optional refinement that restricts mapping probabilities to anatomically plausible target regions. The approach is demonstrated on a prostate radiotherapy case study and used to compare different certainty-map strategies and probability kernels. The experiments show that the certainty-map design has a stronger effect on resulting dose and DVH uncertainty bounds than the specific kernel choice, while the additional benefit of the in/out strategy is case-dependent and modest in the present example. Overall, the proposed framework provides a transparent way to incorporate DIR uncertainty into radiotherapy dose assessment and to study how modelling choices affect propagated dose metrics.

RPO-PDT: Demonstrating Role-Play-Based Knowledge Adaptation for Student Support Dialogue (Demonstration System)

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09255v1 Announce Type: new Abstract: We present RPO-PDT: a retrieval-grounded, role-play-based dialogue system for adaptive student support in higher education. RPO-PDT is: (1) able to provide institution-specific Personal Development Tutor (PDT) guidance using structured knowledge sources; (2) constrained by explicit persona, boundary, confidentiality, and safety policies; and (3) designed around a reverse-roleplay loop where unresolved interactions are replayed from the student perspective, enabling alternative tutor strategies to be generated and stored as reusable strategy memory. RPO-PDT supports both text-based and Furhat-based embodied interaction for demonstrating grounded, safe, and adaptive student-support dialogue.

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09257v1 Announce Type: new Abstract: High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

Back to the Familiar Future: Failure Recovery for VLA Policies via Pre-Imagined Milestone Selection

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09258v1 Announce Type: new Abstract: Vision-language-action (VLA) policies can deviate from nominal trajectories during manipulation, even when tasks remain physically feasible. Recovering from these deviations is challenging, as they push the policy into unfamiliar state spaces where direct re-planning frequently destabilizes action sequences. We propose Back to the Familiar Future (B2FF), a recovery framework for foresight-driven VLAs that leverages future visual conditioning as a recovery interface. Before execution, the VLA generates a milestone bank of familiar future states conditioned on the clean initial observation. At recovery time, a recoverability-aware selector selects a recovery milestone from this bank and enforces it as a fixed visual goal. This enables the VLA to robustly map off-trajectory observations back to a familiar future. On failure-injected LIBERO, under controlled recovery timing aligned with the injected failure, B2FF increases the average success rate of a baseline VLA from 56.3% to 74.0%, demonstrating that pre-imagined milestones can guide recovery without fine-tuning the low-level action generator.

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09261v1 Announce Type: new Abstract: In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning

Xiaojie Li, Xin Jiang, Luanyuan Dai, Jinnan Yang, Yongdong Zhang, Zechao Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09262v1 Announce Type: new Abstract: Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.

Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09266v1 Announce Type: new Abstract: Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

VGP-Nav: Metric-Aware Visual Geometric Perception for Robot Navigation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09268v1 Announce Type: new Abstract: Reliable robotic navigation necessitates the seamless integration of accurate global localization and dense, metric-consistent obstacle perception. A common strategy to achieve these capabilities involves integrating diverse sensing modalities: cameras offer rich visual features for localization, while active sensors like LiDAR provide direct metric measurements. However, such multi-sensor configurations necessitate complex spatial-temporal calibration and increase deployment overhead. Although vision-only approaches offer a low-cost and scalable alternative, existing monocular visual systems typically struggle to simultaneously achieve efficient, globally consistent localization and dense, metric-consistent geometric perception. To bridge this gap, we propose \textbf{VGP-Nav}, a unified framework for \textit{Metric-Aware Visual Geometric Perception} that relies solely on monocular RGB input to jointly support metric localization and obstacle perception. Our key insight is to anchor localization-grounded visual geometry to physically meaningful scale constraints derived from ground-plane geometry, thereby providing a reliable metric reference for monocular perception. VGP-Nav resolves monocular scale ambiguity online and produces localization-grounded, metric obstacle representations that are directly applicable to downstream planning. Extensive experiments demonstrate strong generalization across diverse environments and successful deployment on real mobile robots, highlighting the practicality of our approach for scalable, low-cost, and safe autonomous navigation.

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

George Theodosiou, Loukas Ilias, Dimitris Askounis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09271v1 Announce Type: new Abstract: Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

Fatima Balde, Raoul de Charette, Alexandre Boulch — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09273v1 Announce Type: new Abstract: 3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.

An implicit octree-based adaptive Material Point Method

Robert E. Bird, William M. Coombs, Charles E. Augarde, Ted J. O'Hare — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09275v1 Announce Type: new Abstract: The Material Point Method provides an effective approach for modelling the large deformations that often arise from contact interactions between rigid structures and surrounding continua. However, solving these problems requires accurate representation of the continuum-structure interface, which necessitates high resolution background mesh and material point discretisations. This requirement, combined with evolving continuum-structure interfaces and the fact that most Material Point Method implementations are dependent on structured meshes, can result in large numerical systems and long run times especially when modelling problems in three-dimensions. Motivated by this issue, this paper provides the first octree-based implicit Material Point Method for efficient solution of large deformation continuum-structure interaction problems. The octree background mesh provides a natural way to automatically adapt both the computational mesh and the material point discretisation based on the position of the interaction between the structure and continuum. The new approach is demonstrated on a number of large deformation benchmark and continuum-structure interaction problems, where up to a 5.5-times speed up and a consequent 21-times CO2 saving is achieved when running on a HPC compared to results obtained using a conforming mesh.

ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09276v1 Announce Type: new Abstract: Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09278v1 Announce Type: new Abstract: Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

Bridging nanoparticle morphology and viscoelastic behavior in epoxy nanocomposites: A coarse-grained simulation-informed constitutive model

Atiyeh Hentea, Shadab Zakavatib, Behrouz Arash, Maximilian Jux, Raimund Rolfes — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09279v1 Announce Type: new Abstract: Accurate prediction of the material behavior of polymer nanocomposites under various thermomechanical loading conditions is increasingly demanded for engineering applications. This study proposes an integrated framework combining coarse-grained (CG) molecular simulations and experimental testing to develop predictive constitutive models for nanoparticle/epoxy nanocomposites. The key contribution of this work lies in characterizing the influence of nanoparticle content and agglomerate size on the rate- and temperature-dependent behavior of nanocomposites, enabled by large-scale CG simulations. The proposed framework successfully captures the material response, including nonlinear hyperelasticity, softening behavior, and rate- and temperature-dependent properties, across a broad range of strain rates, temperatures, and nanoparticle sizes and weight fractions. The predictive capability of the CG simulation-informed constitutive model is validated using additional experimental data that were not included in the parameter identification process. By reducing reliance on extensive experimental testing while maintaining high accuracy, this simulation-driven approach offers an efficient pathway for developing robust, predictive constitutive models for designing and optimizing advanced nanocomposites.

Revisiting mesoscopic traffic flow simulation in SUMO: Limitations, analysis, and an alternative

Ying-Chuan Ni, Alina Akopian, Anastasios Kouvelas, Michail A. Makridis — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09282v1 Announce Type: new Abstract: Mesoscopic traffic flow models combines the merits of both macroscopic and microscopic models by capturing individual vehicle behavior in great detail and remaining the computational efficiency. At the time of this study, the mesoscopic model proposed by Eissfeldt (2004) is used in Simulation of Urban MObility (SUMO). The movement of vehicles is governed by dynamic headways between edges. However, the model does not fully comply with the principle of the Lighthill-Whitham-Richards (LWR) model. Several problems are identified, including the incomplete consideration of queue dynamics and the limited implementation of backward traveling spaces. Two case study scenarios demonstrate that the problems lead to unrealistic onset and recovery pattern of congestion. The magnitude of congestion is generally underestimated with this model. To address these drawbacks, a proper mesoscopic discrete-time implementation of link transmission model, which follows the LWR principle, is proposed. By explicitly incorporating backward traveling spaces to capture queue spillback phenomena, the proposed model provides a more precise representation of congestion dynamics. The link density outputs are consistent with the kinematic wave theory and the microscopic traffic simulation in SUMO, thus verifying its theoretical accuracy.

VAIC: Vision-Guided Humanoid Agile Object Interaction Control via Decoupled Commands

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09286v1 Announce Type: new Abstract: Humanoid robots hold immense potential for real-world assistance, yet agile interaction with objects in unstructured environments demands tightly coupled whole-body coordination. Despite recent advancements, current controllers face a critical deployment gap. They rely heavily on dense reference trajectories and perfect state observability, which inherently limits physical generalization. We present Vision Guided Agile Interaction Control (VAIC), a unified framework that bridges this gap by operating exclusively on onboard depth, historical proprioception, and a decoupled user command interface. VAIC employs a two-stage distillation paradigm. First, a privileged teacher policy masters diverse interaction skills using precise object kinematics and exact environmental states. Second, a deployable student policy distills these capabilities by replacing full body tracking with velocity targets across multiple axes and an interaction indicator for each frame. The student utilizes a recurrent object adaptation module to implicitly infer unobservable object dynamics from raw depth streams and proprioception. Evaluations and real-world deployments on the humanoid robot demonstrate that a single VAIC policy successfully executes highly diverse dynamic tasks. These tasks include box carrying, cart interaction, and skateboarding, consistently outperforming baselines and advancing autonomous humanoid deployment.

Trajectory Geometry of Transformer Representations Across Layers

Vishal Pandey, Gopal Singh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09287v1 Announce Type: new Abstract: Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

Yuesen Li, Daniel Link — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09289v1 Announce Type: new Abstract: Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09290v1 Announce Type: new Abstract: Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

Dual Quaternion-Based Unscented Kalman Filter with Visual Inertial Odometry for Navigation in GPS-Denied Environments

Mohamed Khalifa, Hashim A. Hashim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09292v1 Announce Type: new Abstract: Reliable navigation in GPS-denied environments remains a fundamental challenge in robotics, aerospace, and autonomous vehicle applications. This paper presents a Dual Quaternion-Based Unscented Kalman Filter (DQUKF) equipped with a Visual Inertial Odometry (VIO) algorithm for accurate state estimation enabling navigation in GPS denied locations. The proposed framework formulates the DQUKF in an error state manner, where the nominal pose is represented by a unit dual quaternion and the local pose error is represented by a 6-dimensional twistor parameterization used for sigma point generation, covariance propagation, and measurement correction. In parallel, the VIO algorithm tracks features across image frames, synchronizes measurements between the IMU and camera, and provides visual constraints that complement inertial propagation. Simulation results on the EuRoC MAV dataset show that the proposed DQUKF converges under high initialization uncertainty and achieves a position RMSE of 0.2584~m in the difficult flight sequence, outperforming the benchmark filters.

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09293v1 Announce Type: new Abstract: Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.

Virtual-point-based Solutions to Handle Generalized Absolute Pose Problem

Bin Li, Banglei Guan, Shunkun Liang, Yang Shang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09294v1 Announce Type: new Abstract: Multi-camera systems are increasingly adopted in robotics and autonomous navigation for their wide field of view, flexibility, and fault tolerance. Nevertheless, existing PnP solvers fail to handle multiple projection centers. This paper introduces a virtual point formulation that bridges the standard PnP and generalized pose problems, enabling a unified pipeline that transforms existing PnP solvers into generalized pose solvers. Based on this framework, we derive three Virtual-point-based Generalized Pose solvers, namely VGPc, VGPq, and VGPr, leveraging Cayley, quaternion, and rotation-matrix parameterizations, respectively. Extensive experiments demonstrate that the proposed solvers inherit the accuracy and efficiency of original PnP algorithms while significantly outperforming existing generalized solvers. Specifically, VGPc achieves higher estimation accuracy under heteroscedastic noise conditions, VGPq maintains global optimality, whereas VGPr provides superior computational efficiency without accuracy degradation.

N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09295v1 Announce Type: new Abstract: N\"ushu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of N\"ushu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a N\"ushu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce N\"ushuVoice, the first TTS benchmark for N\"ushu. We construct a sentence-level N\"ushu text-to-audio dataset that aligns standardized Unicode N\"ushu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose N\"ushu-PitchVITS, an F0-conditioned VITS framework that leverages N\"ushu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that N\"ushu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.

Justification and structure- and asymptotic-preserving discretizations of a hyperbolized Cahn-Hilliard equation

Jan Giesselmann, Fabio Leotta, Hendrik Ranocha, Jochen Schuetz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09299v1 Announce Type: new Abstract: We study a hyperbolic approximation ("hyperbolization") of the Cahn-Hilliard (CH) equation, originally proposed by Dhaouadi, Dumbser, and Gavrilyuk (2025, DOI: 10.1098/rspa.2024.0606) and study its convergence towards the CH model in a relaxation limit both via formal asymptotic expansions and, for a slightly modified approximation, via the relative energy framework. Moreover, we develop energy-stable semidiscretizations of the CH equation and of this hyperbolization using upwind summation-by-parts operators in space. Subsequently, we combine them with (additive) implicit-explicit (IMEX) Runge-Kutta methods based on a convex-concave splitting. We show that the resulting method is asymptotic preserving, i.e., it converges in the limit of the relaxation parameter to a stable discretization of the original CH equation. The choice of the necessary parameters is guided by the a priori error estimate based on the relative energy framework.

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09301v1 Announce Type: new Abstract: Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency. Unlike random instance-wise missingness, a deficient client lacks the local semantic basis needed to reconstruct the absent modality. More importantly, in graph learning, incomplete representations initialize message passing, so imputation errors can be filtered, mixed, and amplified by the receiving topology. To address this gap, we propose \textbf{PRISM} (\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting), a topology-aware federated cross-modal imputation framework. Rather than reconstructing the missing modality solely from local observations, PRISM recovers missing-modality semantics from the federation and introduces them into local graph propagation under topology-aware control. Experiments on six multimodal graph datasets across graph-centric and modality-centric tasks show that PRISM consistently improves modality-deficient clients, outperforming state-of-the-art baselines by \textbf{4.48}\% on average.

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

Xinyan Gao, Haoran Hao, Xiangyu Yue — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09303v1 Announce Type: new Abstract: The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09304v1 Announce Type: new Abstract: On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09311v1 Announce Type: new Abstract: Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

Haolin Pan, Lianghong Huang, Xvlin Zhou, Mingjie Xing, Yanjun Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09312v1 Announce Type: new Abstract: Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37$\times$ on GPU and 1.54$\times$ on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10$\times$ fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61$\times$/3.67$\times$ geometric mean.

Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time

Nugzar Gognadze, Motonobu Kanagawa, Yu Someya, Hisashi Yashiro — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09313v1 Announce Type: new Abstract: Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09314v1 Announce Type: new Abstract: Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

Jianwei Tai — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09315v1 Announce Type: new Abstract: BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a minimal log schema, denominator hierarchy, and endpoint specification, and prove an audit-schema separation theorem together with a C3 attacked-dependence decomposition; clean agreement and marginal robustness do not identify the joint term that controls C3 routing. As a calibration layer on top of the contract, we apply split-conformal calibration to a non-oracle EEG confirmation channel and report the resulting false-accept frontier under an explicit threat-archetype matrix. We instantiate the contract on EEGMMI native left/right command-control over 5{,}400 events, harmless tool stubs, and seed/case denominators. Provenance blocks C2 routes ($0.000$); agreement-plus-provenance routes C3 flips ($1.000$); confirmation-plus-provenance routes them ($0.000$). The conformal frontier reaches FAR $0.000$ at clean utility $0.150$ for $\alpha=.005$ and FAR $0.119$ at clean utility $0.452$ for $\alpha=.10$ under acquisition isolation; an attacker-controllable confirmation channel breaks the bound to $\approx\!1$. Subject-cluster bootstrap confirms these intervals on $60$ subjects; cross-architecture (TinyEEGNet, EEGNetV4) and capacity-sweep results show within-regime saturation. Mediation and confirmation reduce risk; they are not intent certificates.

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09316v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

Engineering Scalable Distributed List Ranking

Peter Sanders, Matthias Schimek, Tim Niklas Uhl, Thomas Weidmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09318v1 Announce Type: new Abstract: The list ranking problem is one of the classical problems of parallel computing, with nontrivial algorithms and many applications as a subroutine for solving other problems. While it has been intensively studied in the early days of parallel computing, few things happened in the last 20 years. In particular, there is little work on scaling list ranking to large machines and input sizes. We reconsider list ranking starting from the ground-breaking results of Sibeyn a quarter century ago. We employ algorithm and performance engineering to improve his sparse ruling-set algorithm, making it capable of scaling to many processors, and provide a more detailed analysis of the impact of the algorithm's parameters, further guiding our practical implementation. We perform an extensive experimental study across a variety of input instances with different structural properties. We demonstrate that indirect communication, exploiting input locality, and message coalescing allows scaling to billions of elements on up to 24,576 cores.

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09323v1 Announce Type: new Abstract: Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench

A Universal Dense Football Event Representation Based on TabTransformer

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09327v1 Announce Type: new Abstract: Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09331v1 Announce Type: new Abstract: Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

Kateryna Karpo, Artem Chernodub — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09334v1 Announce Type: new Abstract: Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

TORL-VLA: Tactile Guided Online Reinforcement Learning for Contact-Rich Manipulation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09337v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have become a powerful framework for robotic manipulation, and recent studies have introduced tactile or force feedback into VLAs to address contact-rich tasks. However, these models are typically deployed as offline policies. When contact conditions shift from the training distribution, the policy cannot perform online adaptation, leading to problems such as inappropriate contact forces and inefficient retries. Therefore, we propose TORL-VLA, a tactile-guided online reinforcement learning framework that couples tactile feedback with policy refinement for contact-rich manipulation. Our method introduces a tactile-derived wrench-aware VLA to predict reference actions and future wrench sequences, while a lightweight online RL module is used to refine the reference actions. To stabilize learning from mixed exploratory policy-generated and human-intervention data, we introduce an intervention-censored critic that prevents post-intervention success from being wrongly credited to policy-generated actions preceding intervention. Real-robot experiments on long-horizon contact-rich tasks, including latch manipulation, coffee-cup placement, and egg handling, show that TORL-VLA improves success rates at both subtask and full-task levels, as well as time-bounded execution efficiency over strong baselines.

Multi-Hop Knowledge Composition is Bound by Pretraining Exposure

Yannis Karmim, Luis Marti, Djam\'e Seddah, Valentin Barri\`ere — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09338v1 Announce Type: new Abstract: Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.

Thresholded Local Hyper-Flow Diffusion

Meher Chaitanya, Sebastian Dalleiger, Luana Ruiz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09340v1 Announce Type: new Abstract: Local Hyper-Flow Diffusion (HFD) gives an edge-size-independent Cheeger-type guarantee for seeded clustering in general submodular hypergraphs, but existing HFD solvers do not keep intermediate computation local at every iteration. We introduce Thresholded Local HFD (TL-HFD), a first-order method that maintains an active region around the seeds, performs projected subgradient updates on that region and its immediate boundary, and expands via thresholded (top-k) boundary activation. We prove that the local update is exact: the degree-preconditioned projected subgradient step restricted to the active region and its boundary coincides with the unrestricted global update. We establish finite-time dual suboptimality for both exact and thresholded updates, treating the latter as inexact projected subgradient steps with explicit skipped-boundary error. We further derive an additive activated-volume bound controlled by realized local subgradient norms and the minimum boundary-push among newly activated vertices, and translate approximate dual optimality with localized support into a robust sweep-cut guarantee for early-stopped iterates. For general submodular cut-costs, each iteration is local in the scanned region and oracle-sensitive in the hyperedge primitive. Empirically, TL-HFD often matches or improves over HFD while activating less volume, with the largest gains on noisy instances where diffusion tends to absorb non-target vertices.

Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

Micka\"el Basson (CRIStAL, Scool), Philippe Preux (CRIStAL, Scool) — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09343v1 Announce Type: new Abstract: Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

Haojun Guo, Fan Feng, Ziquan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09347v1 Announce Type: new Abstract: Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09348v1 Announce Type: new Abstract: Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

Cornelius Schr\"oder, \v{Z}ygimantas Marcinkus, Markus Lienkamp — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09350v1 Announce Type: new Abstract: Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09351v1 Announce Type: new Abstract: Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

Maria De Marsico, Anil K. Jain, Annalaura Miglino — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09353v1 Announce Type: new Abstract: Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

MosaicIMU: Composing Carrier Experts for Generalizable Neural Inertial Odometry

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09355v1 Announce Type: new Abstract: Robust inertial odometry is essential for various carriers when external sensing is unreliable. Learning-based methods reduce integration drift by capturing local motion priors, but these methods often remain tied to a particular carrier, limiting generalization across heterogeneous platforms. We present MosaicIMU, a carrier-conditioned Mixture-of-Experts (MoE) pretraining-and-adaptation framework for generalizable neural inertial odometry. MosaicIMU uses a prototype-based router to compose carrier-specific expert features, decodes local velocity and uncertainty constraints, and integrates them with a history-aware EKF. For unseen domain adaptation, it freezes the pretrained base model and learns a new lightweight expert residual branch. For edge-deployment, it further reuses the router to select informative online samples for efficient incremental updates. Experiments show that MosaicIMU consistently outperforms learning-based baselines, reducing average ATE and RTE-10s by 40% and 34%, respectively. These results highlight that MosaicIMU provides a scalable pretraining-to-deployment paradigm for generalizable and adaptive neural inertial odometry.

Coupling Complementary Simulations for Combined Performance and Energy Optimization

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09356v1 Announce Type: new Abstract: Polymer simulations are among the most computationally demanding workloads in soft-matter research, often requiring days of execution and high energy consumption to achieve physically meaningful results. In this work, we address these challenges through the coupling and optimization of two complementary simulation frameworks: the Uneyama-Doi Model (UDM) and the SOft coarse-grained Monte Carlo Acceleration (SOMA). UDM efficiently propagates concentration fields at the continuum level, while SOMA resolves chain-scale thermal fluctuations via particle-based Monte Carlo dynamics. Each model was individually optimized for GPU execution using kernel fusion, memory coalescing, asynchronous random-number generation yielding up to 70% (UDM) and 80% (SOMA) performance improvement. The coupling is performed through our proposed coordinator library that orchestrates data exchange and synchronizes time-stepping across multiple GPUs. Further management of coupling workload distribution enabled a 13x overall speedup and 24.5x reduction in total energy usage compared to the SOMA baseline, i. e., 96% energy saving. The proposed hybrid approach maintains the same scientific fidelity while drastically reducing the computational and energy footprint, showcasing the potential of energy-aware, cross-application co-design for sustainable high-performance simulations

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

Yupeng Zhang, Yuzhong Feng, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09360v1 Announce Type: new Abstract: Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.

Bespoke-Card: Why Tune When You Can Generate? Synthesizing Workload-Specific Cardinality Estimators

Johannes Wehrstein, Anton Winter, Timo Eckmann, Carsten Binnig — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09361v1 Announce Type: new Abstract: Cardinality estimators are built to support arbitrary schemas and workloads, forcing them to rely on generic statistics even when the schema and workload is known in advance, leaving optimizers prone to large errors and poor plans. We present Bespoke-Card, an agent-driven system that synthesizes workload-specific cardinality estimators as executable code: a planning agent designs the estimators strategies, a coding agent implements them, and a validator scores the estimates against true cardinalities and PostgreSQL estimates, forming a robust and deterministic harness. Going beyond naive prompting, Bespoke-Card uses structured q-error feedback, regression analysis, concrete outlier subplans, a curriculum isolating join-only, filter-only, and full-subplan errors, and archival selection of the best implementation. Injecting its estimates into the optimizer cuts total PostgreSQL runtime on JOB by 33% and reduces median q-error over all JOB subplans by 41%, while synthesizing a strong estimator in under one hour for less than $10. Bespoke-Card is opening a new avenue for cardinality estimation next to classical generic estimators and learned estimator architectures.

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

Eduardo Borges, Manuel Abreu, Lu\'is Garrote, Urbano J. Nunes — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09362v1 Announce Type: new Abstract: Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09365v1 Announce Type: new Abstract: Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu, Zhizheng Wu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09366v1 Announce Type: new Abstract: Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.

RT-SDGOD: Real-Time Single-Domain Generalized Object Detection

Yupeng Zhang, Fangzhuo Gao, Ruize Han, Wei Feng, Liang Wan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09367v1 Announce Type: new Abstract: In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09368v1 Announce Type: new Abstract: Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.

Residual Pseudospectra Reveal a Physics-Informed Koopman Backbone for Tropical Pacific Variability and ENSO Prediction

Paula Lorenzo-Sanchez, Matthew J. Colbrook, Antonio Navarra — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09369v1 Announce Type: new Abstract: Tropical Pacific sea-surface-temperature (SST) variability spans interacting timescales, with the ENSO as its dominant interannual expression. Yet the dynamical structure organizing this variability and underpinning extended-range predictability remains difficult to extract from high-dimensional observations. Koopman operator learning offers spectral coordinates for nonlinear dynamics, yet finite geophysical records often produce dense, sampling-sensitive spectra whose physical content is ambiguous. We show that this apparent redundancy reflects coherent operator-level structure. Combining kernel Extended Dynamic Mode Decomposition with residual minimization and pseudospectral analysis, we use the Koopman eigenvalue relation as a physics-informed consistency test to organize learned spectra. Applied to ERA5 and HadISST tropical Pacific SST anomalies, the residual landscape identifies 19 robust residual-minimum frequencies with coherent spatial modes that persist across products and sampling realizations. Together, these modes define a compact Koopman backbone spanning low-frequency modulation through quasi-biennial components, including ENSO-band variability. The surrounding spectral cloud is structured by integer powers and nonlinear combinations of this backbone, forming a residual-ordered Koopman hierarchy. The backbone reconstructs substantial Nino3.4 variance and enables skillful out-of-sample forecasts, with greatest gains at 8-18-month leads. By embedding dynamical consistency into physics-informed operator learning, the framework turns opaque spectra into robust, interpretable and predictive representations of tropical Pacific variability.

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

Haotong Yang, Ting Long, Yi Chang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09371v1 Announce Type: new Abstract: Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor misalignment and limiting LLM performance on tool-use tasks. In this paper, we propose a method called Capability-Aligned Hierarchical Learning (CAHL), which leverages RLVR to jointly optimize both policies, enabling better alignment between the high-level planner and the low-level executor. Experiments on constrained tool-use benchmarks (API-Bank and BFCL) and an open-ended environment (Bamboogle) demonstrate the effectiveness of CAHL.

Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle

Juan S. Santillana — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09376v1 Announce Type: new Abstract: Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

Sergei Vorobyov, Eugene Ilyushin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09377v1 Announce Type: new Abstract: Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $\alpha$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the \texttt{auto\_LiRPA}\,/\,$\alpha,\beta$-CROWN verification framework. \textbf{Tensor Parallelism (TP)} shards both weight and $A$-matrices across GPUs, achieving ${\approx}2\times$ peak-memory reduction at $P{=}2$; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. \textbf{Fully Sharded Data Parallelism (FSDP)} shards only weight matrices with a per-layer \texttt{AllGather}, producing bounds that are \emph{bitwise identical} to the single-GPU baseline: baseline memory drops by 80--90\%, peak memory by 34--39\% on wide MLPs. FSDP integrates cleanly with complete verification ($\beta$-CROWN + Branch-and-Bound) and with convolutional layers (\texttt{BoundConv}); a complete \emph{unsat} result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in $\alpha$-CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09378v1 Announce Type: new Abstract: Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09380v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

ReGIL: Retrieval-Guided Imitation Learning from a Single Demonstration

Yuying Zhang, Francesco Verdoja, Wenyan Yang, Ville Kyrki — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09381v1 Announce Type: new Abstract: Learning robot manipulation policies with deep neural networks from a single demonstration remains highly challenging, as even small deviations from the demonstrated trajectory can quickly compound into failure, while collecting substantial online interaction data is costly. We propose ReGIL, a retrieval-guided imitation learning framework that treats a single demonstration as an external memory. ReGIL repeatedly queries this static memory throughout training to simultaneously guide exploration, generate the regularization buffer, and construct rewards. Specifically, it computes rewards through local temporal alignment between the current trajectory and the retrieved segment, providing step-wise and informative feedback for policy improvement. We evaluate ReGIL on robotic manipulation tasks from the LIBERO and Meta-World benchmarks under the single demonstration setting. ReGIL outperforms prior baselines in both success rate and training efficiency. In real-robot experiments, using only one demonstration and less than one hour of online training, ReGIL achieves over 75% success rate across three manipulation tasks with randomness in both initial robot pose and target position. These results demonstrate that leveraging the single demonstration as reusable memory can provide more than static supervision for efficient robot learning. More details can be found on our website: https://regil2026.github.io/

Consecutive Support Matching Induced Parameter Tuning Accelerates Momentum Iterative Hard Thresholding

Samrat Mukhopadhyay, Debasmita Mukherjee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09382v1 Announce Type: new Abstract: Momentum-based acceleration of iterative hard thresholding (IHT) can dramatically speed up sparse signal recovery from linear measurements, but its effectiveness hinges on careful parameter tuning -- a task complicated by the frequent support changes inherent to hard thresholding. We propose CosMIHT(Consecutive Support Matching Induced Momentum IHT), which resolves this difficulty through a simple adaptive rule: start with the conservative parameters and whenever two consecutive iterates share the same support, estimate the extreme eigenvalues of the support restricted Gram matrix via a lightweight power method and switch to the corresponding optimal heavy-ball parameters. This mechanism allows CosMIHT to automatically interpolate between cautious MIHT-like behavior during support discovery and near-optimal accelerated convergence after support identification. Under standard restricted isometry assumptions, we develop a two-phase convergence theory. In the \emph{wandering phase}, we establish a linear contraction of the recovery error up to a noise floor and derive an explicit upper bound on the number of iterations required to identify the correct support. In the \emph{lock-in} phase, we establish that, with a randomly initialized power method based eigenvalue estimates that depend on the number of power iterations, the algorithm enjoys, with high probability, a near-optimal accelerated convergence rate akin to the heavy ball method. We corroborate the theoretical findings with extensive numerical experiments on both noiseless and noisy measurements demonstrating that CosMIHT achieves faster convergence than state-of-the-art iterative sparse recovery techniques without compromising the recovery performance.

An Opticalmechanics Framework for Dynamic Estimation of Multibody Systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09383v1 Announce Type: new Abstract: Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque sensors and controlled laboratory environments. To address this issue, this study proposes an opticalmechanics kinematic-dynamic integrated estimation framework for multibody systems. Specifically, a constrained multibody model is established to describe the system dynamics, while image-measured kinematic quantities are used as non contact inputs for dynamic estimation. The unknown joint torque is then identified through a genetic-algorithm based optimization by minimizing the discrepancy between model-predicted and image-measured kinematic quan tities. Experimental validation on an air-bearing platform showed that the wrist joint torque estimated from image data achieved a mean absolute error of 0.46 Nm compared with sensor measurements. In the forward prediction test, the model-predicted angular velocity achieved a mean absolute error of 0.006 rad/s relative to the image-measured results. This study demonstrates the potential of combining image measurement and mechanical modeling for non-contact dynamic estimation in scenarios where direct force and torque measurement is difficult.

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09388v1 Announce Type: new Abstract: Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09389v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.

Real-time body pose non-verbal communication with a consistency-based reliability measure

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09390v1 Announce Type: new Abstract: Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

Shuhao Li, Weidong Yang, Yue Cui, Zizhuo Xu, Lipeng Ma, Fan Zhang, Xiaofang Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09392v1 Announce Type: new Abstract: Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09393v1 Announce Type: new Abstract: Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

Empirical Study for Structured Output Control in LLMs for Software Engineering

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09395v1 Announce Type: new Abstract: LLM-generated outputs in software engineering rarely exist in isolation. They must plug into toolchains, APIs, and data pipelines that impose strict, often organization-specific structural contracts. A semantically correct output that violates the expected format is, from the consuming system's perspective, indistinguishable from a wrong answer, making structural fidelity an operational prerequisite for deploying LLMs in practice. Yet current models routinely produce syntactically invalid or structurally non-compliant outputs. Unlike encoders, autoregressive decoders generate text token-by-token with a local rather than global focus, amplifying structural fragility whenever the target format deviates from familiar training distributions. We present a systematic evaluation of structural reliability across four representative SE tasks, categorizing failures into syntax, structural, and semantic errors. We benchmark ways of mitigation targeting the decoder: grammar-constrained decoding, regex-based validation, and a strict template-driven control (Template Token Match Generation, TTMG) to isolate the sources of these failures. TTMG nearly eliminates syntax errors, yet substantial structural and semantic errors persist, demonstrating that the core bottleneck lies beyond syntax formatting. A detailed case study further illustrates how residual errors cascade in downstream workflows. Our findings show that current structure-enforcing tools are necessary but insufficient, and highlight the need for approaches that jointly ensure structural fidelity and semantic correctness in LLM-driven workflows.

PriFT: Prior-Support Guided Supervised Fine-Tuning

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09396v1 Announce Type: new Abstract: Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

Radeen Mostafa, Sawradip Saha — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09399v1 Announce Type: new Abstract: We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

Bastian Wittmann, Chinmay Prabhakar, Suprosanna Shit, Bjoern Menze — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09400v1 Announce Type: new Abstract: The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09401v1 Announce Type: new Abstract: Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.

Fully Oblivious Differential Privacy for Frequency Estimation in the Augmented Shuffle Model with Trusted Processors

Takao Murakami, Yuichi Sei, Reo Eriguchi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09402v1 Announce Type: new Abstract: In the shuffle model of DP (Differential Privacy), a shuffler randomly permutes users' data to achieve high accuracy and privacy. Recent studies show that most existing shuffle protocols are vulnerable to collusion attacks by the data collector and users. They address this issue by introducing the augmented shuffle model that incorporates random sampling and dummy data addition into the shuffler. However, it remains open how to ensure the shuffler follows the protocol and does not collude with the data collector in this model. We address this trust issue by thoroughly exploring the augmented shuffle model with TEEs (Trusted Execution Environments). We first introduce a new privacy notion, FODP (Fully Oblivious DP), which strengthens DP to prevent various TEE side-channel attacks based on external/internal memory access patterns and control flows. We propose a general framework for FODP algorithms based on memory-size obfuscation and three concrete algorithms within it. We also improve the efficiency of our algorithms by using the count-min sketch and optimizing the number of hashes. We evaluate our algorithms on Intel SGX and demonstrate their effectiveness through comparisons with nine baselines.

Introducing multiplex semantic networks as multifaceted representations of creative associative knowledge across multilingual samples

Edith Haim, Kurt Haim, Roger E. Beaty, Cynthia S. Q. Siew, Massimo Stella — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09403v1 Announce Type: new Abstract: Creativity is a complex cognitive ability that relies on knowledge organisation and retrieval from semantic memory. Yet most research uses a single task to measure it, capturing only a fraction of this complexity. This study investigates multiplex networks - layered semantic networks obtained from six cognitive tasks - as a more comprehensive approach to modelling the associative knowledge underlying creativity. We collected data from N=518 individuals from four countries (Austria, USA, Singapore, Italy). From their responses to verbal fluency, sentence-chain, free association, and narrative writing tasks, we constructed semantic networks and assembled them in a multiplex structure. AI persona-based responses provided a comparison baseline. Structural reducibility analyses showed that different task layers captured distinct, non-redundant information about semantic organisation, supporting the use of multiple tasks over any single one. The networks from high- and low-creative groups remained structurally distinct, while AI-generated networks showed near-identical structures regardless of creativity group. Finally, we used 12 features (network measures, emotional scores, and spreading activation simulations) in a machine learning model using ridge regression to predict individual creativity scores. The combination of structurally similar layers, as identified in the previous stage, improved a proof-of-concept prediction accuracy by 50%. Structural measures showed the highest feature importance, with spreading activation dynamics providing additional predictive power. Together, these findings indicate that multiplex semantic networks capture a richer, cross-cultural picture of associative knowledge underlying creativity. We also release our diverse dataset and code to foster diverse computational approaches within the creativity community.

Advanced simulation framework for AC/MTDC power systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09406v1 Announce Type: new Abstract: Alternating current (AC)/multi-terminal direct current (MTDC) hybrid power systems (HPSs) play a crucial role in enabling long-distance power transmission and flexible interconnections between AC grids. However, the challenges that HPSs encountered are numerous, with stability and harmonic issues being particularly prominent. Traditional electromagnetic transient (EMT) tools have struggled to accommodate small-signal stability problems and the potential issues of the optimal interactions among converters. To address this gap, HARMONY ("HARMONic stabilitY assessment of PE-penetrated power systems") has been developed for the advanced simulation and analysis of interconnected AC/MTDC HPSs as a comprehensive mathematical framework based on C++ programming language. The primary goals of Harmony are to provide faster and trusted stability analyses, and address the analytical difficulties associated with converter control dynamics, converter-driven stability, and interoperability in HPSs. This framework is intended to be open source, therefore broadening collaboration for researchers, and to contribute to the community of power systems engineers. In this paper, we demonstrate two core functionalities featured in HARMONY, that are optimal power flow (OPF) and harmonic stability analyses (HAS). The underlying analysis models and computational methodologies for both functionalities are presented in detail to help future readers and users gain a clear understanding of mathematical fundamentals of HARMONY. Furthermore, we introduce the integrated framework of OPF and HAS designed in HARMONY, along with representative printed analysis results, to demonstrate the appealing capabilities of HARMONY.

Delayed Functional Observers for Output-Delayed Linear Systems

Hieu Trinh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09407v1 Announce Type: new Abstract: This paper introduces a novel class of delayed functional observers specifically designed to reconstruct delayed control laws under severe output measurement lags, directly complementing recent literature \cite{trinhnn26, trinhnam26}. By systematically mitigating simultaneous, unequal delays across both the actuator and sensor channels, the proposed architecture resolves dual-channel latency without requiring full-state estimation or computationally intensive real-time distributed integration. Ultimately, this work provides a powerful, low-order framework that bridges the gap between idealized control theory and the practical constraints of modern networked engineering systems.

Can Data Work be Reparative?

Srravya Chandhiramowuli, Ding Wang, Alex Taylor — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09408v1 Announce Type: new Abstract: We present an ethnographic study of an alternative approach to data work, developed by a civic-tech initiative that builds datasets for training and benchmarking online safety systems. They aim to respond to online safety concerns from a feminist perspective, by building safety datasets collaboratively with those most impacted by online harms. In this paper, we examine how this approach aims to reorient data work as a site for repair and redress, and trace the struggles they encounter in the process. Specifically, we draw attention to the challenges and tensions involved in advancing just reward for data work and collective governance of AI datasets. Examining these challenges through an STS-informed lens of reparative justice and repair, we argue that the work of repairing data work (and AI) lies, fundamentally, in resetting the ties of accountability. At a time heightened emphasis on efforts like safety evaluations and red teaming to make AI more responsible, we highlight the need to confront foundational questions about how the humans involved in these efforts relate to the datasets and systems they help produce. A reparative lens demands that we interrupt prevailing norms of data work and place at their centre, not AI or datasets, but those most harmed by the neglect, oversight and exclusion animated in the current modes of dataset production. This, we argue, offers a bold vision for responsibility and contributes towards a critical agenda for building alternative futures of data and AI practice.

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Mina Remeli, Moritz Hardt — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09409v1 Announce Type: new Abstract: Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

Capacity, Not Format: Rethinking Structured Reasoning Failures

Hengxin Fan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09410v1 Announce Type: new Abstract: Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09411v1 Announce Type: new Abstract: Large language models can be fine-tuned to encode prompt-borne secrets into fluent, seemingly benign outputs. This creates a steganographic exfiltration risk that is difficult to detect with output-level steganalysis. Recent work proposes mechanistic detection using linear probes that recover the secret from internal activations. We show that this defense can be systematically evaded, but that detectability can be recovered through a targeted data-level intervention. First, we extend the detection setup to include a non-linear MLP probe. We then adversarially fine-tune steganographic trojans across five base models: Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B. The resulting models retain $58$--$79\%$ exact-match secret recovery while evading both ridge and held-out MLP probes, with $1$--$8\%$ average capability degradation across six benchmarks. We then give an information-theoretic characterization of this evasion. Successful evasion preserves recoverability while reducing low-order extractability of the secret from the content-aligned representation, forcing the payload into synergistic interaction with residual degrees of freedom. This motivates a recontextualization dataset that restricts these residual degrees of freedom. On this distribution, both ridge and MLP detectability are restored across all five evasive trojans. Overall, our findings show that activation-based steganography detection is vulnerable to adaptive evasion, but also that theory-guided evaluation distributions can expose otherwise hidden payloads.

Towards Post-Quantum Secure Pharmacovigilance with ML-KEM and ML-DSA

Saee Desai, Tom Shimoni, Eddie Cameron, David Akamine, Aniketh Chunduri — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09412v1 Announce Type: new Abstract: Pharmacovigilance systems handle sensitive healthcare and drug-safety data, including adverse event reports and clinical observations. As quantum computing advances, classical public-key cryptographic systems such as RSA and elliptic-curve cryptography may become vulnerable, creating long-term risks for healthcare data that must remain confidential for many years. This paper presents an educational prototype of a post-quantum secure pharmacovigilance data pipeline. The system uses ML-KEM-768 for post-quantum key establishment, HKDF-SHA-256 for deriving an AES key, AES-256-GCM for efficient file encryption, and ML-DSA-65 for digital signatures and tamper detection. The pipeline supports multiple file formats, including TXT, CSV, JSON, and PDF, by treating files as raw bytes and preserving metadata for reconstruction at the receiver. The prototype includes separate hospital, gateway, pharma receiver, attacker, benchmarking, and dashboard components. We evaluate the system using synthetic pharmacovigilance datasets of different sizes and formats. Our results show that ML-KEM adds a small constant overhead, while AES encryption and ML-DSA signing dominate runtime as file size increases. This work is not a production-ready healthcare system, but rather an educational systems-level exploration of how post-quantum cryptographic primitives can be integrated into healthcare-style data pipelines.

AI Assurance in UK Defence: Challenges in Operationalising JSP 936

Callum Cockburn, Sam Farrow — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09414v1 Announce Type: new Abstract: This report examines practical challenges in operationalising JSP 936 Part 1 for AI assurance in UK Defence. Using a structured interpretive review of the directive's requirements, the analysis identifies eight thematic challenge areas adequacy of evidence and argument, management of human interaction with AI, definition of the operational environment, integration of AI within systems of systems, assessment and maintenance of AI performance, analysis of safety and security, measurement of ethicality, and mitigation of the inherent complexities of AI. The report argues that JSP 936 provides a useful governance basis, but that implementation depends on unresolved technical, organisational, and assurance questions. These challenges stem from the socio-technical nature of AI-enabled systems, uncertainty in real-world deployment contexts, limitations in current assurance methodologies, and tensions between performance, safety, human oversight, security, and ethical acceptability. The report identifies areas where further methods, guidance, and organisational capability are needed for the ambitious, safe, and responsible adoption of AI across Defence. This is consistent with MOD's own framing of JSP 936 as requiring iterative implementation and supporting guidance.

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09416v1 Announce Type: new Abstract: Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

What Should a Skill Remember? Quality-Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09421v1 Announce Type: new Abstract: Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0\% and downstream agent-token cost by 6.0\%; in frozen cross-model transfer, the corresponding reductions average 14.7\% and 13.7\%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: \href{https://github.com/1Reminding/Skill_EE}{SkillEE}.

Toward Signing Activity Projection in Sign Language Interaction

Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09424v1 Announce Type: new Abstract: Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09426v1 Announce Type: new Abstract: Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

Giacomo Gonella, Stefano Menini, Marco Guerini — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09428v1 Announce Type: new Abstract: Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09430v1 Announce Type: new Abstract: Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh A P — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09432v1 Announce Type: new Abstract: Modeling interacting dynamical systems requires capturing spatial interactions alongside long-range temporal dependencies. Graph neural networks (GNNs) provide a natural representation but typically rely on autoregressive rollouts and treat spatial and temporal dynamics separately, leading to error accumulation over long horizons. Existing approaches also focus on local interactions and short temporal contexts, limiting their ability to capture multi-hop dependencies and global structure. We introduce the Graph Mamba Operator (GraMO), a latent-space simulator that integrates state-space models with graph-based interaction learning. In contrast to prior work that sequences nodes or applies spatial and temporal updates in separate stages, GraMO couples graph-based interactions and temporal state updates within a single recurrence. The update is linear in the latent state, with input-dependent coefficients that adapt across regimes. We evaluate GraMO on N-body systems, motion capture, and robotics datasets, achieving the lowest error across benchmarks and the largest gains in long-horizon prediction.

Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09433v1 Announce Type: new Abstract: Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.

Operator learning for solving Fokker-Planck equations with various initial conditions

Li Zeng, Xiaoliang Wan, Yaobin Wang, Fabio Nobile, Tao Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09434v1 Announce Type: new Abstract: The Fokker-Planck equation (FPE) plays a pivotal role in describing the time evolution of probability density functions (PDFs) for systems governed by stochastic dynamics. In this work, we propose a conditional normalizing flow-based physics-informed neural network (PINN) framework for efficiently approximating the solution operator of the FPE for a whole range of initial conditions. Leveraging the Chapman-Kolmogorov equation for Markovian stochastic processes, the problem is reformulated into approximating a transition PDF starting at initial time from a Dirac mass centered at an arbitrary point. The PDF of an associated linearized stochastic differential equation (SDE) is employed as the base distribution for the normalizing flow, providing a good approximation of the target PDF, especially for small times, and thereby avoiding the singularity of the map associated with the Dirac delta initial distribution. Furthermore, a time-weighted loss function is introduced to mitigate numerical instabilities arising at small times, achieving a balance between causality and training difficulty as time progresses. A variety of numerical experiments are presented to illustrate the effectiveness and robustness of the proposed method.

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09435v1 Announce Type: new Abstract: Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

Leveraging Optimal Information-Power Flow for Transmission Switching in AC/MTDC Grids

Haixiao Li, Aleksandra Leki\'c — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09436v1 Announce Type: new Abstract: The emerging AC/multi-terminal DC grids are regarded as a promising solution for accommodating the increasing integration of renewable energy sources. This work proposes an optimization framework to address transmission switching (TS) problems arising in practical operational scenarios, such as maintenance scheduling, contingency management, and fault restoration. Unlike most existing studies, the proposed framework considers the role of communication networks in TS operations and develops an optimal information-power flow (OIPF) model. The OIPF model captures the impact of information flows on circuit breaker actions while incorporating communication-related costs, thereby better reflecting practical operational decision-making processes. To ensure computational tractability, the resulting optimization problem is formulated as a mixed-integer second-order cone programming (MISOCP) model through convex relaxations, polygonal approximations, and Big-M reformulations. Numerical case studies illustrate the applicability of the proposed OIPF model and indicate its potential in supporting transmission switching decisions.

Tracking the Effective Surface Area of Non-Convex Satellites

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09439v1 Announce Type: new Abstract: This paper presents a novel framework to track the effective surface area of non-convex satellites, enabling the use of aerodynamic drag in low Earth orbit for orbital control. The proposed framework enables the satellite to track the effective surface area while simultaneously performing other maneuvers. We introduce this framework through a backstepping control algorithm, and exemplify its advantages with an extension, to simultaneously maximize solar panel exposure. The equilibria of the closed-loop systems are shown to be asymptotically stable, and simulation results confirm the effectiveness of the proposed framework.

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, Moinuddin Qureshi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09441v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.

Leveraging Morphology for Historical Script Metrological Analysis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09446v1 Announce Type: new Abstract: Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09447v1 Announce Type: new Abstract: We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

Explicit and asymptotically good constructions of Algebraic Geometry codes in the sum-rank metric

Peter Beelen, Elena Berardini, Anina Gruica, Maria Montanucci — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09448v1 Announce Type: new Abstract: Algebraic Geometry (AG) codes (i.e. linear codes from algebraic function fields) in the Hamming metric were proposed by Goppa in 1980 and have been intensively studied ever since. Linearized Algebraic Geometry codes, the analogue of AG codes in the sum-rank metric, were instead introduced more recently [9], using quotients of the ring of Ore polynomials with coefficients in an algebraic function field. In this paper, we further investigate the results in [9], providing explicit, optimal and asymptotic constructions.

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

Lei Xu, Xin Quan, Andr\'e Freitas — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09449v1 Announce Type: new Abstract: Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09450v1 Announce Type: new Abstract: LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

Dense Force Estimation with an Event-based Optical Tactile Sensor

Agis Politis, Ren\'e Zurbr\"ugg, Valentina Cavinato — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09451v1 Announce Type: new Abstract: Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer

Dasari Naga Raju — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09453v1 Announce Type: new Abstract: Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learning (MIL) can recover it, remains unsettled. A key obstacle is that many pipelines select model checkpoints on the evaluation fold, artificially inflating concordance. We construct a rigorous benchmark on TCGA-PRAD (487 patients, 101 BCR events) using strict out-of-fold scoring over five-fold cross-validation repeated across five seeds. The choice of MIL aggregator (ABMIL, CLAM, TransMIL, PatchGCN) has little effect (C-index 0.61-0.64 with UNI2-h), while the feature extractor is the dominant factor (ResNet50 0.566 versus pathology foundation models up to 0.639). A clinical Cox model on grade, stage, and age reaches 0.687; no imaging-only model significantly outperforms it (p > 0.10). We introduce Grade-Disentangled MIL (GD-MIL), a gated-attention MIL encoder trained with a gradient-reversal grade adversary that encourages the slide representation to be invariant to Gleason grade before late fusion with clinical variables. GD-MIL achieves C-index 0.704, significantly outperforming both the clinical baseline (delta-c = +0.029, p = 0.0005) and the best imaging-only model (delta-c = +0.062, p = 0.039), suggesting H&E morphology contains prognostic information complementary to grade. A median risk split yields log-rank p < 0.0001 separation in BCR-free survival (~20% vs ~70% at five years).

Tuning Dispatch Thresholds for Fixed Last-Mile Routes: A Simulation-Based Pareto Analysis of a Production Policy

Alexander Ponomarenko, Ilya Antonov — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09455v1 Announce Type: new Abstract: Many parcel networks dispatch vehicles on \emph{fixed routes} using a simple load-accumulation rule: a truck leaves the depot for a fixed route as soon as the volume (or item count) waiting for that route crosses a threshold. The threshold is usually parameterised as an affine function of route length, $\tau_r=\beta+\gamma\,d_r$, and the pair $(\beta,\gamma)$ is chosen once and frozen into production. This paper studies how good that frozen choice actually is, treating the question as a data-intensive, data-driven decision-making problem over a full month of real operational flow. Using a discrete-event simulator that replays the recorded arrival stream and reconstructs every trip, we sweep the $(\beta,\gamma)$ design space, evaluate the two competing objectives -- company operating cost and average parcel lead time -- and recover the Pareto frontier of efficient policies for two deployed variants (volume-triggered and item-count-triggered). The two policies turn out to be in strikingly different states of tune. The volume-threshold configuration lies on its own Pareto frontier: the simulator finds no $(\beta,\gamma)$ pair that strictly dominates it, so the deployed policy is \emph{already Pareto-efficient} -- an unusual positive audit result. The item-count configuration is the opposite: it is dominated by a concrete simulated configuration that is both faster and cheaper, and the available cost saving at equal lead time is about \num{5.0}\,\pct{}. We trace the item-count policy's inefficiency to a base that is too large and a length coefficient that is too small for the deployed truck capacity, and show that a \emph{steeper} threshold -- lower base, higher slope -- is preferable. Because the remedy is a two-scalar reconfiguration, the analysis converts directly into an actionable, zero-capital recurring saving.

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09456v1 Announce Type: new Abstract: On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models

Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09457v1 Announce Type: new Abstract: Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $\omega$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $\omega$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $\omega$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

AbstRAG: Learning to Abstract for Retrieval Problems

Lei Xu, Xin Quan, Daniel Pedronette, Andr\'e Freitas — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09459v1 Announce Type: new Abstract: Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.

A 65-nm Privacy-Preserving Neuromorphic Encoder With 7.13-nJ Efficiency, 2.38-Mb/mm^2 Item-Memory Density, and Federated Learning Support

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09460v1 Announce Type: new Abstract: The increasing demand for privacy-preserving personal data analytics in smart assistants, wearable health monitors, and context-aware systems calls for hardware that is both energy-efficient and secure. This work presents a 65-nm privacy-preserving neuromorphic encoder that leverages transistor-level process variation as physically unclonable entropy for hyperdimensional computing. The proposed 2T-2T entropy cell enables compact, device-specific, and write-free item memory, allowing privacy-preserving bio-signal encoding without storing random basis vectors in conventional memory. The fabricated prototype achieves 7.13 nJ per encoding, 2.38 Mb/mm^2 item-memory density, 76.44 nJ per prediction, and 357.32 nJ per training update. It also supports in-situ decision-making, continual learning, and federated learning for multi-user deployment and cold-start personalization. Evaluations across bio-signal datasets demonstrate 93.2% accuracy on EMG and 96.1% accuracy on UCI-HAR, while reducing hypervector dimensionality by 14.3x compared with binary hyperdimensional computing. These results demonstrate an energy-efficient and privacy-preserving neuromorphic hardware platform for secure edge biomedical intelligence.

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09461v1 Announce Type: new Abstract: Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.

The Changing Global Division of Labor in Software: Emergence and Diffusion of New Programming Skills across IT Hubs

Johannes Wachs, Xiangnan Feng, Simone Daniotti, Frank Neffke — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09463v1 Announce Type: new Abstract: With the rise of new industries, often new jobs emerge. Evolutionary Economic Geography and in particular Industry Life Cycle perspectives predict that these activities first emerge in a limited number of cities to then diffuse to other locations as job descriptions become more standardized. Here, we focus on a particularly important new industry: software development, an activity that is economically important, quickly changing, and has a pronounced spatial concentration in a small number of global IT hubs. We use an online database of over 60 million questions and answers about problems in software development that yields a longitudinal dataset of 237 software skills. By geo-locating 3 million posting users at regular intervals, we link these skills to cities worldwide. We find that, in spite of its digital nature, the software industry exhibits similar spatial regularities as previously observed in more traditional sectors. First, cities diversify into skills that are related to their existing ones. Second, new skills first emerge in cities with large and diversified software sectors, and later diffuse -- mostly unhindered by geographical distance -- to smaller cities specialized in closely related skills. We find suggestive but limited support for a windows of locational opportunity account: although even brand-new skills still emerge first in cities with strong prior specialization in related skills, concentrations of related activities impact less the emergence of new skills than the diffusion of existing ones.

DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09466v1 Announce Type: new Abstract: Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09470v1 Announce Type: new Abstract: Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

Escaping the KL Agreement Trap in On-Policy Distillation

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09471v1 Announce Type: new Abstract: On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

Silas Kwabla Gah, Ebenezer Owusu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09474v1 Announce Type: new Abstract: Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.

Emergent alignment and the projectability of ethical personas

Guillermo Del Pinal, Youngchan Lee, Cameron McNamara, Alejandro Perez Carballo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09475v1 Announce Type: new Abstract: Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

Goal Sets, Not Goal States: Queryable Robot Goals through Goal-Set Hindsight Relabeling

Carlos V\'elez Garc\'ia, Miguel Cazorla, Jorge Pomares — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09476v1 Announce Type: new Abstract: Hindsight relabeling usually turns achieved future states into exact goals, which can overconstrain offline robot learning when task success depends only on a subset of the state. We propose Goal-Set Hindsight Relabeling (GS-HER), a predicate-level generalization of HER in which achieved states certify query-defined goal sets rather than singleton goal states. A binary query specifies which variables define success, making the goal predicate an inference-time input while leaving the underlying offline GCRL algorithm unchanged. Across OGBench tasks and five offline goal-conditioned learners, GS-HER improves performance when full-state goals are bottlenecked by nuisance dimensions and turns hindsight relabeling into a reusable goal interface: one checkpoint can answer multiple robot goal predicates without retraining.

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09477v1 Announce Type: new Abstract: Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

Optical Music Recognition for Real-World Manuscripts with Synthetic Data

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09479v1 Announce Type: new Abstract: Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.

Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

Limin Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09480v1 Announce Type: new Abstract: Molecular systems involve interactions across multiple spatial scales, from local coordination and short-range perturbations to long-range electrostatic and solvent-mediated effects. However, most molecular representation learning methods rely on manually predefined scales, and the task-optimal modeling scale may not coincide with these fixed levels. This study introduces a loss-guided adaptive scale refinement framework for molecular force prediction, treating predefined scales as initial anchors and discovering task-effective resolutions through interpolation, routing, differentiable scale updates, and scale pool refinement. Using a NaCl aqueous ionic system as a minimal testbed, this study constructs short-scale and long-range force prediction branches and analyzes their complementarity. Oracle hard routing reduces the overall force MAE from 399.65 to 382.67, while continuous oracle interpolation further reduces it to 380.96. In close-contact regimes with nearest-ion distance below 0.6 nm, the close-contact MAE decreases from 327.22 to 260.51. A minimal scale pool update experiment shows that starting from endpoint anchors {0,1}, loss-guided updates automatically generate intermediate scales and recover most of the continuous oracle performance. The final updated scale pool {0,0.125,0.25,0.375,0.5,0.75,1} achieves an overall MAE of 381.23. These results support adaptive scale refinement as a promising direction for molecular representation learning, especially when fixed-scale modeling is insufficient.

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09483v1 Announce Type: new Abstract: Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism

Kumar Thushalika, Sukumar Kishanthan, Asela Hevapathige — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09484v1 Announce Type: new Abstract: Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-perfect accuracy on isomorphism detection, we show this performance is illusory. When identical graphs are presented with permuted node labels, LLMs fail to identify their isomorphism. This finding suggests that LLMs exploit patterns rather than reasoning about abstract graph structure. Since permutation invariance is a fundamental requirement for valid structural reasoning, these results indicate that success on graph reasoning benchmarks should not be interpreted as evidence of genuine topological understanding.

LangRetrieval: Language-Guided Self-Evolving Satellite-to-Radar Retrieval via CSI-Driven Reward

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09486v1 Announce Type: new Abstract: Satellite-to-radar (S2R) retrieval estimates ground radar precipitation from geostationary satellite observations, providing a critical solution for precipitation monitoring in radar-sparse regions. However, S2R retrieval is intrinsically ill-posed: similar cloud-top radiances can correspond to distinct precipitation regimes, storm organizations, and surface intensities, which are difficult to uniquely determine the underlying meteorological state from local spectral cues alone. Meteorological semantics offer complementary scene-level information that can help resolve this ambiguity. Yet existing static semantic conditioning is often insufficient, as externally predefined semantics cannot adapt to dynamic convective scenes or align with retrieval objectives. To this end, we propose LangRetrieval, a language-guided conditional flow matching (CFM) framework that establishes a closed-loop optimization mechanism between meteorological semantics and retrieval accuracy. Specifically, LangRetrieval consists of two core components: (i) Semantic Warm-up: structured meteorological attributes are injected into the CFM backbone through cross-attention conditioning, enabling continuous semantic guidance throughout the generation trajectory; and (ii) Self-Evolving Semantic Optimization: a lightweight attribute policy is first initialized from vision-language model annotations and subsequently refined via Group Relative Policy Optimization (GRPO) using multi-threshold Critical Success Index (CSI) rewards, enabling semantic generation to evolve directly toward improved retrieval accuracy.

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09489v1 Announce Type: new Abstract: Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

Dan Zlotnikov, Alex Lazarovich, Ohad Ben-Shahar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09495v1 Announce Type: new Abstract: Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.

Self-Harness: Harnesses That Improve Themselves

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09498v1 Announce Type: new Abstract: The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

Targeting World Models to Compromise Robot Learning Pipelines

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09499v1 Announce Type: new Abstract: World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

Yoojin Nam, Jinhoon Jeong, Namkug Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09500v1 Announce Type: new Abstract: Objective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items. Existing tools generate text without verifying it, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture that pairs generation with verification. Methods. The design rests on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism -- a deterministic, re-executable check where one suffices, and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills coordinated by one orchestrator, whose deterministic tier comprises 21 standard-library detectors. We evaluate it on three reproducible public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Results. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects. On 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a generic single-prompt LLM reviewer detected 11, its misses concentrated in generated-code, bibliography-internal, and style defects the prose does not expose. Conclusion. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript -- feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

Second-order bulk-surface splitting for the wave equation with kinetic boundary conditions

R. Altmann, R. Morandin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09502v1 Announce Type: new Abstract: This paper is devoted to the numerical analysis of a second-order bulk--surface splitting scheme for the semi-linear wave equation with kinetic boundary conditions. The construction is based on the interpretation of the equations as coupled system and the implementation of different difference formulae for the discrete states depending on their exact position in the system equations. This results in a 4-step scheme which decouples bulk and surface dynamics. We prove energy stability and second-order convergence under a weak CFL condition and validate these results also numerically.

Guaranteed Fast Implementation of the Split Covariance Intersection Filter: Nested Newton Method Thanks to the Fourth-Order Convexity of w-Optimization

Hao Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09505v1 Announce Type: new Abstract: The split covariance intersection filter (Split CIF) is a useful tool for general data fusion and has the potential to be applied in a variety of engineering tasks. The w-optimization problem involved in the Split CIF concerns the performance and implementation efficiency of the Split CIF. It is known that the w-optimization problem enjoys the desirable property of convexity (or more clearly, the second-order convexity in this paper's context). This paper proves that the w-optimization problem further enjoys a more desirable property namely the fourth-order convexity, thanks to which a guaranteed fast implementation of the Split CIF can be realized. The new implementation is coined as the nested Newton method, which is also presented in this paper.

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09507v1 Announce Type: new Abstract: Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09508v1 Announce Type: new Abstract: Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

Local Search on Vertex Coloring for Bipartite Graphs

Johanna Gasse — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09509v1 Announce Type: new Abstract: Local search is a well-known heuristic method used in optimization. In this thesis, we explore its capabilities on the vertex coloring problem, an $NP$-hard problem with relevance in both theoretical analysis and practical application. To recognize limitations in the applicability of local search of the vertex coloring problem, we analyze local search landscapes on differently-structured bipartite graphs. We identify structures that ensure only global optima can exist as well as ones that enable the existence of non-global local optima, showing that on general bipartite graphs, it is possible for local search to return arbitrarily bad results. Further, we analyze the capabilities of local search on graphs where a local optimum can be found. To do so, we introduce a gray-box local search mutation operator that removes less frequent colors with higher probability and prove that it finds an optimal coloring on complete bipartite graphs in an expected run time of $\Theta(n \log n)$. This is a drastic improvement to the exponential tun time of the black-box Random Local Search, showing that gray-box mutation operators can improve the run time of local search.

Securing Self-supervised Data Curation for Foundation Models Robustness

Sandeep Gupta, Roberto Passerone — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09511v1 Announce Type: new Abstract: Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL greatly alleviates the costs associated with annotation and manual dataset curation while minimizing the need for human oversight. However, the integrity of SSL-curated datasets must be rigorously checked, as reliance on anonymous and unvetted external sources can substantially increase the risk of data poisoning. In this paper, we propose a Poisoned Data Detector (PDD), an active defense mechanism designed to ensure the integrity of SSL-curated datasets prior to foundation model training. PDDs are designed using a combination of the pretrained ImageBind model and traditional classifiers, including Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). We rigorously evaluated PDDs using 176,200 images from three diverse datasets and three different adversarial attacks encompassing both in-distribution and out-of-distribution scenarios. Notably, SVM-PDD achieves superior performance for both in-distribution (Set3-Set5) and out-of-distribution (TrueFace and 140K RealFace) datasets. Our design demonstrates strong scalability and enables the rapid integration of new adversarial attack detectors through an ensemble approach.

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09514v1 Announce Type: new Abstract: Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.

SwiftVR: Real-Time One-Step Generative Video Restoration

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09516v1 Announce Type: new Abstract: Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

Jan Niklas Lettner, Hadeer El Ashhab, Benjamin Sch\"afer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09517v1 Announce Type: new Abstract: As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibration-aware objectives and architectures to ensure the distributional integrity of energy market forecasts.

Constructions of Quantum $(r,\delta)$-LRCs from cyclic codes

Rajendra Prasad Rajpurohit, Maheshanand Bhaintwal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09522v1 Announce Type: new Abstract: Classical $(r,\delta)$ locally recoverable codes (LRCs) play a central role in distributed data storage systems as they enable an efficient recovery from erasures by accessing a small number of surviving symbols. Motivated by their prospective use in future quantum data storage and by recent theoretical progress on quantum locally recoverable codes (qLRCs), we investigate the construction of qLRCs from classical cyclic $(r,\delta)$-LRCs. Our approach identifies cyclic LRCs whose defining sets satisfy a dual-containing condition, allowing them to serve as valid CSS ingredients. We present three explicit families of $(r,\delta)$-qLRCs, two of which are optimal with respect to the quantum Singleton-like bound, whenever the codes are pure, thereby providing optimal examples. Additionally, the codes presented in Constructions 2 and 3 have no bound on their lengths with respect to the field size required to obtain these codes.

Emergence of Context Characteristics Sensitivity in Large Language Models

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09525v1 Announce Type: new Abstract: During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

When Types Intersect and Effects Get Handled

Ugo Dal Lago, Taro Sekiyama, Stefano Catozi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09526v1 Announce Type: new Abstract: We introduce a novel intersection type system for a $\lambda$-calculus with algebraic effects and handlers. The system, inherently behavioral in nature, enjoys the classical properties of intersection type systems, in particular subject reduction and expansion. It thus characterizes the set of terms whose evaluation process terminates and, at the same time, allows reducing the reachability problem to type inference. This new system, the first with these features for a calculus with handlers, induces a system of simple types which, although not guaranteeing termination, is type sound and admits a decidable HOMC problem, unlike similar type systems like Dal Lago and Ghyselen's HEPCF.

Relocate and Emulate: Re-Hosting Android's Application Layer

Thomas Sutter, Timo Kehrer, Bernhard Tellenbach, Marc Rennhard — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09528v1 Announce Type: new Abstract: Dynamic analysis of Android's application layer typically relies on physical devices, limiting scalability and reproducibility. To compensate, we introduce a systematic re-hosting method that relocates the Android framework and pre-installed software from real device firmware into a fully emulated environment. Our approach integrates vendor-specific components into the Android Open Source Project (AOSP) build system using tailored extraction and injection strategies, producing vendor-flavoured emulator images that preserve system integrity and runtime compatibility. This enables dynamic execution of real-world framework and application-layer components, including proprietary binaries and pre-installed apps, across multiple SDK versions. We evaluate our method on 184 firmware samples from SDK 31-33. It achieves high build and boot success rates, with residual failures primarily occurring during core-service initialization due to baseline strategy limitations, missing dependencies, device-protection checks, or emulator constraints. However, the modular design allows injection strategies to be extended for specific firmware, supporting broader compatibility and future research on automated, adaptive re-hosting. Though we identified potential for optimization through engineering vendor-specific solutions, our research demonstrates the feasibility of vendor-flavoured emulators for scalable, reproducible dynamic analysis.

Hybrid Metaheuristic Combining the Dragonfly Algorithm and Tabu Search for the Traveling Salesman Problem

Ammar Bouketta — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09529v1 Announce Type: new Abstract: The Traveling Salesman Problem (TSP) is a classical NP-hard combinatorial optimization problem that aims to find the shortest Hamiltonian cycle visiting each city exactly once and returning to the starting point. This paper proposes a hybrid metaheuristic for the TSP by combining the Dragonfly Algorithm (DA), a swarm-intelligence-based global search method, with Tabu Search (TS), a memory-based local search technique. The proposed method follows a High-Level Relay Hybridization (HRH) scheme, in which DA is first used to explore the solution space and generate a promising initial tour, while TS subsequently refines this solution through neighbourhood-based improvement and tabu memory. The hybrid approach is evaluated on standard TSPLIB benchmark instances, including burma14, att48, and ch150, and compared with standalone DA, standalone TS, and several classical metaheuristics such as Genetic Algorithm, Ant Colony Optimization, Particle Swarm Optimization, and Random Search. A systematic grid-search procedure is also conducted to study the influence of the main hyperparameters on solution quality and execution time. The experimental results indicate that the proposed hybrid can improve tour quality compared with the standalone DA and TS on the tested instances, highlighting the benefit of combining global exploration with local exploitation. However, the results also suggest that performance remains sensitive to parameter settings and problem size, motivating further validation on larger benchmarks and stronger TSP-specific baselines.

Reduced integration with scaled boundary parametrization for virtual elements at finite strains

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09530v1 Announce Type: new Abstract: This contribution presents an alternative stabilization technique for the virtual element method (VEM) based on reduced integration combined with a scaled boundary parametrization. To this end, a Taylor series expansion of the constitutive quantities with respect to the sectional center is carried out, enabling analytical integration of the weak form and reducing the need for integration points to only one per section. The accuracy of the proposed formulation is shown by several numerical examples, including a non-linear patch test. Different loading, e.g. compression under large deformations, and material conditions, such as hyperelastic anisotropy and elasto-plasticity, are considered. The biquadratic serendipity finite element formulation (Q2) and the low-order finite element formulation with hourglass stabilization (Q1STc+) are used for comparison. While the patch test was not fulfilled using higher order shape functions, the formulation led to good results and was able to capture the structure's response accurately. Furthermore, the formulation performed better when the physical element resembled the pre-assigned parent elements. The example of the asymmetrically notched specimen under elasto-plastic material behavior showed that the proposed formulation is able to capture inelasticities.

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

Muhammad Hamza Arshad Majeed, Sidahmed Benabderrahmane, Talal Rahwan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09532v1 Announce Type: new Abstract: Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100\% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88\% holdout pass rate) and 40 clean predictive rules with 2--7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.

Just-in-time Restoration with Distributed Fiber Sensing in Metropolitan Optical Networks

Sleman Mouammar, Italo B. Brasileiro, Andre C. Drummond — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09533v1 Announce Type: new Abstract: Distributed Fiber Sensing (DFS) leverages optical backscattering signals to predict failure events and enable just-in-time restoration in metropolitan optical networks, i.e., without optical amplifiers. In this paper, we study the effectiveness of proactive restoration based on DFS information in all-optical networks, while considering different sensing devices' capabilities. We evaluate whether restoration can be provisioned just-in-time before a failure happens, and its impact on key performance metrics, including the number of affected and suspended optical circuits, bandwidth blocking rate, and service downtime. Simulation results demonstrate that just-in-time restoration enabled by DFS with a prediction time capability of 15~ms can reduce circuit disruptions by more than 90\% compared to restoration without sensing and ensure optical service continuity in optical networks comparable to resource-intensive protection schemes, at a fraction of the spectral resources.

A Continuification Approach to CAV Control in Mixed Traffic via Variable Speed Limits

Brian Block, Cecilia Pasquale, Silvia Siri, Simona Sacone, Stephanie Stockar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09534v1 Announce Type: new Abstract: This paper presents a method for controlling traffic via the use of connected and automated vehicles (CAVs) acting as moving bottlenecks. Current methods for moving bottleneck control use a couple PDE-ODE model, based on the Lighthill-Whitham-Richard (LWR) model, to represent the influence of the CAV. Control of the CAV is normally achieved by designing the control on the ODE which models the speed of the moving bottleneck. The proposed method in this paper instead looks to reduce the computational burden of controlling multiple CAVs by designing the moving bottleneck controller first on the PDE. The original control designed on the PDE is a linear quadratic regulator (LQR) that determines the optimal variable speed limit (VSL) for the entire length of freeway in order to regulate density to a desired setpoint. Then, a continuification approach is utilized to determine the input speed for each CAV. Results show that multiple CAVs can be controlled via this method, with minimal computational burden, and that as the number of CAVs increases the solution approaches the global optimal solution determined by the LQR.

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09535v1 Announce Type: new Abstract: Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

Lucas G\"ornhardt, Timo Bartels, Niklas Schwarz, Tim Fingscheidt — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09536v1 Announce Type: new Abstract: Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, attempts to integrate Hadamard codes into semantic segmentation fall far behind state-of-the-art models in mean intersection-over-union performance. Regarding object detection, such output encodings have not yet been investigated at all. Further, no prior art addressed intrinsic codeword inconsistencies or actually exploited intrinsic codeword redundancy. Accordingly, we first derive a novel decoding procedure for Hadamard codewords towards optimal class-wise probabilities, solving the underlying optimization problem by using the projection onto the probability simplex. Second, our optimization delivers a measure of prediction inconsistency. Third, we are the first to show how to exploit these inconsistencies for adversarial attack and disturbance detection. Fourth, we introduce HadamardNet, a framework employing Hadamard codes as output representations for semantic segmentation and object detection models and tasks. We conduct a comprehensive evaluation both on disturbances and adversarial attacks, achieving state-of-the-art perturbation detection performance for both tasks in only a single detection pass, while delivering equivalent or close-by reference performance on clean data.

STEPS: Semantic-Contract-Guided Scheduling for LLM-Assisted Natural-Language-Driven Edge AI Services

Houyi Qi, Minghui Liwang, Xianbin Wang, Seyyedali Hosseinalipour — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09537v1 Announce Type: new Abstract: Networked AI services are increasingly delivered through edge infrastructures to support latency-sensitive applications. Edge scheduling is critical for deciding where and how AI services are executed under limited communication and computing resources. Existing frameworks usually assume that requirements are given as numerical constraints, such as latency bounds, energy budgets, or cost limits. In practice, users often express expectations through ambiguous natural language, creating a gap between user intent and resource constrained scheduling. To bridge this gap, we propose semantic-contract-guided edge potential scheduling (STEPS), a natural language driven scheduling framework for LLM assisted edge AI services. STEPS introduces semantic contracts as executable interfaces between user-side semantics and edge-side decision making. An LLM assisted semantic parser extracts service levels and confidence scores, which are converted into service preferences, fulfillment bounds, and semantic uncertainty. Based on these contracts, STEPS formulates edge scheduling as a contract-guided potential game that jointly determines execution-node selection, computing-resource provisioning, and bandwidth allocation. It also builds feedback signals from semantic request drift, fulfillment drift, fulfillment pressure, and admission pressure to adjust semantic admission, contract conservativeness, and edge coordination. We characterize the exact potential game structure, establish pure strategy Nash equilibrium existence, and prove convergence and stability properties. Experiments show that STEPS improves semantic contract fulfillment, reduces contract guided service loss, and maintains robust adaptation under ambiguous requests and non-stationary edge environments.

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09539v1 Announce Type: new Abstract: Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic datasets, we compare 1-block, 2-block (standard), and 3-block STGCN variants. Our findings reveal that the single-block architecture achieves optimal performance for short-term prediction (10 mins) on three of four datasets, while incurring only marginal degradation ($\leq$1.8% relative error) at longer horizons. Crucially, the 2-block variant incurs 61% higher CPU inference latency and 37% lower throughput relative to 1-block -- substantial overhead for resource-constrained ITS deployment. The 3-block architecture offers no favourable tradeoff, more than doubling computational cost for $<$0.5% relative improvement. These results suggest that the default 2-block STGCN may be over-parameterised for many applications, with implications for both practitioners deploying traffic prediction systems and researchers benchmarking efficiency-focused methods.

A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation

Siyuan Li, Xiaoyang Bi, Mengshi Qi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09542v1 Announce Type: new Abstract: Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

Dmitry Pronin, Evgeny Kazartsev — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09543v1 Announce Type: new Abstract: This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09547v1 Announce Type: new Abstract: Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

Model Poisoning Against Federated Model Adaptation with Chain of Bit-Flips

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09548v1 Announce Type: new Abstract: Federated Learning (FL) allows a set of clients to collectively train a global model without sharing local training data. Giving the responsibility of the training to decentralized actors may lead to poisoning attacks: clients controlled by malicious third party potentially poison the training dataset to install a backdoor in neural networks. In FL, these backdoor attacks rely solely on algorithmic approach, however, recent advances in hardware faults threats (e.g, Rowhammer) have widen the overall attack surface. In the context of federated model adaptation, we introduce a novel category of backdoor attack against FL systems that relies on model poisoning based on hardware-fault attacks. More precisely, we propose a task-agnostic backdoor attack that is implanted during the FL training time by inducing hardware faults (bit-flips) in parameters of a single local model. The backdoor is crafted during a previous offline phase from the pretrained model initially used by the FL system. Our results show that a backdoor can be successfully applied on different type of models and datasets. Typically, with up to 10 faults per malicious client occurrence and 19 total occurrences on a ResNet-18 are enough to reach 94% of attack success rate. Finally, we discuss the practicality and the robustness of the attack potential defenses, while putting into perspective the practical constraints of Rowhammer, which is the preferred attack vector for this type of threats.

SecureClaw: Clawing Back Control of LLM Agents

Yuhan Ma, Stefan Schmid — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09549v1 Announce Type: new Abstract: Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

InquiTree: Evaluating AI Agents in the Scientific Inquiry Loop with Paper-Derived Research Trees

Shaoyang Cui — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09550v1 Announce Type: new Abstract: While LLM-based agents are increasingly used in scientific workflows, it remains unclear whether they are truly qualified for the dynamic and uncertain process of discovery. Existing static evaluations often conflate genuine reasoning with rote memorization. We introduce InquiTree, a diagnostic environment that formalizes scientific inquiry as interactive Research Trees: directed acyclic graphs capturing the logical dependencies among hypothesis formulation, study design, result interpretation, and belief updating. Evaluating agents on a 30-paper test pool and releasing the open-access InquidTree-18(IT-18) subset, we identify two key limitations. First, agents exhibit an "Erosion of Marginal Capabilities": during long-horizon interactions, they develop "cognitive tunneling," where critical judgment and anomaly detection degrade relative to their intrinsic baselines. Second, performance drops on papers published after model training cutoffs, revealing a boundary between interpolation and extrapolation and suggesting that apparent competence is partly driven by parametric memory. These findings indicate that scaling context alone is insufficient for reliable AI scientists; stronger architectures or human oversight may be required to preserve critical evaluation and generalization.

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

Yuhan Ma, Yong Li, Stefan Schmid — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09551v1 Announce Type: new Abstract: Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09553v1 Announce Type: new Abstract: Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Yinan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09556v1 Announce Type: new Abstract: AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

Safe-RULE: Safe Reinforcement UnLEarning

Shixiong Jiang, Taozheng Zhu, Fanxin Kong — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09559v1 Announce Type: new Abstract: Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

PRISM: Recovering Instruction Sets from Language Model Activations

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09563v1 Announce Type: new Abstract: As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

STON'R Converges to First-Order Nash~Equilibria of Multiplayer Games

Marika Kosohorsk\'a, Tom\'a\v{s} Kroupa, Tom\'a\v{s} Votroubek — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09565v1 Announce Type: new Abstract: Nonconcave games present a unique challenge, as neither pure Nash equilibria nor local Nash equilibria (LNE) are guaranteed to exist, even in zero-sum settings. Additionally, computing approximate LNE in smooth multiplayer games over bounded regions is PPAD-hard. These challenges, coupled with the inherent complexity, have driven recent research toward broader equilibrium concepts, such as min-max critical points, and first-order Nash equilibria (FONE), which correspond to solutions of specific non-monotone variational inequalities. This paper addresses general-sum multiplayer games with compact convex strategy sets and smooth, nonconcave utility functions. Daskalakis et al. introduced the STON'R algorithm for solving variational inequality problems and established convergence under smoothness assumptions. They further showed that the algorithm's limit points correspond to equilibria in specific classes of games, namely local minimax equilibria in two-player zero-sum games and Nash equilibria in smooth concave games. In this work, we extend the convergence result to multiplayer general-sum games and show that the variational inequality solutions targeted by STON'R correspond to first-order Nash equilibria (FONE), a general game-theoretic solution concept that unifies these previously studied cases. We demonstrate the effectiveness and robustness of the algorithm on various examples from recent literature.

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

Tom Beyer, Svea Wisy, Sven Tomforde — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09568v1 Announce Type: new Abstract: The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

Tao Li, Liang Liu, Jianli Han, Weimin Lv — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09569v1 Announce Type: new Abstract: With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.

UXBench: Benchmarking User Experience in AI Assistants

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09570v1 Announce Type: new Abstract: As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09572v1 Announce Type: new Abstract: Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

Code Is More Than Text: Uncertainty Estimation for Code Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09577v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09578v1 Announce Type: new Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

AeroMesa: Efficient Data Management System for Multi-Dimensional Spatio-Temporal Trajectories

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09581v1 Announce Type: new Abstract: The rapid growth of trajectory data -- especially the dense 4D traces generated by unmanned aerial vehicles (UAVs) -- is placing mounting pressure on spatio-temporal data management systems. Existing HBase-based trajectory indexes suffer from three limitations: coarse-grained temporal pruning, locality-unfriendly XZ2 spatial encodings with workload-blind ordering, and severe row-key interval fragmentation when altitude is jointly encoded with the horizontal dimensions. We present AeroMesa, a unified system that natively supports $(x,y)$, $(x,y,t)$, $(x,y,z)$, and $(x,y,z,t)$ queries within a single storage framework. AeroMesa integrates three complementary designs: a temporal index (TI$^{+}$) that refines pruning to second-level granularity, a Hilbert-BFS spatial index with a Workload-Aware Jaccard ordering, and a decoupled 4D architecture that separates horizontal indexing from altitude-aware secondary indexing to eliminate isotropic-encoding fragmentation. We implement AeroMesa on Apache HBase and Redis and evaluate it on a real-world dataset (T-Drive) and a high-fidelity 90,000-trajectory UAV simulation dataset. AeroMesa consistently outperforms all baselines: TI$^{+}$ cuts temporal-query candidates by up to 51% over MCTM, the Hilbert-BFS/WAJ index lowers 2D latency by up to 17.9% over the state-of-the-art TMan, and the decoupled 4D design reduces latency by up to 30$\times$ while cutting merged scan ranges by up to three orders of magnitude over XZ3/TXZ3 joint-encoding approaches.

On Choosing the $\mu$ Parameter in Gaussian Differential Privacy

Bogdan Kulynych, Antti Honkela — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09582v1 Announce Type: new Abstract: Recent work argues for using Gaussian differential privacy (GDP) to report the privacy guarantees in privacy-preserving machine learning. We provide principled mappings from pure-DP $\varepsilon$ to GDP $\mu$ by matching the worst-case success of a strong-adversary membership inference attack in terms of three metrics: multiplicative advantage at fixed FPR, precision at fixed recall, and the standard privacy profile. We tabulate $\mu$ values across a useful range of parameters and recommend $\mu \approx \varepsilon/5$ as a conservative general-purpose conversion.

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09585v1 Announce Type: new Abstract: Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

Pressure-robust and quasioptimal Discontinuous Galerkin discretisations of the $p$-Stokes problem

P. A. Gazca-Orozco, M. R\r{u}\v{z}i\v{c}ka — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09586v1 Announce Type: new Abstract: In the present paper, we propose Local Discontinuous Galerkin (LDG) approximations for a nonlinear system of $p$-Stokes type, having $(p,\delta)$-structure. On the basis of the primal formulation, we prove well-posedness and stability (a priori estimates) of the methods under truly minimal regularity assumptions. We show that the first method possesses a pressure-robust and quasi-optimal error estimate, and discuss its consequences. Moreover, we propose a second method, for which we show a pressure-robust error estimate and prove convergence and convergence rates, which are optimal for linear ansatz functions for all $p\in (1,\infty)$ and $\delta\geq 0$.

Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization

Muhammad Haris Khan, Joel wester — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09587v1 Announce Type: new Abstract: People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and evaluate it both computationally and through a study with 16 participants who regularly use AI for creative tasks. Our computational assessment reveals that SRT increases semantic diversity by 85--167\% while reducing consensus phrases by 43--95\% across task modes. In the user study, SRT outputs received higher usefulness ($p = .019$, $W = .208$) and coherence ratings ( $p = .006$, $W = .260$); 68.8\% of participants were willing to use SRT-Strong for multiple tasks versus 18.8\% for baselines. Originality and coherence ratings were positively correlated across all systems ($\rho = +.40$ to $+.67$), suggesting that divergence need not compromise readability. Taken together, these preliminary findings can inform the design of AI systems that aim to support everyday creativity without contributing to homogenization.

Probabilistically Checking Quantum Proofs, with Interaction

Baocheng Sun, Thomas Vidick — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09588v1 Announce Type: new Abstract: The model of interactive oracle proofs (IOP) generalizes the notion of probabilistically checkable proof (PCP), in which a static proof is verified probabilistically by querying a small number of bits, to the interactive setting: a polynomial-time verifier interacts with an unbounded prover, but is restricted to only reading a small number of bits, in total, from the messages sent by the prover. IOPs provide a relaxed setting in which to study local probabilistic verification. They have proved instrumental in devising efficient methods for verification through subsequent compilation into non-interactive or succinct protocols. We study a quantum analogue of interactive oracle proofs (qIOP) in which the verifier and communication are both allowed to be quantum; yet the verifier is restricted to perform measurements only on a small number of qubits received from the prover. Our main result is a qIOP for any language in QMA, in which the total communication is polynomial but the verifier only reads a polylogarithmic number of qubits in total. The protocol has completeness parameter exponentially close to $1$ and soundness bounded away from $1$ by a constant. In the absence of a quantum PCP theorem, this provides the first information-theoretically sound local and robust characterization of QMA, albeit interactive. Our protocol combines the use of a quantum locally testable code (LTC) with classical techniques, notably probabilistically checkable proofs of proximity (PCPP). We avoid the necessity for complex multi-qubit tests employed in other settings by leveraging the local indistinguishability property of the quantum LTC.

I Was Scrolling and Then I Saw a Pregnant Strawberry

Piera Riccio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09589v1 Announce Type: new Abstract: AI minidramas (also known as fruit dramas) are short, algorithmically distributed generative AI video series featuring anthropomorphized characters that have recently emerged as a widespread phenomenon on social media platforms. This paper argues that despite their seemingly innocuous aesthetic, these videos reproduce deeply gendered narrative structures in which female characters are systematically associated with moral transgression, sexual betrayal, and reproductive capacity, and that several plots also encode the logic of racialization, i.e., the process by which visible bodily difference is morally loaded. Drawing on feminist film theory, critical race theory, and platform studies, it further argues that the generative AI aesthetic of these videos, characterized by softness, roundness, and visual cuteness, functions as a mechanism of aesthetic laundering, neutralizing the ideological weight of these narratives and enabling their circulation despite content moderation systems. This paper approaches these questions through personal observation and close reading, reflecting on the specific affordances of generative AI that make this phenomenon both possible and culturally consequential for the field of computational creativity.

Clinically Grounded Privacy Evaluation of Medical LMs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09590v1 Announce Type: new Abstract: Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

Parent-Hash DAG: A Cost Analysis of Constant-Time Append for On-Chain Registries

Ian C. Moore, Fernando Paredes Garcia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09593v1 Announce Type: new Abstract: Provenance trees are append-only directed acyclic graphs of artifact registrations anchored on a public blockchain, recently introduced as the data substrate of operator-gated provenance infrastructure. Their defining data-structural pattern is a parent-hash directed acyclic graph (PHDAG), in which each append performs a constant number of storage writes to previously-untouched slots. This pattern has not previously been isolated as a standalone primitive, formally bounded with explicit constants, or benchmarked against the standard alternative, the incremental Merkle tree (IMT). We formalize PHDAG append as O(1) in gas cost, independent of registry size and tree depth, and develop a stochastic cost model for IMT in which per-insert cost is a random variable over the leaf index, deriving closed-form expressions for its mean and variance. We validate both analyses empirically on Base Sepolia across tree depths 1 to 25. PHDAG is observed to be depth-invariant at 76,276 gas (standard deviation about 6 gas), while IMT cost grows linearly with depth. The crossover below which IMT is cheaper falls far beneath the depths of every production registry surveyed. We further establish trustless registry reconstruction from public event logs in linear time with no off-chain dependency.

Popcorn: A Configurable Benchmark for Visual Evidence in Multimodal Movie Recommendation

Ali Tourani, Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09595v1 Announce Type: new Abstract: Movies are long-form audiovisual works, yet recommender benchmarks often rely on trailers, thumbnails, or metadata. These sources differ in semantics and scalability: full movies preserve consumption-level evidence, trailers concentrate promotional highlights, and thumbnails provide sparse but catalog-scale visual signals. We present Popcorn, a configurable benchmark for visual evidence in multimodal movie recommendation, combining title-aligned full-movie/trailer embeddings with MovieLens-linked thumbnail features encoded by modern visual and vision-language models. Popcorn standardizes modality assembly, fusion, splitting, evaluation, and LLM-augmented metadata through a single configuration contract. Experiments show that thumbnail VLMs provide strong, scalable item-side evidence, while controlled trailer/full-movie comparisons show that visual evidence sources are not interchangeable: the choice of source and fusion strategy affects ranking accuracy, coverage, diversity, and calibration. The framework is available at https://github.com/RecSys-lab/Popcorn.

Awareness of Technological Isomorphism: Integrating AI into Elementary Mathematics Teaching on Data and Prediction,A Case Study of the Compound Line Graph

Li Li, Yu Cao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09598v1 Announce Type: new Abstract: The deep integration of Artificial Intelligence (AI) into elementary mathematics education necessitates a conceptual tool capable of explaining students' cognitive transition from disciplinary knowledge to AI understanding. This study proposes a novel core concept, "Awareness of Technological Isomorphism, " defined as a student's metacognitive realization that their own mathematical cognitive operations (e.g., observing trends, inducing patterns, and making predictions) share an underlying logical structure with AI technical operations (e.g., pattern recognition and predictive modeling). This awareness, in turn, facilitates cognitive transfer from disciplinary mathematics to AI comprehension. Underpinned by transfer learning and metacognitive theories, this study clarifies the distinct essence of this concept from traditional "computational thinking." We demonstrate the explanatory power of this framework in two ways: elucidating the mechanism of students' cognitive leap from mathematics to AI, and guiding instructors to identify "isomorphic interfaces" within disciplinary curricula. On this basis, a three-stage pedagogical pathway--spanning "Perception, Comprehension, and Creation"--is constructed alongside a corresponding evaluation rubric. This framework is empirically validated through a case study based on the "Compound Line Graph" lesson from a fifth-grade mathematics textbook in China, offering a highly replicable operational framework for the deep convergence of disciplinary instruction and AI literacy education.

Formal Foundations and Proof-Carrying Certificates for q-ary Covering Codes in Lean 4

Andreas Florath — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09600v1 Announce Type: new Abstract: Covering codes in finite Hamming spaces ask for small sets of words whose Hamming balls cover the whole space. This paper presents a Lean 4 formalization of the elementary theory of q-ary covering codes, centered on certificate predicates for upper bounds, lower bounds, and exact covering numbers $K_q(n,r)$. The formalization proves the q-ary Hamming-ball volume formula, the sphere-covering lower bound, elementary exact cases, product and relation rules, and selected small exact certificates. It also demonstrates an end-to-end workflow for checking explicit upper bounds transcribed from van Laarhoven et al. (1989). The accompanying database is proof-carrying: stored bounds have traces that replay to Lean proofs of the corresponding upper- or lower-bound predicates. The contribution is not new record bounds or a reproduction of known tables, but a reusable, auditable foundation for machine-checked covering-code certificates.

Assessing Sample Quality in Conditional Generation under Compositional Shift

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09601v1 Announce Type: new Abstract: Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.

Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion

Kuanlin Chen, Cheng-En Ou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09603v1 Announce Type: new Abstract: Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

Jonathan F. Carter, Lionel Tarassenko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09605v1 Announce Type: new Abstract: Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using $100\times$ less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

Path-Traced Inverse Rendering with Global Illumination in 3D Gaussian Fields

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09606v1 Announce Type: new Abstract: Ray tracing enables 3D Gaussian fields to serve as a representation for physically based light transport. Faithful inverse rendering requires forward rendering and backward optimization to be defined within a consistent light-transport pipeline. Existing inverse rendering methods estimate G-buffers via splatting and optimize materials in screen space, tying the recovered properties to a rasterization-based pipeline. This pipeline mismatch, together with simplified rendering equations that neglect indirect illumination, often leads to inconsistent shading, visible artifacts, and inaccurate material-lighting estimation under path-traced rendering. Therefore, we propose a splatting-free path-traced inverse rendering framework for 3D Gaussian fields, where forward light transport and backward gradient propagation are defined within a unified ray-tracing pipeline. Our key idea is to define a path-space equivalent interaction model for overlapping Gaussian primitives, under which Monte-Carlo-based path tracing is unbiased for the induced light-transport integral, while pathwise gradients are replayed over the same ray-traced interactions rather than splatting-derived screen-space buffers. The framework optimizes materials and a compact Spherical-Gaussian environment under the full rendering equation with ray-traced visibility and multi-bounce light transport. Extensive experiments demonstrate competitive material inversion and improved path-traced rendering quality, producing more plausible shadows, reflections, and relighting results under global illumination.

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Yongzhong Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09607v1 Announce Type: new Abstract: Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

Zhiqiang Wu, Yitong Dong, Xian Wei — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09608v1 Announce Type: new Abstract: Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09610v1 Announce Type: new Abstract: Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09613v1 Announce Type: new Abstract: Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09615v1 Announce Type: new Abstract: Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.

Strict-Priority Packet Delay in Switches with Transmit-Ring Buffering

Yash Deshpande, Quirin Vogel, Wolfgang Kellerer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09619v1 Announce Type: new Abstract: Strict Priority (SP) scheduling is widely used at switch egress to provide low-latency service to high-priority (HP) traffic. Existing deterministic and stochastic latency models typically account for scheduler behavior and packet transmission, but omit a common switch implementation detail: the transmit ring (TXR) between the scheduler and the physical port. Because the switch must prepare the next packet before the current transmission completes, packets already placed in the TXR can further delay HP packets. This changes both the worst-case delay and the per-hop delay distribution of HP packets. This paper identifies this modeling gap, extends standard SP latency models to include the TXR, and validates the revised model through measurements on multiple switches. It also provides a measurement method for estimating the TXR size, a parameter that is often not reported in switch datasheets. The resulting model provides a closer representation of switch behavior for systems that use SP scheduling and require either delay bounds or delay distributions.

Motion planning for hundreds of floating robots

Jan Kamm, Antonio Terpin, Raffaello D'Andrea, Aswin Ramachandran — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09620v1 Announce Type: new Abstract: Planning collision-free motion for large robot fleets is difficult because collision avoidance induces strong inter-agent coupling that grows rapidly with team size. We consider omnidirectional floating robots on water, where choreographies are specified by sparse keyframes and an interactive tool must generate trajectories within seconds, even when transitions span minutes and thousands of time steps. We propose a scalable pipeline that builds a collision graph from an initialization, decomposes the coupled problem into interaction clusters, and solves clusters independently (and in parallel) with robustness mechanisms for common decomposition pathologies. We validate the approach in simulations up to 500 robots. The synthesized trajectories have also been deployed in two real-world demonstrations, on Lake Z\"urich with a fleet of 24 Way of Water crafts and at the Time Space Existence 2025 Venice Biennale.

Constrained user-item allocation for e-commerce marketing campaigns

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09623v1 Announce Type: new Abstract: When running marketing campaigns, retailers must decide which products to promote and which users to target. These decisions are inherently coupled: effective campaigns match users and items with strong mutual affinity into non-overlapping groups of predefined sizes. However, existing approaches assume predefined campaign structure or decouple item selection from user assignment, and cannot discover campaign groupings directly from joint interaction patterns. We therefore formalize this campaign problem as auto-targeting: jointly selecting users and items to construct multiple disjoint campaigns. To solve this combinatorial problem, we propose three complementary strategies: (i) constrained spectral biclustering to find dense regions in the user-item affinity matrix, (ii) greedy local search with pairwise swaps for combinatorial refinement, and (iii) a multi-armed bandit framework to escape local optima through exploration. We evaluate these methods on a synthetic dataset, the Amazon Reviews benchmarks, and large-scale proprietary commercial data, and compare the results to simulated annealing as a baseline. The results show that biclustering consistently achieves the highest campaign quality, lift, and fairness scores. While biclustering runs efficiently on smaller datasets, its runtime increases substantially on very large ones, where bandit-based methods instead offer a scalable alternative.

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09630v1 Announce Type: new Abstract: Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $\pi_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

Efficiently Restructuring Sovereign Debt via Arctic Auctions with Convex Costs

Jugal Garg, Edwin Lock, Vijay V. Vazirani — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09631v1 Announce Type: new Abstract: We study the problem of computing competitive equilibria in the Arctic product-mix auction, originally developed for the Icelandic government for exchanging blocked financial accounts, and more recently proposed by IMF staff for sovereign debt restructuring. From the buyers' perspective, the Arctic auction is equivalent to the quasi-linear Fisher market. However, unlike the standard Fisher model, the seller can express rich supply preferences through explicit supply-side costs and constraints. Despite extensive algorithmic literature on Fisher markets, the seller side has not received much attention, and no polynomial-time algorithm was previously known for computing competitive equilibrium when sellers face nontrivial costs. We examine the natural and expressive regime of separable, stepwise-increasing marginal costs that underlie the above-stated applications. Using polyhedral theory techniques, we first show that rational inputs lead to rational-valued competitive equilibria. Motivated by this result, we develop the first polynomial-time algorithm for this setting based on a non-trivial extension of classic primal-dual balanced-flow techniques for linear Fisher markets. Our work provides a robust computational foundation for auctions with sophisticated preferences, paving the way for flexible and institutionally feasible market designs in global finance.

Civil Court Simulation with Large Language Models

Yifan Chen, Haitao Li, Kaiyuan Zhang, Yueyue Wu, Qingyao Ai, Yiqun Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09632v1 Announce Type: new Abstract: Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practice and harder to simulate because its claims, liability, and remedies are more flexible. We present a multi-agent court simulation framework for Chinese civil cases. The framework organizes role-based interaction through a five-stage civil trial procedure and integrates memory module and statute retrieval to support long-process adjudication. Experiments show that the framework produces reliable civil judgments, with clear strengths in liability allocation and multi-item adjudication. Further experiments show that memory quality substantially affects downstream simulation quality. Through a five-layer factor framework, we analyze how legal grounding, information conditions, judicial capability and role orientation, organizational pressure, and social context affect the framework's reliability and behavior. These results support the effectiveness of the proposed framework for civil court simulation. The dataset and code are available at: https://github.com/foggpoy/Civil-Court.

ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

Debojyoti Biswas, Xianbiao Hu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09634v1 Announce Type: new Abstract: 3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications. Long-range detection is challenging because sensing evidence is sparse; yet this ``long-range'' scenario is routine in traffic. Although >30m is often labeled long-range in computer vision, on roadways it affords only approx. 1-2s for perception and decision-making. Under such extreme sparsity, two core challenges arise. First, early multimodal fusion tends to discard sparsity information and inject noise from empty or falsely occupied cells, degrading long-range recall. Second, context-agnostic uniform channel supervision favors dense and near-range samples, leaving far and small objects under-optimized, delaying the earliest detection of distant objects. We propose ``Ask The Neighbor'' (ATN3D), a LiDAR-Radar framework tailored for sparse-range conditions. ATN3D introduces (i) Density-aware early fusion with cross-modal gating that conditions fusion on per-voxel density/sparsity and Radar evidence, (ii) Occupancy-gated neighborhood aggregation with circular kernels to aggregate only from credible cells, (iii) Evidence-conditioned channel self-attention to adapt channel weights with weather/range, and (iv) a Range-aware loss that re-balances classification and localization by distance, aligning training with distance-stratified evaluation. On the VoD benchmark across clear and foggy conditions, ATN3D surpasses strong baselines: +3.55% mAP in clear weather and +8.41% mAP under simulated heavy fog; for >30m objects, gains are +3.33% (clear) and +2.09% (heavy fog). These results indicate earlier and more reliable long-range detections under sparse sensing in on-road traffic.

Gradient-Guided Reward Optimization for Inference-time Alignment

Hankun Lin, Ruqi Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09635v1 Announce Type: new Abstract: Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation

Mohammad Hossein Amini, David Dewar, Shiva Nejati, Mehrdad Sabetzadeh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09637v1 Announce Type: new Abstract: Personas are widely used in software engineering to support requirements elicitation, design, and validation, but their manual creation is costly, time-consuming, and hard to scale. Recent LLM-based approaches automate persona generation from textual data; however, they typically rely on single-shot generation and subjective evaluations, limiting practical reliability. We present PerGent, an industry-grade method for persona generation built around an iterative critique-refinement loop. Specifically, PerGent uses a generator and a critic LLM agent, coordinated by an orchestrator, to iteratively refine personas using external resources such as interviews, surveys, and job postings through a critique-refinement loop with a user-defined maximum number of rounds. We deploy and evaluate PerGent in an industrial setting at Kinaxis, comparing it with three baselines, including one-shot methods. In an expert in-situ evaluation, PerGent achieved the highest expert approval rate (96.9%), exceeding all baselines. We further compare PerGent-generated personas with best-practice personas manually created by domain experts prior to the adoption of LLMs. Compared to baselines, PerGent reproduces a larger proportion of expert content while also contributing substantial new content beyond the pre-LLM personas. We conclude with lessons learned from deploying and evaluating PerGent at Kinaxis.

Data-driven discovery of governing differential equations across physical systems

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09638v1 Announce Type: new Abstract: Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09639v1 Announce Type: new Abstract: The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

Physics-Aware Sparse Learning and Selective Online Adaptation for Euler-Lagrange Robot Dynamics

Rishabh Dev Yadav, Samaksh Ujjawal, Sihao Sun, Spandan Roy, Wei Pan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09640v1 Announce Type: new Abstract: Accurate dynamics models are essential for model-based robotic control, yet nominal Euler--Lagrange models often become inaccurate in the presence of payload variation, unmodeled coupling, friction, aerodynamic effects, and changing operating conditions. Most learning-based correction methods improve prediction accuracy by introducing a single additive residual, but do not preserve the internal mechanical structure of Euler--Lagrange systems. This leads to models that do not preserve symmetry, positive-definiteness, or the coupling between inertia and velocity-dependent terms, which can result in physically inconsistent predictions and reduced reliability when embedded in model-based controllers. We propose a structure-preserving residual learning framework that decomposes model mismatch into an inertia correction, the corresponding induced Coriolis term, and a generalized-force residual. The mechanical component is learned under physical constraints, while the disturbance-sensitive component is represented through a sparse history-dependent latent interaction model and adapted online using Bayesian linear regression. This separation preserves key mechanical structure while restricting adaptation to the part of the dynamics most affected by changing conditions. Experiments across multiple robotic platforms, including mobile, aerial, and manipulator systems, show that the proposed method improves dynamics prediction and trajectory tracking under coupled and time-varying dynamics. These results highlight the value of combining structured residual modeling, compact latent interaction selection, and selective online adaptation for real-world model-based control.

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09641v1 Announce Type: new Abstract: The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

FMplex: Model Virtualization for Serving Extensible Foundation Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09643v1 Announce Type: new Abstract: Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09644v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

Modeling Components and Connections in Cyber-Physical Systems

Kate Sanborn, Tanuj Kenchannavar, Vakul Nath, Jonathan Sprinkle — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09645v1 Announce Type: new Abstract: Text based configuration files for cyber-physical systems show the hierarchy of component modules well but often hide the details of connections and interfaces between modules. A model-based visual approach to these configuration files can better capture this information. The XML structure of Robot Operating System (ROS) launch files can be improved using a modeling approach. This paper presents ROSLaunchVisual, a model-integrated environment built on WebGME for designing, visualizing, and managing ROS launch files. The tool raises the level of abstraction by allowing developers to create and modify launch files using a graphical interface that represents nodes, publishers, subscribers, and arguments as interconnected components. The tool provides a dynamic system analysis that can then be used in the static development and analysis of new and existing launch files. ROSLaunchVisual incorporates features such as metamodel-driven validation, automatic import/export of launch files, and visual communication mapping. Plugins further enhance functionality by updating libraries, checking for semantic errors, and managing remaps. By making launch file creation more intuitive and less error-prone, ROSLaunchVisual improves development efficiency and system understanding, especially in collaborative or large-scale robotics projects.

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09646v1 Announce Type: new Abstract: We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

Luciano Duarte, Olga Ovcharenko, Sebastian Schelter — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09648v1 Announce Type: new Abstract: Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.

A Unifying Framework for Concept-Based Representational Similarity

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09653v1 Announce Type: new Abstract: Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

Beyond Accuracy: Community Perspectives on Machine Translation

Yujun Wang, Ehud Reiter, Shimei Pan, Steffen Eger, Wei Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09655v1 Announce Type: new Abstract: Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care about ethical concerns, trust, reliability, costs, and more. We argue that listening to various user communities is essential so that research efforts would be directed towards the problems that the communities care about. To this end, we present a large-scale analysis, for the first time, that investigates what four stakeholder communities (AI developers, professional translators, language learners, and language service providers) post about MT technology on social media. To do so, we construct a dataset of 79,286 posts and comments from Reddit, Facebook, Bluesky, and Mastodon from 2019 to 2025, and analyse where these communities disagree, and how and why. Overall, we find that communities often disagree, and even show strong conflicts due to polarised sentiments on topics such as translation quality, efficiency, and reliability. This is because these communities approach these topics differently: the AI community frames them as technical and computational problems, while non-AI (user) communities care more about quality nuances, time savings, user trust, and broader social issues.

Muon Learns More Robust and Transferable Features than Adam

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09658v1 Announce Type: new Abstract: Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

End-to-End Context Compression at Scale

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09659v1 Announce Type: new Abstract: Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

A Differentiable Simulation of the Eye for Patient-Specific Strabismus Surgery Planning

Even Harsigny, Pablo Alvarez, Michel Duprez, St\'ephane Cotin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09661v1 Announce Type: new Abstract: Purpose: Up to 4% of adults will develop strabismus in their lifetime. The most common surgical intervention involves adjusting the length of one or more extraocular muscles to correct the angular deviation. This correction depends on surgical expertise and statistical reference tables, which often fail to yield optimal results for patients with atypical eye morphology. Our work proposes a physics-based modeling approach to personalized surgical planning, accounting for patient-specific eye anatomy. Methods: We built a physics-based simulator of the eye and its muscles, incorporating patient-specific geometry and Hill-type muscle biomechanics. We solve an optimization problem to find the surgical dosage that minimizes angular deviation. The model is implemented as a fully differentiable simulation, enabling efficient optimization. We validated the framework by comparing its predictions with standard surgical tables for emmetropic eyes before applying it to anatomically atypical virtual patients. Results: Our model's predictions for emmetropic eyes were first validated, demonstrating a strong fit with standard surgical tables. More importantly, for high-myopia models, the framework computed a clinically significant increase in the required surgical dosage compared to standard eyes. This computed recession difference is highly relevant as surgical plans are adjusted in 0.5 mm increments. Conclusion: Our results show that our model provides a calibrated surgical plan that, unlike standard tables, also accounts for pathologies involving atypical eye shapes. This patient-specific model represents a step toward personalized surgical planning, with the potential to improve dosage accuracy and surgical outcomes for atypical cases.

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Sai Adith Senthil Kumar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09662v1 Announce Type: new Abstract: Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

Dun Li, Jiatao Li, Hongzhi Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09663v1 Announce Type: new Abstract: Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.

In-Context Learning for Latent Space Bayesian Optimization

Tuan A. Vu, Harri L\"ahdesm\"aki, Julien Martinelli — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09664v1 Announce Type: new Abstract: Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.

Frequency-based Constrained Sampling for Interval Patterns

Djawad Bekkoucha, Abdelkader Ouali, Bruno Cr\'emilleux — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09666v1 Announce Type: new Abstract: Output space pattern sampling is a powerful alternative to exhaustive pattern mining for exploring large pattern spaces, as it enables users to focus on representative patterns drawn according to a chosen interestingness measure. In this paper, we address the problem of sampling interval patterns under user-defined syntactic constraints. We introduce CFips, a sampling approach that incorporates constraints directly into the sampling procedure. The approach relies on a multi-step sampling framework and supports several syntactic constraints by decomposing them into elementary predicates on interval bounds while preserving exact sampling guarantees. We formally prove that CFips samples interval patterns proportionally to their frequency within the constrained pattern space. The experimental results show that integrating constraints into the sampling procedure enables to complete mining tasks that would otherwise fail within a given time out.

Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

Seoungbin Bae, Dabeen Lee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09668v1 Announce Type: new Abstract: Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$\eta$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $\eta$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $\Omega(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09669v1 Announce Type: new Abstract: Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09670v1 Announce Type: new Abstract: Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09671v1 Announce Type: new Abstract: Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09672v1 Announce Type: new Abstract: Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

Wesley Pegden — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09674v1 Announce Type: new Abstract: We present Trellis: an autoformalization system that leverages LLM agents in a deterministically constrained workflow to enforce incremental progress in Lean autoformalization tasks through iterative refinement of natural language proofs. Our approach is motivated by the common mathematician's notion of what it means to have a rigorous proof in the first place: namely, that it would be routine to elaborate any part of the proof in further detail. The result is a system which aims to achieve reliable autoformalization on a modest budget and with generalist agents, with specialization to autoformalization coming not from any task-specific agent training but instead from a meaning-of-rigor inspired workflow enforced by process semantics. We link to an end-to-end Lean formalization of a recent Ramsey theory breakthrough produced by the process.

Boundary-Layer-Induced Failure of Standard Physics-Informed Neural Networks: A Legendre Wavelet Collocation Benchmark for Singularly Perturbed Transport Problems

Suvendu Nayak, Arun Kumar Gupta — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09676v1 Announce Type: new Abstract: Boundary layers provide a demanding test for numerical solvers because the solution may remain almost constant over most of the domain while changing rapidly in a narrow region near the boundary. This paper studies a singularly perturbed one-dimensional transport boundary-value problem with increasing Peclet number $(\mathrm{Pe})$. A local Legendre wavelet collocation method (LWM) is compared with a standard soft-boundary physics-informed neural network (PINN) for this benchmark. The wavelet approximation uses locally supported Legendre polynomial basis functions and converts the problem into a square algebraic collocation system with residual, boundary, and interface-continuity equations. Numerical experiments are performed for $\mathrm{Pe}=1,10,100,$ and $1000$. The LWM captures all four cases, with the largest error remaining below $5\times 10^{-3}$. The standard soft-boundary PINN performs well for the mild cases but fails to resolve the sharp boundary layer for the larger Peclet numbers. The results show that local wavelet collocation is more reliable than the standard soft-boundary PINN for this benchmark, while dense near-boundary evaluation helps reveal errors that may be missed on coarse grids.

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

Parthsarthi Rawat — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09679v1 Announce Type: new Abstract: We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires predicting who performs which action and when, across eight classes in broadcast soccer. Building on the three FOOTPASS baselines [1] (TAAD, TAAD+GNN, and TAAD+DST), we contribute four extensions: (1) gradient check pointing to enable full-backbone fine-tuning on a single GPU; (2) fusion of GNN logits into the DST encoder, combining graph-based tactical context with per-player visual features; (3) square-root frequency class weighting to address the 213:1 pass-to-tackle imbalance in the training data; and (4) a post processing pipeline comprising per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble. Our system achieves 0.548 Macro F1 on the test set and 0.446 on the challenge set (server evaluation).

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09681v1 Announce Type: new Abstract: Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

Jaber Jaber, Osama Jaber — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09682v1 Announce Type: new Abstract: AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel

An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats

Dmitrii Vasilev — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09686v1 Announce Type: new Abstract: Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.

A space-time sparse-grid method for the wave equation

Matteo Ferrari, Andrea Moiola, Chiara Perinati, Ilaria Perugia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09688v1 Announce Type: new Abstract: We develop a fast space-time numerical scheme for approximating solutions to the linear wave equation. The approach is based on the sparse-grid combination technique applied to a coercive space-time discretization. Designed for tensor-product space-time discretizations, the method enables efficient parallelization of the resulting solver. We provide a rigorous theoretical analysis establishing convergence rates and computational complexity estimates. Numerical experiments validate the theoretical estimates and demonstrate the efficiency of the proposed method.

Low-Rank Acceleration of the Operator Fourier Transform

Jack Kelley — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09689v1 Announce Type: new Abstract: We develop a numerical algorithm for the efficient solution or approximation of solutions to the Helmholtz equation on a structured grid in two dimensions. We make use of the Operator Fourier Transform (OFT) and a low-rank cross approximation scheme (Cross-DEIM) to decompose the problem into an integral over a pseudo-time of solutions to the Schr\"odinger equation. The OFT is a framework for solving operator equations like fractional Laplacian equations or the Helmholtz equation, when the latter is written as a product of two paraxial operators. The main computational cost in the OFT is the solution to the Schr\"odinger equation, especially when the dimension or mesh resolution is high. In this work, we alleviate this cost by utilizing a low-rank method. Such methods aim to beat the curse of dimensionality when low-rank structures are present in the solution. We show that the combination of these two approaches can have large cost reductions for certain classes of problems.

Observability for Delegated Execution in Agentic AI Systems

Abhinav Mishra, Kumar Sharad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09692v1 Announce Type: new Abstract: Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple incompatible delegation assignments. This gap is especially acute in LLM-based agentic systems, where agents dynamically select tools, vary execution sequences across runs for the same instruction, and spawn cooperating sub-agents. These dynamics fragment and interleave traces, making delegation-scoped reconstruction from causal structure alone structurally underdetermined. Although individual actions are authorized and logged, existing audit, tracing, and security schemas lack the semantics to reconstruct what actions occurred under a given delegation across heterogeneous systems. We focus on delegation-scoped attribution and access/share footprint reconstruction, not intent inference or reasoning reconstruction. We present an agent-aware observability substrate consisting of a lightweight gateway and a common information model that binds delegation context at execution time. This enables reliable cross-tool delegation-scoped reconstruction and direct forensic queries without heuristic time-window correlation.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09697v1 Announce Type: new Abstract: Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

Optimal Feedback Communication with Information Maximization and Distortion Minimization

Aolin Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09698v1 Announce Type: new Abstract: We study the problem of optimally sending a real-valued source through multiple uses of a channel with feedback. First, we state a set of conditions that are sufficient for an encoder to achieve maximal mutual information between the source and all the channel outputs. This set of conditions are also necessary when the channel is input-identifiable, a condition widely satisfied by common channel models. More notably, we further study the information maximization-distortion minimization problem, where the mutual information between the source and all channel outputs still needs to be maximized, while at each step, the MMSE of estimating the source from the channel outputs so far also needs to be minimized. We derive a solution to this problem for discrete channels with certain symmetries, e.g. $k$-ary symmetric or $k$-ary erasure channels. We show that for such channels the famous posterior matching scheme, while not necessary for information maximization alone, is sufficient and essentially necessary for achieving both information maximization and distortion minimization. This work also provides a new perspective of regularizing distortion-minimizing feedback communication through information maximization, which enables us to find the optimal solution that otherwise would be intractable.

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

Ravi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag, Dinesh Singh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09699v1 Announce Type: new Abstract: The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in craniofacial reconstruction when translating from the skull (x-ray) to the face (optical) domain, due to a mismatch in the alignment of structural identity across modalities. To address this issue, we propose Cranio-Diff, a diffusion-based framework for cross-domain cranio-facial reconstruction from 2D X-ray skull images. The proposed approach integrates skull-conditioned structural guidance through ControlNet with biometric text conditioning to generate a face which is more semantically and structurally aligned with the given skull. The proposed Cranio-diff method is evaluated on skull-face dataset obtained from X-ray scans of 120 subjects in lateral and frontal views. To enable controlled evaluation, each face image is synthesised across three age groups (25, 45, 65) and three BMI variations of -10%, baseline and +10%, yielding 4320 paired samples. To the best of our knowledge, this is the only X-ray-face dataset with this magnitude. Extensive experiments showed that the proposed method outperforms recent existing approaches in both generated image quality and retrieval task. Finally, to evaluate the performance of our proposed method, we have evaluated the quality of the generated image using FID, IS, SSIM, LPIPS, PSNR and ArcFace score. Additionally, retrieval performance is evaluated using recall@k, mAP@k and MRR@k. Obtained experimental results demonstrate that the proposed method can be used as an alternate tool in providing aid in forensic investigations.

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09700v1 Announce Type: new Abstract: Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09701v1 Announce Type: new Abstract: AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

Wenjie Xi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09705v1 Announce Type: new Abstract: Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score. Through Tweedie's formula, far-away perturbations can influence local score components via posterior covariance, meaning a local model succeeds only if its receptive field covers the smoothed score's response range. We formalize this mechanism, proving a size-uniform comparison theorem for local marginals under reverse diffusion. We also introduce Finite-Depth Local Flow (FDLF), a white-box diagnostic benchmark with exact scores, densities, and controllable response ranges. Empirically, we validate the interplay between spatial mixing, smoothed-score quasi-locality, and model receptive fields. Under spatial mixing, the smoothed score remains quasi-local relative to the receptive field, enabling stable extrapolation. Conversely, when spatial mixing weakens, the score's locality rapidly degrades, causing size transfer to fail.

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09707v1 Announce Type: new Abstract: As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09709v1 Announce Type: new Abstract: Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Mohammad Beigi, Ming Jin, Lifu Huang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09711v1 Announce Type: new Abstract: Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

Zhu Li, Shekhar Nayak, Matt Coler — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09717v1 Announce Type: new Abstract: Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09718v1 Announce Type: new Abstract: Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

Alejandro Gonzalez-Garcia, Dries Dirckx, Jan Swevers, Wilm Decr\'e — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09719v1 Announce Type: new Abstract: Autonomous mobile robots operating in tight environments require motion planning frameworks that account for the physical footprint of the robot. Simplifying the geometry to a point or a circle is conservative and discards information needed to successfully and safely traverse narrow passages. This work proposes a safe local motion planning and control method that guarantees that a polytopic robot footprint stays inside a continuously updated convex free-space region. The containment condition is formulated as a set of discrete-time control barrier function constraints within a model predictive controller. The number of safety constraints depends on the complexity of the local free-space geometry and the robot shape, instead of the number of obstacles. The proposed free-space formulation does not need any obstacle detection or segmentation. A comparative analysis against a polytope-based obstacle avoidance formulation confirms favorable scaling up to a reduction of 91$\times$ in computation time as the number of obstacles increases. The approach is validated in simulation with an autonomous surface vehicle and on hardware with a non-holonomic mobile robot, using both occupancy grids and LiDAR sensing. The experiments demonstrate safe real-time motion planning and control at 10~Hz on an onboard embedded computer, including reactive avoidance of dynamic obstacles.

Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

Hudson de Martim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09724v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures, including fabricated citations submitted to courts and anachronistic legal content presented as current, continue to appear across jurisdictions. We argue that these failures are not residual confabulations to be eliminated by scaling language models, but symptoms of an architectural mismatch between probabilistic retrieval and the hierarchical, temporal, and institutional structure of legal knowledge. We develop the argument in three moves. First, we articulate the ontological commitment of legal knowledge as a triad of properties derivable from classical legal theory: hierarchical and mereological structure, diachronic dynamism under operational closure, and causal traceability of institutional provenance grounded in the duty of justification. Second, we identify three corresponding pathologies of retrieval (mereological blindness, diachronic blindness, and causal opacity), each developed with an operational definition, a failure mechanism, a canonical example, and detection criteria for diagnostic use. Third, we review the state of the art through this lens, showing that existing approaches address these requirements unevenly and do not yet compose into a paradigm that treats them as co-constitutive. From this analysis we derive four architectural commitments that characterize the deterministic-by-design direction for legal retrieval: ontological primacy, event reification, bitemporal correctness, and deterministic interaction protocols. The framework concerns quaestio juris (which norms apply and in what state) rather than the downstream tasks that act on identified norms, and addresses legislative and constitutional retrieval primarily, with interpretive time as an explicit extension.

Disentanglement with Holographic Reduced Representations

Jhonny J. Velasquez Olivera, Christo K. Thomas, Walid Saad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09725v1 Announce Type: new Abstract: Disentanglement, the separation of factors of variation in data using neural networks, remains a long-standing challenge in machine learning. Prior work has addressed this problem with variational autoencoders and generative adversarial networks that incorporate ideas from variational inference and information-theoretic constraints. In contrast to methods that rely on continuous representations, we propose a design that treats disentangled representations as symbolic structures, motivated by the compositional relationships among the concepts that make up samples from a distribution. However, learning discrete symbolic structures with neural networks while maintaining differentiability is difficult and often requires complex architectures. To address this, we introduce an unsupervised learning algorithm that uses holographic reduced representations (HRR) for neural disentanglement. We show that the HRR unbinding operation provides an inductive bias for separating factors and yields competitive results against baselines, as measured by latent traversals and disentanglement metrics. We complement these empirical findings with an information-theoretic analysis of the HRR unbinding channel. We prove that unbinding induces approximately independent symbol-value pairs and derive a per-slot capacity bound that quantifies how many distinct symbolic concepts can be reliably encoded, giving a quantitative account of the inductive bias toward disentanglement. The resulting representations differ from standard autoencoder-based models, in that their latent units are vectors that are summed together, rather than scalar dimensions of a low-dimensional latent vector. We show that this HRR representation is more robust to noise than other disentangled representations and maintains reconstruction quality across a range of SNRs.

Bayesian Probing on Graphs

Anupam Gupta, Benjamin Moseley, Rudy Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09729v1 Announce Type: new Abstract: We introduce a stochastic probing problem with correlated items. In our model, which we call Bayesian Probing, the correlations are modeled by an underlying graph $G$. Each vertex is independently active with a known probability. Each item corresponds to an edge in the graph. Probing an edge has some cost, gives some reward if both endpoints are active, and also reveals the state of its endpoints. Hence a probe induces a Bayesian update on the remaining edges. The goal is to adaptively probe items/edges subject to a knapsack constraint to maximize the expected total reward obtained from the probed edges. Bayesian Probing generalizes stochastic knapsack and stochastic probing by allowing correlations between items. Moreover, it gives a tractable model for the Bayesian Active Search problem, a popular problem considered in the machine learning community. In Bayesian Active Search, the goal is to find items in a particular class by adaptively probing at most, say $k$, items. Given a prior distribution over items, we want to compute a Bayesian policy to maximize the number of such items found. For this general problem with arbitrary priors, there are strong lower bounds on efficiently computing good policies. In this paper, we design efficient approximation algorithms for Bayesian Probing. These results give the first efficient approximation algorithms for Bayesian Active Search, for a class of practically-relevant prior distributions.

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09730v1 Announce Type: new Abstract: Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

Tight Sample Complexity of Transformers

Chenxiao Yang, Nathan Srebro, Zhiyuan Li — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09731v1 Announce Type: new Abstract: We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $\Omega(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $\Omega\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Wendy K. Tam — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09735v1 Announce Type: new Abstract: The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09738v1 Announce Type: new Abstract: Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.

ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09740v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models demonstrate strong perfor-1 mance on language-conditioned robotic manipulation within their training dis-2 tribution, yet their generalization capabilities remain fundamentally limited. They3 lack the robustness required to handle perturbations, frequently failing when con-4 fronted with lighting changes, altered camera viewpoints, or small initial-state5 variations. We propose PROBEACT, a training-free runtime intervention frame-6 work that detects and recovers from grasping and placement failures in pre-7 trained VLA policies without modifying their weights or requiring additional8 demonstrations. PROBEACT combines three components: (i) a lightweight multi-9 target hidden-state probe that predicts the 3D positions of task-relevant objects10 from intermediate VLA features, with Hungarian-matched identity tracking for11 multi-object scenes; (ii) an object-agnostic kinematic state machine that detects12 grasp, transport, and placement failures using only gripper-internal signals and13 end-effector kinematics; and (iii) a hierarchical Control Barrier Function (CBF)14 filter that encodes repeated-failure locations as soft safe-set constraints, mini-15 mally correcting VLA actions while preserving baseline behavior. As a plug-and-16 play, training-free intervention loop, PROBEACT is orthogonal to existing train-17 ing pipelines. Evaluated on the LIBERO-plus benchmark, our framework acts as18 a universal safety net, improving the success rate of the OpenVLA-OFT model19 from 69.6% to 74.1%, while demonstrating broad applicability to both base and20 fine-tuned VLA policies.

bbsolver: A Unified Error-Bounded Spatiotemporal Optimization Solver for Key Timing and Topology-Consistent Vector Paths

Ilya Gusinski (IVG Design) — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09741v1 Announce Type: new Abstract: Dense sampling records what an animation system actually evaluated, but it produces a poor final representation: every sampled frame can become a key, edit handles become noisy, and animated vector paths remain hard to adjust. Existing reducers usually treat the two axes separately: animation-curve reducers reduce key timing, while curve and path simplifiers reduce geometry. When applied independently to animated paths, these methods can break point identity across frames, change vertex structure over time, or provide no single error budget that covers both timing and shape. bbsolver frames the task as tolerance-bounded spatiotemporal reduction. A host application, such as After Effects or Blender, samples temporal and spatial animation into a documented JSON bundle; the standalone solver chooses sparse keys, interpolation metadata, and path representation; and the output is accepted only if replayed samples remain within the requested worst-case error. The same solver core can be used by any application that can export samples and write back returned keys or paths. In After Effects validation, solved keys written back into AE and re-sampled from AE playback reduce a DUIK humanoid walk cycle from 12,684 samples to 540 keys at epsilon=1, a 23.5x reduction, and an ant rig from 11,956 samples to 653 keys, an 18.3x reduction, with maximum errors below 1 px and 1 degree. A Blender-sampled FBX mocap retarget reaches 214 keys from 13,455 samples at epsilon=3; baselines tuned to matched measured accuracy require 4.5x to 27.5x more scalar key entries. For vector paths, bbsolver supports reduction when vertex identity/order is constant over time and diagnostics for variable-vertex-count streams, including a 6.7x After Effects-compatible procedural-path compression and exact transition-timing recovery in a diagnostic case.

Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

Claudio Nordio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09744v1 Announce Type: new Abstract: We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers.

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09746v1 Announce Type: new Abstract: With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09748v1 Announce Type: new Abstract: Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09749v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.

Collaborative Human-Agent Protocol (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09751v1 Announce Type: new Abstract: Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

Human-Centred Risk Mitigation for AI-Mediated Information Manipulation: A SOCMINT Framework Based on Information Manipulation Sets

Antonio Scala — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09754v1 Announce Type: new Abstract: AI-mediated information manipulation increasingly takes the form of social cyber attacks that target trust, attention, credibility, reputation, and decision-making rather than only technical infrastructures or isolated false contents. Existing defensive approaches often oscillate between incident-level analysis, which fragments campaigns into weak signals, and attribution-first analysis, which may delay mitigation until responsibility is established. This paper proposes a SOCMINT framework based on Information Manipulation Sets (IMS) as an intermediate operational unit between individual incidents and strategic attribution. Building on the VIGINUM/EEAS use of IMS in counter-FIMI analysis, the framework treats manipulation as a coherent process involving narratives, accounts, infrastructures, temporal patterns, cross-platform migration, synthetic amplification, and cognitive targeting. The proposed pipeline moves from signal detection and diagnostic triage to IMS hypothesis construction, confidence/severity assessment, mitigation selection, and iterative update. A compact scenario illustrates how IMS-based analysis captures what content-level and attribution-first approaches miss. The paper also proposes a tabletop evaluation protocol to assess decision quality, confidence calibration, and mitigation proportionality. The main implication is that human-centred risk mitigation requires not only better detection, but also structured reasoning under uncertainty, auditable decision-making, and safeguards against over-securitising legitimate dissent.

Perturbative Contrastive Physical Learning

Kyungeun Kim, Amanuel Anteneh, Israel Klich, Olivier Pfister, J. M. Schwarz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09756v1 Announce Type: new Abstract: Responses to perturbations are key to understanding physical systems. The ability to contrast such responses by comparing how a system reacts under slightly different conditions provides a mechanism for learning. Here, we introduce Perturbative Contrastive Physical Learning (PCPL), a general framework in which learning emerges from measurable contrasts between physical states produced by controlled changes to inputs, boundary conditions, parameters, or interpreter functions. PCPL unifies and extends prior approaches: Equilibrium Propagation is rooted in contrasts between free and nudged equilibria in energy-based systems, while Frequency Propagation corresponds to contrasts extracted from sinusoidally driven, frequency-demodulated responses. We show that contrast-driven updates can reflect either local sensitivities or global inverse-problem structure, yet do not require centralized gradient computation. Instead, effective learning geometry emerges implicitly from the system's own physical response, allowing learning behavior to arise without an external processor or explicit backpropagation. We demonstrate PCPL in two platforms: (i) spring networks that update bond stiffness using measured displacements and forces, and (ii) continuous-variable photonic circuits trained via x quadrature measurements and finite-difference estimates of the Jacobian. Both platforms successfully learn classification tasks. We further show that a continuous-variable photonic circuit can be trained to implement analog multiplication, illustrating a step toward more autonomous physical learning systems.

Difference-Aware Retrieval Policies for Imitation Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09758v1 Announce Type: new Abstract: Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.

Preserving Plasticity in Continual Learning via Dynamical Isometry

Andries Rosseau, Robert M\"uller, Ann Now\'e — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09762v1 Announce Type: new Abstract: Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09764v1 Announce Type: new Abstract: A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09767v1 Announce Type: new Abstract: Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

SemDINO: A DINOv3-Driven Network for Cross-Temporal Semantic Alignment in Change Detection

Xinyu Tong, Meihua Zhou, Jinxiao Sun, Yingjie Tang, Lei Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09772v1 Announce Type: new Abstract: Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale representation, and poor robustness to pseudo-changes caused by illumination, season, and registration noise. To address these issues, we propose a novel end-to-end semantic change detection network named SemDINO, which integrates a dual-branch encoder, multi-scale temporal interaction, semantic purification, change enhancement, and decoupled multi-task prediction into a unified framework. Specifically, we construct a dual-branch encoder that combines a CNN backbone and frozen DINOv3 features via gated pyramid fusion, enabling rich multi-scale semantic representation. Then, a multi-scale temporal bidirectional transformer interaction (M-TBTT) module is proposed to achieve global cross-temporal feature alignment and information interaction. To further enhance genuine changes and suppress pseudo-variations, we introduce semantic purification (SCP), bidirectional change enhancement (BiChangeEnhance), and multi-scale change enhancement (MCE) modules collaboratively. Finally, a multi-branch CD prediction head is designed to jointly output binary change mask, bi-temporal semantic maps, and edge constraint. Extensive experiments on public remote sensing CD datasets demonstrate that SemDINO achieves superior performance and generalization ability against state-of-the-art methods, especially in complex scenarios with interference factors.

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09774v1 Announce Type: new Abstract: Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

AetheRock: An Arm-Worn Robot Teaching System for Force-Guided Vision-Tactile Learning

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09777v1 Announce Type: new Abstract: Force and tactile sensing are indispensable in contact-rich manipulation. However, force-aware robot learning faces critical challenges due to the incompatible assembly of tactile and force sensors in handheld or wearable devices. To address these limitations, we first introduce AetheRock for gripper-force, vision, and tactile data collection, which is an arm-worn device featuring a modular and easily manufactured visuo-tactile sensor, GelSlim-MiniFab, at the fingertip, a resistive pressure sensor at the human finger contact region, a customized PCB module, and a wearable kit for comfortable and robust collection. Building on this, we propose ForceVT, a representation learning framework that uses force and vision to guide fidelity-agnostic tactile learning, enabling robust inference in any tactile situation. Real-world experiments show that AetheRock achieves qualified data efficiency and that ForceVT effectively alleviates inefficiencies when visuo-tactile sensors exhibit manufacturing and utilization inconsistencies. Overall, our work mitigates the limitations of gripper-force vision-tactile robot learning through innovative hardware design and algorithms.

Quality-Diversity Search in Sound Generation: Investigating Innovation Engines for Audio Exploration

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09780v1 Announce Type: new Abstract: This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.

Cohort-based Semantic Labeling: AI-Enabled Recovery of Visualization Semantics from Deployed SVGs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09782v1 Announce Type: new Abstract: Many web-based visualizations are deployed as Scalable Vector Graphics (SVG), a format that faithfully preserves visual appearance but typically omits the higher-level semantic structure needed for machine interpretation. Once rendered and published, information about a visualization's components, roles, and encodings is no longer explicitly available, limiting downstream operations such as querying, accessibility augmentation, explanation, personalization, and transformation. To address this gap, we introduce CSL, an AI-enabled, multi-stage pipeline for automatically recovering visualization semantics from deployed SVGs through two complementary mechanisms: (1) cohort-based decomposition, which organizes heterogeneous SVG primitives into structurally coherent subsets that reduce the semantic assignment space, and (2) hybrid semantic grounding, which combines model-based inference with deterministic structural validation and propagation to make labeling both context-sensitive and structurally anchored. CSL produces Semantic SVG (SSVG), a representation in which SVG elements are annotated with graphical mark type, visualization role, and data role. We implemented CSL as an end-to-end prototype and evaluated it on 102 SVG visualizations, achieving global macro-averaged accuracies of 0.822 for mark type, 0.853 for visualization role, and 0.860 for data-role recovery. An ablation against a non-cohort whole-chart baseline showed that cohorting significantly improves accuracy (paired t-test: t > 20, p < 0.001; Cohen's d > 2.0), and repeated labeling of a randomly selected SVG over 100 runs yielded mean agreement above 91.9% across all three attributes. These results provide strong evidence that CSL can transform deployed SVGs into machine-usable semantic representations, enabling more accessible, adaptive, and user-steerable visualization systems.

Zero Touch Predictive Orchestration: Automating Time-Series Models for the Cloud-Edge Continuum

Abd Elghani Meliani, Arora Sagar, Adlen Ksentini, Raymond Knopp — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09787v1 Announce Type: new Abstract: The Cloud-Edge Continuum (CEC) enables latency-critical applications by distributing resources to the far edge, but its extreme volatility makes proactive Zero Touch Management via time-series forecasting essential. However, orchestrators face a severe "cold start" problem: newly discovered nodes lack the historical data required to train localized predictive models, while generalized models fail to capture unique hardware and microservice behaviors. To solve this, we propose a fully automated time-series prediction architecture driven by a novel data-mixing methodology. At the infrastructure level, we introduce a lightweight, technology-agnostic Resource Exposer (RE) that dynamically discovers nodes and continuously collects customizable telemetry (e.g., compute, network, energy). To overcome the sparsity of these initial local samples, our framework automatically merges them with TimeTrack, our publicly available, high-resolution dataset collected at 45-second intervals. This synergizes TimeTrack's foundational, high-frequency temporal patterns with the precise calibration of the local node data. Processed through a Neural Architecture Search (NAS) engine, the system automatically generates highly accurate baseline models. Experimental results demonstrate that merging the target data with TimeTrack effectively mitigates the cold start challenge. This integration significantly improves forecasting accuracy measured in Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) and accelerates convergence compared to training on the sparse local samples alone, training solely on generic datasets, or mixing the target data with standard alternative datasets, establishing a robust foundation for continuous MLOps deployment.

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09788v1 Announce Type: new Abstract: Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

Principled Uncertainty in Clinical AI: End-to-End Bayesian Modelling and Algorithmic Equity Auditing Across Multimodal Patient Data

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09789v1 Announce Type: new Abstract: Clinical artificial intelligence (AI) systems routinely produce predictions without principled quantification of uncertainty, limiting their trustworthiness in high-stakes medical environments. This paper presents an integrated research programme addressing two interconnected problems: (1) the development of a fully end-to-end Bayesian uncertainty modelling framework for multimodal clinical data, and (2) the application of calibrated uncertainty estimates as a formal measure of algorithmic equity across patient subgroups. We construct a probabilistic deep learning architecture comprising modality-specific variational encoders, a precision-weighted late fusion mechanism, and a decomposed uncertainty output head that separates aleatoric from epistemic uncertainty. The system is trained with a composite Bayesian loss incorporating binary cross-entropy, Kullback-Leibler divergence regularisation, and an uncertainty calibration penalty. We evaluate model calibration using Expected Calibration Error (ECE = 0.096) and conduct a subgroup equity audit across facility type, socioeconomic status, age group, and biological sex on a dataset of 1,000 simulated patients. Results demonstrate that epistemic uncertainty systematically identifies underserved populations: primary/rural facility patients show a 15.3% uncertainty equity gap (p < 0.001, effect size = 0.698), low socioeconomic status patients exhibit a 6.8% gap (p < 0.001), and elderly patients show a 3.9% gap (p < 0.001), whilst no significant sex-based disparity is detected. These findings establish that calibrated uncertainty is not merely a technical property of probabilistic models but constitutes an actionable equity signal with direct clinical relevance.

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Solja\v{c}i\'c — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09792v1 Announce Type: new Abstract: End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction

Ewa Miazga, Jorge Condor, Piotr Didyk — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09794v1 Announce Type: new Abstract: View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.

SynManDex: Synthesizing Human-like Dexterous Grasps from Synthetic Human Pre-Grasps

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09798v1 Announce Type: new Abstract: Human hand-object interactions encode functional intent, but direct transfer to robotic hands often fails under morphology, contact, and reachability constraints. We present SynManDex, a synthetic pipeline that uses generated human pre-grasps as affordance-aware proposals and resolves the final contacts with robot-native optimization. SynManDex samples object-conditioned digital human pre-grasps, retargets them to dexterous robotic hand poses, optimizes force-closure contacts on the target embodiment, and admits trajectories that pass checks from each step. The resulting keyframes support both grasp-and-lift demonstrations and various prehensile manipulation tasks such as tea pouring, photo taking, and flute playing, designed via VLM agents. As a result, SynManDex combines high grasp quality (86.4\% grasp stability) with 4.67/5 human-likeness (93.4\%). It achieves 80.7\% successes in simulation and 25/30 (83.3\%) real-robot successes when applied to a 36-DOF bimanual dexterous robotic platform.

FASE: Fast Adaptive Semantic Entropy for Code Quality

Shizhe Lin, Ladan Tahvildari — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09800v1 Announce Type: new Abstract: Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

Udvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09802v1 Announce Type: new Abstract: We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbol{\pi}_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2(\log(T)\right)$, where $\tilde{\Delta}$ is the constraint-aware sub-optimality gap subject to policy $\pi_0$, with variance-aware multiplicative term $\kappa$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

Echo-Memory: A Controlled Study of Memory in Action World Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09803v1 Announce Type: new Abstract: We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

Topological Neural Operators

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09806v1 Announce Type: new Abstract: We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09809v1 Announce Type: new Abstract: AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09811v1 Announce Type: new Abstract: World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09813v1 Announce Type: new Abstract: Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09816v1 Announce Type: new Abstract: Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

Rethinking the Divergence Regularization in LLM RL

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09821v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

Causally Evaluating the Learnability of Formal Language Tasks

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09822v1 Announce Type: new Abstract: Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.

TSseek: Regular Expression-Based Similarity Search for Distributed Time Series Datasets

Xiaoshuai Li, Khalid Alnuaim, Mohamed Y. Eltabakh, Elke A. Rundensteiner — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09824v1 Announce Type: new Abstract: Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series. To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs. In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series). Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.

An Agency-Transferring Model-Free Policy Enhancement Technique

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09825v1 Announce Type: new Abstract: Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09826v1 Announce Type: new Abstract: Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09827v1 Announce Type: new Abstract: Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

Latent Spatial Memory for Video World Models

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09828v1 Announce Type: new Abstract: Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2501.15505v5 Announce Type: cross Abstract: Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

XAInomaly: Explainable and Interpretable Deep Contractive Autoencoder for O-RAN Traffic Anomaly Detection

Osman Tugay Basaran, Falko Dressler — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2502.09194v1 Announce Type: cross Abstract: Generative Artificial Intelligence (AI) techniques have become integral part in advancing next generation wireless communication systems by enabling sophisticated data modeling and feature extraction for enhanced network performance. In the realm of open radio access networks (O-RAN), characterized by their disaggregated architecture and heterogeneous components from multiple vendors, the deployment of generative models offers significant advantages for network management such as traffic analysis, traffic forecasting and anomaly detection. However, the complex and dynamic nature of O-RAN introduces challenges that necessitate not only accurate detection mechanisms but also reduced complexity, scalability, and most importantly interpretability to facilitate effective network management. In this study, we introduce the XAInomaly framework, an explainable and interpretable Semi-supervised (SS) Deep Contractive Autoencoder (DeepCAE) design for anomaly detection in O-RAN. Our approach leverages the generative modeling capabilities of our SS-DeepCAE model to learn compressed, robust representations of normal network behavior, which captures essential features, enabling the identification of deviations indicative of anomalies. To address the black-box nature of deep learning models, we propose reactive Explainable AI (XAI) technique called fastshap-C.

Physics-Embedded Neural Networks for sEMG-based Continuous Motion Estimation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2506.22459v1 Announce Type: cross Abstract: Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological consistency. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while achieving accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and $R^2$ metrics.

XAI-on-RAN: Explainable, AI-native, and GPU-Accelerated RAN Towards 6G

Osman Tugay Basaran, Falko Dressler — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2511.17514v1 Announce Type: cross Abstract: Artificial intelligence (AI)-native radio access networks (RANs) will serve vertical industries with stringent requirements: smart grids, autonomous vehicles, remote healthcare, industrial automation, etc. To achieve these requirements, modern 5G/6G design increasingly leverage AI for network optimization, but the opacity of AI decisions poses risks in mission-critical domains. These use cases are often delivered via non-public networks (NPNs) or dedicated network slices, where reliability and safety are vital. In this paper, we motivate the need for transparent and trustworthy AI in high-stakes communications (e.g., healthcare, industrial automation, and robotics) by drawing on 3rd generation partnership project (3GPP)'s vision for non-public networks. We design a mathematical framework to model the trade-offs between transparency (explanation fidelity and fairness), latency, and graphics processing unit (GPU) utilization in deploying explainable AI (XAI) models. Empirical evaluations demonstrate that our proposed hybrid XAI model xAI-Native, consistently surpasses conventional baseline models in performance.

BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks

Osman Tugay Basaran, Martin Maier, Falko Dressler — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2602.14033v1 Announce Type: cross Abstract: Future sixth-generation (6G) mobile networks will demand artificial intelligence (AI) agents that are not only autonomous and efficient, but also capable of real-time adaptation in dynamic environments and transparent in their decisionmaking. However, prevailing agentic AI approaches in networking, exhibit significant shortcomings in this regard. Conventional deep reinforcement learning (DRL)-based agents lack explainability and often suffer from brittle adaptation, including catastrophic forgetting of past knowledge under non-stationary conditions. In this paper, we propose an alternative solution for these challenges: Bayesian reasoning via Active Inference (BRAIN) agent. BRAIN harnesses a deep generative model of the network environment and minimizes variational free energy to unify perception and action in a single closed-loop paradigm. We implement BRAIN as O-RAN eXtended application (xApp) on GPU-accelerated testbed and demonstrate its advantages over standard DRL baselines. In our experiments, BRAIN exhibits (i) robust causal reasoning for dynamic radio resource allocation, maintaining slice-specific quality of service (QoS) targets (throughput, latency, reliability) under varying traffic loads, (ii) superior adaptability with up to 28.3% higher robustness to sudden traffic shifts versus benchmarks (achieved without any retraining), and (iii) real-time interpretability of its decisions through human-interpretable belief state diagnostics.

A Proof in Coq that Core Logic is not Paraconsistent

Joseph Vidal-Rosset — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.05953v1 Announce Type: cross Abstract: First, this paper proves that Tennant's two claims (i.e. that his own logical system is paraconsistent, and that it overlaps minimal logic) are both false. Second, this proof is certified with Coq.

When the Scaffold Stays On: AI, Practice Style, and Screening in Elite Skill Formation

Song Yao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.06253v1 Announce Type: cross Abstract: Generative AI raises short-term productivity by completing tasks that learners would otherwise practice on their own. Whether this substitution erodes frontier skill, the skill behind top-tail non-AI-aided performance, is an open question of rising stakes. The sharper question is whether selection mechanisms can screen apart two coexisting types: substitute-users, who use AI in place of deliberate practice, and complement-users, who use it to accelerate skill development. In elite programming, the International Collegiate Programming Contest (ICPC) and the International Olympiad in Informatics (IOI) prohibit AI under proctoring and admit entrants through qualification rounds, whereas online Codeforces (CF) contests are unproctored and open to all. From CF histories we build an AI-prompt signature (more first-attempt acceptances, fewer attempts and retries) consistent with AI-assisted practice. Three patterns triangulate institutional screening. First, CF practice shifted toward this signature across cohorts over two AI rollouts. Second, in open CF contests a stronger signature predicts smaller rating gains for users with no ICPC/IOI affiliation, but not for those who qualified for the AI-prohibited contests. Third, inside the AI-prohibited ICPC environment, a shift toward AI-style practice predicts higher non-AI-aided scores for AI-era entrants. The same practice input carries opposite signs depending on whether the environment screens for it. The contrast points to two levers: how AI is integrated into training, since within the screened pool AI-style practice coincides with stronger non-AI-aided performance; and the design of AI-prohibited evaluation gates as a type-separating institution. Both extend beyond programming to credentialing systems (medical and legal boards, professional certification) that certify skill in a workforce increasingly shaped by AI.

Blockchain Infrastructure for Intelligent Cyber--Physical--Social Systems:Post-Quantum Security, Interoperability, and Trustworthy Data Economies in the Era of Embodied AI

Song Guo, Huawei Huang, Dongping Liu, Aoyu Zhang, Luyao Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.06895v1 Announce Type: cross Abstract: The deployment of embodied artificial intelligence via world-model-based robotics presents a transformative opportunity for blockchain infrastructure, establishing urgent demand for trustworthy data provenance, cross-organizational governance, and incentive-compatible sharing across decentralized ecosystems. Simultaneously, quantum computing advances recognized by the 2025 Nobel Prize in Physics and the Turing Award threaten the cryptographic primitives securing these data economies, creating an interdependent imperative: long-lived verification for embodied AI depends on crypto-agile architectures capable of withstanding quantum adversaries. This tutorial examines blockchain as the coordination layer bridging this dual transition, from financial substrate to foundational Cyber-Physical-Social Systems infrastructure that simultaneously secures against quantum cryptanalysis and enables scalable, trustworthy data economies. The session opens with an immersive AWS Braket demonstration engaging participants with superconducting, trapped-ion, and neutral-atom hardware to assess cryptographic threat timelines and witness ECDSA-to-post-quantum signature transitions. Five integrated modules progress from embodied AI and world-model requirements through quantum hardware reality and evidence-based security migration, to scalable cross-shard architectures via BrokerChain protocols, trustworthy data economies implementing Croissant metadata standards and robotic learning provenance, and industry ecosystem integration for multi-modal cloud deployment. By bridging quantum hardware realities with embodied AI data requirements, this tutorial charts blockchain as unified infrastructure for next-generation decentralized intelligent environments, providing open-source frameworks and roadmaps for architecting quantum-resistant, interoperable, and data-trustworthy systems.

The Montparnasse Algorithm for RNA Design

Tristan Cazenave — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07562v1 Announce Type: cross Abstract: RNA design consists of discovering a nucleotide sequence that optimizes predefined criteria, such as secondary structure. It is useful for synthetic biology, medicine, and nanotechnology. We propose Montparnasse, a Monte Carlo search framework based on Generalized Nested Rollout Policy Adaptation, augmented with a problem-specific prior, slow and long adaptation at level 1, and a lexicographic multicriteria evaluation. Montparnasse solves all 100 puzzles of the Eterna100 V1 benchmark consistently faster than DesiRNA, the previous state of the art, across all time limits, reaching full coverage more than three times faster overall. On messenger RNA secondary structure optimization for hemoglobin alpha, it identifies sequences with more paired bases than the MFE-optimal solution of LinearDesign.

Considerations for an Integrated Detector Design at FCC-ee: A Human-AI Exploration

Charles Young — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07564v1 Announce Type: cross Abstract: This report explores detector design considerations for the Future Circular Collider in its electron-positron mode (FCC-ee) through an extended dialogue between a physicist and an AI assistant. Starting from initial "prejudice" detector concepts proposed by the AI assistant without explicit physicist input, each subsystem is examined in detail, with the AI's assumptions challenged and revised through the exchange. The discussion covers the full detector from beam pipe to luminosity monitor, with particular attention to the interplay between subsystem choices and the practical considerations - calibration, stability, and operational simplicity - that are essential for a fifteen-year precision physics program. The narrative documents how the integrated detector design evolved substantially from the starting point to revised "prejudice" detector concepts of the AI assistant. The focus of this report is on the process to illustrate both the potential and the limitations of human-AI collaboration in experimental physics design, and the physics capabilities of any of the "prejudice" detector concepts remain to be explored.

SurfDesign: Effective Protein Design on Molecular Surfaces

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07567v1 Announce Type: cross Abstract: Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods condition only on backbone structure. We introduce SurfDesign, a surface-conditioned protein design framework that models molecular surfaces as continuous geometric manifolds and integrates them with pretrained protein language models. SurfDesign employs surface-based equivariant message passing to capture surface normals, curvature, and directional geometry, together with a parameter-efficient fine-tuning strategy. Focusing on functional protein design, we show that SurfDesign consistently outperforms prior surface-conditioned and backbone-only methods on de novo binder and enzyme design benchmarks. We also report strong performance on inverse-folding benchmarks as a diagnostic of structural compatibility. Our results highlight manifold-aware surface representations as a principled foundation for functional protein and enzyme design. Code is available at https://github.com/smiles724/SurfDesign.

Forecasting Japanese elections: A nonlinear machine-learning approach

Sota Kato, Xuan Luo, Budrul Ahsan, Asahi Obata, Takafumi Nakanishi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07572v1 Announce Type: cross Abstract: Despite Japan being one of the world's largest advanced democracies, the development of election forecasting models for its national elections remains limited. This study introduces nonlinear machine-learning forecasting models, based on decision tree and ensemble learning methods, for predicting the outcomes of Japanese lower-house elections. To assess the methodological benefits of our approach, we replicated the theoretical framework and dataset of Lewis-Beck and Tien's (LBT) foundational statistical forecasting model for Japanese elections. Our models demonstrated moderately but consistently improved predictive accuracy compared to LBT's model in both in-sample and out-of-sample evaluations, suggesting that nonlinear algorithms offer an alternative approach to classical linear methods in capturing complex electoral dynamics. This study represents one of the earlier applications of nonlinear machine-learning techniques to single-country election forecasting. It offers a replicable framework that, when combined with the country-specific electoral theories of other nations, may enhance the predictive performance of forecasting models in broader national contexts.

Forward-Looking Stress Testing Under Macro Scenarios: Stable SVaR Estimation Using a Hybrid GPR-HS Framework with SACS

Ujjwala Vadrevu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07575v1 Announce Type: cross Abstract: Regulatory stress testing frameworks, including the Comprehensive Capital Analysis and Review (CCAR) and the Internal Capital Adequacy Assessment Process (ICAAP), require robust Stressed Value-at-Risk (SVaR) estimation under forward-looking macroeconomic scenarios. Traditional parametric approaches often exhibit numerical instability under extreme shocks, reducing the reliability of capital projections. This paper extends the Hybrid Gaussian Process Regression Historical Simulation (GPR-HS) framework of Vadrevu (2026) to forward-looking stress scenarios, demonstrating stability across three regimes: West Asia War, Climate Risk, and AI Bubble/Regulation. A key contribution is the Scenario-Averaged Covariance Stabilization (SACS) framework, which constructs stress covariance as a weighted aggregation of historical crisis regimes, providing stable and interpretable dependence structures. Stressed return paths are generated over a 252-day horizon using deterministic drift and stochastic residuals, while volatility is modeled via Gaussian Process Regression with Aggressive Noise Initialization (ANI). The framework exhibits consistent convergence across all assets and scenarios. SVaR ranges from -2.1020% to -2.2231%, with the coherence property |SES| > |SVaR| preserved. The results support GPR-HS with SACS as a stable and regulator-aligned approach for forward-looking SVaR and SES estimation in CCAR and ICAAP applications.

FADRW: A Feature-Aware Modulated and Dynamically Reweighted Loss for Few-Shot Linguistic Steganalysis

Shuo Liu, Xianghong Lin, Yukun Wei, Zhongliang Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07655v1 Announce Type: cross Abstract: The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. However, detection is severely hampered by two fundamental issues during model training. Firstly, extreme class imbalance (less than 1% steganographic samples) induces a strong decision bias. Secondly, the invisibility of generative steganography means its features are nearly indistinguishable from benign text; this similarity, compounded by their extreme rarity, leads to severe feature marginalization, where faint steganographic signals are completely overwhelmed. To directly address these optimization-level challenges, we propose FADRW (Feature-Aware Modulated and Dynamically Reweighted Loss), a novel loss function framework engineered for few-shot steganalysis. FADRW employs Dynamic Reweighting to progressively counteract decision bias, and a Feature-Aware Modulation module to structurally reshape the feature space, preventing feature marginalization by enhancing the separability of these subtle features. Extensive experiments on datasets from three real-world social platforms demonstrate that FADRW significantly outperforms state-of-the-art methods, particularly in the challenging few-shot steganographic sample scenario.

SC3: The Multi-Solvent Solubility Challenge and Benchmark

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07656v1 Announce Type: cross Abstract: Solubility prediction is a standard benchmark in computational chemistry, yet multi-solvent models which reportedly approach the experimental-noise ceiling (i.e. the aleatoric limit) are not yet reliable enough to be deployed. We argue that this gap is partly artefactual: published benchmarks differ in curation policies, evaluate on count-weighted RMSE that hides failure on tail-heavy solvent distributions, and treat the widely cited 0.6-0.8 log S inter-laboratory figure as the aleatoric ceiling even though it reflects worst-case, not expected, disagreement. We introduce SC3, a multi-solvent solubility benchmark built on BigSolDB v2.1 with three contributions: (i) a reproducible curation pipeline yielding 101,535 measurements over 1,327 solutes and 206 solvents, with a recalibrated aleatoric floor of 0.106 log S-roughly 6 times tighter than the conventional figure; (ii) nested Gold/Silver/Bronze consensus tiers with per-point standard deviation, three leakage-checked splits, and a multi-solvent metric suite (PS-RMSE, Z-RMSE); and (iii) a 31-model benchmark across six families, whose best Bronze PS-RMSE sits at 5 times the aleatoric limit, and we observe this is a gap unclosed by any deep alternative tested. We perform three follow-on analyses: data scaling, transfer from quantum-chemistry solvation energies, and feature-level attribution, which demonstrates that calibrated per-point uncertainty is a reusable infrastructure for diagnosis beyond point prediction.

Hardware-aware Low-latency Quantum Compilation with Data-driven Lightweight Error Detection for Early Fault-Tolerant Systems

Sumit Chongder (Indian Institute of Technology Jodhpur) — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07666v1 Announce Type: cross Abstract: Noisy intermediate-scale quantum (NISQ) processors are entering an early fault-tolerance regime where full quantum error correction carries prohibitive resource costs, yet lightweight error detection can meaningfully improve algorithmic success rates. Existing compilation and error-detection toolchains treat these concerns in isolation, with no principled way to balance detection overhead against success probability under latency constraints. We present an integrated hardware-aware compilation and data-driven quantum error-detection (QED) framework that jointly optimises qubit mapping, SWAP insertion, and syndrome-schedule placement via a noise-weighted cost function and a learned multi-objective scheduler. Simulation experiments on an HPC cluster using GPU-accelerated density-matrix simulation (NVIDIA cuQuantum SDK) across VQE, phase-estimation, and Grover benchmarks, three noise profiles, and circuit sizes of 6-20 qubits (depths 10-160), show that joint co-design raises algorithmic success probability by up to 68 percent (95 percent CI: 60 percent to 76 percent) over SABRE on an 8-qubit VQE instance with post-selection.

The Need for Neural ISP in the Small-Pixel Era: How Shrinking Pixels Push Optics to the Limit and Neural Restoration Pushes Back

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07675v1 Announce Type: cross Abstract: Smartphone telephoto cameras are approaching a "telephoto physics wall": as pixel pitches shrink toward sub-0.5 micron, the optics remain limited by geometric aberrations, leading to diminishing returns on resolution. Traditional Image Signal Processors (ISPs) cannot eliminate these aberrations, because they operate through local, stage-wise processing with no explicit model of the underlying point spread function (PSF). We demonstrate how a learning-based Neural ISP for image restoration, trained on the underlying degradations, inverts what stage-wise pipelines cannot, turning small-pixel designs into a net advantage. We investigate this through a controlled simulation of a representative telephoto module, evaluating five configurations (0.35--0.75 micron pixel pitch). The aperture is scaled proportionally to keep per-pixel SNR and diffraction spot size fixed, thereby isolating geometric aberration and spatial sampling. While the traditional ISP improves only modestly with smaller pixels, the Neural ISP scales substantially: at 0.35 micron} it reaches 745 cycles/mm MTF50 (vertical), a 2.5--3x resolution improvement over the traditional ISP, and LPIPS improves significantly from 0.244 to 0.151 while traditional results stay comparatively flat. In a low-SNR extension (15 dB per-frame bursts at 0.35 micron), a multi-frame Neural ISP recovers performance close to the bright-light single-frame baseline, whereas a multi-frame traditional ISP shows no meaningful improvement -- indicating that traditional pipelines at small pixels are bottlenecked by uncorrected PSF blur rather than by noise. These results point to a design philosophy in which Neural ISPs enable high-resolution telephoto modules by correcting residual optical aberrations rather than requiring increasingly complex optics.

Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models

Joseph Boyd, Matthew Lyon, Martino Mansoldo, Christian Hurry, Finnian Firth — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07676v1 Announce Type: cross Abstract: Spatial transcriptomics (ST) is a powerful tool for exploring biological properties dependent on structure, proximity, and interaction in tissue. The methods underpinning ST are developing rapidly but are limited in their ability to profile many thousands of genes at a subcellular scale. Although dissociated from tissue, it is known that the whole-transcriptome readouts of cells in single-cell RNA sequencing (scRNA-seq) retain information about their former in situ neighbourhoods, motivating computational methods to recover it. While paired ST and scRNA-seq datasets are scarce, each modality in its own right is abundantly available. We therefore propose to perform cross-modal translation between unpaired ST and scRNA-seq data. In this work we show that a single-cell foundation model can perform this translation via adversarial fine-tuning. We demonstrate that our method performs favourably against methods built for multi-omics translation.

Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference

Shengxian Ding, Haonan Gao, Pangpang Liu, Xinyuan Tian, Yize Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07677v1 Announce Type: cross Abstract: Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

A Counting Process View of Relational Event Models: Practical Asymptotics

Cornelius Fritz, Alexander Fuchs-Kreiss — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07680v1 Announce Type: cross Abstract: Relational Event Models (REMs) provide a rigorous framework for analyzing dyadic interactions observed in continuous time, capturing history-dependent dynamics such as triadic closure and reciprocity. Framing REMs through the lens of counting processes embeds the model in a rich theoretical foundation, facilitating its mathematical analysis. While Maximum Likelihood Estimation (MLE) is standard practice for estimating these models, the underlying statistical guarantees rely on specific asymptotic regimes, namely, whether the network size (n), the observational period (T), or both approach infinity. We review the theoretical foundations of such counting-process-based models, formalizing the core assumptions required to achieve asymptotic normality across these different limits. With a specific focus on Cox-type multiplicative models, we detail the circumstances under which these assumptions hold. Supported by simulation studies, we illustrate how structural modeling choices, including temporal windowing and logarithmic transformations, affect empirical coverage and estimator convergence. We thereby derive several guiding principles for specifying such models in realistic contexts, bridging theory and practice.

Transfer learning for causal forest

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07693v1 Announce Type: cross Abstract: Transfer learning addresses the challenge of transfering knowledge from one domain to another. Traditional transfer learning focuses on adapting models trained on a source domain (with a lot of observations) to improve performance on a target domain (with few observations). In this work we consider the case of a model shift and we focus on the transfer learning applied to a causal forest namely HTERF. This causal forest aims to estimate the Conditional Average Treatment Effect (CATE). The approach considered is the offset method presented by Wang (2016) adapted to a causal context. This method relies on the use of intermediate models in order to estimate the offset between source and target distributions. Our main result is a bound on the CATE error of HTERF on target depending on the error of the intermediate models. Simulation studies show the good performances of this approach in different settings on simulations and on a real-world dataset.

TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07697v1 Announce Type: cross Abstract: As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has become a main challenge in atmospheric chemistry. Yet mechanism validation based on complex numerical models still relies heavily on expert knowledge: mechanistic hypotheses must be operationalized into executable experiments, and model outputs must be organized into traceable evidence. We present TianJi-Environ, an auditable AI Scientist for atmospheric-chemistry mechanism validation. TianJi-Environ establishes the first WRF-Chem-based multi-agent framework that autonomously drives complex atmospheric-chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. Using ozone response and particulate-matter feedback as two representative examples, we demonstrate TianJi-Environ's capability for mechanism validation. In a summertime ozone case over the North China Plain, the system detects directionally consistent aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height, but judges the evidence for ozone response to NOx control to be incomplete. In a wintertime PM2.5 case over the Guanzhong Basin, it localizes the unsupported link to insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating. These results show that TianJi-Environ makes expert-driven mechanism validation explicit, structured, and auditable, offering a reproducible paradigm for multi-agent systems coupled with complex atmospheric-chemistry models.

MatMind: A Structure-Activity Knowledge-Driven Generative Foundation Model for Materials Science

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07712v1 Announce Type: cross Abstract: Progress in AI-driven crystal materials science has so far been carried by narrow architectures purpose-built for individual tasks -- graph neural networks for property prediction, diffusion and flow-matching models for crystal generation -- each excelling within its niche yet unable to act as a shared backbone across the full spectrum of materials problems. Generative large language models offer a fundamentally different paradigm, in which structural representation, quantitative prediction, and structure-activity reasoning can be unified within one model, but the materials community has yet to see this paradigm realized at a level competitive with established narrow specialists. Here we present MatMind, a generative foundation model purpose-built for crystal materials science under this paradigm, developed through the coordinated activation of structure-activity knowledge and physics-informed feedback within a progressive training framework -- combining structure-activity knowledge injection, a dual-head architecture that jointly trains language reasoning and numerical regression in a shared representation space, and multi-objective physics-informed reinforcement learning over stability, novelty, and structural diversity. Across three task families, MatMind attains the lowest mean absolute error on energy above hull, bulk modulus, and band gap -- surpassing graph neural network predictors purpose-built for these tasks -- reaches an S.U.N. rate of 65.3% on unconditional crystal generation, and achieves a comparable multiplicative improvement on magnetization-density-conditioned generation, where only 21 positive samples exist within over 600000 training entries. By matching or surpassing narrow specialists on their own ground while operating within a single unified model, MatMind shows that the LLM-based paradigm can serve as a viable backbone for crystal materials science going forward.

Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps

Daria Kern, Negar Chabi, Souraj Adhikary, Andre Mastmeyer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07717v1 Announce Type: cross Abstract: This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The method combines coarse-to-fine segmentation, predictions from multiple anatomical planes, and additional fuzzy 3D spatial maps that provide anatomical location cues to improve segmentation accuracy. We combine multi-planar 2D-U-Net models augmented by a spatial occurrence map. The approach involves two main stages. First, the abdominal volume of interest region is detected by traversing the whole scan axially with a 2D-U-Net and determining the x-y-z-minimum and -maximum extents of the 5 abdominal organs of interest. Second, we use spatial occurrence maps to enhance our multi-planar 2D-U-net architecture inside the bounds from the former stage. The method is evaluated on 80 CT scans from various public sources. The results show Dice improvements of about 4% at maximum compared to the same model trained without spatial occurrence maps.

GNSS-FM: A Self-Supervised Foundation Model for Daily GNSS Displacement Time Series

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07725v1 Announce Type: cross Abstract: Displacement time series from Global Navigation Satellite Systems (GNSS) are essential for a wide range of applications, including monitoring tectonic crustal deformations and investigating the different stages of the earthquake cycle. Machine learning methods have proven promising for GNSS applications; however, most remain fully supervised. This creates a bottleneck as labeled data are scarce, even though large amounts of unlabeled GNSS data are freely available. We present GNSS-FM, a self-supervised foundation model for daily GNSS time series. The model uses a dual-stream input combining displacement and velocity-like increments, and is pretrained using a masked latent prediction objective with vector-quantized targets adapted from wav2vec 2.0, with several modifications for geodetic data. Pretrained on data from over 17,000 globally distributed GNSS stations, an analysis of the learned codebook suggests that the representations capture the main signal types in GNSS displacement data, including seismic offsets, tectonic drift, and seasonal patterns. The foundation model is later fine-tuned on two downstream tasks, namely 90-day displacement forecasting and seismic step localization, where it outperforms strong task-specific baselines in both cases. These results show that self-supervised pretraining is a promising approach for GNSS time series analysis.

Benchmarking Quantum Algorithmic Resilience for CVaR Portfolio Optimization: The Expressibility-Coherence Trade-off

Prashik N. Somkuwar, K. Srinivasan, G. Raghavan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07727v1 Announce Type: cross Abstract: Quantum combinatorial optimization offers theoretical advantages for complex financial modeling, but physical implementation on Noisy Intermediate Scale Quantum (NISQ) devices is severely constrained by hardware topology. This study presents a hardware benchmarking analysis between a Hardware Efficient Variational Quantum Neural Network (HE-VQNN) and the Warm Start Quantum Approximate Optimization Algorithm (WS-QAOA) for a hybrid Mean Variance and Conditional Value at Risk (CVaR) portfolio objective. By implementing a novel classical quantum hybrid proxy matrix to bypass the CVaR auxiliary qubit bottleneck, we map up to 16 assets from the NIFTY 50 index onto an IBM heavy hex processor. We systematically quantify algorithmic resilience to the "SWAP tax" incurred during routing. Empirical results reveal a critical operational trade-off: WS-QAOA provides exact theoretical mapping but suffers catastrophic hardware decoherence due to exponential nonlocal gate overhead. Conversely, HE-VQNN preserves hardware coherence but lacks the mathematical expressibility to capture dense tail risk asset correlations. This study exposes the limitations of dense financial optimization on current architectures forces an nonviable choice between algorithmic inexpressibility and hardware decoherence. This is indicative of a deeper limitation as to what can and cannot be done with NISQ computers lacking in all-to-all connectivity.

Decomposing tournaments into comparability graphs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07748v1 Announce Type: cross Abstract: In this note, we introduce the \emph{partial order decomposition number} of a digraph $D$, denoted $pod(D)$, defined as the minimum integer $k$ such that $A(D)=A(P_1)\cup\cdots\cup A(P_k)$, where $P_1,\ldots,P_k$ are partial orders on $V(D)$. We prove that $\dic(D)\le \diomega(D)^{pod(D)}$ for every digraph $D$. In particular, every class of digraphs with bounded $pod$ is polynomially $\dic$-bounded. We apply this to tournaments, showing that if $\mathcal C$ is a class of tournaments with bounded dichromatic number, then the closure of $\mathcal C$ under substitution is polynomially $\dic$-bounded, thereby making progress on a question of Aubian, Charbit, Lopes, and the first author. As further applications of $pod$, we prove that poset tournaments of bounded dimension are $\dic$-bounded, derive polynomial lower bounds on the directed clique number of an explicit family of tournaments, thereby answering a conjecture of Gutowski and Rams, and show that tournaments with bounded $pod$ have bounded domination number.

Beyond Point Estimates: Benchmarking Uncertainty Quantification Methods on the AION-1 Astronomical Foundation Model

Karla Tame-Narvaez, Aleksandra \'Ciprijanovi\'c, Shubhendu Trivedi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07771v1 Announce Type: cross Abstract: Foundation models for astronomical surveys offer powerful learned representations that can be transferred to downstream regression tasks such as galaxy property estimation. However, point predictions alone are insufficient for scientific inference; reliable uncertainty quantification (UQ) is essential. We compare seven UQ methods on galaxy property regression using frozen AION-1 foundation-model embeddings, predicting redshift, stellar mass, stellar-population age, gas-phase metallicity, and specific star-formation rate, from Legacy Survey photometry/imaging and DESI spectra, with PROVABGS-derived labels. Distribution-free conformal methods achieve marginal coverage within $\sim$1\,pp of the nominal 90\% across all properties, while non-conformal baselines (Deep Ensembles, MC~Dropout) fail to calibrate reliably. Among conformal approaches, Conformalized Quantile Regression (CQR) delivers the best coverage in the bin with the poorest model predictions. More importantly, only the Locally Valid and Discriminative (LVD) framework -- particularly when operating on AION-1 embeddings -- also provides finite-sample \emph{local validity}, producing intervals that adapt to each galaxy's local prediction difficulty rather than relying on marginal guarantees alone. These results establish conformal prediction, and LVD in particular, as the preferred UQ framework for uncertainty-aware inference on foundation-model embeddings in astrophysics.

Non-Archimedean Polydisc Spaces and Applications to Optimisation

Paul Lezeau, Yiannis Fam, Anthea Monod, Yue Ren — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07782v1 Announce Type: cross Abstract: We propose a new framework for optimisation over non-Archimedean spaces inspired by Berkovich geometry. Specifically, we introduce polydisc spaces, which consists of products of closed balls over a non-Archimedean field. These spaces retain the rigid hierarchical structure of the non-Archimedean field whilst acquiring many desirable geometric features absent from it. We show that metric trees embed naturally into these spaces, demonstrating their capacity to represent hierarchical data. We study their metric geometry, establishing properties such as geodesic uniqueness, confirming their comaptibility with classical optimisation techniques. We further propose a class of real-valued functions given by linear combinations of absolute values of polynomials. These functions admit a piecewise polynomial description along geodesics and satisfy a universal approximation property. We formulate a theory of optimisation on polydisc spaces: we prove existence of minimisers and explore algorithms for finding them. We provide an accompanying open-source Julia library implementing the core objects and optimisation procedures introduced.

Blow-ups of order types of positive density

Ruy Fabila-Monroy, Benedikt Hahn, Jes\'us Lea\~nos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07806v1 Announce Type: cross Abstract: Order types are an equivalence relation between point configurations that capture their combinatorial and convexity properties. Let $P$ be a $\kappa$-colored sequence of $n \ge d+1$ points in general position in $\mathbb{R}^d$. Let $\rho$ be a $\kappa$-colored order type on $k \le d+1$ points that has positive density on $P$; that is, for some constant $\delta >0$, there are $\delta \cdot \binom{n}{k}$ $k$-point subsequences of $P$ that have the same order type as $\rho$ and the same color pattern. In this paper we show that there exists a constant $c >0$ (depending only on $d, \delta$, $k$ and $\kappa$) and disjoint subsets $X_1,\dots,X_k$ of $P$, each with at least $c \cdot n$ points, such that for every choice of $k$ points $x_i \in X_i$, $(x_1,\dots,x_k)$ has the same order type and color pattern as $\rho$.

Agentic multi-fidelity learning of quasiparticle and excitonic properties

Arnab Neogi, Aaron Forde, Christopher A. Lane, Sergei Tretiak, Jian-Xin Zhu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07836v1 Announce Type: cross Abstract: Many-body GW-Bethe-Salpeter equation calculations are essential for accurate simulations of electronic structure and optical properties in modern low-dimensional nanomaterials. However, these methods are computationally demanding and can exhibit localized numerical instabilities or convergence failures that are difficult to detect within high-throughput workflows. We introduce an agent-guided multi-fidelity framework for correcting GW-Bethe-Salpeter excited-state landscapes in strained MoS2-WS2 bilayers. Across stacking registries, strain branches and reciprocal-space samplings, the workflow identifies spike-like excursions, near-zero-gap collapse and cross-fidelity inconsistencies associated with fragile long-wavelength dielectric screening. A structural agent evaluates calculations by assigning confidence weights and selectively using a small number of high-accuracy reference points. Machine learning models then transfer information across related systems and apply Gaussian process corrections to recover improved quasiparticle gaps and exciton binding energies, with calibrated uncertainty estimates. The approach corrects numerically induced artifacts without erasing physical strain dependence and substantially improves agreement with higher-fidelity references relative to a no-agent baseline. These results show that reliable surrogate learning for excited-state materials requires explicit diagnosis of numerical fragility, not direct interpolation of raw first-principles data points. The proposed framework is readily transferable to other optoelectronic nanomaterials characterized by strong quantum confinement, such as quantum dots, nanoribbons, layered two-dimensional semiconductors, and hybrid perovskite nanostructures.

Large-scale empirical tuning and comparison of default optimizers for variational inference

Trevor Campbell, Jonathan H. Huggins, Kyurae Kim, Charles C. Margossian — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07841v1 Announce Type: cross Abstract: Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black box" inference algorithm. However, over the past decade, many new adaptive stochastic optimization algorithms have been developed that reduce or remove entirely the need for tuning. In this work, we investigate this new collection of adaptive methods in the context of BBVI, with the goal of establishing the current state of the art in tuning-free optimization-based inference. In particular, we present a large-scale empirical evaluation of 56 stochastic gradient-based optimization algorithms applied to 1092 Bayesian inference optimization problems, involving over 550,000 individual optimization runs and 15 core-years of compute. The optimization algorithms we evaluate are chosen to represent a wide spectrum of recent approaches and the benchmark problems are chosen to span a range of difficulty, with posterior target dimension 1-10^4, condition number 1-10^8, and a range of variational families. Our results show that no single method dominates, but running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance. We thus provide a strong baseline for applications where expert tuning is not possible and for comparison when developing new stochastic optimization algorithms.

Counting Hamiltonian Paths in 3-Regular Planar Graphs

Ira Pohl, Larry Stockmeyer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07844v1 Announce Type: cross Abstract: We introduce two infinite families of 3-regular planar graphs. Both families are conceptual adversaries to the Pohl-Warnsdorf algorithm for finding Hamiltonians. We provide a closed form calculation of the number of Hamiltonians.

Affine Filtering Measurements and Their Applications to Quantum Decoding

Avijit Mandal, Noah Shutty, Henry D. Pfister, Stephen P. Jordan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07852v1 Announce Type: cross Abstract: Unambiguous state discrimination (USD) measurements are attractive because outcomes are either marked as conclusive (i.e., error free) or inconclusive (i.e., erased). We study affine filtering measurements, a structured variant of USD for decoding classical linear codes over pure-state classical-quantum channels, where a conclusive outcome identifies an affine subspace containing the transmitted codeword and an inconclusive outcome is treated as an erasure. For a group-covariant indexing of pure-state codewords, we show that the optimal design of affine filtering measurements is a semidefinite program that can be reduced to a linear program via character-based diagonalization. We use the resulting measurement to build a quantum decoding framework for local codes, and we demonstrate (via simulations on regular LDPC codes from Gallager ensembles using single parity check local constraints) that affine filtering based decoding can outperform symbol-wise USD and symbol-wise pretty good measurement based decoding methods on i.i.d. pure-state channels. In an independent and concurrent work, Buzet and Chailloux study similar fine-grained USD measurements for symmetric families of states. Their focus is on the code-agnostic setting whereas our focus is on code-aware constructions and decoding.

Optimal Wiener-Filter Solutions for Denoising of Graph Signals on Directed Graphs

Chun Hei Michael Chan, Alexandre Cionca, Dimitri Van De Ville — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07876v1 Announce Type: cross Abstract: Graph signal processing has opened new avenues to the canonical denoising problem in interesting settings. Specifically, here we propose a Wiener-filter solution for graph signals on directed graphs. Under various stationarity assumptions combining uncorrelated and correlated noise conditions, we show optimal solutions, including a successful proof-of-concept for temperature graph.

On the sharp linear convergence rate of the circumcentered--reflection method on subspaces

Yunier Bello-Cruz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07888v1 Announce Type: cross Abstract: For two subspaces $U,V\subseteq\RR^n$, the circumcentered--reflection method (CRM) of Behling, Bello-Cruz, and Santos~\cite{BBS2018} computes the projection onto $U\cap V$ using only the reflections across $U$ and $V$, with known linear-convergence rate $c_F$, the cosine of the Friedrichs angle. We prove that, when CRM is initialized in $V$, it contracts at the strictly smaller rate $\rho_V=(\sin^2\theta_p-\sin^2\theta_F)/(\sin^2\theta_p+\sin^2\theta_F)$, where $\theta_F\in(0,\pi/2]$ is the Friedrichs angle and $\theta_p\in[\theta_F,\pi/2]$ the largest principal angle between $U$ and $V$. The bound is sharp, attained on an explicit ray in $V$, and optimal among parameter-free single-step iterations. The constant itself is not new: Bauschke, Bello-Cruz, Nghia, Phan, and Wang~\cite{BBNPW2016} identified it as the optimal rate of the relaxed alternating-projection family and of their adaptive linesearch map $B_T$. Our contribution is that the parameter-free geometric circumcenter attains it as well, via Kantorovich's inequality applied to a single self-adjoint operator on $V$. Restricted to $V$, CRM coincides pointwise with the linesearch maps $A_T$ and $B_T$ from the Gubin--Polyak--Raik framework~\cite{GPR1967}. We further prove $\rho_V

Beyond the Thin-Layer Limit: Differentiable Volumetric Training for Visible-Range Diffractive Neural Networks

Dineth Jayakody, Dushan N. Wadduwage — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07896v1 Announce Type: cross Abstract: Diffractive deep neural networks (D2NNs) promise miniaturized, power-efficient, light-speed optical front-ends for machine vision, yet the most mature demonstrations remain in the terahertz regime, built from readily fabricated millimeter-scale neurons. Translating D2NNs to the visible range, where nearly all vision pipelines operate, was long blamed on the difficulty of fabricating nanoscale neurons; but even after recent advances removed that barrier, visible-range D2NNs matching their terahertz counterparts remain out of reach. We identify the true obstacle as the thin-layer approximation underlying nearly all D2NN training, which treats each diffractive layer as an infinitely thin mask. It fails not because of the short wavelength, as is commonly assumed, but because the low-refractive-index materials (n approximately 1.3-1.5) used at visible wavelengths require relief structures thick enough that intra-layer diffraction and phase accumulation become significant. To overcome this, we introduce a differentiable beam-propagation ($\partial$BPM) layer that models each element as a finite-thickness volume and propagates light through it during training, keeping the fabrication-compatible height map end-to-end trainable without full-wave simulation in the loop. Across MNIST, Fashion-MNIST, and CIFAR-100 classification and imaging, $\partial$BPM training substantially reduces the design-to-device mismatch, and full-wave FDTD validation raises classification accuracy from 50% to 90% without re-optimization. The $\partial$BPM layer thus offers a scalable, physics-aware bridge between efficient optical neural-network optimization and fabrication-consistent diffractive design.

Identifiability and Estimation for Unlabeled Finite Mixtures under Marginal Independence

Takafumi Kanamori, Yushi Hirose, Shohei Yamamoto — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07914v1 Announce Type: cross Abstract: We study component recovery and mixing-matrix estimation from unlabeled finite mixtures whose observable distributions share the same latent components but have unknown mixing weights. The main identifying signal is marginal independence: each component is assumed to be independent on at least one coordinate pair, but no labels, clean component samples, or mixing weights are observed. We first prove a structural result for product components: under linear independence of the univariate marginals, any independent affine combination of the components must coincide with a single component. We then extend this principle to observable mixtures and show that, under full-rank and no-cancellation conditions, marginally independent affine combinations recover the corresponding latent components. When every component is independent on some coordinate pair, all components are identifiable, and the mixing matrix is recoverable under the stated completion conditions. Finally, we propose a Product-Marginal Maximum Mean Discrepancy (PM-MMD) estimator over affine combinations of the observable mixtures and prove uniform convergence and stability under approximate marginal independence. This framework also separates the empirical roles of the assumptions: irreducibility is, in general, not directly testable from the unlabeled mixtures alone, whereas marginal independence yields a candidate-level diagnostic through held-out PM-MMD. Controlled and flow-cytometry experiments show when marginal independence provides a useful recovery signal. In the reported multi-component comparisons, condition-aware representative selection stabilizes PM-MMD and improves recovery relative to clustering, factorization, and pairwise mixture-proportion baselines using the same unlabeled mixtures.

Barycentric Projections of Optimal Transport Plans on Riemannian Manifolds

Kisung You — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07926v1 Announce Type: cross Abstract: Optimal transport couplings are probabilistic objects, while many learning pipelines require deterministic maps. In Euclidean space, barycentric projection converts a coupling into a map by taking conditional expectations, but on a Riemannian manifold curvature and cut loci make this operation nontrivial. We develop a framework for barycentric projections of transport couplings on Riemannian manifolds. The intrinsic projection maps each source point to the conditional Fr\'echet mean of its destination law and is shown to be the best deterministic representative under squared geodesic loss. The corresponding minimum value is an integrated conditional Fr\'echet variance, which vanishes exactly for map-induced couplings and therefore defines a conditional-variance Monge defect. We also study a tangential log-exp projection, prove its Euclidean exactness, its compatibility with Brenier-McCann maps in the Monge case, and its interpretation as the first unit Riemannian gradient update for the intrinsic objective. For discrete couplings, both constructions decompose row-wise into weighted Fr\'echet mean and log-exp problems. Experiments on spherical data, synthetic SPD data, and real EEG covariance matrices support the proposed division of roles: the intrinsic projection is the variational representative, while the tangential projection is a useful local displacement surrogate.

Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

Yunbei Xu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07931v1 Announce Type: cross Abstract: We prove a variance-aware pointwise majorizing-measure theorem for centered Gaussian processes. Classical generic chaining characterizes the scalar quantity $\mathbb E\sup_{x\in T}X_x$; the theorem here gives a simultaneous high-probability envelope for the entire field. For an ambient prior $\mu$, the envelope at $x$ is governed by a pointwise Fernique-Talagrand functional \[\Phi_\mu(x):=\int_0^{4\sigma(x)}\sqrt{\log\frac{1}{\mu(B_d(x,\varepsilon))}}\,d\varepsilon,\] together with the corresponding Gaussian tail term. The theorem provides a reusable field-level refinement of classical generic chaining and a Gaussian-process counterpart of pointwise empirical-process bounds for deep neural networks. We also record a Bayesian algorithmic lower envelope from the interactive Fano/data-processing principle. For a known prior $\pi$, an observation channel, and a concrete estimator $\widehat t(Y)$, the lower bound is expressed through the exact ghost small-ball mass $\mathbb E_{Y\sim Q}\pi(B_d(\widehat t(Y),\Delta))$, rather than a worst-case covering number. In Gaussian location experiments, comparison decoders convert Bayes location error into lower bounds on decision-aligned Gaussian ranges. We then construct an elementary weighted-basis example separating the usual Fano relaxation for a fixed prior, the Bayesian algorithmic lower envelope, the pointwise Gaussian envelope on the selected subatlas, and the full-class minimax risk/global Gaussian scale. Together, these results show that algorithmic lower bounds provide local-geometric certificates of pointwise complexity for fixed estimators in overparameterized ambient classes, precisely in regimes where classical minimax theory becomes either too coarse or oracle-dependent.

Feasibility to detect rapid change and disappearance of seagrass: Lessons from nearly 80 years of vegetation change in the Ako, Seto Inland Sea, Japan

Takehisa Yamakita, Yoji Igarashi, Akira Eto, Ken Ishida, Masaaki Iiyama — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07949v1 Announce Type: cross Abstract: This study analyses the Ako tidal flat in the Seto Inland Sea, Japan, where nearly all Zostera marina disappeared within a single year in 2025. Using aerial photographs from the 1940s onward, high-resolution satellite imagery, GRUS images (2.5-5 m), and monthly Sentinel-2 composites (10 m), we reconstructed approximately 80 years of seagrass distribution. YOLO-based segmentation using deep learning achieved high accuracy (overall accuracy >= 0.9) across these datasets; although species could not be discriminated, the models captured the major temporal dynamics in vegetation area. The long-term mean seagrass area was 6.8 ha, but values fluctuated widely, from 3.5 ha in 1974 to 41.3 ha in 1989 except 0.2 ha in 2025. Sentinel-2 composites from 2019 to 2026 revealed clear seasonality, with vegetation increasing in early summer and declining from autumn. In 2025, however, the area decreased sharply after summer and remained anomalously low throughout the winter of 2025-2026. Our results, indicating that the 2025 event was not a normal fluctuation but a rapid ecosystem shift involving the loss of the dominant canopy-forming species, most plausibly driven by regionally elevated summer water temperatures. The findings also have implications for seagrass Essential Ocean Variables (EOVs) and the State of Nature (SoN) metrics used in TNFD-aligned nature-related disclosures. Unlike forests, seagrass meadows require finer temporal resolution because both pronounced seasonality and abrupt collapse strongly influence area-based indicators. Therefore, in addition to previously noted issues such as species-level classification accuracy, we recommend that (1) baselines be defined over the longest available record and justified ecologically, (2) seasonal standardization be applied before inter-annual comparisons, and (3) years with extreme area anomalies be flagged rather than used as reference points.

Lagrange multipliers in Maximum likelihood estimations and Least squares problems with Constraints

Takeshi Fukasawa — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.07984v1 Announce Type: cross Abstract: This study investigates a statistical property of Lagrange multipliers in constrained Maximum Likelihood Estimation (MLE) and Least Squares (LS) problems from the perspective of numerical optimization. Building on large-sample theory, we show that the associated Lagrange multipliers converge to zero as the sample size increases, provided the distribution is correctly specified in MLE or the residuals are normally distributed in LS. Although this asymptotic behavior has long been recognized in statistics, it has received little explicit attention in numerical optimization and has rarely been exploited in algorithmic design. Importantly, the insight extends beyond classical low-dimensional settings: even in modern high-dimensional applications, such as deep learning, where the number of parameters may exceed the sample size, the same reasoning applies provided the generalization performance is good. This observation has two main implications. First, many constrained optimization algorithms, including the Augmented Lagrangian Method, Sequential Quadratic Programming, and Interior Point methods, require initial values for the multipliers, and choosing zero is statistically justified. Numerical experiments for constrained regressions and dynamic discrete choice model estimations support this implication by showing that initializing multipliers at zero usually lead to stable and efficient performance. Second, penalty-based approaches that convert constrained problems into unconstrained ones can perform well when the true multipliers are small. This helps explain why penalty-based methods often perform well in practice.

Repair Before Veto, When Repair Is Hidden: Quantum-Accessible Features for Repair-Augmented Constraint Learning

Yifan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08020v1 Announce Type: cross Abstract: Hard-constraint decision systems usually veto infeasible candidates. This is too rigid when the system can act: if a known affordable repair would make an infeasible candidate feasible and valuable, rejection is a false veto rather than a ranking error. We introduce Q-RACL (Quantum Repair-Augmented Constraint Learning), a repair-before-veto framework that first defines RACL decision semantics and then identifies the single inference link where quantum feature access can be load-bearing. RACL accepts a candidate when a sequential repair plan restores feasibility and preference; otherwise it returns structured rejection credit. The hard link is repair-feasibility inference: which repair class restores feasibility from an observed candidate and context. We construct a discrete-logarithm-hidden RACL family where the repair class is a shifted interval rule in the latent exponent a = log_g(x), while the learner observes only x = g^a mod p. Under standard DLP-based learning separation, this coordinate is inaccessible to efficient raw-input classical policies but accessible to a quantum agent through Shor/Fourier structure. Across six primes and ten seeds, bounded raw-input classical policies and a wrong raw-Fourier encoding remain near chance, whereas the Q-DLP policy keeps false-veto rate below 1.1%, wins all paired seeds, and yields QNI_cond = 0.9777 to 0.9972. A classical DLog oracle matches it, isolating feature access rather than classifier capacity. Thus quantum AI is not added as a generic model upgrade; for this DLP-hidden repair family, it supplies the missing feature that closes the repair-before-veto loop.

Variational Proximal Policy Optimization

Ousmane Amadou Dia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08032v1 Announce Type: cross Abstract: Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization ($\textsc{VP}_2\textsc{O}$), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over localized expert prototypes alongside an expert orthogonalization loss, $\textsc{VP}_2\textsc{O}$ introduces a geometry-based proximal-control mechanism that can reduce reliance on fixed clipping or KL schedules. Our results on a 33B/4B sparse Mixture-of-Experts model show several improvements across complex reasoning benchmarks, establishing a $+\mathbf{179}$ ELO gain on Codeforces and a $\mathbf{32\%}$ reduction in token count on AIME mathematical reasoning tasks.

Numerical solution of the nonlinear Dirac equation by a splitting variational quantum algorithm

Qian Zuo, Ying He, Xiaofei Zhao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08053v1 Announce Type: cross Abstract: In this work, we propose an operator-splitting variational quantum algorithm, termed Dirac-sVQA, for simulating the nonlinear Dirac equation (NLDE). The main difficulty arises from the state-dependent nonlinear interaction, its time-discrete update depends explicitly on the intermediate spinor state and, in general, cannot be implemented as a fixed state-independent unitary circuit. To address this difficulty, we decompose the NLDE evolution into a structured linear Dirac substep and a nonlinear variational correction. The linear substep is implemented by a spinor-Fourier Dirac propagator on a joint position-spin register, preserving the spin-momentum coupling and mass-induced spin evolution of the Dirac operator. The nonlinear correction is reformulated as a measurement-based variational update through a small set of overlap, self-channel, and cross-channel observables. We provide the corresponding quantum circuits and derive measurement-aware resource and complexity estimates. Numerical experiments in several nonlinear regimes show that Dirac-sVQA accurately captures both the total density and the componentwise spinor dynamics, agrees well with classical Fourier pseudospectral splitting solutions, and exhibits stable error behavior over time. These results provide numerical evidence for the feasibility of operator-splitting variational quantum simulation for nonlinear relativistic wave equations.

New Fractional Ambiguity Function Integrated with CNN-Based Machine Learning for Signal Classification

Aamir H. Dar, Prakhar Kumar Sonkar, Neeraj Kumar Sharma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08110v1 Announce Type: cross Abstract: A new fractional ambiguity function (NFrAF) derived from the fractional Fourier transform is introduced as a generalization of the classical ambiguity function. The fundamental analytical properties of the NFrAF, including symmetry, marginality, and Moyal type identities, are rigorously established. After verifying its ability to detect and localize monocomponent and multicomponent linear frequency modulated (LFM) signals, the NFrAF is integrated into a convolutional neural network based machine learning framework for signal classification. Owing to its superior time frequency resolution and localization, the NFrAF provides a more informative input representation than conventional methods such as the spectrogram and classical ambiguity function. Experimental results on simulated datasets demonstrate consistent improvements in classification accuracy, highlighting the effectiveness of the proposed representation for data driven signal analysis.

Palindrome complexity versus factor complexity

Jeffrey Shallit — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08127v1 Announce Type: cross Abstract: Let ${\bf x} = (a_i)_{i \geq 0}$ be an infinite word over a finite alphabet $\Sigma$. Let $\rho (n)$ be the factor complexity function for $\bf x$ and ${\rm Pal}(n)$ be the palindrome complexity function for $\bf x$. We give a new relationship between these two quantities; namely, if $\bf x$ is not ultimately periodic, then $$ \lim_{n \rightarrow \infty} {{ {\rm Pal} (n) \log ({\rm Pal} (n) + 1)} \over {\rho (n)}} = 0. $$ Furthermore, we prove that the numerator in this result is essentially optimal.

Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

Yi Duan, Zhao Yang, Jiwei Zhu, Ying Ba, Chuan Cao, Bing Su — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08147v1 Announce Type: cross Abstract: DNA cis-regulatory elements (CREs) such as enhancers control gene expression levels. Accurately predicting regulatory activity from DNA sequences is valuable but challenging, as it requires understanding complex biological regulatory processes. Existing methods typically regress activity scores from sequences in a black-box manner, limiting both interpretability and regression performance. Meanwhile, large language models (LLMs) benefit from explicit reasoning processes, yet directly applying LLMs to raw DNA sequences performs poorly. In this paper, we bridge this gap by introducing R3LM, a framework that teaches LLMs reasoning-informed regression on regulatory DNA through structured biological knowledge. Specifically, we design a biologically grounded data format that structures DNA's regulatory information for improved LLM understanding, and construct CRE-ReasonBench, the first dataset that associates DNA sequences and activity scores with mechanistic reasoning traces. Through two-stage training that first teaches LLMs reasoning over structured biological information then performs regression, R3LM achieves state-of-the-art performance on enhancer prediction across three cell types, outperforming both LLMs with raw sequence input and specialized DNA models while providing interpretable mechanistic explanations. We expect R3LM as an interpretable reward model that can effectively assist biologists in CRE design. Code is available at https://github.com/DuanYi516/R3LM.

Inverse design of bespoke interatomic potentials via active learning by information-matching

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08148v1 Announce Type: cross Abstract: Interatomic potentials (IPs) enable large-scale atomistic simulations beyond the reach of first-principles methods, but their predictive reliability depends critically on the selection of training data, quantified uncertainty, and model expressiveness. Active learning (AL) provides a principled framework for constructing efficient and accurate IPs, yet most strategies reduce parameter uncertainty without explicitly accounting for the specific material properties being predicted. The information-matching (IM) approach addresses this limitation by requiring that the selected training data provide at least as much parameter space information as needed to achieve prescribed uncertainty targets for selected quantities of interest (QoIs). Here, we apply IM to develop bespoke IPs specifically tailored for predicting plastic strength in metals. Due to the high computational cost of simulating plastic strength, we employ an indirect IM strategy that targets inexpensive intermediate QoIs that correlate with strength. The IM method enables precise parameter constraints with minimal training data, yielding precise predictions for both the intermediate QoIs and plastic strength. Yet, model error remains a key limitation, and a post hoc uncertainty inflation correction provides a viable means to mitigate this limitation. These findings illustrate both the promise and limits of uncertainty-aware AL for predicting complex material properties.

Latent Structural Categorical Matrix Completion with Application to Quasispecies Analysis

Qian Zhang, Meixia Lin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08188v1 Announce Type: cross Abstract: Matrix completion has been extensively studied for real-valued data, but existing methods are often limited in handling categorical variables. We propose LCMC, a double-loop optimization framework for categorical matrix completion via latent factorization based on a binary tensor representation. In this setting, each categorical entry is encoded as a one-hot vector along a third tensor mode, thereby preserving its discrete, non-ordinal nature. The outer loop adaptively estimates the latent dimension by iteratively updating it with feedback from the inner loop, while the inner loop reconstructs the categorical matrix through tensor factorization, supported by a corresponding theoretical analysis. To further improve scalability and robustness, we introduce enhancements including a split-merge-refine strategy and an adaptive data reduction technique. Experiments on synthetic and real-world datasets in viral quasispecies reconstruction, demonstrate that LCMC achieves superior accuracy and efficiency compared to existing methods.

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

Mariyam Khan, Shohei Shimizu, Thong Pham — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08196v1 Announce Type: cross Abstract: We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale noise model (LSNM). Existing methods that handle hidden confounders typically assume additive noise, but in practice, causes often modulate not just the mean but also the variance of their effects. We prove that acyclic directed mixed graphs (ADMGs) satisfying a bow-free condition are identifiable under LSNM with hidden variables, establishing the first identifiability result for causally insufficient models beyond noise additivity. We further provide sufficient conditions for identifying causal direction even when the bow-free assumption is violated. Our two-stage algorithm, LSNM-UV, is sound and complete, and experiments demonstrate improved performance over additive baselines on heteroscedastic data.

Vector Space of Cycles

Moo K. Chung, Anass B. El-Yaagoubi, Hernando Ombao — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08202v1 Announce Type: cross Abstract: Most statistical and machine learning methods for directed interactions focus on pairwise effects among variables. Even existing cyclic models represent feedback primarily through node-level dependencies, making large-scale recurrent organization difficult to estimate and compare. This limitation is particularly acute in biological and neural systems, where interactions are highly recurrent and involve many overlapping cycles. We introduce a variational framework for statistical inference on cyclic interactions. Directed interactions are represented as edge flows on a simplicial complex and evolved under an energy-minimizing dynamical system. The resulting dynamics separate transient interaction components from persistent harmonic flows, yielding a low-dimensional cycle space that captures stable recurrent organization. Rather than enumerating individual cycles, the proposed framework represents cyclic interactions as elements of a Hilbert space, enabling projection, averaging, comparison, and population-level statistical inference. We establish theoretical properties of the harmonic projection, including characterization of the cycle space, variance reduction, and population inference. Simulations demonstrate substantially improved recovery of cyclic structure in dense recurrent systems compared with existing directed-interaction methods. Applied to resting-state fMRI from 400 human subjects, the framework reveals reproducible large-scale cyclic organization that is not detectable through edgewise averaging. These results provide a scalable statistical framework for studying recurrent interactions in high-dimensional dynamical systems.

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08210v1 Announce Type: cross Abstract: Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.

Entanglement in the Quantum Volunteer's Dilemma

Noah Dane Hebdon, Dax Enshan Koh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08227v1 Announce Type: cross Abstract: A well-known model in game theory, the Volunteer's Dilemma describes a group of $n$ players who decide whether to volunteer for a collective benefit at a personal cost, or to abstain and risk forfeiting the benefit altogether. A quantum version of this dilemma, developed within the Eisert-Wilkens-Lewenstein framework, allows each player to manipulate one qubit of a shared entangled state, leading to symmetric Nash equilibria with higher expected payoffs than in the classical game. Existing analyses, however, assume maximal entanglement. Within the same framework, we introduce a generalized Quantum Volunteer's Dilemma with a tunable entanglement parameter $\gamma$ and study the extent to which equilibrium behavior depends on the level of entanglement. We derive explicit conditions relating $\gamma$, the number of players, and the players' strategies under which symmetric Nash equilibria exist, focusing on two canonical strategy profiles: one for $2\leq n\leq 9$, and one for even $n$. We find that maximal entanglement is not required to sustain symmetric equilibria. Instead, equilibrium behavior persists above a threshold value, which we compute analytically in both cases. We also demonstrate that the threshold value directly depends on system size. This characterization is directly relevant for implementations on resource-constrained quantum devices, where entanglement is inherently limited.

Post-Rejection Follow-up Sampling: A Methodology for Counterfactual Outcome Measurement in Algorithmic DEX Trading

Arati Uday Kamat — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08228v1 Announce Type: cross Abstract: Algorithmic trading systems on decentralised exchanges (DEXs) reject most candidate tokens they evaluate. The counterfactual outcome of rejected candidates (what would have happened had the system entered) is rarely measured. This paper introduces Post-Rejection Follow-up Sampling (PRFS). A separate tracking subsystem samples each rejected token's price and liquidity at a configurable cadence, over a horizon of up to twenty-four hours. PRFS produces the data needed to evaluate filter precision against actual market outcomes of rejected candidates, not against synthetic backtest reconstructions. The methodology, data architecture, and deposit format are described in Section III. The companion dataset contains 67,000 forward-outcome observation rows across 2,997 rejection events spanning 457 unique mints, collected over a continuous eight-day window (2026-04-10 to 2026-04-19, UTC). Approximately 55 percent of rejection events receive at least one forward observation; coverage at the mint level is complete. The principal binding constraint on downstream classification is per-event horizon density, not event-level coverage. PRFS is dataset-independent. It generalises to any algorithmic decision system in which rejections substantially outnumber executions.

AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals

Aueaphum Aueawatthanaphisut — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08247v1 Announce Type: cross Abstract: Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.

QnRL: Quantum-Native Reinforcement Learning

Alexander DeRieux, Walid Saad — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08276v1 Announce Type: cross Abstract: Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly approximate environment behavior by estimating expected outcomes, which limits their expressive power and adaptive potential. Overcoming such challenges requires a novel QRL approach that exploits the distributional nature of quantum computers to directly model environment random variables as quantum state distributions. Hence, in this paper, a novel framework dubbed quantum-native reinforcement learning (QnRL) is proposed. QnRL is a distributional RL framework that learns conditional distributions naturally in Hilbert space via superimposed and entangled quantum states. Thus, QnRL can directly model the behavior of stochastic learning environments via the natural properties of quantum systems. QnRL accomplishes this via a novel, proposed quantum amplitude kickback (QuAK) algorithm that enables comparing the $n$-th power of the $m$-th moment of multiple superimposed distributions. It is theoretically proven that a conditional action policy distribution is distilled from the moments of a quantum generative model entirely within Hilbert space via QuAK, and optimized via QnRL. This complex distribution composition is also shown to provide extra dimensions for expressing environment correlations that are unknown to purely classical and classically-sampled quantum distributional models. Experimental results across diverse environments show that QnRL achieves up to $82.9\%$ higher evaluation scores, with up to $94.3\%$ fewer parameters on average, more accurately estimates the expected return for unseen observations, and better adapts to varying stochastic conditions compared to the baseline.

Strategic Type Spaces

Olivier Gossner, Rafael Veiel — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08297v1 Announce Type: cross Abstract: We provide a strategic foundation for information: in any given game with incomplete information we define strategic quotients as information representations that are sufficient for players to compute best-responses to other players. We prove 1/ existence and essential uniqueness of a minimal strategic quotient called the Strategic Type Space (STS) in which a type is given by an interim correlated rationalizability hierarchy and represents a set of beliefs over other players' types and nature that rationalize this hierarchy and 2/ that the minimal STS has a recursive structure that is captured by a finite automaton.

MEC-Cox: Machine-Learning-Assisted Generalized Entropy Calibration for ATT Marginal Hazard-Ratio Estimation

Se Yoon Lee, Yonghyun Kwon, Jae Kwang Kim — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08305v1 Announce Type: cross Abstract: Externally controlled survival trials are increasingly used when concurrent randomized controls are infeasible, particularly in oncology and rare-disease settings with time-to-event endpoints. We target an average-treatment-effect-on-the-treated (ATT)-type marginal hazard-ratio estimand, comparing treatment with counterfactual control in the treated trial population, and estimate it using inverse-probability-weighted (IPW) Cox regression. Valid inference is challenging because IPW Cox regression depends on the weights through both event contributions and risk-set averages, making flexible machine-learning nuisance estimation difficult to incorporate directly. Building on machine-learning-assisted generalized entropy calibration (MEC) by Lee and Kim (2026), we propose MEC-Cox for ATT-weighted IPW Cox regression. The method begins with normalized source-propensity-score odds weights for external controls and then applies Bregman calibration to balance cross-fitted prognostic summaries between external controls and treated trial patients. The calibration basis may include control-survival predictions, Cox linear predictors, penalized-survival-model predictions, or other prognostic-score summaries. MEC-updated weights therefore play a dual role as source-transport and prognostic-score balancing weights. We establish consistency, characterize a calibration-induced efficiency gain, and develop a stacked sandwich variance estimator. Simulations show that MEC-Cox can reduce bias, increase efficiency, and improve coverage through flexible machine-learning-assisted adjustment.

MetaboliSim: a Python implementation of the Mader model for dynamic and steady-state simulation of muscular energy metabolism

Katharina Dunst, Vincent Scharf, Clemens Hesse, Alexander Asteroth — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08366v1 Announce Type: cross Abstract: The Mader model is the most widely used mathematical framework for muscular energy metabolism in German-language sport science, underpinning lactate diagnostics, maximal lactate steady state (MLSS) estimation and training prescription. Despite decades of use, neither its dynamic ODE formulation nor its steady-state equations have been available as open code, leaving results based on the model impossible to reproduce independently. We close this gap with MetaboliSim, an open-source Python implementation of both formulations: a dynamic model that integrates the five-variable ODE system (phosphate potential, $\dot{V}\mathrm{O}_2$, muscle and blood lactate, and glycogen) with a fourth-order Runge-Kutta scheme, and a steady-state model that computes MLSS power and the lactate-power relationship in one- and two-compartment variants. We verified implementation correctness against published reference values and assessed physiological plausibility across constant-load, step-test, sprint and running protocols. The implementation reproduces the published reference output within stated tolerances and remains numerically stable throughout (halving the time step changes blood lactate by less than 0.01 mmol/L), with both formulations yielding congruent MLSS estimates. Key physiological behaviour ($\dot{V}\mathrm{O}_2$ on-kinetics, lactate accumulation, PCr dynamics and the sub/supra-MLSS separation) emerges directly from the model equations without protocol-specific tuning, and a sensitivity analysis shows MLSS power varying approximately linearly with $\dot{V}\mathrm{O}_{2\max}$ and nonlinearly with $\dot{V}\mathrm{La}_{\max}$. As the first openly available implementation of the complete Mader model (AGPL-3.0), MetaboliSim lets independent groups reproduce, verify and build on published model-based results. Source code: https://codeberg.org/3phos/metabolisim; Platform: https://metabolisim.org

Programmable Silicon Retina on Pixel Processor Array

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08370v1 Announce Type: cross Abstract: Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13\% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47\%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina's "information distillation" mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.

A Switching Beamformer for Highly Non-Stationary Environments

Manan Mittal, Ryan M. Corey, John R. Buck, Andrew C. Singer — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08385v1 Announce Type: cross Abstract: Adaptive beamforming is a cornerstone of array signal processing, yet its performance often collapses in the face of complex, rapidly changing interference. When interferers appear or move unpredictably, conventional estimators encounter a fundamental memory trade-off: short windows enable rapid tracking but suffer from high estimation variance, while long windows provide stable rejection but fail to adapt to shifts. This challenge is resolved by introducing the Universal Switching Beamformer (USB), which integrates competitive sequential prediction into the beamforming architecture. By employing a linear transition diagram, the USB implicitly maintains an exponentially large family of candidate covariance histories and dynamically re-weights them based on their cumulative output power. This mechanism allows the beamformer to automatically vary its effective memory length without explicit change detection or heuristic parameter tuning. A theoretical upper bound is proven on the regret relative to an omniscient oracle that selects the best piecewise-stationary covariance model in hindsight. Extensive simulations and experiments on the SwellEx-96 dataset demonstrate that the USB achieves the agility of short-window estimators and the precision of long-term integration, providing a principled solution for tracking highly non-stationary scenes.

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08437v1 Announce Type: cross Abstract: Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domain gap between controlled enrollment and unconstrained authentication. Existing datasets are largely restricted to controlled setups and fail to capture the compound variability of real-world environments. In this paper, we introduce X-Palm, a cross-domain dataset comprising 6,006 palm images from 103 individuals (206 hands). To the best of our knowledge, X-Palm is the first palmprint dataset providing novel paired-identity acquisition specifically designed to bridge the gap between reliably controlled multispectral enrollment and unconstrained mobile authentication while encompassing a broad spectrum of in-the-wild variability. Unlike existing datasets that focus on single to a few variations, X-Palm addresses the massive modality and environmental shifts encountered in practical deployments by capturing paired data for identities across two distinct domains: (1) a controlled Multispectral Palmprint setting using our custom-developed scanner, and (2) an unconstrained smartphone palmprint setting that is participant-driven, incorporating simultaneous variations in hardware, hand pose, illumination, background, camera-to-hand distance, perspective, and palm surface conditions (e.g., moisture and occlusions). Our extensive benchmarks of 12 SOTA models reveal that while existing methods achieve high performance on controlled data, they experience severe performance collapse on X-Palm. Conversely, models trained on X-Palm demonstrate consistent robustness across domains, positioning X-Palm as a valuable resource for training a model towards real-world, cross-domain generalization. Data access instructions and the related benchmarking codes are publicly available at: https://github.com/X-Palm/X-Palm-2026

Improving Bayesian Optimization via Training-Aware Conditional Diffusion Models

Yilin Zheng, Haowei Wang, Szu Hui Ng, Enlu Zhou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08438v1 Announce Type: cross Abstract: Bayesian optimization (BO) is a widely used approach for black-box optimization that uses a Gaussian process (GP) as a surrogate and guides sequential evaluations via an acquisition function, with the ultimate goal of locating the global optimum $\mathbf{x}^{\star}$. To align with this goal, information-based acquisition functions such as Predictive Entropy Search (PES) model $\mathbf{x}^{\star}$ as a random variable and reduce the entropy of its distribution, but approximating this distribution via traditional GP posterior sampling is computationally expensive. To address this limitation, we leverage Conditional Diffusion Models (CDMs) to efficiently approximate the distribution of $\mathbf{x}^{\star}$ and develop BO-inherent training strategies for CDMs. Motivated by the structural properties of the CDM-learned distribution, we further develop an acquisition strategy termed Diffusion-based Mode Seeking (DMS) to guide the sequential evaluation. We establish a sub-optimality guarantee for the CDM-learned distribution and demonstrate through extensive experiments that DMS outperforms standard BO baselines.

Quantum Kravchuk Transform using $\mathfrak{su}(2)$ fast-forwarding

Chaowen Guan, Akshit Katiyar — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08443v1 Announce Type: cross Abstract: We present a quantum algorithm for the Kravchuk transform that scales logarithmically in both the dimension and the inverse of the error parameter. The quantum Kravchuk transform maps computational basis states to states with amplitudes proportional to Kravchuk functions. We achieve this by combining two key techniques: the structural relationship between the Kravchuk transform and the Lie algebras $\mathfrak{su}(2)$, and a recent fast-forwarding simulation method for $\mathfrak{su}(2)$ operators in the oscillator representation. More precisely, we first establish the map from Kravchuk transform in computational basis to $\mathfrak{su}(2)$ in Fock basis. Then built on this connection, we apply the fast-forwarding to achieve an efficient quantum Kravchuk transform.

LOTTERY: Learning from Reference-Only Samples in Two-Sample Testing under Size Asymmetry

Xunye Tian, Zhijian Zhou, Liuhua Peng, Feng Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08460v1 Announce Type: cross Abstract: Data-adaptive two-sample testing assesses if two samples come from the same distribution, using a discrepancy learned from the data (e.g., via kernel-based feature representations). Such methods typically rely on data splitting to decouple learning from testing and control type I error. However, this paradigm is ill-suited to few-shot settings with severe sample-size imbalance: abundant reference samples are available, while only a handful of query samples arrive. In this paper, we show how this imbalance can be leveraged constructively. Using abundant reference data, we learn reference-dependent representations that summarize salient structure of the reference distribution and provide informative signals for detecting departures. We incorporate a collection of representation families that capture both global and local structure, and adaptively weight them using only reference samples via an uncertainty-guided principle. Theoretically, we establish permutation-based type I error control and show consistency of the aggregated test: as the sample sizes grow, the test power converges to one whenever the representation set contains at least one consistent representation. Empirically, our aggregation achieves strong performance across a range of benchmarks while retaining type I error control.

Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08493v1 Announce Type: cross Abstract: \textit{Tissue graph counterfactuals} ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize \textit{tissue graph counterfactuals} as a class of spatial interventions that either rewire connections between cells (\textit{edge perturbation}) or modify the expression of their neighbors (\textit{node perturbation}). We then introduce \textit{Cellina} {\renewcommand{\thefootnote}{\ddag}\footnote{https://cellina.readthedocs.io}\addtocounter{footnote}{-1}}, a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, \textit{Cellina} outperforms spatially-informed and non-spatial competitors in tissue perturbations, disentanglement, and scalability. Additionally, we show that \textit{Cellina} reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

Fumiaki Yamaguchi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08505v1 Announce Type: cross Abstract: Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral on AMI, but degrades sharply on in-the-wild data: on VoxConverse, DER rises from 0.075 to 0.113. We trace the failure to speaker under-counting in the clustering stage, caused by a fixed minimum cluster size interacting with the reduced number of embeddings per speaker. We propose a relative minimum cluster size, mcs = round(f * n) with f = 0.01, which adapts to the embedding budget per recording. A single value of f recovers VoxConverse DER to 0.079 (about 89% of the lost accuracy) while keeping AMI flat, and the accelerated pipeline reaches up to 12.2x speedup on AMI (MPS) over our CAM++ baseline.

Almost balanced ordered biclique covering of graphs

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08506v1 Announce Type: cross Abstract: Let $f(n,k)$ be the minimum size of a collection of bicliques such that (i) every edge of the complete graph $K_n$ is covered by at least one and at most $k$ bicliques in the collection, and (ii) for each edge $\{u,v\}$, the number of bicliques in which $u$ appears in the first class and $v$ in the second class differs by at most one from the number of bicliques in which $u$ appears in the second class and $v$ in the first class. For $k=1$, $f(n,k)$ reduces to the biclique partition number of $K_n$, and the Graham--Pollak theorem gives $f(n,1)=n-1$. For $k=2$, $f(n,k)$ is the ordered biclique partition number of $K_n$, for which it is known that $c_1 n^{1/2} \le f(n,2) \le c_2 n^{1/2+o(1)}$ for some positive constants $c_1$ and $c_2$. In this note, we establish almost tight bounds for $f(n,k)$ for general $k$.

Fixed-Parameter Tractability of $t$-Uniform Hypergraphicality

Riley Brown, Istvan Miklos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08523v1 Announce Type: cross Abstract: We study the $t$-uniform hypergraphicality problem under a compressed representation of the degree sequence. Instead of listing all vertex degrees explicitly, the input consists of pairs $$ (\delta_1,n_1),\dots,(\delta_k,n_k), $$ meaning that exactly $n_i$ vertices have degree $\delta_i$. Thus the parameter $k$ denotes the number of distinct degrees. Although deciding $t$-hypergraphicality is NP-complete for every fixed $t>2$, we prove that the problem is fixed-parameter tractable parameterized by $(k,t)$. Our result shows that tractability extends substantially beyond previously known bounded-range regimes: even degree sequences with large overall degree spread can be handled efficiently when the number of distinct degrees is bounded. Our approach decomposes hyperedges according to their types with respect to the degree classes, yielding a bounded-dimension spectrum representation. Using balancing hinge-flips, we show that every feasible spectrum can be transformed into a realization of the prescribed degree sequence. This leads to an integer programming feasibility formulation with $$ \binom{t+k-1}{k-1} $$ variables. Applying Lenstra's theorem yields an FPT algorithm running in time $$ f(k,t)\cdot \mathrm{poly}(L), $$ where $L$ denotes the encoding length of the compressed input.

A Taxonomy of Real-World Asset Tokenization for Blockchain-Based Financial Infrastructure

Giorgio Vella, Luca Pennella, Mark C. Ballandies — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08534v1 Announce Type: cross Abstract: Real-world asset (RWA) tokenization has emerged as a prominent application of blockchain technology, enabling off-chain financial and non-financial assets to be represented through blockchain-based instruments. However, deployed RWA systems remain difficult to compare because legal claims, custody arrangements, token mechanics, verification processes, and on-chain integrations are often described separately. This paper develops a systems-level taxonomy of RWA tokenization to classify how off-chain assets are legally, economically, and technically represented on-chain. Following an iterative taxonomy-development method, we organize twenty-three dimensions into five components: governance, asset structure, token properties, distributed ledger technology, and economy. We apply the taxonomy to twenty major RWA systems selected by market capitalization and compare their design choices across asset classes and implementation models. The classification shows that current RWA tokenization is predominantly implemented through hybrid architectures: blockchain tokens support representation, transfer control, redemption workflows, pricing, and composability, while core legal guarantees remain anchored in off-chain legal wrappers, custodial arrangements, compliance processes, and verification mechanisms. The analysis also reveals recurring documentation gaps concerning voting rights, dispute forums, burn mechanics, supply constraints, and reserve verification. Overall, the taxonomy provides a structured basis for comparing RWA systems, identifying design patterns and limitations, and supporting future research on blockchain-based financial infrastructure.

Block coordinate descent for joint delay-energy optimization in multi-hop D2D networks

Kai-Xiang Hu, Jacek Gondzio, Caixia Kou — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08544v1 Announce Type: cross Abstract: In multi-hop device-to-device (D2D) networks, the optimization of network-level metrics is particularly difficult due to the tight coupling between network-layer routing and physical-layer resource allocation. Departing from traditional average-performance metrics, this paper addresses the joint optimization of routing paths, transmission power, and bandwidth allocation. We formulate a generalized cost function to minimize the maximum transmission time (i.e., the bottleneck delay) alongside the total energy consumption. To tackle the resulting highly non-convex formulation, we propose a novel block coordinate descent (BCD) framework. At the network layer, we develop two adaptive routing algorithms: a matrix-free Frank-Wolfe (MF-FW) algorithm for fast execution in dense topologies, and a low-rank primal-dual interior-point method (LR-PDIPM) that bypasses dense matrix inversions via the Sherman-Morrison formula for high-precision solutions. At the physical layer, we design a parallel dual ascent algorithm leveraging a time-domain perspective transformation to solve the resource allocation subproblem to global optimality. The proposed BCD framework is proven to converge to an {\epsilon}-neighborhood of a stationary point. Through comprehensive experiments, the proposed BCD framework establishes its superiority in achieving the optimal delay-energy trade-off. Specifically, the LR-PDIPM variant achieves a maximum 9.14-fold reduction in total energy consumption and up to an order of magnitude improvement in energy efficiency, while maintaining a bounded maximum delay gap (up to 3.78-fold) relative to the best baseline. Meanwhile, the warm-start MF-FW variant identifies near-optimal solutions in mere seconds, serving as a highly practical engineering approach.

G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08580v1 Announce Type: cross Abstract: Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.

Improving the sharpness in neural network-based parametric post-processing of ensemble forecasts

\'Agnes Baran, M\'at\'e Mihalina — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08587v1 Announce Type: cross Abstract: Statistical post-processing has proven to be an effective tool in improving ensemble forecast of different weather variables. Case studies show that post-processing can remedy the typically underdispersive and potentially biased behaviour of the ensemble while optimizing a proper scoring rule expressing the forecast skill. The price of these positive effects is generally a deterioration in sharpness; the width of the central prediction intervals and the uncertainty of the predictions are increasing, especially for shorter lead times. This work aims to reduce the extent of the latter phenomenon for neural network-based parametric post-processing methods by extending the network's loss function with a penalty term. We demonstrate the effect of the proposed technique for 2m temperature ensemble forecasts of the European Centre for Medium-Range Weather Forecasts downloaded from the EUPPBench benchmark dataset and verified against synoptic observations. Here, the predictive distribution is Gaussian, and we use the continuous ranked probability score (CRPS) as loss function. The case studies confirm a substantial relative decrease ($8.2\%-12.5\%$) in the width of the nominal central prediction interval compared to the width of the predictive distribution computed without the penalty term, while there is no deterioration in the mean CRPS of probabilistic forecasts and in the RMSE of the predictive mean.

Coil-Integrated Alignment Sensor for Real-Time Feedback of Coil-Scalp Contact Point and Angle During Transcranial Magnetic Stimulation (TMS)

B. Seyed, M. Koehler, S. M. Goetz — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08618v1 Announce Type: cross Abstract: Whereas coil positioning in transcranial magnetic stimulation (TMS) to reach a specific cortical target with modern focal stimulation coils has been intensively studied, the alignment and contact of a coil with the head is often ignored. Focal figure-of-eight coils have a point on the surface, where they generate the largest induced electric field. This point should touch the head first, and the coil should be approximately tangential to the head in this point. Previous research has demonstrated the large impact if the coil does not touch the head with the right point and that many operators struggle with establishing or maintaining the correct coil-scalp alignment. This paper presents a technological support technology that can monitor the exact position of the contact point and also pressure to provide feedback to users. As the system uses exclusively components from consumer electronics, the sensor is low-cost and affordable. Through proper design, we achieved sufficient robustness so that the sensor does neither reset during TMS pulses and also not show any detectable degradation.

Parameter Tuning with Generalization Guarantees for GPU-Accelerated Linear Programming

Siddharth Prasad, Dravyansh Sharma — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08638v1 Announce Type: cross Abstract: Recent research has developed practical, parallelizable first-order methods for large scale linear programming, but performance is highly dependent on hyperparameter selection. We derive generalization guarantees for hyperparameter tuning within (cu)PDLP, a state-of-the-art first-order LP solver designed for modern hardware. First, we pin down the behavior of PDHG, the primal-dual hybrid gradient algorithm that underlies PDLP, as a function of its step size and primal weight, leading to linear sample complexity guarantees for learning those parameters. We then conduct a structural analysis of PDLP, which augments PDHG with several specialized techniques like preconditioning, adaptive step sizes, averaging, adaptive restarts, and smoothed primal weight updates. Our analysis captures the behavior of the solution trajectory as a function of the hyperparameters and leverages recent advances in data-driven algorithm design to obtain polynomial sample complexity guarantees for learning those hyperparameters. Finally, we conduct proof-of-concept experiments that demonstrate the need for data-driven PDLP parameter tuning. Our results showcase the versatility of the data-driven algorithm design toolkit for principled hyperparameter tuning within solver-grade implementations of complex modern optimization algorithms.

Nonlocal Teams and Information Structures

Drishti Baruah, Sachin Teli, Ankur A. Kulkarni — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08645v1 Announce Type: cross Abstract: We look at Bell inequalities from the lens of information structures in stochastic teams. We consider the usual CHSH game and a dynamic variant of the same to study how various classes of strategies, classical, projective and quantum, behave under team theoretic solution concepts. We find that projective strategies (where each player performs projective measurements) enjoy important properties in the usual CHSH game, but they do not carry over to its dynamic version. These results shed light on the delicate interplay of information structure in quantum strategies and the fragility of some well known ideas under changes of information structure.

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

Marco Marena, Qin Li, Haimin Wang, Haodi Jiang, Prajwal Shah, Bo Shen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08652v1 Announce Type: cross Abstract: Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlier periods, we leverage the multi-decade availability of full-disk \HeI{} observations, whose absorption is modulated by coronal irradiance and magnetic topology and is widely used as a proxy for open-field regions. We present a diffusion-based conditional image translation framework, Coronal Hole-aware Diffusion Model Translator (CH-aware DMT), to reconstruct synthetic SDO/AIA 193 \AA{} EUV images from \HeI{} inputs. The model is trained on temporally co-aligned SOLIS \HeI{} and AIA 193 \AA{} pairs spanning 2011--2015 using a month-based split, where January--October are used for training, November is used for validation, and December for testing. On the held-out test set, the reconstructions preserve dominant full-disk EUV morphology (CC=0.92) and recover CH-related low-intensity structure (CC=0.84). We further assess historical applicability by (1) comparing reconstructed AIA 193 \AA{} morphology with SOHO/EIT 195 \AA{} over 2005--2015; (2) comparing reconstructed AIA 193 \AA{} images generated from KPVT \HeI{} inputs against Yohkoh/SXT soft X-ray observations; and (3) evaluating long-term reconstructed disk-integrated emission statistics against observational EUV series and independent solar activity proxies (sunspot number and F10.7 radio flux over 1974--2015). These results indicate that CH-aware DMT conditioned on \HeI{} can provide a physically plausible synthetic AIA 193 \AA{} coronal proxy for historical studies, supporting multi-decade analyses of large-scale coronal evolution before the direct EUV imaging was available.

Uncertainty Principles for the Number Theoretic Transform

Giulio Malavolta, Alon Rosen — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08662v1 Announce Type: cross Abstract: Motivated by polynomial identity testing with exponentials (Li and Wu, ITCS'26), we study uncertainty principles for the number-theoretic transform (NTT). We show that the NTT satisfies strong sparsity tradeoffs: For every fixed prime $q$ and for all but finitely many primes $p \equiv 1 \pmod q$ every nonzero $f\in \mathbb F_p^{\mathbb Z_q}$ and its number-theoretic transform $\hat f$ satisfy \[ |\mathrm{Supp}(f)| + |\mathrm{Supp}(\hat f)| \ge q+1. \] Thus, a $k$-sparse function has transform support at least $q-k+1$. As our main technical contribution, we prove a probabilistic version of the above uncertainty principle, averaged over primes $p$, in the regime $p=q^{O(1)}$. As an application, we obtain a black-box identity test for $k$-sparse exponential polynomials of degree at most $d$ with vanishing soundness error, for $q$ moderately larger than $k$.

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

Bitya Neuhof, Yuval Benjamini — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08679v1 Announce Type: cross Abstract: Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

Discovering and decoding latent mean-field structure with variational autoencoders

Marco Biroli, Max Welling, Vincenzo Vitelli — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08694v1 Announce Type: cross Abstract: Generative models are increasingly used to capture correlations in many-body systems, but the representations they learn remain largely opaque to physical interpretation. Here, we establish an intuitive criterion that quantifies the capacity of a variational autoencoder (VAE) to faithfully reconstruct the joint probability distribution of a many body system. In a nutshell, a bound on the VAE capacity is obtained by comparing the rate of the latent channel to the bipartite mutual information of the data. Using this bound, we show that the conditionally independent decoder of any successful VAE is structurally identical to a finite-size mean-field factorization. Hence, a successful reconstruction is direct evidence for a latent mean-field theory and the microscopic parameters of that theory can be read off the trained decoder. We validate these conclusions on a hierarchy of solvable models with scalar (Curie-Weiss), vector (Hopfield) and tensor (Maier-Saupe) order parameters, recovering the full Hopfield pattern matrix from equilibrium samples alone. We find that, when applied to Salamander retinal recordings, a two-latent VAE reproduces the population statistics with only two effective collective variables allowing us to recover the `stored patterns' of the neural population and write a generalized Hopfield model which correctly models the experimental data.

EL-Shellability of the poset of ranked cactuses

Vincent Moulton, Andreas Spillner, Antonia Stavemann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08724v1 Announce Type: cross Abstract: Recently the poset of ranked cactuses $(\mathfrak{P}(X),\preceq)$ was introduced. For a finite set $X$, this poset consists of a set $\mathfrak{P}(X)$ of certain collections of ordered pairs of subsets of $X$ together with an ordering $\preceq$ that is similar to the refinement ordering of partitions of a finite set. In addition, the maximal chains in this poset correspond to binary ranked cactuses, a fact which can be used to construct the so-called space of equidistant cactuses. In this paper, we show that the poset of ranked cactuses is EL-shellable. As a consequence we also show that the proper part of the link of the origin of the space of equidistant cactuses has the homotopy type of a wedge of spheres.

Numerical Analysis on Backward Stochastic Differential Equations by Finite Transposition Method

Penghui Wang, Yanqing Wang, Xu Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08731v1 Announce Type: cross Abstract: In this paper, we propose a finite transposition method to solve backward stochastic differential equations (BSDEs, for short). Based on the transposition solution theory for BSDEs, our method offers a promising way of efficiently computing solutions, which can be regarded as an analogous method for BSDEs as the classical finite element method for partial differential equations. Our method has the advantage of easily computable conditional expectations.

Algebra of Bivariate-Bicycle Surface Codes

Renyu Wang, Leonid P. Pryadko — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08771v1 Announce Type: cross Abstract: We relate the properties of bivariate-bicycle-surface (BBS) codes, constructed from a pair of bivariate polynomials over a finite field, to the number and location of their common roots in the extension field. The number of roots $(x,y)$ with finite, non-zero coordinates -- counted with algebraic multiplicity -- determines the dimension of the codes. This dimension is invariant under monomial automorphisms of the Laurent polynomial ring. Conversely, roots with zero or infinite $x$- or $y$-coordinates indicate that specialized generators are required near the corresponding boundary (e.g., the left or right boundary for a root where $x$ is zero or infinite, respectively). These roots can appear or disappear under monomial transformations, which reveals the structure of tilted boundaries. Based on these results, we formulate a prescription for constructing BBS codes that works for regions with rectangular, diagonal, and arbitrarily tilted boundaries. A key advantage of this approach is that no corner corrections are needed, provided the polynomials satisfy orientation-specific edge conditions.

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

Ganzhao Yuan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08783v1 Announce Type: cross Abstract: Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, existing orthogonalized methods are typically paired with constant or open-loop magnitude rules, and therefore do not explicitly calibrate their update magnitudes from the observed optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary guarantees. OptMuon-A achieves the noise-adaptive rate $\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/2}T^{-1/4})$ under average smoothness, while OptMuon-I achieves $\tilde{\mathcal O}(T^{-1/2}+\sigma^{1/3}T^{-1/3})$ under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate $\tilde{\mathcal O}(T^{-1/2})$ without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.

Evaluating AI Investment Strategies

Irene Aldridge — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08791v1 Announce Type: cross Abstract: We study the problem of auditing a black-box algorithmic decision-maker from observable inputs and outputs alone. Our main result is an exact decomposition: under precisely characterized conditions, the cumulative \emph{regret} of a dynamic policy equals the sum of per-period covariances between the cost vector and the policy's decision. This extends the single-period identity of Aldridge~(2026) to the full multi-period setting of stochastic dynamic programming. We prove the identity holds exactly under i.i.d. costs and mean-unbiased Markov policies, derive closed-form bias corrections for non-stationary and time-varying cases, and establish the discounted-horizon analog. A Bellman recursion for the covariance regret functional connects the result to standard reinforcement learning algorithms; for rolling-window policies, the estimation-error bias is $O(d/w)$. The decomposition has direct implications for algorithmic auditing in strategic environments: in platform mechanism design, it provides a welfare-based audit metric without access to the agent's private type; in repeated games, covariance reduction is a sufficient condition for policy improvement; in procurement and ad auctions, the bias correction quantifies welfare loss from strategic misreporting. The associated trajectory estimator is consistent, asymptotically normal with HAC variance, and computable in $O(T \cdot nd)$ time. This makes the proposed approach a tractable, model-free audit tool for platform mechanisms, algorithmic portfolio strategies, and any sequential decision system subject to external performance review.

Generalization in Nonlinear Least Squares via Learned Feature Geometry

Ayub Kharel, Ilja Kuzborski, Patrick Rebeschini, Yasin Abbasi-Yadkori — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08799v1 Announce Type: cross Abstract: We study the generalization of ridge-regularized nonlinear least-squares models via on-average algorithmic stability, deriving error bounds for local minimizers in terms of a data-dependent effective dimension that reflects the geometry of the gradient model at the trained parameters, through the empirical Jacobian Gram matrix and a residual--curvature term. In the linear case, where the curvature term vanishes, this recovers the classical effective dimension of the Jacobian kernel covariance, but evaluated at the trained model rather than at initialization as is typical in neural tangent kernel analyses. We further bound this effective dimension via covering complexity of the gradient features, leading to guarantees that depend on learned geometry rather than parameter count. In particular, for manifold-supported data and piecewise Lipschitz Jacobians, the bounds scale with intrinsic dimension, while for one-hidden-layer ReLU networks, the mechanism can be made explicit through counts of activation-stable regions. Experiments on synthetic manifolds, clustered distributions, and benchmark datasets illustrate trained-Jacobian compression, the tightness of the residual-curvature linearization, and agreement between the stability bound and observed generalization gaps. A key feature of our bounds is the simplicity of their derivation, which follows from first principles using the Brascamp--Lieb inequality under strongly log-concave noise.

Optimal Control and Dissipativity of Linear Hermitian Matrix-Valued Dynamical Systems

Corentin Briat — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08856v1 Announce Type: cross Abstract: We develop a unified framework for linear-cost optimal control, finite-time optimal steering, dissipativity analysis, and zero-sum differential games for linear impulsive systems whose state is a Hermitian matrix evolving in $\mathbb{H}^{n+m}_{\succeq0}$, a class that encompasses continuous- and discrete-time linear systems and switched systems as degenerate cases, and includes the second-order moment dynamics of linear (stochastic) hybrid systems. The entire theory rests on three tools: a single \emph{key identity} relating cost, trajectory, and a dual variable, an Extended Schur complement lemma, and a Schur inner-product decomposition, applied identically to the flow integral and to each jump. These yield structurally uniform sufficient and necessary conditions, dual linear matrix inequality (LMI) characterizations, and explicit optimal policies for every problem class, on both finite and infinite horizons under time-varying assumptions (without time invariance or periodicity), together with causal dwell-time policies for the problems that admit them.

SCOPE: A Syndrome-Driven Control Plane for QEC-Enabled Quantum Networks

Xiaojie Fan, Zian Wang, Ashutosh Tiwari, Himanshu Gupta — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08873v1 Announce Type: cross Abstract: As quantum networks evolve from experimental testbeds to fault-tolerant systems, the primary performance metric shifts from physical link fidelity to end-to-end logical error rate. However, current control planes remain ill-equipped for this transition: routing decisions are typically decoupled from Quantum Error Correction (QEC) strategies, relying on topology or scalar fidelity metrics that fail to predict how specific physical noise structures interact with logical codes. Optimizing this coupled route-and-code performance requires precise, real-time visibility into network error biases, yet traditional active tomography is operationally prohibitive due to throughput collapse and service interruption. We present SCOPE (Syndrome-based COntrol PlanE), a network-layer architecture that enables joint routing and coding optimization using purely passive telemetry. Instead of injecting probes, SCOPE harvests error syndromes -- the parity-check outcomes naturally generated by QEC decoders during user service. By aggregating these signals, SCOPE's inference engine reconstructs the network's time-varying error map, capturing complex, context-dependent noise correlations. This visibility drives a decision engine that proactively pushes optimal route-and-code configurations to source nodes. NetSquid and IBM-calibrated simulations show that SCOPE reduces estimation error by more than 60% relative to a standard EM baseline. In large-scale networks, this precision reduces logical error rates by 30-35% (up to 65%) against topology-aware baselines.

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

Yanxiong Li, Guoqing Chen, Qianqian Li, Sen Huang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08898v1 Announce Type: cross Abstract: In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model's adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.

Estimate Collapsibility of Causal Effects in Completed Partial DAGs via Strong d-Convex Hulls

Yuxin Deng, Yi Sun, Zhiming Li, Huaxiong Liu — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08941v1 Announce Type: cross Abstract: This paper proposes a collapsible method for estimating causal effects that maintains the estimator's consistency before and after marginalization over some variables in completed partially directed acyclic graphs (CPDAGs). We first introduce the estimate collapsibility for CPDAGs and characterize the minimal collapsible sets as strong d-convex hulls. An efficient algorithm is devised to obtain such sets in DAGs and is generalized to CPDAGs. Then, we combine the graph reduction procedure with the IDA framework. Finally, experiments and empirical analysis show the effectiveness of the collapsibility for causal estimations in CPDAGs. Code is available at https://github.com/Jamyang-D/strongly-convex.

A systematic investigation of molecular encoding methods for drug property predictions across neural network and Transformer encoder-based model

Sheng-Ya Chen, Shan-Ju Yeh — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08973v1 Announce Type: cross Abstract: Fundamental investigations into how different molecular encoding methods affect molecular property prediction remain relatively limited. In this study, we extensively examined the optimal molecular encoding methods for molecular properties prediction using two prevalent structure designs: a classical neural network model (MLP) and a Transformer encoder-based model (MLP+TL). For molecular encoding methods, we investigated several types of fingerprints, including traditional topological fingerprints, substructure-based fingerprints, and string-based representations. These two models were trained on seven well-known molecular datasets to evaluate different input molecular encoding methods based on evaluation metrics. On several biologically relevant classification tasks, including toxicity, mutagenicity, and side-effect prediction, our models consistently achieved average AUC values above 0.9. Rather than relying on external post-hoc explanation methods such as the local interpretable model-agnostic explanation (LIME) or the Deep SHapley Additive exPlanations (SHAP), we leveraged the model's intrinsic attention weights as an internal interpretability signal for identifying potentially important feature. The MLP+TL model using MACCS and PubChem as input can capture chemically interpretable groups that determined the major blood-brain barrier (BBB) permeability and mutagenicity in Salmonella typhimurium. In particular, a comparison between Morphine and Heroin highlighted the role of hydroxyl-related substructures in BBB permeability prediction, which was consistently reflected in the attention weights. Overall, our findings provide practical guidance for selecting effective molecular encoding methods and contribute to the development of interpretable molecular informatics approaches for drug discovery.

Dynamics in a Low-Rank Separable Field Cellular Automaton

Xiaorui Shi, Mengsha Huang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08983v1 Announce Type: cross Abstract: Complex collective dynamics in cellular automata are usually associated with local-neighborhood combinatorics, yet it remains unclear whether long-lived dynamical organization requires such explicit local interaction structure. Here, we introduce a Separable-Field Cellular Automaton (SFCA), a normalized-field cellular automaton in which local neighbor counting is replaced by a rank-one-like row-column field. Each cell is updated according to a normalized field, with survival and birth governed by two threshold intervals. Systematic scans over interval widths and positions revealed four outcome classes: extinction, fixed points, cycles, and long transients. The outcome phase diagram was organized by the relative geometry of the survival and birth intervals: fixed points dominated when born interval was contained in survival interval, whereas long transients concentrated near the boundary between partial overlap and no overlap. A fine scan along this transition showed that the long-transient region forms a narrow but persistent ridge separating two qualitatively distinct cycle-dominated regimes. One side produced dense, high-change-rate cycles approximating global period-2 alternation, whereas the other produced sparse, low-change-rate, stripe-like cycles. Damage-spreading further supported a basin-competition interpretation, in which the long-transient ridge reflects delayed selection between two cyclic attractor families rather than random nonconvergence, while finite-size analysis shows that the long-transient ridge remains robust across tested grid sizes. These results show that structured long-transient dynamics can arise under compressed separable field coupling, suggesting that nontrivial collective organization does not necessarily require full local-neighborhood combinatorics.

Not All Warm Starts Help: Benchmarking Primal-Dual Initializations for ACOPF Algorithms

Babak Taheri, Daniel K. Molzahn — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.08984v1 Announce Type: cross Abstract: Warm starts are widely used to accelerate AC optimal power flow (ACOPF) solves, but the impact of different initialization strategies has received limited systematic study, particularly for the primal-dual interior-point methods that dominate large-scale ACOPF algorithms. This paper benchmarks initialization strategies for ACOPF solved with the interior-point solver IPOPT on 19 PGLib-OPF instances (5 to 30,000 buses), testing all 15 non-empty subsets of the primal blocks $\{P_g, Q_g, V_m, V_a\}$ under oracle conditions and three DC-seeded combinations in a practical setting. The experiments show that most partial primal-plus-dual restarts increase solve time or reduce convergence reliability. Among the oracle primal-plus-dual (O-PD) configurations, only the complete restart reliably converges on every baseline-convergent case, reaching a $47.6\%$ median solve-time speedup. Twelve of the 14 partial O-PD combinations have negative median speedups, and several fail repeatedly on larger networks. Decomposing the dual into constraint and bound multipliers shows that \emph{coverage}, not the presence of duals per se, governs robustness: the full bound-multiplier vector reaches 90.7\% convergence and a $+26.8$\% median speedup, whereas block-matched coverage (oracle multipliers on some bounds, defaults on the rest) drops to 70.4\% and $-31.1$\%. Practical DC seeding sometimes helps the AC solve, but the benefit is no longer statistically significant once the DCOPF presolve cost is included in the end-to-end comparison ($p = 0.4171$). For learned warm-start methods, the results support the following target ordering: predict the full primal vector first; if only partial coverage is possible, prioritize voltage variables; and avoid partial or inconsistent dual predictions unless the primal estimate is nearly complete.

Multi-Armed Bandits with Arriving Arms: Sequential Screening, Dynamic Regret, and Sublinear Guarantees

Deqi Zheng, Xiaoyang Xu, Yuhong Yang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09002v1 Announce Type: cross Abstract: We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making regret against a single best arm in hindsight inappropriate. We instead evaluate performance relative to the best arm currently available, leading to a dynamic-regret criterion for arriving-arm environments. To address the resulting challenges of arrival information discrepancy (AID) and a drifting benchmark (DB), we propose UCB for Arriving Arms (UCB-AA), an elimination-based procedure with an aiding preliminary screening step for newly arrived arms before full competition with incumbent arms. We show that UCB-AA attains regret bounds that depend explicitly on the arrival process, achieves sublinear dynamic regret under regularity conditions on gap evolution, and admits an online extension for unknown horizons. Simulation results show that UCB-AA reduces wasted pulls and maintains a smaller active arm set while preserving competitive regret performance.

BareWave: Waveform-Native Flow-Matching Text-to-Speech

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09048v1 Announce Type: cross Abstract: Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.

Data augmented bootstrap: Unifying confidence interval construction by approximate invariance

Kevin Han Huang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09049v1 Announce Type: cross Abstract: We propose the data augmented bootstrap (DAB), a framework for constructing confidence intervals from approximately invariant transformations of the data. As special cases, DAB recovers popular methods that rely on exact group symmetries, such as conformal prediction, wild bootstrap for Maximum Mean Discrepancy U-statistics and the recently proposed SymmPI. Meanwhile, DAB also recovers the classical bootstrap method, which exploits the dataset's approximate invariance under uniform sampling of data indices as the dataset size grows. For all DAB methods, we establish theoretical coverage results that interpolate between finite-sample and asymptotic guarantees according to the strength of the invariance, and without assuming a group structure. The approximate invariance is measured in the Kolmogorov distance and, for statistics that satisfy Gaussian universality, reduces to conditional mean and variance matching. This allows us to incorporate data augmentation (DA), a widely used machine learning heuristic based on approximate invariances, into known statistical methods. We empirically test the performance of incorporating DA into bootstrap, wild bootstrap and conformal prediction for simulated settings as well as for image, language and scientific data.

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09050v1 Announce Type: cross Abstract: Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09141v1 Announce Type: cross Abstract: Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.

The Size of the Intersection of $q$-ary Hamming Balls

Ville Junnila, Tero Laihonen, Tuomo Lehtil\"a, Pavan Padavu Devaraj — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09158v1 Announce Type: cross Abstract: The interest in studying the size of the intersection of multiple $q$-ary Hamming balls has grown due to the recent advances in DNA-based data storage systems. We present an exact formula for the cardinality of the intersection of $s$ Hamming balls of varying radii over a $q$-ary alphabet. It is known that the distances between the center points of the Hamming balls are not enough, in general, to determine the size of the intersection. Based on our formula, we are able to find more refined structural properties of the center points for determining the exact size of the intersection. Moreover, we also analyze the size of the intersection for sufficiently large $n$. When $s=3$, we give the necessary and sufficient conditions (for all $q\ge 2$, $q\neq 6$ and sufficiently large $n$) to obtain the maximum size of the intersection when the center points of the Hamming balls have a given minimum distance and demonstrate how to compute it using our general formula.

Multilevel Stochastic Gradient Descent for Risk-Averse PDE-Constrained Optimization

Niklas Baumgarten, Philipp A. Guth, David Schneiderhan, Tommaso Vanzan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09291v1 Announce Type: cross Abstract: We present recent advances in applying and analyzing multilevel stochastic gradient descent algorithms to risk-averse, three-dimensional PDE-constrained optimization problems. The algorithm uses adaptive multilevel Monte Carlo gradient estimates, provides parallel scalability as well as improved convergence rates and computational complexity compared to standard batched stochastic gradient descent methods. We study the method in computationally demanding settings using three-dimensional elliptic diffusion problems and large risk-aversion parameters.

Declines in research funding and science ecosystem fragility

Anbang Du, Beining Zhang, Rifat Atun, Michael G Head, Markus Brede — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09387v1 Announce Type: cross Abstract: Scientific knowledge advances through within-country and cross-border scientific activities and collaborations, influenced by funding and strength of research enterprise. Sudden declines in research funding, for example from Federal sources in the United States (US) 2024-25, adversely impact on scientific collaboration. How rapid declines in funding affect the science enterprise and the magnitude of impact need to be analysed. Past studies have modelled the global scientific system as complex collaborative networks of entities and studied its topology and dynamics. However, these studies have not undertaken compensation analysis to real-world shocks that have produced rapid declines in scientific research funding. In this study we examine the effect of the sharp declines in the US Federal funding on cancer science research enterprise globally. We model the cancer science ecosystem as a 5-layer multiplex network of collaborative linkages between 233 countries and territories in grants and clinical trial co-investigations, paper co-authorships, co-inventions and patent co-ownerships. We quantify information flow in the multiplex system through network efficiency. Proposing a framework for compensation analysis, we show that sharp declines in US Federal funding for research degrade global information exchange in science, imposing outsized compensatory burdens on country groups such as the European Union (EU) and BRICS (Brazil, Russia, India, China and South Africa). However, we also show that if other countries provide more support for international collaborations, there is an opportunity to remodel the cancer science system to be more resilient to future shocks.

SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths

Timo Hei{\ss}, Julia Herbinger, Bernd Bischl, Giuseppe Casalicchio — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09404v1 Announce Type: cross Abstract: Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quantify interactions without revealing their functional form, or visualize only restricted interaction types. We propose Surrogate-based Analysis of Interactions via Local effect Smooths (SAILS), a model-agnostic framework that analyzes pairwise interactions through interpretable generalized additive model (GAM) surrogates fitted to the local effects of a black-box model. For each interval of a feature of interest, the surrogate smooth terms isolate the interaction components on derivative level, enabling (i) interaction detection through a heuristic derived from significance tests on smooth terms, (ii) interaction form categorization into linear, product-separable, and non-product-separable types, and (iii) tailored, interpretable visualizations for each interaction type. We empirically validate the framework through controlled simulations and a real-world task, demonstrating its effectiveness for pairwise interactions, with limitations under strong feature correlations and higher-order interactions. SAILS fills a notable gap in the XAI toolbox, going beyond detection of interactions alone to characterizing their functional form.

Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09419v1 Announce Type: cross Abstract: Artificial intelligence is rapidly advancing materials characterization, yet most applications in electron microscopy rely solely on image contrast, overlooking the chemical and experimental context that shapes image formation. This limitation makes defect classification inherently ambiguous, as similar contrasts can arise from different materials or imaging conditions. Here we develop a context-aware learning framework that integrates image-derived contrast with metadata describing composition, beam energy, and detector geometry. Using a systematically constructed dataset of ~55 million simulated patches spanning 576 cases across 96 doped monolayer transition-metal dichalcogenides, we show that conditioning on contextual variables transforms defect classification from an ill-posed image-only task into a well-posed, physically grounded problem. The framework achieves over 98% accuracy on simulations and near-human agreement on experimental data, with a 94% reduction in posterior entropy. By emphasizing contextual grounding over architectural complexity, this approach links experimental image contrast to the underlying chemical and imaging conditions, supporting physically grounded defect assignments and a general pathway toward multimodal AI models for autonomous materials characterization.

The macroscopic Kaehler metric of Geometric Thermodynamics versus the microscopic one on the Event Manifold: Exact Partition Functions on CV manifolds. Extended Souriau temperatures and spontaneous magnetizations

Pietro Fr\'e, Alexander S. Sorin, Mario Trigiante — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09438v1 Announce Type: cross Abstract: In this paper we clarify the relation between Geometric Thermodynamics and Information Geometry based on the Fisher matrix. On the macroscopic odd-dimensional contact manifold of thermodynamic variables, we introduce for the first time a metric, whose pull-back on the isoentropic symplectic submanifolds transverse to the Reeb field is K\"ahlerian. The pull-back of such metric on equilibrium states, that are lagrangian submanifolds, is the Fisher Hessian. Then we consider the Souriau-like Thermodynamics that uses Calabi-Vesentini (CV) manifolds as Kaehlerian microscopic event manifolds and the Killing moment maps as observable functions. A systematic use of the theory of compact abelian structures and the setup of Special K\"ahler Geometry in which CV manifolds are encoded allows us to perform the explicit integration defining the partition function for any entry in the CV Tits Satake universality class. The additional actions completing the abelian structure are non linear Casimir functions of the Killing moment-maps and suggest a generalization of Souriau thermodynamics that partially breaks the isometry group symmetry by means of the non vanishing mean values of the Casimir functions in a manner similar to the spontaneous magnetization in ferromagnetism. Our new exact Gibbs distributions provide the analogue for Cartan Neural Networks of the Gaussian probability distributions in flat space used in conventional Machine Learning.

Metric-Free Riemannian Optimization

Jonas P\"uschel — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09465v1 Announce Type: cross Abstract: Riemannian optimization provides a powerful framework for constrained optimization by incorporating problem-specific structure directly into the geometry of the search space. In many applications, however, the explicit evaluation or application of the Riemannian metric can be computationally expensive or numerically unstable, limiting the practical efficiency of otherwise well-founded algorithms. Motivated by such settings, this work investigates to what extent classical Riemannian optimization algorithms can be reformulated without explicitly applying the metric. We show that many first-order components of Riemannian optimization only rely on the differential of the objective function and access to the Riemannian gradient, but not on explicit metric application. Based on this observation, we develop metric-free formulations and generalize optimization approaches to Finsler and Banach manifolds. Numerical experiments demonstrate that the proposed metric-free strategies retain the effectiveness of their metric-dependent counterparts while significantly reducing computational overhead. These results highlight that a substantial portion of Riemannian optimization can be carried out independently of explicit metric application, broadening its applicability to problems with expensive or implicitly defined metrics.

Report the Floor: A Training-Free Conformal Interval Is a Mandatory Baseline for Probabilistic Time-Series Forecasting

Valery Manokhin — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09473v1 Announce Type: cross Abstract: Probabilistic forecasters are increasingly learned, yet the baselines they are compared against are often weak or omitted. We show that the simplest possible conformal interval - a last-value point forecast wrapped in a finite-sample split-conformal residual quantile, with no parameters and no training - is a far stronger baseline than its near-total absence from recent learned-forecasting and conformal-time-series comparisons would suggest. In one-step-ahead online forecasting across 2,217 real series from nine public sources (Monash, LOTSA, the LTSF traffic/electricity/weather suites, METR-LA, BOOM, nips/probts), this ConformalNaive interval decisively beats the naive value-quantile baselines, the entire NPTS family (NPTS 73%, SeasonalNPTS 64% of series), and the published Conformal Seasonal Pools (CSP) method (71% of series, bootstrap 95% CI [69,73], paired Wilcoxon p approx 7.6e-135); it is on par with the simpler learned conformal predictors (RCI, quantile regression; median relative Winkler within 2%) and is beaten only by the adaptive-online and ensemble methods (SPCI, ACI, AgACI), which track distribution shift and lead by 9-33% relative Winkler. It is also better calibrated than a trained neural forecaster: on the six datasets that introduced DeepNPTS, the trivial floors cover the truth 84-85% of the time at a nominal 95%, versus DeepNPTS's 66%. At multi-step seasonal horizons the picture inverts: the random-walk floor is the weakest method and the seasonal pool (CSP) wins - a boundary we map. Finally we give ConformalNaive+, a one-line, training-free, horizon-adaptive selector that attains the better of two complementary floors at every horizon with restored coverage. We argue the matching conformal naive floor must be a mandatory baseline whenever a learned probabilistic forecaster claims gains.

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration

Junyi Gong, Zijie Qiu, Ben Zhong Tang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09520v1 Announce Type: cross Abstract: Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer this question with scalar feedback loops-generate, score, reject-that amount to informed trial-and-error. Here we show that replacing a single number with the full physicochemical rationale from first-principles calculations transforms the LLM from a stochastic sampler into a causal reasoner. Our system couples retrieval-augmented generation with a self-reflection module that feeds orbital energies, atomic charges, and electron densities-rather than compressed scores-back into the design loop. On HOMO-LUMO gap targets from 1.0 to 5.0 eV, this structure-property-relationship (SPR) reflection achieves a deviation as low as 0.0003 eV and a 100% success rate on moderate tasks, decisively outperforming scalar-feedback and non-reflective baselines. The framework generalizes seamlessly to dipole-moment design and proves robust across five distinct LLM backbones. These results establish a new paradigm: when the model understands not only that a molecule fails, but why, iterative molecular design becomes genuinely mechanistic.

Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy

Jorge Rodriguez-Ramos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09541v1 Announce Type: cross Abstract: Single-Molecule Force Spectroscopy (SMFS) provides unprecedented insights into biomolecular mechanics, yet the high-throughput generation of force-extension trajectories creates a severe data curation bottleneck. Identifying rare molecular unbinding events within thousands of noise-dominated curves traditionally relies on tedious, non-scalable manual auditing. Here, we present a system-agnostic, interpretable deep learning framework tailored to overcome extreme class imbalance in automated SMFS triage. Utilizing 1D-to-2D rasterized geometric matrices, we deployed a modified ResNet18 architecture governed by an asymmetric Focal Loss objective function. We evaluated this framework on the complex mechanical unfolding pathways of the R. champanellensis cellulosome. Under hyper-imbalanced test conditions where the target interaction constituted only 1.34% of the dataset (13 true events out of 970 traces), the model achieved an overall accuracy of 0.9196 and a remarkable True Positive Rate (Recall) of 0.9231. By implementing an empirically calibrated dual-threshold triage system, the pipeline automatically discarded 880 unambiguous background noise traces , reducing the manual curation workload by over 90% while safely preserving high-value rare data. Finally, Gradient-weighted Class Activation Mapping (Grad-CAM) visually validated that the network's decisions are firmly anchored in the relevant geometric features of the force curves, specifically localizing on the structural unbinding regions, effectively mitigating 'black-box' skepticism. Built for free cloud-based execution, this open-source tool democratizes scalable, highly precise molecular discovery across the biophysics community.

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09558v1 Announce Type: cross Abstract: Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.

Constraint residuals, graph posteriors, and determinant-corrected full-space targets in Bayesian inverse problems

Jonathon Cottom, Emilia Olsson — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09594v1 Announce Type: cross Abstract: Bayesian inverse problems constrained by state equations are often sampled in a full parameter-state space by penalising the residual, rather than in a reduced space where the state is eliminated. We show that these formulations are not automatically equivalent as posterior measures. For finite-dimensional discretisations of equality-constrained inverse problems, assume the state equation $c(\theta,u)=0$ has a unique solution $u=G(\theta)$ and nonsingular state Jacobian $\D_u c$. The reduced posterior, its graph lift, and the zero-noise residual posterior are then distinct. A local change of variables shows that an uncorrected Gaussian residual penalty converges, after marginalisation over $u$, to the reduced density multiplied by $\abs{\det \D_u c(\theta,G(\theta))}^{-1}$. Thus algebraically equivalent residuals can define the same feasible set but different limiting posteriors. We derive determinant corrections for unweighted, weighted, and rescaled residual penalties that have the graph-lifted reduced posterior as their hard-constraint limit. The result separates feasibility from posterior calibration: driving the residual to zero is not sufficient for exact sampling of the graph-lifted reduced posterior unless the sampling or correction step targets the corresponding corrected density.

Powering the Future of AI: Navigating the Trade-offs for Europe's Energy Transition and Net-Zero Goals

Mohammad Hemmati, Gbemi Oluleye, Vassilis M. Charitopoulos — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09617v1 Announce Type: cross Abstract: The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structurally challenging component in power system planning and operation. Using a spatially explicit optimisation model of Europe across 21 AI growth scenarios, we systematically quantify additional demand, capacity requirements, emissions, and operational impacts of DCs. Results indicate that AI could drive 73-723 TWh of extra demand by 2050, risking cumulative emissions overshoots of 67-181 MtCO2 between 2030 and 2050. Our analysis indicates that after 2030, the geography of AI infrastructure will be shaped more by firm power and system flexibility than by the mere abundance of clean energy. In moderate scenarios, AI requires an additional of 200 hours of firm generation, which increases LCOE by 35 EUR/MWh in key hubs. We show that even under the pessimistic scenarios, existing infrastructure would require 70 GW additional capacity, while under managed growth pathways, this expansion could reach 226 GW. We further find DCs workload dynamics strongly shape energy dispatch, system flexibility, and emissions, while improved efficiency significantly reduces capacity needs, and system peaks. While our findings suggest that net-zero targets for 2050 may be achieved, critical emission risks may appear in the intermediate years, and the EU may compromise its carbon-neutral goals unless policies adapt to this accelerating digital transformation.

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09667v1 Announce Type: cross Abstract: Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

Dohwan Kim, Jung-Woo Choi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09677v1 Announce Type: cross Abstract: While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

A Bell-State Extension of Loop-Back Quantum Key Distribution

Luis Adri\'an Lizama-P\'erez — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09723v1 Announce Type: cross Abstract: Bidirectional quantum key distribution (QKD) protocols face persistent challenges related to classical disclosure, confinement of the signal space to predictable subspaces, and limited detectability under substitution or entanglement-swapping attacks. In this work, we present a Bell-state extension of the Loop-Back QKD architecture that improves efficiency and detectability while preserving its defining feature of a simplified, measurement-free remote terminal. The protocol employs entangled Bell states together with deterministic local Pauli encoding at the remote node. A central element is that Alice privately prepares and knows the initial Bell state, which serves as a hidden reference enabling her to interpret the Bell-state transition induced by Bob, while preventing an adversary from reconstructing the encoding without access to this reference. By exploiting both intra- and inter-family Bell transitions, the scheme expands the effective signal space beyond the subspace restrictions of earlier two-way protocols. Alice performs a Bell-state measurement to deterministically infer Bob's operation without any basis sifting. Although the traveling subsystem remains locally maximally mixed, concealing the initial Bell family amplifies disturbance under separable substitution strategies, yielding an intrinsic detection probability of approximately 3/4 per round. From an efficiency perspective, the protocol lifts the intrinsic post-selection limitation of single-qubit Loop-Back schemes: the effective throughput is bounded only by the Bell-state measurement success probability, reaching up to 50% in linear-optical implementations. These features make the proposed scheme particularly suitable for mobile or edge-based QKD scenarios, where passive remote nodes must operate under high loss and limited interaction times.

Quantum Cut Sparsifiers

Arpon Basu, Joshua Brakensiek, Pravesh K. Kothari, Aaron Putterman — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09728v1 Announce Type: cross Abstract: In this paper, we continue a line of research initiated by Basu, Brakensiek, and Putterman [2026] studying the sparsifiability of Hamiltonians. We focus particularly on the sparsifiability of the widely-studied Quantum Cut (QC) Hamiltonians. Our main result is that in an $n$-qubit system, any $n$-qubit QC Hamiltonian can be sparsified to $\widetilde{O}(n /\varepsilon^2)$ many terms while preserving the energy of every state up to a factor of $1 \pm \varepsilon$. Our result can be interpreted as giving an importance sampling scheme for the edges of an arbitrary graph $G$ such that the \emph{Kikuchi} graph at level $\ell$ of the sampled graph is a spectral approximation to the Kikuchi graph of $G$. Importantly, the \emph{same} sampling scheme works simultaneously for all $\ell$. The natural approach of leverage score sampling, analyzed via matrix concentration inequalities, yields a polynomially worse bound in our setting because the underlying matrices have dimension $\sim 2^n$. Instead, our approach relies on decomposing the action of these matrices into invariant subspaces. Then, by using an operator-valued inequality of Alon and Kozma [Ann. Henri Poincar\'e, 2020], itself building on an \emph{octopus inequality} of Caputo, Liggett, and Richthammer [J. AMS, 2010], we extend our sparsification technique to all expander graphs. We then invoke expander decomposition to extend our sparsifier to all graphs.

Adaptive directional gradients for parameterised quantum circuits

Brian Coyle, Snehal Raj, Virag Umathe, El Amine Cherrat, Elham Kashefi — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09734v1 Announce Type: cross Abstract: Training parameterised quantum circuits (PQCs) on quantum hardware is bottlenecked by the measurement cost of gradient estimation, which under the parameter-shift rule scales linearly in the number of trainable parameters and dominates the total shot budget of training at scale. In this work, we propose a framework of forward gradient estimators for PQCs, based on the forward mode of automatic differentiation, that yields an unbiased estimator of the gradient by averaging a freely tunable number of random directional derivatives and recovers SPSA, random coordinate descent, and the parameter-shift rule as limiting cases, with no ancilla qubits or controlled-gate overhead. We prove that stochastic quantum forward gradient descent converges under standard assumptions, with an explicit second-moment expansion that interpolates between the single-direction extreme of SPSA and the full-gradient extreme of parameter-shift. Within this framework we derive QUIVER (Quantum Iterative V-adaptive Estimator Rule), an adaptive optimiser for parameterised circuits whose update rule follows from a closed-form minimum measurement-cost allocation. We show numerically that forward gradients train Hamming-weight-preserving orthogonal quantum neural networks with up to 60 qubits and 1770 parameters on the ECG5000 and MNIST datasets orders of magnitude more efficiently than the parameter-shift rule. We also demonstrate that our proposed QUIVER optimiser can outperform iCANS and gCANS measurement-frugal optimisers on optimisation problems using the quantum approximate optimisation algorithm and quantum simulation with the variational quantum eigensolver.

Almost-perfect packings and Tuza's conjecture in the random geometric graph

Patrick Bennett, Ryan Cushman, Andrzej Dudek, Xavier P\'erez-Gim\'enez — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09736v1 Announce Type: cross Abstract: The triangle packing number $\nu(G)$ of a graph $G$ is the maximum size of a set of edge-disjoint triangles in $G$. Tuza conjectured that in any graph $G$ there exists a set of at most $2\nu(G)$ edges intersecting every triangle in $G$. We show that Tuza's conjecture holds in the random geometric graph for a large range of densities. We also study the problem of covering almost all edges of the random geometric graph with edge-disjoint copies of some fixed graph $F$. In particular, we show the existence of almost-perfect packings for an infinite family of $F$, and state some negative results as well.

Discovering Functionally Selective Brain Regions with a Deep Topographic Multimodal Model

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09770v1 Announce Type: cross Abstract: Nearby neurons in cortex share similar response profiles, producing systematic spatial organization across sensory and cognitive systems. Recent topographic models reproduce aspects of this structure but remain unimodal and spatially constrain each layer separately, yielding fragmented maps that capture neither the contiguity of cortical processing streams nor their integration across modalities. We introduce Topo-Omni, a topographic multimodal model in which visual, auditory, and language/cognitive processing share a single contiguous in-silico sheet. Built by fine-tuning a pretrained foundation model with a spatial smoothness objective, this architecture develops clusters across modalities that are consistent with human neuroimaging, from sensory to cognitive systems. Driving or suppressing a cluster selectively biases or impairs perception, paralleling human intervention studies. Finally, we use our model to screen for novel clusters in-silico and discover new natural landscape and animal networks which we validate in human data. A single spatial principle thus organizes representations across modalities and processing stages, yielding testable hypotheses about cortical organization.

Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution

Yifan Wang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09778v1 Announce Type: cross Abstract: Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filter can silently repair an incompetent upstream policy, so that post-filter success measures the filter, not the policy. We argue that safe policy learning should ask who earns the safety - the policy or its protective layers - and we make this question measurable. We introduce Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), which (i) trains a compact variational quantum circuit (VQC) policy under a primal-dual intervention budget that penalizes reliance on a differentiable Control-Barrier-Function (CBF) projection, and (ii) is evaluated with a safety-attribution protocol that decomposes the executed-trajectory correction into a CBF term and a deployment runtime-guard term, and stress-tests the policy with guard-off evaluation. On closed-loop, high-fidelity BOPTEST building-control emulators (5 seeds, 60 episodes per method), intervention-aware training significantly lowers the quantum policy's raw pre-filter violation and total safety-layer reliance (both p < 10^-4) with no significant energy regression; at an equal approximately 400-parameter budget the quantum policy is significantly safer and more comfortable than a matched classical policy. Guard-off evaluation confirms the improvement is policy-level and exposes a valuable negative result: a learned differentiable energy head is only safe when paired with a distribution-aware runtime guard. The attribution protocol is general beyond quantum policies and buildings.

Biclique decompositions from Welzl orders

Jean Cardinal, Rose McCarty, Yelena Yuditsky — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09785v1 Announce Type: cross Abstract: A biclique decomposition of a graph is a partition of its edges into complete bipartite subgraphs. We consider graphs whose vertices can be ordered such that the neighborhood of every vertex is the union of a sublinear number of intervals. We observe that these graphs admit compact representations in the form of biclique decompositions of small size. Here, the size of a decomposition is measured as the sum of the number of vertices of its bicliques. Combining this result with the existence of suitable vertex orderings for graphs of low neighborhood complexity, as proven by Welzl in 1988, we recover and extend several known results up to logarithmic factors. These results include upper bounds on the Zarankiewicz problem, matrix multiplication, quantum circuit complexity, and shortest path algorithms in ``well-structured'' instances.

Finite-n Estimate of Dedekind Numbers by Layer-Ratio Monte Carlo

Tian-Shun Chen, Hao Feng, Haozhe Wang, Kilar Zhang — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09795v1 Announce Type: cross Abstract: Dedekind's problem counts monotone Boolean functions, equivalently downsets of a Boolean lattice. We recast this enumeration as a finite layer-ratio reconstruction problem for the Whitney numbers of the ranked ideal lattice. An exact adjacent-layer double count expresses each layer ratio through local averages of the number of addable elements and the number of removable elements. Reversible fixed-layer Markov chains estimate these averages and hence estimate the Dedekind number M(n). Backtests at M(8) and M(9) calibrate seed-level variability under the fixed protocol and measure the observed Monte Carlo budget scaling. The resulting estimate probes the Whitney-number sequence of the ideal lattice. Although these rows have previously been described empirically as unimodal, the high-precision n=9 estimate has a shallow two-shoulder feature around the central rank, contrary to that empirical description; n=11 and n=13 center-window estimates show a larger-contrast analogous pattern. The protocol estimate for M(10) is \[ \widehat M(10)=(8.9360\pm0.0010)\times 10^{78}, \] where the displayed uncertainty is the budget-based forecast scale from the cross-n scaling law under the production budget.

On the generalized Tur\'an number of complete bipartite graphs

Oliver Janzer, Sean Longbrake, Liana Yepremyan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09801v1 Announce Type: cross Abstract: For graphs $F$ and $H$, the generalized Tur\'an number $\mathrm{ex}(n,F,H)$ denotes the maximum number of copies of $F$ in an $H$-free graph on $n$ vertices. We prove that if $s\in \{2,3\}$, $s< a\leq b$ and $t$ is sufficiently large, then $\mathrm{ex}(n,K_{a,b},K_{s,t})=\Theta(n^s)$. The $s=2$, $a=b=3$ case of this result answers a question of Spiro. Proving another conjecture of Spiro, we show that for every graph $F$ with at least one edge, there exist infinitely many real numbers $r$ such that $\mathrm{ex}(n,F,H)=\Theta(n^r)$ holds for some graph $H$.

Weighted universal approximation of differentiable maps on infinite-dimensional manifolds

Philipp Schmocker, Josef Teichmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2606.09820v1 Announce Type: cross Abstract: We generalize the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, we establish a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. This leads us to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. As a further application, we show that linear functions of the signature are able to approximate path space functionals including their directional derivatives.

New Combinations of Polynomial Root-Finding Iterations

Victor Y. Pan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:1705.00729v4 Announce Type: replace Abstract: Some near-optimal polynomial root-finders of 2024-25, based on subdivision iterations, approximate all complex roots of a polynomial or all roots lying in a fixed Region of Interest in the complex plane. We combine these iterations with Newton's and/or Schroeder's to yield significant empirical acceleration versus each approach standing alone. Like the cited recent algorithms, our root-finders can be applied not only to a polynomial represented in monomial basis by its coefficients but also to a black box polynomial represented by an oracle (black box subroutine) for its evaluation. Some by-products of our study such as an extension of the Gauss-Lucas theorem and a fast black box estimator for root radius can be of independent interest.

Measuring a hate speech spectrum with faceted Rasch item response theory and perspective-aware, explainable-by-design deep learning

Chris J. Kennedy, Geoff Bacon, Alexander Sahn, Claudia von Vacano — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2009.10277v2 Announce Type: replace Abstract: We propose a system for measuring hate speech on a continuous, interval-valued spectrum ranging from genocidal to supportive speech by combining supervised deep learning with faceted Rasch item response theory (IRT). We decompose the theoretical construct of hate speech into constituent concepts operationalized as 10 ordinal labels. Those labels are reconstituted via IRT probabilistic latent modeling into an interval outcome measure while simultaneously estimating and adjusting for each annotator's labeling perspective. Our scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction, allowing design-based explainability of the continuous score through those components. We apply this method to a new, open source dataset of 50,070 social media comments sourced from YouTube, Twitter, and Reddit, annotated and labeled by 11,143 United States-based Amazon Mechanical Turk workers. Our RoBERTa-based model shows improved accuracy compared to alternative approaches. This system offers a new paradigm for supervised NLP that encourages continuous rather than binary constructs, and design-based incorporation of annotator perspective and model explainability.

An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution

Yajue Yang, Zeqing Zhang, Yuanqing Wu, Jia Pan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2104.12183v2 Announce Type: replace Abstract: The general inverse kinematics (IK) problem of a manipulator, namely that of acquiring the self-motion manifold (SMM) of all admissible joint angles for a desired end-effector pose, plays a vital role in robotics modeling, planning and control. To efficiently solve the generalized IK, this paper proposes an interval branch-and-bound-based approach, which is augmented with a fast numerical IK-solver-enabled search heuristics. In comparison to independent solutions generated by sampling based methods, our approach generates patches of neighboring solutions to provide richer information of the inherent geometry of the SMM for optimal planning and other applications. It can also be utilized in an anytime fashion to obtain solutions with sub-optimal resolution for applications within a limited period. The performance of our approach is verified by numerical experiments on both non-redundant and redundant manipulators.

Generalized binary utility functions and fair allocations

Franklin Camacho, Rigoberto Fonseca-Delgado, Ram\'on Pino P\'erez, Guido Tapia — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2109.08461v2 Announce Type: replace Abstract: The problem of finding envy-free allocations of indivisible goods can not always be solved; therefore, it is common to study some relaxations such as envy-free up to one good (EF1). Another property of interest for efficiency of an allocation is the Pareto Optimality (PO). Under additive utility functions, it is possible to find allocations EF1 and PO using Nash social welfare. However, to find an allocation that maximizes the Nash social welfare is a computationally hard problem. In this work we propose a polynomial time algorithm which maximizes the utilitarian social welfare and at the same time produces an allocation which is EF1 and PO in a special case of additive utility functions called buyer utility functions. Moreover, a slight modification of our algorithm produces an allocation which is envy-free up to any positively valued good (EFX).

SFILES 2.0: An extended text-based flowsheet representation

Gabriel Vogel, Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2208.00778v2 Announce Type: replace Abstract: SFILES are a text-based notation for chemical process flowsheets. They were originally proposed by d'Anterroches (Process flow sheet generation & design through a group contribution approach) who was inspired by the text-based SMILES notation for molecules. The text-based format has several advantages compared to flowsheet images regarding the storage format, computational accessibility, and eventually for data analysis and processing. However, the original SFILES version cannot describe essential flowsheet configurations unambiguously, such as the distinction between top and bottom products. Neither is it capable of describing the control structure required for the safe and reliable operation of chemical processes. Also, there is no publicly available software for decoding or encoding chemical process topologies to SFILES. We propose the SFILES 2.0 with a complete description of the extended notation and naming conventions. Additionally, we provide open-source software for the automated conversion between flowsheet graphs and SFILES 2.0 strings. This way, we hope to encourage researchers and engineers to publish their flowsheet topologies as SFILES 2.0 strings. The ultimate goal is to set the standards for creating a FAIR database of chemical process flowsheets, which would be of great value for future data analysis and processing.

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2208.00859v2 Announce Type: replace Abstract: We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.

Toward automatic generation of control structures for process flow diagrams with large language models

Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2211.05583v2 Announce Type: replace Abstract: Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions.

TAMUNA: Doubly Accelerated Distributed Optimization under Partial Participation

Laurent Condat, Ivan Agarsk\'y, Grigory Malinovsky, Peter Richt\'arik — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2302.09832v4 Announce Type: replace Abstract: In distributed optimization and federated learning, slow and costly communication between parallel devices and the central server constitutes the primary bottleneck. To alleviate this burden, two strategies have emerged: 1) local training (LT), which reduces communication frequency by performing multiple local computations between rounds, and 2) compression (CC), which consists of transmitting lower-dimensional, compact representations. Recent theoretical advances have successfully combined LT and CC to achieve doubly-accelerated communication rates, with respect to both condition number and model dimension. However, these methods have a major drawback: they require full client participation and break down when idle clients miss communication triggers. We introduce TAMUNA, the first algorithm to successfully intertwine LT, CC, and partial participation. By decoupling primal model updates from dual control variates, TAMUNA overcomes the architectural deadlock of prior methods. In the strongly convex setting, TAMUNA converges linearly to the exact solution, establishing a new state of the art by exhibiting doubly-accelerated convergence, while supporting arbitrary levels of client participation.

Reversible Numeric Composite Key (RNCK)

Nicola Asuni — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2306.04353v2 Announce Type: replace Abstract: In database design, composite keys uniquely identify records and prevent duplication. However, wide multi-column keys can increase index size, comparison work, and join costs. Surrogate keys can mitigate some of these costs, but they also require additional constraints and governance to preserve business-level uniqueness. This paper presents a Reversible Numeric Composite Key (RNCK): a single non-negative integer that encodes multiple normalized attributes and can be decoded back to the original tuple under a fixed schema. RNCK is designed to combine the semantic fidelity of composite keys with the operational convenience of numeric keys. RNCK can improve storage footprint and key-comparison efficiency when attribute domains are bounded and stable. We formalize correctness and ordering properties, and specify operational semantics for partial-overflow mode. The approach has been used in production systems and is applicable to relational databases, static datasets, and key-value caching systems within the stated constraints.

Chatlaw: A Multi-Agent Legal Assistant based on a Role-Aligned Mixture-of-Experts Architecture

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2306.16092v3 Announce Type: replace Abstract: Artificial Intelligence (AI) holds great potential in legal services, yet Large Language Models (LLMs) face two major challenges: limited knowledge of the Chinese legal system and vulnerability to hallucinations. To address these issues, we present Chatlaw, a multi-agent legal assistant. Chatlaw's framework is designed to emulate the Standard Operating Procedures (SOP) of real law firms, where different roles (e.g., assistant, researcher, senior lawyer) collaborate on a case. To computationally mirror this collaborative structure, we developed a novel Role-Aligned Mixture-of-Experts (RA-MoE) architecture. In this system, the internal "experts" are specifically trained to align with the distinct tasks of each agent role (e.g., inquiry, analysis, drafting). These specialized agents (Legal Assistant, Researcher, etc.) then form the collaborative framework. When they interact with users, retrieve legal knowledge, analyze case details, or generate reliable consultations, the RA-MoE architecture intelligently routes their computations to the corresponding dedicated expert, ensuring each step is handled by the most qualified parameters. In evaluations, Chatlaw surpasses general-purpose AI models, including GPT-4, achieving a 7.73% improvement in accuracy on the LawBench benchmark and an 11-point higher score on the Unified Qualification Exam for Legal Professionals. Real-case studies and expert assessments further confirm its robustness. Chatlaw enhances the accessibility and reliability of legal services, advancing the provision of legal support to the public.

Deep reinforcement learning for process design: Review and perspective

Qinghe Gao, Artur M. Schweidtmann — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2308.07822v2 Announce Type: replace Abstract: The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate this transition. Specifically, deep reinforcement learning, a subclass of machine learning, has shown the potential to solve complex decision-making problems and aid sustainable process design. We survey state-of-the-art research in reinforcement learning for process design through three major elements: (i) information representation, (ii) agent architecture, and (iii) environment and reward. Moreover, we discuss perspectives on underlying challenges and promising future works to unfold the full potential of reinforcement learning for process design in chemical engineering.

Empirical assessment of ChatGPT's answering capabilities in natural science and engineering

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2309.10048v2 Announce Type: replace Abstract: ChatGPT is a powerful language model from OpenAI that is arguably able to comprehend and generate text. ChatGPT is expected to greatly impact society, research, and education. An essential step to understand ChatGPT's expected impact is to study its domain-specific answering capabilities. Here, we perform a systematic empirical assessment of its abilities to answer questions across the natural science and engineering domains. We collected 594 questions on natural science and engineering topics from 198 faculty members across five faculties at Delft University of Technology. After collecting the answers from ChatGPT, the participants assessed the quality of the answers using a systematic scheme. Our results show that the answers from ChatGPT are, on average, perceived as ''mostly correct''. Two major trends are that the rating of the ChatGPT answers significantly decreases (i) as the educational level of the question increases and (ii) as we evaluate skills beyond scientific knowledge, e.g., critical attitude.

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2310.10196v3 Announce Type: replace Abstract: Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by physical and virtual sensors, they record dynamic system behaviors and enable a wide range of downstream tasks. Effectively analyzing such data is crucial to unlocking their rich information content. Recent advances in large language models and other foundation models have accelerated their use in time series and spatio-temporal data mining. These approaches not only improve pattern recognition and reasoning across diverse domains but also support progress toward artificial general intelligence that can understand and process temporal data. In this survey, we present a comprehensive, up-to-date review of large models tailored or adapted for time series and spatio-temporal data along four dimensions: data types, model categories, model scopes, and application areas/tasks. We organize existing work into two main groups: large models for time series analysis (LM4TS) and for spatio-temporal data mining (LM4STD), and further distinguish general-purpose from domain-specific models. We also curate related resources, including datasets, model implementations, and tools, organized by major application areas. Overall, this survey consolidates recent advances and highlights foundations, applications, resources, and open research opportunities in large model-centric temporal data analysis.

A higher order numerical method for singularly perturbed elliptic problems with characteristic boundary layers

Alan F. Hegarty, Eugene O'Riordan — Tue, 09 Jun 2026 00:00:00 -0400

arXiv:2311.00554v2 Announce Type: replace Abstract: A Petrov-Galerkin finite element method is constructed for a singularly perturbed elliptic problem in two space dimensions. The solution contains a regular boundary layer and two characteristic boundary layers. Exponential splines are used as test functions in one coordinate direction and are combined with bilinear trial functions defined on a Shishkin mesh. The resulting numerical method is shown to be a stable parameter-uniform numerical method that achieves a higher order of convergence compared to upwinding on the same mesh.