Wiley: Computer Animation and Virtual Worlds: Table of Contents

Evaluating the Viability of Foundation Models in Virtual Agent Perception

Fri, 05 Jun 2026 18:50:12 -0700

Overview of our pipeline. The agent's vision consists of a single central view (60° field of view, colored). In green, we highlight the regions used by Moondream [] to list objects in the field of view via a query, while in blue we show the inputs used by DinoV2 []. The black arrows and modules are shared among the three models. Visual attention is divided into a query module that provides contextual information to the model and a detection module that identifies objects specified in the object list. Since multiple object classes can be identified by the detection module (in this work, exit signs or people), the Drift Diffusion Model (DDM) determines the agent's walking direction, while a stochastic method defines the duration for which the agent follows that direction.

ABSTRACT

This work presents a comparative study evaluating how recent foundation models, capable of handling various computer vision tasks, perform at simulating a virtual agent's perception and using that perception to navigate the environment and achieve its goals. To achieve this, we selected two recent foundation models well known in the literature and compared them with a recent procedural approach by replicating these models' evaluation methods. We employed these models for evacuating a burning building, aiming to identify either exit signs or people to follow in the environment to find an escape route. Our results show that out-of-the-box foundation models were unable to outperform a specialized procedural method and performed significantly worse on the task, although they still exhibit strong adaptability to previously unknown contexts. However, among the two foundation models tested, we found that a VLM capable of image reasoning and detection performed significantly better than a self-supervised generalized computer vision model.

A Real‐Time VIMS Prediction Model Based on Short Long‐Term Visual Motion Features

Wei Quan, Zekai Yin, Yang Yang, Hui Zhang, Cheng Han — Fri, 05 Jun 2026 02:40:43 -0700

We present a real-time method to predict VIMS in VR. By capturing visual motion features with simplified inputs, it outperforms existing visual information-based prediction methods and can be deployed in immersive VR environments for real-time computation.

ABSTRACT

Visually induced motion sickness (VIMS) is one of the major obstacles to the broader adoption of virtual reality (VR) technology. As visual-vestibular conflict is considered a major factor contributing to VIMS, real-time prediction from visual cues remains crucial for timely intervention and for improving user experience. In this paper, we present a real-time VIMS prediction model that utilizes a hybrid architecture of a 3D Convolutional Neural Network and a Long Short-Term Memory network to capture both short and long-term features. We used low-resolution frames and replaced optical flow with frame difference maps and lateral/vertical displacement components to reduce data complexity and accelerate model inference. Experimental results on two datasets demonstrate that our model outperforms existing vision-based approaches, such as VR-SP and VR-SA, with an RMSE of 3.77 and a PLCC of 0.935. Importantly, the average processing time for each 108-frame video is less than 0.83 s. In addition, the method's real-time VIMS prediction capability has been verified in a Unity3D-based VR environment, with an end-to-end latency of 192 ms. These results highlight the model's advantages in terms of both efficiency and accuracy, making it a promising solution for VIMS-aware VR applications.

Effects of Embodied Agent Body Type on Perceptions and Dietary Intentions in Digital Health Counseling

Fri, 05 Jun 2026 01:51:27 -0700

This graphical abstract shows the Virtual Nurse in one of the conditions, Obese.

ABSTRACT

Embodied Conversational Agents (ECAs) are increasingly used for health education, yet the role of an agent's physical embodiment remains relatively underexplored. This study examines how body mass index (BMI) cues in an embodied agent relate to user perceptions, trust in medical content, and health-related intentions during a brief dietary counseling interaction focused on diabetes prevention. In a between-subjects experiment, participants interacted with one of three visually distinct versions of the same healthcare coach, representing underweight, normal-weight, or obese body types. Across conditions, participants demonstrated improvements in dietary intentions, while significant increases in diabetes memory outcomes were observed only in the underweight and normal-weight conditions. Increases in overall behavior change intention were observed only in the underweight condition. Subjective evaluations indicated that the normal-weight agent was rated higher on realism, likability, and interaction quality, while post-interaction perception ratings suggested that more extreme body representations were sometimes perceived as less comfortable or less appropriate for a health advisory role. Together, these results suggest that visual body type cues may be associated with differences in users' subjective experiences of embodied health counseling agents, while evidence for broader differences in learning and motivation-related outcomes should be interpreted cautiously.

CameraVQ: Vector‐Quantized Representations for Monocular Camera Calibration

DaEun Cheong, JungHyun Han — Thu, 04 Jun 2026 04:02:37 -0700

iven an image, a frozen DINO_v3 extracts features for a classifier to predict a distributions over an intrinsic codebook. The highest-probability code is decoded into the camera matrix K by a frozen decoder.

ABSTRACT

We present a monocular camera calibration method, dubbed CameraVQ. It reformulates monocular camera calibration as classification over vector-quantized camera intrinsics. Existing methods based on geometric cues or direct regression often suffer from unstable optimization and poor generalization. In contrast, CameraVQ learns a discrete codebook of camera intrinsics and predicts the latent code from a single image. This discrete formulation constrains predictions to a statistically learned manifold of valid configurations, enabling robust calibration and strong generalization. Through extensive evaluation on diverse calibration benchmarks, CameraVQ achieves state-of-the-art performance on the majority of datasets.

Effects of Immersion Fidelity and Multimodal Support on Cognitive Load, User Experience, and Behavioral Intentions in Embodied Conversational Agent Interactions

Thu, 04 Jun 2026 00:48:35 -0700

The graphical abstract submitted shows the comparison of the 2D and VR Environment setups.

ABSTRACT

This study examines how immersion fidelity (2D vs. VR) and multimodal support shape user experience during 20+ min diabetes-related conversations with an embodied conversational agent (ECA). Using a between-subjects design, participants engaged in conversation-only interactions or conversations augmented with subtitles, visualizations, or both. We assessed memory retention, diet attitude change, behavioral commitment intentions to adopt healthier eating habits, perceived workload, working alliance, and subjective impressions of the agent. In conversation-only interactions, 2D supported stronger working alliance, while VR elicited stronger behavioral commitment intentions. When subtitles and/or visualizations were present, working alliance and behavioral commitment intentions converged across modalities, although subjective impressions remained condition-specific. Across conditions, VR was associated with higher perceived workload, including consistently higher physical demand and additional increases in mental and overall workload under multimodal support. Memory retention and diet attitude change did not differ between 2D and VR. These findings highlight tradeoffs between relational alignment, motivational impact, and interaction effort when designing scalable ECA-mediated health conversations across 2D and VR platforms.

The Effects of Perspective‐Based Out‐of‐Body Experience in Virtual Reality Public Speaking

Dixuan Cui, Xinmeng Ren, Christos Mousas — Wed, 03 Jun 2026 22:26:20 -0700

Virtual reality (VR) offers an immersive platform for practicing public speaking in controlled environments that simulate various audience settings and performance contexts. Our findings highlight how perspective-based out-of-body experience (OBE) influences user behavior, with practical implications for the design of VR experiences for public speaking.

ABSTRACT

Virtual reality (VR) offers an immersive platform for practicing public speaking in controlled environments that simulate various audience settings and performance contexts. Prior research has examined how manipulating visual perspectives, a common method for inducing out-of-body experiences (OBEs), can influence users across various contexts. However, the effects of visual perspectives in a VR public speaking setting remain underexplored. To address this gap, we conducted a VR public speaking study in which we compared three perspective-based OBE conditions (i.e., first-person perspective vs. third-person perspective back vs. third-person perspective front) and explored their effects on objective measurements (i.e., acoustic features of speech and gaze behaviors) and self-reported ratings (i.e., sense of embodiment, co-presence, intrinsic motivation, and self-statements). Moreover, we explored the correlations between objective and self-reported data. We found that perspective-based OBE conditions had a significant effect on spectral slope, gaze behaviors (i.e., fixation count, average fixation duration, and total fixation duration), sense of embodiment (i.e., body ownership, agency and motor control, and external appearance), and co-presence. Results also showed several correlations between objective measurements and self-reported ratings. Our findings highlight how perspective-based OBE conditions influence human behavior, with practical implications for the design of VR experiences for public speaking.

A Multimodal Neurophysiological Framework for Real‐Time Evaluation of User Experience in Virtual Architectural Environments

Hui Liang, Longfei Yang — Wed, 03 Jun 2026 06:30:28 -0700

This study proposes a multimodal framework integrating VR and EEG to objectively quantify real-time architectural experiences. Findings demonstrate that neurophysiological metrics like Frontal Alpha Asymmetry (FAA) accurately decode emotional valence, capturing transient fluctuations that remain invisible to traditional retrospective subjective scales.

ABSTRACT

This study addresses the long-standing challenge of objectively and continuously quantifying User Experience (UX) in immersive architectural environments. We propose a real-time computational framework that integrates EEG-based affective computing with virtual reality (VR) spatial design. A hierarchical edge architecture combined with the Lab Streaming Layer enables millisecond-level synchronization between neural signals, behavioral trajectories, and semantic spatial events. To ensure robustness under free navigation, an adaptive preprocessing pipeline is implemented that incorporates zero-phase filtering and artifact subspace reconstruction. At the feature level, Frontal Alpha Asymmetry (FAA) and the Engagement Index (EI) are employed to jointly model emotional valence and attentional investment. Multimodal calibration is achieved via regression against subjective benchmarks, including SAM and NASA-TLX. A within-subject experiment comparing stress-inducing and restorative architectural scenarios demonstrates significant neurophysiological differentiation. FAA strongly correlates with reported valence, while EI shows a moderate association with perceived workload. Importantly, EEG measures capture transient fluctuations that remain invisible to retrospective scales. The findings establish a quantitative foundation for evidence-based design and pave the way toward closed-loop adaptive environments driven by real-time affective feedback.

Realistic Trail Following Behavior in a Dynamically Deformable Terrain Using Terrain‐Based Penalty Fields

M. Hamza Noor, Muhammad Usman — Mon, 01 Jun 2026 22:42:18 -0700

Opposite direction crowds navigating while following the trails of their respective directions on a deformable terrain.

ABSTRACT

Realistic simulation of pedestrian movement in natural environments remains challenging due to the limited interaction between agents and dynamically changing terrain. In real-world settings, pedestrian trails emerge through repeated footstep interaction with deformable terrain such as snow, grass, or soil, and these trails subsequently influence navigation behavior and crowd flow. However, existing crowd simulation approaches typically treat terrain as static or rely on predefined paths, limiting both visual realism and behavior consistency. We present an environment-aware crowd navigation model that integrates footstep-driven trail formation, directional trail encoding, and trail-aware pathfinding to produce emergent, visually grounded pedestrian behavior. Our approach models terrain as a deformable surface represented by a dynamic height and trail-intensity field updated at each footstep. As agents traverse the environment, footsteps increment local trail intensity and modify terrain appearance, producing persistent visual trails. These trails are simultaneously encoded into a 2D grid used for navigation. Pathfinding costs are dynamically adjusted using accumulated foot traffic, encouraging agents to follow existing trails while still allowing divergence when beneficial. Results show improved visual plausibility, natural trail reuse, and coherent crowd flow compared to static-terrain baselines.

A Gait Recognition Method via Graph Convolutional Networks With Local Mask

Jing Jiang, Hongyu Shi, Yueping Kong, Yingxuan Huang, Jie Yang — Mon, 01 Jun 2026 22:37:34 -0700

To capture correlated local and global motion patterns in joint and skeleton data, MA-Gait is proposed. Built upon raw joint sequences from input videos, it constructs centripetal distance and relative skeletal length sequences as multi-semantic inputs. Three parallel MFEN branches, each combining a masked graph convolutional network (MA-GCN) and a multi-scale temporal convolution network (MS-TCN) in a residual cascade, extract spatiotemporal gait features. Optimization uses triplet and cross-entropy losses.

ABSTRACT

To overcome the limitation of uniform graph convolution in skeleton-based gait recognition—specifically its inability to adaptively model coordinated limb swings—this paper proposes a novel framework named MA-Gait. The core innovation is a local masking mechanism integrated with a global context mask, applied to key body parts including arms, legs, head, and trunk. This guides a multi-head attention graph convolutional network to enhance feature extraction from these locally masked regions during coordinated movements. Additionally, multi-semantic gait data, comprising centripetal relative joint coordinates and skeletal lengths, are constructed to enrich gait representation through a multibranch architecture for parallel feature learning. Evaluated on CASIA-B and OUMVLP-Pose datasets, MA-Gait achieves average recognition accuracies of 91.4% and 62.2%, respectively, significantly outperforming existing model-based methods. The results validate the effectiveness of the local masking mechanism, multibranch design, and multi-semantic learning paradigm in capturing discriminative gait features under complex conditions such as viewpoint changes and occlusion.

Controlling Dendritic Ice and Frost Crystal Growth Along User‐Drawn Trajectories

Yudai Ichimura, Syuhei Sato — Mon, 01 Jun 2026 03:57:28 -0700

This paper proposes a method to control dendritic crystal growth, such as ice and frost, along user-specified trajectories. By applying external guiding and driving forces, users can generate realistic growth animations following complex shapes like letters or branching patterns.

ABSTRACT

This paper proposes a method to control the growth of dendritic crystals, such as ice and frost, along trajectories specified by the user. The growth of dendritic crystals involves numerous parameters, requiring extensive trial and error to achieve the shapes intended by artists. To address this problem, we generate target shapes and external forces that follow the user-specified shape and use these to control the behavior of the temperature field and particles representing the water molecules. The external forces consist of a guiding force that generates a reverse flow along the user-drawn trajectory to facilitate particle arrival at the reaction front and a driving force that attracts particles toward the target shape. Furthermore, to prevent the crystal from growing beyond the target shape, a field representing the freezing temperature of ice and frost is set according to the target shape. The experimental results demonstrated that it is possible to generate ice and frost animations that grow along user-specified shapes, such as character-like forms.

Dynamic Fatigue Feature Injection for Simulating Fatigue Effects in Animation

Iliana Loi, Konstantinos Moustakas — Mon, 01 Jun 2026 03:54:33 -0700

In this work, we introduce a framework for fatigue-driven motion synthesis accounting for both fatigue-induced characteristic features and fatigue scalings to model the human fatigue state. A siamese model is employed to extract fatigue features from experimental fatigued gait motion sequences, which features are then utilized to optimize a transformer-based fatigue model to learn to synthesize fatigue movements relying solely on non-fatigued gait data and subject embeddings. This architecture further leverages the 3CC-λ PINN model from the Fatigue-PINN pipeline, to infuse fatigue scalings onto the resulting motions. Our framework offers a new research direction to fatigue-driven motion synthesis, whose outcomes can be exploited by both biomechanical, animation, and XR applications, where the temporally-evolved fatigue effects are realistically rendered in human motion.

ABSTRACT

Simulating fatigue effects on human motion is essential both for conducting biomechanical analyses in order to develop overexertion and fatigue prevention and mitigation techniques, as well as to realistically render temporally evolved fatigued movements in animation and Extended Reality (XR) pipelines. In this work, we propose a framework to account for the dynamic injection of fatigue and fatigue-induced characteristic features in non-fatigued movements for data-driven fatigue-driven motion synthesis, an under-explored scientific field in recent literature. To do so, we leverage an interplay between a siamese neural network and a Transformer-based Fatigue Model to account for the encoding and sampling of fatigue features, while fatigue scalings are incorporated into motion via the state-of-the-art Fatigue-PINN. Our quantitative evaluation findings confirm the effectiveness and validity of our framework, while qualitative analysis shows that the fatigued motion sequences produced from our model are comparable with the observations of real-world experimental studies investigating the impact of externally perceived fatigue in human motion. Moreover, we developed a demonstrator to showcase and assess the capability of our model to dynamically integrate fatigue scalings and fatigue features into motion sequences, while evaluating its ability to be seamlessly integrated into XR and motion synthesis environments.

Virtual Human for Police Training in Virtual Reality: A Dynamic Model of a Suspect With Modulated Resistance and Means

Yvain Tisserand, Séolane Bouchoucha, Ben Meuleman, David Sander — Mon, 01 Jun 2026 03:33:43 -0700

Graphical abstract illustrating the dynamic virtual human model for VR police training, combining suspect means, modulated resistance, behavior-control strategies, and multimodal control.

ABSTRACT

Effective simulation-based police training requires exposure to realistic human behavior under uncertainty, particularly in situations involving escalation and use-of-force decisions. Virtual Reality (VR) enables immersive and repeatable training scenarios, where behaviorally credible virtual humans (VHs)are important to produce realistic and operationally relevant simulated scenarios. This paper presents a VH acting as a suspect during police interventions, whose behavior is governed by two parameters: resistance and means. Resistance models compliance or aggression, while means represents the potential to inflict harm using a weapon or object. These parameters drive a multimodal behavioral system integrating facial expression, posture and gesture, gaze, locomotion, and voice interaction. Behavior selection is controlled by a state-machine architecture enabling escalation and de-escalation in response to police actions, grounded in law-enforcement literature on proportional use of force. We evaluated the model in immersive VR with police officers across scenarios combining resistance and means. Resistance primarily affected perceived danger and intent, whereas means more strongly affected perceived capacity and opportunity. These findings validate the framework and show that interpretable rule-based VHs can elicit differentiated threat appraisals in controlled and repeatable VR scenarios, supporting their use for training and behavioral research in police contexts.

Neural Fluid Simulator With Hybrid Physical‐Visual Constraints

Feilong Du, Xiaojuan Ban, Yuhang Xu, Angelos Chatzimparmpas, Yalan Zhang — Mon, 01 Jun 2026 00:46:22 -0700

Overview of the proposed neural fluid simulator. The process begins with reconstructing and refining point clouds from a monocular 2D image sequence. Next, a velocity field is inferred from the point cloud sequence to provide particles with kinetic properties. Finally, an energy-based physical constraint dynamics solver predicts future particle states to achieve the fluid simulation.

ABSTRACT

Traditional physics-based fluid simulations typically rely on manual modeling and incremental adjustments to achieve desired effects, which can limit objectivity and generalizability to new scenarios. To address these challenges, we propose a novel neural fluid simulator that integrates visual priors from 2D image sequences with physically constrained continuous convolution. Specifically, we extract and refine point clouds from image sequences, then infer the kinetic properties of the fluid. We introduce an energy-based physical constraint and incorporate it into a continuous convolution solver. By iteratively optimizing these inputs to enforce physical laws—particularly incompressibility—the solver produces accurate fluid motion predictions. Our approach uniquely combines visual data and physical constraints, enhancing the realism and accuracy while providing stronger generalization of fluid simulations.

A Semantic‐to‐Motion Digital Twin Framework for Expressive Industrial Avatars in Telepresence

Damien Mazeas, Anthony Foulonneau, Jérémy Lacoche — Sun, 31 May 2026 20:51:15 -0700

This framework introduces a semantic interpretation stack that translates natural language into expressive robotic motion. By mapping expert intent to the SAEH taxonomy (show, alert, encourage, hesitate) the system enables intuitive, non-verbal communication in industrial telepresence through a validated digital-twin pipeline.

ABSTRACT

Remote industrial assistance is increasingly mediated through robotic avatars, yet existing telepresence systems provide limited support for non-verbal behaviors such as hesitation, urgency, or emphasis. Manually controlling such expressive motion can increase operator workload. We present a feasibility and system-integration study of a semantic-to-motion telepresence pipeline that interprets expert utterances and scene context into parameterized expressive robot behaviors. The system combines visual grounding with language-based intent inference to produce structured motion descriptors, which are mapped to procedural motion primitives executed in a unity-based digital twin and streamed to a physical robot via ROS. As the semantic control vocabulary for this prototype, we define SAEH (show, alert, encourage, hesitate), a four-class operational vocabulary derived from a thematic review of 104 HRI papers. We implement the approach on a Niryo Ned 1 manipulator and show that it can generate and execute kinematically distinct expressive motion profiles, including deictic, warning, supportive, and hesitant behaviors. The proposed framework partially separates communicative intent from robot-specific execution through parameterized procedural primitives, providing a technical foundation for context-aware expressive telepresence; however, a measured mean end-to-end latency of ≈$$ \approx $$10 s currently limits its use to supervisory and asynchronous task loops rather than continuous live interaction.

Making Faces: Evaluating Facial Control Methods in VR for Live Conversations

J. K. Sangeeth Chandran, Marisa Llorens Salvador, Cathy Ennis — Fri, 29 May 2026 08:04:51 -0700

This paper investigates avatar facial expression control during live dyadic conversations in virtual reality. It compares face tracking with two controller-based thumbstick methods designed for VR devices without facial tracking, focusing on how these approaches support co-presence, usability, and expressive communication.

ABSTRACT

Facial expressions are a crucial part of affective communication in our daily lives, and the same applies to virtual humans. However, achieving a full range of believable and naturalistic emotional expression via VR devices still remains a challenge, particularly when constrained by the capabilities of midrange consumer VR devices without face tracking as opposed to the more expensive versions with facial tracking capabilities. In this study, we evaluated and compared three methods to control avatar facial expression in VR: continuous face tracking (FT); thumbstick label (TL), which is a VR controller's thumbstick-based selection method with visually labeled expression; and thumbstick label with confirmation button (TB), which is a variation of TL with an additional step of pressing a button. Thirty participants took part in the experiment. Participants found FT to be significantly more usable than TL and TB. While TL was comparable to FT on co-presence and several items of expressiveness, with significant gaps emerging mainly for naturalness and smoothness in expression change. These findings suggest that a well-designed control method with low latency and low motor and cognitive load can serve as an alternative method of avatar control in VR devices without facial tracking technology.

Musculoskeletal Motion Control and Generation Based on Muscle Synergies

Libo Sun, Jiwei Wen, Wenhu Qin — Thu, 28 May 2026 07:36:39 -0700

We present a two-stage framework for musculoskeletal motion control and generation based on muscle synergies. It enables accurate imitation, diverse motion generation, and effective downstream task control.

ABSTRACT

This paper presents a unified two-stage framework for physics-based musculoskeletal motion control and generation. To tackle the challenges posed by high dimensionality and redundancy, we first employ an autoencoder to learn a low-dimensional muscle synergy space from activation data. Policies trained in this space make the character faithfully reproduce motions while generating physiologically plausible muscle activations. We then leverage these expert trajectories to train a Conditional VAE, encoding skills into a continuous latent space for task-agnostic motion synthesis and downstream control. Experiments show our method achieves high motion imitation accuracy and generation diversity, ensures control stability, and maintains physiological realism, offering an effective solution for generalizing control of complex musculoskeletal characters.

Eye‐Tracking in Virtual Reality: Usability Assessment of a Regional Heritage Museum

Thu, 28 May 2026 07:25:52 -0700

Eye-tracking-based evaluation of Universal Design principles in a virtual reality heritage museum application.

ABSTRACT

This paper presents a comparative study of two variants of the user interface in a virtual reality (VR) museum application: an interface designed in accordance with the principles of universal design (UD) and an interface that does not meet these principles (non-UD). The application was developed in the Unity engine and run on Meta Quest Pro headsets with built-in eye tracking. During the completion of a set of tasks, eye-tracking data (including heatmaps and gaze scanpaths) and task completion times were recorded, and subjective ratings were then collected using the standardized system usability scale (SUS). The results indicate an advantage of the UD variant over the non-UD variant: the mean SUS score was higher for UD (73.75) than for non-UD (65.31). A Wilcoxon signed-rank test confirmed the statistical significance of this difference (Z = 4.87, p < 0.001, r = 0.49). Moreover, tasks were completed faster in the UD variant (total time 40.41 s) than in the non-UD variant (57.85 s). This difference was also statistically significant (Z = 5.61, p < 0.001, r = 0.56). Eye-tracking observations supported these differences, suggesting more efficient interface exploration in the UD variant.

MADECM: A Curiosity‐Augmented Evolutionary Algorithm for Multi‐Agent Policy Diversity Optimization

Jianyang Wu, Yv Fu, Xinning Wang, Xin Yang — Thu, 28 May 2026 01:37:39 -0700

Overview of MADECM: RND estimates the novelty of agents' local observations to prioritize exploration-relevant experience via additional updates with dual Critic networks, while a staged evolutionary procedure combines reward optimization and determinant-based diversity optimization within a MAP-Elites archive to jointly improve task return and policy heterogeneity.

ABSTRACT

Multi-agent reinforcement learning (MARL) often suffers from low sample efficiency and limited behavioral diversity, leading to policy homogenization, insufficient exploration, and reduced robustness. To address these challenges, we propose MADECM, a curiosity-augmented evolutionary framework built upon MADDPG that integrates curiosity-driven updates with evolutionary quality-diversity optimization. MADECM employs random network distillation (RND) to estimate the novelty of each agent's local observations and uses the resulting novelty signal to dynamically allocate additional update frequencies, thereby emphasizing exploration-relevant experience during training. In addition, MADECM combines population-based diversification with a quality-diversity (QD) archive through a staged optimization procedure, enabling the joint improvement of task return and policy diversity. We evaluate MADECM on the multi-agent particle environment (MPE), including Spread and Reference, which capture cooperative and partially observable dynamics, and on google research football (GRF), which emphasizes long-horizon sequential decision-making. Results show that MADECM consistently outperforms strong MADDPG-based baselines. The modular design of MADECM, consisting of RND-based novelty estimation and staged QD optimization, further supports consistent generalization across these structurally distinct environments without task-specific hyperparameter tuning.

Understanding User Perceptions Toward Queue Cutting in Immersive VR Environments

Elza Ibragimov, Natasha Kholgade Banerjee, Sean Banerjee, Ashutosh Shivakumar — Tue, 26 May 2026 07:16:55 -0700

We conduct a study on understanding how users perceive a queue-cutting NPC in a virtual doctor's office reception area. During the wait, a queue-cutting NPC requests an in-queue NPC to cut in ahead of them and is either allowed or denied entry into the queue. Using data on subjective responses from 45 participants we find significant differences in perception of time, frustration, and likelihood of exit when the queue-cutting NPC is allowed in.

ABSTRACT

Queue cutting is a frustrating experience in the real-world when a person is waiting to receive service as it violates the normative behavior of first-in-first-out ordering of queues. As social experiences move to virtual worlds users may experience human and non-playable characters (NPCs) attempting to cut the queue, thereby adding to the negative experiences of waiting. While normative behavior for activities such as joining an existing conversational group or walking between agents engaged in conversation have been studied in virtual reality, perceptions toward queue cutters have not been studied. We conduct a study on understanding how users perceive a queue-cutting NPC in a virtual doctor's office reception area. During the wait, a queue-cutting NPC requests an in-queue NPC to cut in ahead of them and is either allowed or denied entry into the queue. Using data on subjective responses from 45 participants we find significant differences in perception of time, frustration, and likelihood of exit when the queue-cutting NPC is allowed in. Our work enables future research on detecting situations capable of generating user frustration, and providing appropriate intervention via the VR environment, mitigating negative experiences, and ensuring timely service.

LLM‐Based Modeling of 3D Joint Kinematics of the Polonaise Folk Dance for Motion‐to‐Text Generation for User‐Feedback Capabilities

Tue, 26 May 2026 07:03:48 -0700

The study investigates the usage of LLMs for generating natural-language feedback based on kinematic features extracted from motion capture data. In the motion-to-text generation kinematic motion descriptors are translated into structured natural-language corrective guidance. This proposed novel approach focuses on enabling interpretable and adaptive feedback that can support learning in immersive VR environments.

ABSTRACT

The analysis and learning of complex human movements require precise motion capture, data processing, and clear feedback. This paper presents a comprehensive pipeline for real-time movement analysis and personalized feedback generation based on motion capture data and large language models. Three-dimensional motion data are captured using an 8-camera Vicon motion capture system and processed using both a custom joint angle computation method designed to support real-time streaming and Vicon Plug-in Gait angles. The extracted kinematic features are normalized and structured into movement patterns that represent key elements of dance choreography. Based on these patterns, synthetic performance samples are generated and used to fine-tune large language models capable of interpreting movement performance and producing natural-language feedback. Models fine-tuned in this study include Mistral 7B, H2O-Danube 3 4B, and Qwen 2.5 3B. The proposed approach enables the generation of context-aware, descriptive guidance that goes beyond numerical scores and supports motor learning. Experimental results demonstrate that H2O-Danube 3 4B can provide accurate and efficient feedback while remaining suitable for interactive and real-time applications. The presented framework offers a scalable foundation for intelligent movement training systems and can be extended to other movement disciplines and immersive virtual reality environments.

Obstacle‐Aware Fluid Control via Vector Potential Editing

Yizhang Chen, Takashi Kanai — Tue, 26 May 2026 07:01:14 -0700

This paper introduces a near real-time framework for inserting obstacles into pre-computed Eulerian fluid simulations. By operating in the vector potential space with projection-free boundary handling, our method strictly preserves divergence-free velocity fields without expensive global pressure solves. Additionally, we integrate a Vortex Primitive Method to inject physically-informed wake turbulence at separation points, achieving high-fidelity fluid-solid interactions with significantly reduced computational overhead

ABSTRACT

Realistic fluid simulations in interactive applications require both high visual fidelity and the ability to incorporate new obstacles at run time. In practice, high-quality Eulerian fluid data are often precomputed offline, but inserting obstacles into such baked fields typically requires solving a global pressure Poisson equation, negating the efficiency gains of precomputation. We present a framework for obstacle-aware fluid control based on vector potential formulation, which maintains divergence-free velocity fields by construction. Our method introduces a projection-free boundary handling technique inspired by Curl-Noise, where a one-sided potential field decomposition enforces approximate free-slip conditions around arbitrary obstacles without global re-solves. To restore turbulent details suppressed by numerical dissipation, we further incorporate a Vortex Primitive Method (VPM) that injects physically-informed vortex particles at separation points identified through surface curvature criteria and a Bernoulli-inspired pressure heuristic. The VPM operates in vector potential space via the Biot-Savart law, preserving the divergence-free property of the reconstructed velocity field. Experiments on three scenarios with comprehensive quantitative metrics demonstrate that our per-frame editing cost is two orders of magnitude lower than a standard pressure Poisson solve, enabling near real-time one-way fluid-solid interaction with near-exact mass conservation.

Feeling the Flow: A Real‐Time Multimodal Visuo‐Haptic Interface for Interactive Fluid Dynamics in XR

Guoqing Chen, Min Xiong — Tue, 26 May 2026 06:49:44 -0700

A real-time multimodal visuo-haptic XR interface couples GPU-accelerated fluid simulation with physics-aware spectral haptic rendering, enabling intuitive perception of laminar, transitional, and turbulent flow behaviors through synchronized visual and tactile feedback.

ABSTRACT

In Extended Reality (XR), fluid dynamics are predominantly conveyed through visual cues alone. The absence of tangible feedback limits immersion and the perception of intrinsic physical properties such as viscosity and flow regime transitions. We present “Feeling the Flow”, a real-time multimodal interface integrating GPU-accelerated Lattice Boltzmann Method (LBM) simulation with high-fidelity haptic feedback. To address the frequency mismatch between visual and haptic channels, we introduce a multi-rate asynchronous architecture that enables 1000 Hz physics updates on consumer hardware. A key contribution is our Reynolds-number-based dynamic spectral modulation mechanism: unlike conventional linear force amplification, this method modulates the haptic signal's spectral content according to flow regime, producing smooth forces in laminar flow and progressively introducing multi-scale turbulent textures as Reynolds number increases. Technical validation demonstrates real-time performance, high numerical accuracy, and correct capture of vortex-shedding dynamics. A controlled user study (n=24$$ n=24 $$) demonstrates that the multimodal interface significantly enhances perception of complex flow dynamics by 47.1% (p=0.023$$ p=0.023 $$) without increasing cognitive load, highlighting the potential of physics-aware haptics in immersive XR experiences.

Analysis of Robustness in Biped Locomotion Controllers

Gangrae Park, Seung‐Wook Ko, Taesoo Kwon, Yejin Kim — Sun, 24 May 2026 21:54:46 -0700

We propose a benchmark framework for evaluating and analyzing the robustness of bipedal locomotion controllers under external forces with varying conditions. Furthermore, based on our experimental results, we introduce a new locomotion controller called MMIPM-RL.

ABSTRACT

Character locomotion analysis in physics-based simulation remains a challenging problem in the field of computer animation. Locomotion is a fundamental skill, yet generating robust and natural motion is challenging. This has led to the development of various locomotion controllers. Robustness, defined as responsiveness to external perturbations and environmental changes, is a fundamental requirement in locomotion control. However, a lack of objective and systematic comparisons is evident in existing controllers. In this article, we propose a benchmark framework for evaluating and comparing the robustness of locomotion controllers using multiple criteria. In order to assess the responsiveness of the controller, external perturbations are applied to a simulated biped character at different gait phases. This analysis offers insights into the behavior of controllers, highlights the limitations of existing approaches, and motivates a novel controller called MMIPM-RL.

CPSN: Collision‐Overlap and Physics‐Based Self‐Supervised Neural Cloth Simulation

Tao Peng, Zhi Gong, Xianfang Tang, Zili Zhang, Li Li, Xinrong Hu — Sun, 24 May 2026 19:40:13 -0700

CPSN is a self-supervised neural cloth simulation framework that combines Gaussian mixture skinning (GMS) with a differentiable collision-overlap loss to improve physical plausibility and deformation stability. Given human motion sequences, the framework predicts garment dynamics using GRU-based deformation modeling and physics-based self-supervision. By enforcing continuous body-cloth influence and explicit collision suppression, CPSN generates realistic garment animations with fewer interpenetration artifacts, smoother deformation transitions, and fine-scale wrinkle details across diverse garment types.

ABSTRACT

Self-collision handling remains a fundamental and long-standing challenge in neural cloth simulation, particularly for loose garments with complex topologies. We propose a self-supervised neural cloth simulation framework that integrates Gaussian mixture skinning (GMS) with a differentiable collision-overlap loss to significantly enhance physical plausibility and visual realism. We employ continuous and spatially smooth GMS weights to model vertex-skeleton coupling, enabling stable deformations under large body motions. To explicitly address cloth self-collisions, we introduce a differentiable spatial repulsion constraint that suppresses interpenetration and layer-overlap artifacts. The proposed objective is jointly optimized with physics-inspired losses, enabling the network to learn consistent cloth dynamics without relying on ground-truth physical simulations. Experimental results demonstrate improved temporal stability, reduced collision artifacts, and stronger generalization compared to existing self-supervised methods.

Gabor‐Augmented Two‐Stream Framework With GoogLeNet‐Based Temporal Attention for Semantic Dynamic Hand Gesture Recognition

A. D. Harale, Kailash Jagannath Karande — Sun, 24 May 2026 19:11:47 -0700

The proposed framework presents an end-to-end deep learning pipeline for dynamic hand gesture recognition with built-in explainability. Hand gesture recognition plays a vital role in enabling realistic interaction between humans and computers, with applications in assistive technologies, virtual reality, and healthcare. Although conventional static gesture recognition has achieved significant progress, dynamic gestures remain more challenging due to occlusions, complex spatiotemporal dependencies, and varying environmental conditions. While recent deep learning and transformer-based models have improved recognition accuracy, challenges related to interpretability, computational efficiency, and robustness still persist. To address these issues, this research introduces a Gabor-augmented with Temporal Attention and User-interface Framework (GTAU-F) for recognizing semantic dynamic hand gestures. The framework begins with video-based hand gesture data as input, which is decomposed into individual frames for detailed analysis. Each frame undergoes a comprehensive preprocessing stage, including resizing, noise reduction, grayscale conversion, binarization, morphological dilation, and grayscale mapping with overlay, ensuring enhanced quality and consistency of gesture representations. Following preprocessing, discriminative spatiotemporal features are extracted using a Gabor-enhanced Inflated 3D Two-Stream Network, which effectively captures both spatial characteristics and temporal motion dynamics of gestures. These features are then passed to a classification module based on an attention-guided GoogLeNet architecture integrated with temporal encoding and a builder optimization network, enabling improved classification robustness, convergence, and temporal understanding of dynamic gestures. To enhance interpretability, the framework incorporates an explainability module using Gradient-weighted Class Activation Mapping++ (Grad-CAM++), which highlights salient regions of hand movements and generates heatmaps to visually explain model predictions. In addition, a human-centric adaptive gesture interface facilitates real-time and user-friendly interaction. The proposed approach is evaluated on multiple benchmark datasets, including ChaLearn LAP IsoGD, NVGesture, EgoGesture, and the Hand Gesture Recognition (HGR) dataset. Performance is assessed using standard metrics such as accuracy, precision, recall, and specificity. Experimental results demonstrate that GTAU-F achieves over 99% performance across all evaluation metrics, indicating strong robustness, high accuracy, and excellent generalization capability across diverse datasets. Overall, GTAU-F emerges as an efficient, interpretable, and robust solution for enhancing dynamic gesture recognition in human–computer interaction systems.

ABSTRACT

Hand gesture recognition portrays a vital role in prompting a realistic interaction between humans and computers. It has potential around diverse domains, such as assistive technologies, virtual reality and healthcare. Even though conventional static gesture recognition has attained significant advancements, dynamic gestures are more challenging to recognize because they rely on occlusions, spatiotemporal cues and environmental changes. While current deep learning and transformer-based models enhanced accuracy, there remain some issues related to interpretability, computational efficiency and robustness. This research presents a Gabor-augmented with Temporal Attention and User-interface Framework (GTAU-F), implemented for recognizing semantic dynamic hand gestures. GTAU-F emphasizes a Gabor-enhanced inflated 3D two-stream network for extracting spatio-temporal features, together with an attention-guided GoogLeNet that integrates temporal encoding and builder optimization for coherent convergence. To facilitate certain interpretation, Gradient-weighted Class Activation Mapping++ (Grad-CAM++) visualizations are employed, and a human-centric adaptive gesture interface facilitates real-time, user-friendly interactions. Assessment outcomes based on four datasets (ChaLearn LAP IsoGD, NvGesture, EgoGesture, and Hand gesture recognition database) illustrate that GTAU-F attains over 99% in accuracy, precision, recall, and specificity. Finally, GTAU-F emerges as an efficient, robust and interpretable solution for enhancing dynamic gesture recognition in human–computer interactions.

Diverse Locomotion Styles From Linear‐ and Angular‐Velocity Phase Manifolds

Seungmoo Jung, Takashi Kanai — Fri, 22 May 2026 18:36:04 -0700

For real-time character animation, the proposed end-effector-aware phase manifold learning framework preserves fine-grained, gesture-rich walking styles while maintaining coherent whole-body motion. Its two-stage training strategy and angular-velocity-based phase representation better capture stylistic periodicity than conventional global phase representations.

ABSTRACT

Real-time character animation requires generating natural motions while preserving diverse walking styles under user control. Phase-based representations are commonly employed in motion generation frameworks to control periodic motions such as walking; however, existing approaches face a trade-off between preserving fine-grained local periodic details and maintaining coherent whole-body motion. Methods focusing on local periodic features often insufficiently represent other joints, while global phase representations tend to smooth out stylistic details, leading to style homogenization. This paper proposes an end-effector-aware phase manifold learning framework that balances local stylistic features and global motion consistency. The proposed method employs a two-stage training strategy that first learns local periodic characteristics of end-effectors and then integrates full-body periodicity while fixing the learned local representations. In addition, we introduce an angular velocity-based phase representation, which more directly captures the rotational characteristics of walking motions than linear velocity. Experimental results demonstrate improved style preservation for gesture-dominant motions while generating stable whole-body walking motions.

3D Hair Reconstruction From Sketches Using Strand and Depth Maps

Ritsuki Ishiwata, Syuhei Sato, Shun Tatsukawa — Fri, 22 May 2026 06:20:26 -0700

Overview of the proposed sketch-based 3D hair reconstruction pipeline. Given a user-drawn line drawing and a hair-region mask, our method generates two intermediate representations: A strand map that encodes hair flow, computed by diffusing stroke directions using diffusion curves, and a depth map estimated using ControlNet. We then reconstruct a 3D hair model from these maps using the pretrained HairStep network.

ABSTRACT

Creating realistic 3D human hair models typically requires substantial manual effort and expertise. To reduce this burden, many methods have been proposed for automatic hair modeling from images. In particular, multi-view reconstruction approaches based on deep learning or optimization can produce high-quality results, but they often require specialized capture setups, making them difficult for general users. Methods that use only a single image have also been explored for convenience; however, sketch-based hair modeling–despite the popularity of sketch input for other 3D objects–remains relatively under-studied. In this paper, we propose a method for reconstructing diverse hairstyles from sketches. Building on a high-performance single-image hair reconstruction pipeline, we extend its input modality to sketches. Given a sketch image and a hair-region image, our method generates two intermediate representations: A strand map that encodes hair flow, computed by solving a diffusion equation, and a depth map estimated using ControlNet. We then reconstruct a 3D hair model from these maps using the existing reconstruction procedure. Experiments with multiple sketch types demonstrate that our approach can reproduce hair geometry consistent with the input sketches.

Person‐In‐Situ: Scene‐Consistent Human Image Insertion With Occlusion‐Aware Pose Control

Shun Masuda, Yuki Endo, Yoshihiro Kanamori — Thu, 21 May 2026 08:11:53 -0700

We tackle a novel problem of occlusion-aware human image insertion with explicit pose control, which cannot be handled by the state-of-the-art method. Our method can insert a person in a specified pose at an appropriate depth within a scene, without altering the scene's appearance.

ABSTRACT

Compositing human figures into scene images has broad applications in areas such as entertainment and advertising. However, existing methods often cannot handle occlusion of the inserted person by foreground objects and unnaturally place the person in the frontmost layer. Moreover, they offer limited control over the inserted person's pose. To address these challenges, we propose two methods. Both allow explicit pose control via a 3D body model and leverage latent diffusion models to synthesize the person at a contextually appropriate depth, naturally handling occlusions without requiring occlusion masks. The first is a two-stage approach: the model first learns a depth map of the scene with the person through supervised learning, and then synthesizes the person accordingly. The second method learns occlusion implicitly and synthesizes the person directly from input data without explicit depth supervision. Quantitative and qualitative evaluations show that both methods outperform existing approaches by better preserving scene consistency while accurately reflecting occlusions and user-specified poses.

Text‐to‐3D City: Plan‐then‐Execute Urban Generation With LLM Planners and Procedural Synthesis

Xiaohang Dong, Hualong Yu, Xu Zhang, Jianye Wang, Qicheng Li — Thu, 21 May 2026 01:31:12 -0700

Text-to-3D City presents a plan-then-execute framework for large-scale urban generation. An LLM-based City Planner translates natural-language descriptions into structured procedural parameters, which are then executed by a validity-aware PCG pipeline to synthesize controllable, scalable, and engine-ready 3D cities.

ABSTRACT

City-scale 3D urban generation requires planning-level semantic grounding from user intent and scalable geometric synthesis with structural validity and editability. Procedural content generation (PCG) offers controllability and scalability, but is hard to author due to high-dimensional parameters and nonintuitive workflows. Meanwhile, directly generating city geometry or scripts from text with LLMs can suffer from weak large-scale consistency and limited geometric validity, hindering downstream editing and engine deployment. We present Text-to-3D City, a plan-then-execute framework that couples an LLM-based City Planner with a PCG-based Implementer. Given a natural language description, the Planner grounds textual intent into a structured city plan by composing PCG parameters via a schema and in-context exemplars. The Implementer deterministically executes road generation, block extraction, lot subdivision, and asset placement with validity checks and reproducible seeding to synthesize an engine-ready 3D city. Experiments on multi-view renderings evaluate text-scene alignment, diversity, realism, and runtime, demonstrating rapid generation and scalability to large urban scenes.

Virtual Reality or Laptops? A Comparison of Two Formats for Developing Visualization Tools for Environmental Planning and Management

Bin Xu, Robert Newell, Brian White — Tue, 05 May 2026 06:53:58 -0700

A comparative evaluation of computer-based and virtual reality visualization tools in environmental planning demonstrates that VR provides stronger experiential qualities, while both tools effectively communicate information, improve understanding of complex issues, and support informed decision-making in watershed planning.

ABSTRACT

Interactive and realistic visualization tools show strong potential for supporting participatory planning and stakeholder engagement in environmental management. These tools can be delivered as conventional computer applications or through virtual reality (VR), yet their comparative advantages remain unclear. Using the Millstream Creek Watershed (British Columbia, Canada) as a case study, this study compares computer-based and VR versions of an interactive visualization tool to assess usability, experiential qualities, and perceived value in environmental planning and decision-making. A mixed-methods approach was employed through an open house and a series of workshops. Quantitative data from Likert-scale responses were analyzed using single-sample and dependent t-tests, while qualitative data from written responses and group discussion were examined using thematic coding. Results indicate that the VR tool performed more strongly in experiential qualities, including immersion, engagement, and realism, whereas the computer-based version was rated more favorably for physical comfort. Both tools were positively evaluated in terms of perceived value, with VR showing stronger potential for community engagement. Qualitative findings further highlight the importance of soundscapes in enhancing experiential quality and suggest educational applications of interactive visualization tools. Overall, the study provides guidance for selecting visualization formats to support stakeholder and community engagement in environmental planning.

Digital Pasts, Tangible Futures: A Thematic Study of Virtual Heritage (VH) Research

Shu Ma, Amir Karimi — Fri, 01 May 2026 20:52:14 -0700

This study traces the evolution of Virtual Heritage (VH) from digitization to immersive and participatory experiences, highlighting key themes of technology, engagement, authenticity, and sustainability, and showing how VH bridges digital pasts with future heritage practices.

ABSTRACT

This research conducts a bibliometric investigation into the scholarly discourse surrounding virtual heritage from 1997 to 2025, using data extracted from the Scopus database. Analyzing a curated selection of 306 peer-reviewed articles and conference papers, the study uncovers the structural, thematic, and collaborative dimensions of this interdisciplinary field. Employing analytical tools such as CiteSpace, VOSviewer, and the Bibliometrix R package, the research maps influential contributors, institutional networks, and evolving research clusters. The results demonstrate increasing convergence between technology and heritage studies, particularly in virtual reality, user interaction, and artificial intelligence. Despite this growth, notable research voids remain, especially regarding global diversity, conceptual engagement with authenticity, and the long-term preservation of digital reconstructions. The study concludes with methodological observations and proposes future research pathways to foster a more inclusive, sustainable, and critically grounded framework for developing and evaluating Virtual Heritage practices.

Issue Information

Wed, 29 Apr 2026 20:39:34 -0700

Computer Animation and Virtual Worlds, Volume 37, Issue 3, May/June 2026.

Evaluating Idle Animation Believability: A User Perspective

Wed, 29 Apr 2026 20:36:34 -0700

This work demonstrates that there is no perceptual difference between genuine and acted idle motion. The first dataset containing these idle animations is also provided.

ABSTRACT

Animating realistic avatars requires using high-quality animations for every possible state the avatar can be in. This includes actions like walking or running, but also subtle movements that convey emotions and personality. Idle animations, such as standing, breathing, or looking around, are crucial for realism and believability. In virtual applications, these are often handcrafted or recorded with actors, but this is costly. Furthermore, recording realistic idle animations may be complex, since the actor being aware of the recording could interfere with the genuineness of the movements. Currently, there are no large-scale idle animation datasets for deep learning, and this recording challenge may partly explain this. Nevertheless, this paper concludes that both acted and genuine idle animations are perceived as real, and users are not able to distinguish between them. It also states that raw recorded idle animations and artist-retouched ones are perceived differently. These conclusions mean that recording idle animations should be easier than expected, implying that actors can be instructed to act the movements, significantly simplifying the recording process. This should help future efforts to record idle animation datasets. Finally, we publish ReActIdle, the first three dimensional idle animation dataset containing long sequences of real and acted idle motions.