Visuo-Tactile World Models

Abstract

We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile images, VT-WM better understands robot-object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM demonstrates significant downstream versatility, effectively adapting its learned contact dynamics to a novel task and achieving reliable planning success with only a limited set of demonstrations.

Architecture

Our visuo-tactile world model is designed to address a key challenge in multimodal robot world models: how to combine exocentric vision with tactile sensing in order to generate consistent imagined futures. As shown in the figure, the architecture consists of three main components: a vision encoder, a tactile encoder, and an autoregressive predictor.

VT-WM block diagram — Visuo-Tactile World Model. Vision (s_k) and tactile (t_k) latents, obtained from Cosmos and Sparsh encoders, are processed by a transformer predictor given control actions a_k to generate next-step states (s_k+1, t_k+1).

Vision and Touch in Imagination

To illustrate the predictive capability of the visuo-tactile world model, we evaluate rollouts conditioned on real robot action sequences. Specifically, we use held-out demonstrations from two tasks in our dataset: press button and scribble with marker. For each task, the VT-WM is queried autoregressively using the ground-truth sequence of control deltas.

Since the model produces latent representations of future visual and tactile observations, we employ pretrained decoders to reconstruct these latents for visualization. Across both tasks, the predicted visual states closely resemble the final RGB images of the real trajectories. The predicted tactile states also capture the key interaction events: although slight differences appear in the precise location of per-finger contacts, the rollouts consistently indicate whether contact occurs.

Ground Truth Sequence

VT-WM Sequence

Planning in Imagination

The action-conditioned nature of our predictor enables the use of the visuo-tactile world model as a simulator within a Cross-Entropy Method (CEM). At each step, the planner samples a population of action sequences {a_i^k:k+H} for i = 1,...,N over a horizon H. For each sequence, the predictor autoregressively generates future latents (s_k+1:k+H, t_k+1:k+H). A cost function, defined by energy minimization with respect to a goal image, assigns a score to each trajectory. In practice, this cost can be as simple as an L2 distance between the final predicted visual latent s_k+H and the latent of the goal image s_goal. CEM then selects the top-performing fraction of sequences, updates the sampling distribution toward them, and iterates until convergence. The best sequence is then executed on the real robot in an open-loop manner.

Pushing
Reach and Push
Wiping
Stacking

V-WM (imagination 💭)

VT-WM (imagination 💭)

V-WM (zero-shot 🤖)

VT-WM (zero-shot 🤖)

Planning through VT-WM

The barplot shows success rates, averaged over five trials per task from distinct initial conditions. The results confirm the better planning capability of VT-WM across all tasks, supporting our hypothesis that a contact-aware model generates more effective plans. The benefits of tactile input become increasingly evident in contact-rich tasks: VT-WM improves success rates by 10% on push fruits, 35% on reach & push, 31% on wipe cloth, and 11% on stack cubes. These gains are most pronounced in multistep tasks involving sustained contact, where vision alone is insufficient to inform about the object state during planning

Success rate of zero-shot transfer of plans via CEM with VT-WM and V-WM on real robot.

Do VT-WMs better capture object permanence and plausible physics than vision-only WMs, and generate futures consistent with action conditioning?.

In the figure below we show snapshots of a trajectory where the robot performs a wiping motion just above a cloth, without making contact. In the ground-truth sequence (top row), keypoints on the cloth remain stationary. In contrast, the V-WM’s rollout (bottom row), conditioned on the real actions, shows significant displacement of keypoints and deformations of the cloth. This highlights the V-WM’s difficulty to distinguish between contact and non-contact states based on visual input alone. The VT-WM’s rollouts (middle row), however, exhibit fewer artifacts and less variation, demonstrating the advantage of tactile sensing in providing the world model with critical physical grounding.

VT-WM prevents spurious motion of objects not subject to forces, whereas V-WM often hallucinates unintended displacements

BibTeX

If you find VT-WM useful for your work, please cite:

@misc{higuera2026visuotactileworldmodels,
      title={Visuo-Tactile World Models}, 
      author={Carolina Higuera and Sergio Arnaud and Byron Boots and Mustafa Mukadam and Francois Robert Hogan and Franziska Meier},
      year={2026},
      eprint={2602.06001},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2602.06001}, 
}

Visuo-Tactile Wold Models