Latent Policy Barrier: Learning Robust Visuomotor Policies by Staying In-Distribution

Zhanyi Sun, Shuran Song,

Stanford University

Abstract

Teaser

Visuomotor policies trained via behavior cloning are vulnerable to covariate shift, where small deviations from expert trajectories can compound into failure. Common strategies to mitigate this issue involve expanding the training distribution through human-in-the-loop corrections or synthetic data augmentation. However, these approaches are often labor-intensive, rely on strong task assumptions, or compromise the quality of imitation. We introduce Latent Policy Barrier, a framework for robust visuomotor policy learning. Inspired by Control Barrier Functions, LPB treats the latent embeddings of expert demonstrations as an implicit barrier separating safe, in-distribution states from unsafe, out-of-distribution (OOD) ones. Our approach decouples the role of precise expert imitation and OOD recovery into two separate modules: a base diffusion policy solely on expert data, and a dynamics model trained on both expert and suboptimal policy rollout data. At inference time, the dynamics model predicts future latent states and optimizes them to stay within the expert distribution. Both simulated and real-world experiments show that LPB improves both policy robustness and data efficiency, enabling reliable manipulation from limited expert data and without additional human correction or annotation.

Technical Summary Video

Method Overview

Method Overview

Latent Policy Barrier (LPB) decouples precise expert imitation from OOD recovery by leverages two complementary components: (a) a base diffusion policy trained exclusively on consistent, high-quality expert demonstrations, ensuring precise imitation and high task performance; and (b) an action-conditioned visual latent dynamics model trained on a broader, mixed-quality dataset combining expert demonstrations and automatically generated rollout data. At inference time, if the Euclidean distance between the current latent state and the nearest expert state is below a threshold, LPB defaults to standard action denoising. Otherwise, LPB refines the action denoising process by performing policy steering in the latent space, effectively ensuring that the agent stays within the expert distribution. LPB uses the dynamics model to predict future latent states conditioned on candidate actions output from the base policy, and minimizes the distance between the predicted future latent states and their nearest neighbors from the expert demonstrations in the same latent space. In this way, LPB simultaneously achieves high task performance and robustness, resolving deviations without compromising imitation precision.

Simulation Benchmark

Tasks

Comparison to Baselines

As shown in the table, under this limited-demonstration setting (20% demonstration), LPB matches or exceeds every baseline on all four simulated tasks, showing strong sample efficiency.

Real Robot Experiment - Cup Arrangement

Robot setup

We evaluate LPB’s ability to improve the robustness of an off-the-shelf pretrained policy for the cup arrangement task. We perform two groups of experiments: in-distribution initial poses, where the wrist camera initially observes both the cup and the saucer (left), and out-of-distribution initial poses, where the camera sees neitherobject, sometimes not even the table (right).

Data Collection

If you are interested in seeing the data collection, please click here.

In-distribution Evaluation

All videos are 4× real-time.

Base Policy (Baseline)

LPB (Ours)

LPB matches the base policy on in-distribution initial poses.

Out-of-distribution Evaluation

All videos are 4× real-time.

Base Policy (Baseline)

LPB (Ours)

LPB substantially outperforms the base policy on out-of-distribution initial poses.

Real Robot Experiment - Belt Assembly

Robot setup

We further evaluate on the Belt Assembly task from the NIST board tasks. We collect 200 expert trajectories to train the base policy and 400 rollout trajectories to train the dynamics model. We evaluate both LPB and the base policy on out-of-distribution initial poses.

All videos are 2× real-time.

Base Policy (Baseline)

LPB (Ours)

Wrist View Rollout

If you are interested in seeing the rollout videos in wrist camera view, which better illustrate the initial task variations, please click here.