Learning Visuotactile Policy for Fast and Stable Placement

Traditional goal-reaching placement leads to toppling under complex contact dynamics. Our FSP policy attains stability by online adjusting the end-effector's local configurations and global velocity once the object contacts the table, and retracts back to the initial position afterwards.

Placement is an essential yet underexplored component of robotic manipulation. Its contact-rich nature poses significant challenges for perception and control, making it inherently difficult to balance efficiency and stability. We present Fast and Stable Placement (FSP), a visuotactile policy learning framework that leverages multimodal sensing for enhanced contact perception and achieves reliable placement through online adjustment. FSP adopts a two-stage curriculum for robust placement skill learning. An oracle policy is first trained with privileged information; then, it is distilled into a deployable policy that operates on carefully designed domain-invariant modalities, enabling direct sim-to-real policy transfer. These comprehensive modalities are tokenized and fused with the attention mechanism to support end-to-end policy learning. Extensive evaluations in both simulation and on real robots demonstrate that the learned FSP policy achieves a superior trade-off between efficiency and stability.

Overview of the FSP policy learning framework. An oracle FSP policy π^* is first trained with privileged knowledge (a), which is then distilled into a deployable policy π^† that operates on domain-invariant observations (b). Given the real-world setup (c), the policy trained in the aligned simulated environment can be directly deployed on real robots (d).

To actively complete missing details, we incorporate multi-view observations and pretrain a masked autoencoder (MAE) on multi-view depth maps, enhancing the policy’s spatial reasoning. Fig. (a) provides details of the architecture. The MAE module is required to reconstruct the masked portion of the image based on visible patches, enforcing the policy to infer the unobserved geometry. For the FSP policy, we also segment out the object region, which may be partially occluded during placement, and the pretrained MAE encoder leverages these observations to implicitly imagine the complete geometry, as shown in Fig. (b). For efficiency, we adopt a lightweight U-Net to conduct segmentation, which is trained with mask labels provided by SAM2.

We consider three categories of objects that present distinct challenges for placement: slender, heavy-top, and daily objects. Slender objects have a high center of mass and a small support area, making them susceptible to failure from even minor disturbances during placement. Heavy-top objects, such as an inverted slender-neck bottle, further amplify this challenge by concentrating the center of mass further higher. Finally, Daily objects represent common household items, included to assess whether our method can generalize across a diverse range of shapes.