Overview of the FSP policy learning framework. An oracle FSP policy π* is first trained with
privileged knowledge (a), which is then distilled into a deployable policy π† that operates
on domain-invariant observations (b). Given the real-world setup (c), the policy trained in the
aligned simulated environment can be directly deployed on real robots (d).
To actively complete missing details, we incorporate multi-view observations and pretrain a masked autoencoder
(MAE) on multi-view depth maps, enhancing the policy’s spatial reasoning. Fig. (a) provides details of the architecture.
The MAE module is required to reconstruct the masked portion of the image based on visible patches, enforcing
the policy to infer the unobserved geometry. For the FSP policy, we also segment out the object region,
which may be partially occluded during placement, and the pretrained MAE encoder leverages these observations
to implicitly imagine the complete geometry, as shown in Fig. (b). For efficiency, we adopt a lightweight
U-Net to conduct segmentation, which is trained with mask labels provided by SAM2.