CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation

Haonan Chen^*1,2, Yuxiang Ma^*3, Stephen Tian², Xiaoshen Han¹, Wenlong Huang², Feiyang Wu¹

Yunzhu Li⁴, Jiajun Wu², Edward H. Adelson³, Yilun Du¹

⁴

^*Equal contribution.

Paper Code (Coming Soon)

This video includes narration — unmute for audio explanation.

CoStream demonstrates robust contact-rich manipulation with real-time recovery from external perturbations during CPU, GPU, and RAM component assembly.

Abstract

Long-horizon, contact-rich manipulation — such as seating a GPU into a PCIe slot — demands both millimeter precision and generalization to new tasks. Classical pipelines achieve precise control through brittle, task-specific interfaces that need costly redesigns, while monolithic policies generalize better but lose precision on out-of-distribution tasks unless retrained. Both assume a capability, once acquired, must ship as a rigid whole rather than being freely decomposed and recomposed.

We show that complex manipulation can emerge from composing simple, independent behaviors. CoStream orchestrates foundation models and diverse sensors into composable core behaviors: a semantic behavior extracting spatial constraints, a predictive behavior forecasting trajectories from imagined videos, and a reactive behavior providing high-frequency tactile corrections. On a shared SE(3) interface, these compose by right-multiplication into one pose command per control step.

We demonstrate CoStream on 8 real-world tasks spanning everyday manipulation and precision assembly — with the strongest gains in contact-rich assembly and object transfer — and show robust recovery from manual perturbations during execution.

CoStream: Composing Simple Behaviors for Complex Contact-Rich Manipulation. CoStream composes multiple sensor-grounded behaviors into a single end-effector command. A semantic behavior parses instructions with an LLM and a VLM into geometric constraints. A predictive behavior extracts a 3D reference trajectory from a video world model. A reactive behavior closes a high-rate loop from tactile and force feedback. The behaviors share an SE(3) interface and compose by right-multiplication into one pose command at every control step, which a compliant controller executes while regulating contact force.

Why CoStream?

The same set of behaviors solves 8 contact-rich tasks — from sub-millimeter assembly to everyday manipulation — with no per-task demonstrations, retraining, or pipeline redesign.

Classical Pipelines

New task×Re-engineer perception & planning modules Contact precision×Brittle, open-loop interfaces Add a sensor×Rewrite hand-specified interfaces

Monolithic VLAs

New task×Collect fresh demos & retrain Contact precision×Destabilizes on contact Add a sensor×Retrain the whole network

CoStream (Ours)

New task✓Drop-in — no retraining Contact precision✓Sub-mm via 25 Hz reactive behavior Add a sensor✓Add one behavior; SE(3) composes it

Task Performance

We evaluate CoStream on 8 real-world tasks (15 trials each) across two groups: high-precision contact-rich assembly and everyday manipulation.

Contact-Rich Assembly

CoStream reaches a 96.7% mean success rate, while VoxPoser and π_0.5 complete none (0.0%). Drill insertion: 0.5 mm clearance. Hover a bar for raw trials.

Everyday Manipulation

CoStream (66.7% mean) improves over π_0.5 (11.7% mean) across everyday tasks, with the largest gains on object transfer.

Ablation: Reactive Behavior

The reactive behavior is essential for sub-millimeter precision. On drill insertion, removing it drops success from 100% to 20% — tactile force feedback is critical for handling sub-millimeter clearances.

Long-Horizon Motherboard Assembly

CoStream completes the full motherboard assembly sequence — transitioning seamlessly between CPU, GPU, and RAM insertion — demonstrating compositional generalization across contact-rich subtasks without task-specific retraining.

Same behaviors, zero retraining between subtasks. Full motherboard assembly — CPU, GPU, and RAM installation in sequence (4× speed).

Everyday Manipulation

CoStream transfers to everyday manipulation without per-task retraining. We compare CoStream (left) against the monolithic VLA baseline π_0.5 (right) on the same tasks.

Cup → Plate — CoStream (Ours)

Cup → Plate — π_0.5

Clothes → Box — CoStream (Ours)

Clothes → Box — π_0.5

Turn on the Lamp — CoStream (Ours)

Turn on the Lamp — π_0.5

Robustness to Human Perturbations

CoStream's reactive behavior provides real-time corrections that preserve the objectives set by the semantic and predictive behaviors, even under external perturbations during execution.

CPU Perturbation

GPU Perturbation

RAM Perturbation

Predictive Behavior

The predictive behavior uses a video world model to "imagine" candidate trajectories from the current image and step instruction. A VLM critic selects the rollout that best satisfies the semantic and safety constraints, and a 3D keypoint tracker lifts it into an object-centric motion prior.

Crucially, this is just one behavior. Every behavior — foundation model, learned network, or classical control — is simply a function f(observation) → SE(3). Because they share this interface, extending CoStream means adding one behavior, not redesigning a pipeline or retraining a policy.

Selected Trajectory

The video world model generates a valid trajectory for GPU insertion

Rejected Trajectory

The VLM critic filters out physically implausible rollouts

Failure Cases

Despite a 96.7% mean success rate, failures remain, from two main sources: perturbations that exceed the reactive stream's correction capacity, and semantic reasoning errors during multi-stage planning.

GPU Perturbation Failure

A large-magnitude perturbation displaces the GPU beyond the reactive behavior's recovery range, resulting in a failed insertion attempt.

RAM LLM Stage Failure

An incorrect stage transition predicted by the LLM semantic behavior causes the robot to proceed prematurely, resulting in a misaligned RAM insertion.

BibTeX

@misc{costream2026,
  title  = {CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation},
  author = {Chen, Haonan and Ma, Yuxiang and Tian, Stephen and Han, Xiaoshen and Huang, Wenlong and Wu, Feiyang and Li, Yunzhu and Wu, Jiajun and Adelson, Edward H. and Du, Yilun},
  year   = {2026}
}

CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation

CoStream demonstrates robust contact-rich manipulation with real-time recovery from external perturbations during CPU, GPU, and RAM component assembly.

Abstract

Why CoStream?

Classical Pipelines

Monolithic VLAs

CoStream (Ours)

Task Performance

Contact-Rich Assembly

Everyday Manipulation

Ablation: Reactive Behavior

Long-Horizon Motherboard Assembly

Everyday Manipulation

Cup → Plate — CoStream (Ours)

Cup → Plate — π0.5

Clothes → Box — CoStream (Ours)

Clothes → Box — π0.5

Turn on the Lamp — CoStream (Ours)

Turn on the Lamp — π0.5

Robustness to Human Perturbations

CPU Perturbation

GPU Perturbation

RAM Perturbation

Predictive Behavior

Selected Trajectory

Rejected Trajectory

Failure Cases

GPU Perturbation Failure

RAM LLM Stage Failure

Related Research

Multi-Modal Manipulation via Multi-Modal Policy Consensus

OAT: Ordered Action Tokenization

Flexible Multitask Learning with Factorized Diffusion Policy

Predicting Object Interactions with Behavior Primitives: An Application in Stowing Tasks

BibTeX

Cup → Plate — π_0.5

Clothes → Box — π_0.5

Turn on the Lamp — π_0.5