Tianao (Owen) Zeng

Control & Learning Group at Carnegie Mellon University | Remote / Pittsburgh, PA | May 2026 - Present

Robot Manipulation Data Pipelines for VLA / Robot Learning

Building and validating robot manipulation data pipelines for VLA and robot-learning research, with an emphasis on clean dataset format, reproducible collection, and failure-aware debugging.

Research role

Research Assistant, Department of Electrical and Computer Engineering

This work sits between embodied AI research and systems engineering. I am adapting a LIBERO/OpenPI-style data collection workflow to Unitree G1 humanoid manipulation with Dex1 grippers, then packaging successful demonstrations into a LeRobot-compatible structure for downstream policy and VLA model training.

Research systems

What I built

LeRobot / OpenPI-style demonstration datasets

Collected successful manipulation episodes for three LIBERO-style kitchen tasks and packaged each task as a structured dataset with state-action trajectories, task metadata, success indicators, and synchronized rollout videos.

  • Collected 500 successful episodes per task across 3 tasks, for 1,500 successful demonstrations total.
  • Packaged 1,500 parquet trajectory files and 4,500 synchronized videos across left-high, left-wrist, and right-wrist camera views.
  • Recorded episode-level metadata including task labels, success flags, frame indices, rewards, and done signals.
  • Aligned the data layout with LeRobot/OpenPI expectations so downstream training code can consume the dataset directly.

Simulation debugging and collection stability

Improved the reliability of long-running collection jobs by diagnosing failed rollouts from simulator logs and videos, then hardening the collection process around memory and task-specific environment issues.

  • Fixed an end-effector site mismatch by dynamically resolving the gripper site from observed robot state instead of relying on a hardcoded MuJoCo site index.
  • Moved from single large 500-episode runs to chunked 50-episode collection to reduce memory pressure and make long jobs recoverable.
  • Merged chunks into clean task-level datasets with reindexed episodes, frames, global indices, videos, and metadata.
  • Validated final archives after extraction to confirm data files, videos, metadata, and success labels were complete.

Outcomes

Current impact

  • Delivered 1,500 successful robot manipulation episodes across 3 tasks for VLA / robot-learning training workflows.
  • Converted fragile simulator rollouts into a reproducible dataset pipeline with explicit validation and success metadata.
  • Strengthened the bridge between AI research goals and production-style data engineering: clean schemas, stable jobs, and debuggable artifacts.

Stack

Tools and methods

PythonMuJoCoRobosuiteLIBEROUnitree G1Dex1LeRobotOpenPIPyArrowParquetHDF5LinuxtmuxSSHGPU workflows