Control & Learning Group at Carnegie Mellon University | Remote / Pittsburgh, PA | May 2026 - Present

Robot Manipulation Data Pipelines for VLA / Robot Learning

Building and validating robot manipulation data pipelines for VLA and robot-learning research, with an emphasis on clean dataset format, reproducible collection, and failure-aware debugging.

Control & Learning Group Research

Research role

Research Assistant, Department of Electrical and Computer Engineering

This work sits between embodied AI research and systems engineering. I am adapting a LIBERO/OpenPI-style data collection workflow to Unitree G1 humanoid manipulation with Dex1 grippers, then packaging successful demonstrations into a LeRobot-compatible structure for downstream policy and VLA model training.

Research systems

What I built

LeRobot / OpenPI-style demonstration datasets

Collected successful manipulation episodes for three LIBERO-style kitchen tasks and packaged each task as a structured dataset with state-action trajectories, task metadata, success indicators, and synchronized rollout videos.

Collected 500 successful episodes per task across 3 tasks, for 1,500 successful demonstrations total.
Packaged 1,500 parquet trajectory files and 4,500 synchronized videos across left-high, left-wrist, and right-wrist camera views.
Recorded episode-level metadata including task labels, success flags, frame indices, rewards, and done signals.
Aligned the data layout with LeRobot/OpenPI expectations so downstream training code can consume the dataset directly.

Simulation debugging and collection stability

Improved the reliability of long-running collection jobs by diagnosing failed rollouts from simulator logs and videos, then hardening the collection process around memory and task-specific environment issues.

Fixed an end-effector site mismatch by dynamically resolving the gripper site from observed robot state instead of relying on a hardcoded MuJoCo site index.
Moved from single large 500-episode runs to chunked 50-episode collection to reduce memory pressure and make long jobs recoverable.
Merged chunks into clean task-level datasets with reindexed episodes, frames, global indices, videos, and metadata.
Validated final archives after extraction to confirm data files, videos, metadata, and success labels were complete.

Outcomes

Current impact

Delivered 1,500 successful robot manipulation episodes across 3 tasks for VLA / robot-learning training workflows.
Converted fragile simulator rollouts into a reproducible dataset pipeline with explicit validation and success metadata.
Strengthened the bridge between AI research goals and production-style data engineering: clean schemas, stable jobs, and debuggable artifacts.

Stack

Tools and methods

PythonMuJoCoRobosuiteLIBEROUnitree G1Dex1LeRobotOpenPIPyArrowParquetHDF5LinuxtmuxSSHGPU workflows