The data infrastructure
for physical AI.

Hyphenbox turns raw egocentric/multi-modal data into training-ready datasets for general-purpose robotics.

Stack 01 · Three primitives, one pipeline
01 / Dense Action Labels

Dense action labels

Frame-accurate action segmentation over long-horizon manipulation. Scene context, object state, and contact sequences, annotated as the model will consume them.

Frame-accurate · scene-grounded
02 / 2D → 3D Hand Tracking

2D → 3D hand tracking

Millimeter-level 3D hand reconstruction from 2D egocentric video. Stable under self-occlusion and close-range object interaction, the regime where off-the-shelf pose fails.

mm-level · occlusion-stable
03 / 2D → 3D Body Pose

2D → 3D body pose

Full-body pose reconstruction from a single egocentric camera. Physics-aware, temporally smooth, with root trajectory, the rig a policy can actually consume.

physics-aware · temporally grounded
02 · Thesis
General-purpose robotics is a data problem, not a model problem.

Capture is scaling exponentially. Training-ready data isn't. Models train on whatever reaches them as usable signal, not on hours of raw footage.

A model can't learn from raw pixels. It needs to know what action occurred, in what order, with what contact, and where hands and bodies are in 3D, frame by frame.
Annotation is the bottleneck. Generic labeling tools and off-the-shelf pose estimators weren't built for embodied, first-person data.
We build the stack between capture and training: dense action labels, 2D→3D hand, 2D→3D body, human-verified.
Shift 03 · Where the bottleneck moved
Capture vs. usable data gap t = 2022 2026 volume
Capture Usable data
03.1 · The mismatch

Collection is scaling. Annotation isn't.

Across real homes, factories, and retail floors, partners are capturing 0+ hours of egocentric video daily.

What ships to a model is a different story. Raw RGB isn't a training-ready dataset. It becomes one only after someone decides, frame by frame, what happened, in what order, with what body, with what contact.

That transformation is the bottleneck. We built the stack that closes it.

The 0 hours of egocentric video used for pretraining NVIDIA Isaac GR00T N1.7 were explicitly action-labeled with dense 3D hand tracking and camera motion, derived directly from raw footage. The data that trained the model didn't exist in raw form; it was constructed. Source: EgoScale / NVIDIA Research · arxiv.org
For Data-Collection Partners 05 · The multiplier
Between capture and training sits the annotation layer.

You already do the hard part. Fielding head-mounted rigs, recruiting operators, clearing consent, capturing the long tail of real homes, factories, and retail floors. None of that is trivial.

Labs don't train on hours of raw footage. They train on hours of enriched data. Turning one into the other is a research problem: dense action labels, 2D to 3D hand, 2D to 3D body, human-verified, frame by frame.

We sit on the research side of the pipeline. You keep scaling collection. We turn it into the dataset labs train on.

Per hour of footage, indicative
Raw capture $x
Training-ready dataset 6–7× $x
Prevailing rates for embodied annotation work (action labels, 3D hand & body, human-verified) vs. raw egocentric capture.
Pipeline 06 · Raw → training-ready
One stack. Days, not months.
I / Input

Raw ego video

Video, depth, and other signal from real homes, factories, and retail.

01 / Pre-label

VLM pre-labels

Our proprietary VLM proposes dense action segments and scene context.

02 / Reconstruct

2D → 3D hand & body

Millimeter 3D hand and full-body pose, lifted from 2D, temporally grounded.

03 / Verify

Human-in-the-loop QA

Every label passes through an expert reviewer. No crowdsourced floor.

O / Output

Training-ready dataset

Dense labels · 3D hand · 3D pose · human-verified. Model-consumable.

0× throughput / annotator mm-level 3D accuracy 100% human-verified
07 · Contact

If you're collecting egocentric video, we should talk.

Frontier robotics labs. Data-collection partners. Anyone whose roadmap is bottlenecked on training-ready data.