May 2026

SABER.

A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation — the first high-fidelity retail robotics action dataset built from natural human behavior, not teleoperation.

The Core Claim

Domain-specific robot deployment is fundamentally a data problem. High-fidelity naturalistic human behavior — systematically captured and retargeted — is a scalable foundation for robot adaptation. No robot in the loop required.

44.8K

Training Samples

100+

Hours Captured

2.19×

Improvement

Resources

Watch Videos

In-store capture demos

arXiv

Research Paper

Download PDF

Paper (local copy)

Dataset

SABER-10K on Hugging Face

Benchmark Results

RoboBenchMart evaluation

Need the full 44.8K corpus or custom capture?

Contact Sales

Full Pipeline

Complete SABER Capture Pipeline

The complete SABER pipeline from synchronized dual-stream videos: egocentric video, 360° exocentric view, hand landmarks, body skeleton, and SMPL mesh — derived simultaneously from real in-store human actions.

The Challenge

Why Retail Demands Its Own Data

Modern VLAs like GR00T N1.6 achieve near-zero success on retail tasks out of the box — not because the model is weak, but because the retail domain is entirely absent from training data.

Distinct Skill Distribution

Articulated object interaction, multi-height shelf reaching, basket loading, floor retrieval, and context-dependent placement — all repeated across hundreds of SKUs in layouts no lab can replicate.

Long-Tail Scene Variation

Dense shelves, active restocking, occlusions, varied lighting, reflective packaging, and product deformability create real-world complexity that generic datasets cannot approximate.

Repetition Matters

A model must see skill families repeatedly across contexts — grasping bottles from different shelf heights, opening fridges from varied approach angles — to achieve reliable deployment.

Performance

Key Results at a Glance

2.19×

Improvement over fine-tuning baselines on RoboBenchMart

29.3%

Mean success rate across all 10 retail manipulation tasks

91%

Average fridge task success — up from 43% baseline

100%

Non-robot data — entire dataset captured from human video alone

44.8K

Total Samples

100+

Capture Hours

Action Streams

Eval Tasks

Dataset Architecture

Three Complementary Action Streams

From the same dual-camera in-store captures, three distinct supervision signals are derived — each encoding a different level of kinematic abstraction.

Stream 1

LAPA Latent Actions

25K

Embodiment-agnostic motion tokens derived via inverse-dynamics encoding from egocentric video. Captures whole-arm motion, reach trajectories, and grasping dynamics without robot joint labels.

Egocentric GoPro

Stream 2

Dexterous Hand Retargets

18.6K

21-point hand landmarks estimated, human-corrected frame-by-frame, then retargeted to robot joint space via Dex-Retargeting. Provides explicit finger-level precision supervision.

Egocentric GoPro

Stream 3

Whole-Body Retargets

1.2K

SMPL body parameters estimated from the 360° ALIA view, human-corrected, and retargeted to the Unitree G1 humanoid. Provides torso-arm-leg coordination for floor retrieval and extended reach.

Exocentric ALIA 360°

Methodology

From Store Footage to Robot Training

SABER is constructed from a dual-stream capture architecture — egocentric GoPro + exocentric ALIA 360° — across multiple real grocery stores.

In-Store Capture

100+ hours across multiple real grocery stores with head-mounted GoPro + DreamVu ALIA 360°

Action Extraction

LAPA encoding, hand pose estimation, and SMPL body estimation with human QC annotation

Robot Retargeting

Dex-Retargeting to robot hand joint space + SMPL-to-Unitree G1 whole-body retargeting

VLA Post-Training

Shared-backbone multi-task training on GR00T N1.6 with flow-matching objective

Demo Videos

Capture Sessions & Task Annotations

Annotated in-store capture footage from the SABER dataset — showing retail manipulation tasks with action labels and multi-scene diversity.

Annotated

Retail Task Cycles

Pushing trolleys, packing goods, arranging goods, opening doors, inspecting labels, and handling baskets.

Annotated

Retail Task Cycles

Placing and moving foods, scooping loose goods, inspecting deformable packets, carrying multiple goods, inspecting fruits, closing doors, and placing goods.

Evaluation

RoboBenchMart Results

SABER-MM post-training on GR00T N1.6 evaluated across 10 retail manipulation tasks spanning fridge, board-to-board, floor pick, and basket pick categories.

Mean Success — All Tasks

29.3%

13.4%

Fridge Tasks (avg open + close)

91%

43%

Floor Pick Tasks (avg)

17%

2.19×

Mean improvement over baseline
SABER-MM vs. RoboBenchMart fine-tuning only

SABER-MM

Baseline

Task	Category	Baseline (RBM FT)	SABER-MM	Change
fridge (avg open + close)	Fridge	0.43	0.91	+112%
board_to_board_duff	Board	0.10	0.10	—
board_to_board_nestle	Board	0.02	0.02	—
board_to_board_vanish	Board	0.02	0.11	+450%
pick_from_floor_beans	Floor	0.04	0.17	+325%
pick_from_floor_slam	Floor	0.02	0.17	+750%
pick_to_basket_fanta	Basket	0.08	0.19	+138%
pick_to_basket_nivea	Basket	0.08	0.21	+163%
pick_to_basket_stars	Basket	0.12	0.14	+17%
Mean (all tasks)		0.134	0.293	+119%

Training Corpus

SABER-MM Data Composition

The post-training corpus combines SABER's three streams with robot-native anchor data and task-aligned demonstrations — totaling ~52.1K samples.

52.1K

Total Samples

SABER — LAPA Latent Actions

25K samples · Egocentric video

48.0%

SABER — Hand Retargets

18.6K samples · Dex-Retargeting

35.7%

SABER — Body Retargets

1.2K samples · Unitree G1

2.3%

NVIDIA Robot Data

4.8K samples · Anchor signal

9.2%

RoboBenchMart

2.5K samples · Task-aligned

4.8%

Key Insights

What SABER Demonstrates

Finding 01

Human Video Scales Where Teleoperation Can't

SABER demonstrates that high-fidelity naturalistic human behavior, systematically captured and retargeted, is a viable and scalable foundation for domain-specific robot adaptation — without a robot in the loop.

Finding 02

Three Streams Are Complementary

LAPA tokens capture whole-arm trajectory, Dex-Retargeting provides finger-level precision, and body retargets supply torso-arm-leg coordination. Together they provide non-overlapping kinematic information.

Finding 03

Robot-Native Anchor Stabilizes Training

The 4,800-sample robot-native anchor data proved necessary to stabilize early training even at SABER's scale, suggesting general manipulation signal matters for robust convergence.

Finding 04

Task Progress Beyond Binary Success

SABER-MM teaches models to progress further through each task sequence — mean P≥2/3 of 0.445 vs 0.278 baseline — indicating reaching and grasping are well-learned while placement remains the frontier.

Citation

Cite This Work

@article{dreamvu2026saber,
  title   = {SABER: A Scalable Action-Based Embodied Dataset
             for Real-World VLA Adaptation},
  author  = {Menga, Narsimha and Sakurikar, Parikshit and Rouhi, Amirreza
             and Reddy, Satya Sai and Govil, Anirudh and Chittajallu, Sri Harsha
             and Aggarwal, Rajat and Namboodiri, Anoop and Reddi, Sashi},
  year    = {2026},
  month   = {May},
  note    = {DreamVu Inc.},
  url     = {https://dreamvu.ai/saber}
}

SABER.

Watch Videos

arXiv

Download PDF

Dataset

Benchmark Results

Complete SABER Capture Pipeline

Distinct Skill Distribution

Long-Tail Scene Variation

Repetition Matters

LAPA Latent Actions

Dexterous Hand Retargets

Whole-Body Retargets

In-Store Capture

Action Extraction

Robot Retargeting

VLA Post-Training

Retail Task Cycles

Retail Task Cycles

Human Video Scales Where Teleoperation Can't

Three Streams Are Complementary

Robot-Native Anchor Stabilizes Training

Task Progress Beyond Binary Success

Ready to Build the Data Layer for Retail Robots?