Key Modeling Advances

In recent years, several key model categories have emerged that significantly enhance the capabilities of robotic systems:

Large Vision Models (LVMs)

Neural networks (CNNs or Transformers) trained to perform vision-only tasks such as classification, segmentation, detection, or representation learning, using only image data (sometimes video or 3D data). Large Vision Models (LVMs) are pretrained similar to large language models, on massive datasets and have a large number of parameters (into the billions), allowing them to understand complex visual patterns and relationships.

Model Parameters Training Data Size GPUs / Total Compute Notes / Source
SAM (Meta) 636M 1.1B masks, 11M images 256 A100s, 4.5k GPU-days Segmentation (Meta 2023)
ViT-G/14 (Google) 1.8B 4B images (JFT-4B) 2048 TPUv4s, weeks Classification (Zhai et al. 2022)
DINOv2 (Meta) 1B 142M images 128 A100s, weeks Self-supervised (Meta 2023)
SwinV2-G (MSRA) 3B 70M images 1536 V100s, 40 days Classification (Liu et al. 2021)
ConvNeXtV2-giant (Meta) 1.6B 1B images 1024 A100s, weeks Classification (Meta 2023)
Florence (Microsoft) 893M 900M images 512 V100s, 10 days Classification (Yuan et al. 2021)
YOLOv9-E 300M 100M images 32 A100s, days Detection (YOLOv9)

Vision Language Models (VLMs)

VLMs are that jointly learn from images (or videos) and language. They are trained to align, generate, or understand across both modalities—e.g., image captioning, VQA, image-text retrieval, multimodal reasoning.

Model Parameters Training Dataset Size GPUs / Total Compute Notes / Source
CLIP ViT-L/14 (OpenAI) 430M 400M image-text pairs 256 V100s, 12 days Alignment (OpenAI 2021)
BLIP-2 (Salesforce) ~340M 129M image-text pairs 32 A100s, 14 days (est.) Captioning (Salesforce 2023)
LLaVA-1.5 13B 13B 158K conv images + pretrain 400M 8 A100s (finetune) Multimodal chat (LLaVA 2024)
PaLI-3 (Google) 14B 17B images, 1.4T tokens 2048 TPUv4s, months Multimodal (PaLI-3)
Kosmos-2 (Microsoft) 1.6B 1B image-text pairs + 20M captions 512 A100s (est.) Multimodal (Microsoft 2023)
IDEFICS 80B (HF) 80B 1B image-text pairs 256 A100s, 1 month Multimodal (HF 2023)
Flamingo 80B (DM) 80B (frozen ViT-G/14) 1.8B image-text pairs 256 A100s, weeks Multimodal (DeepMind 2022)
GPT-4V ~1T? (est.) Multi-billion images (closed) 10,000+ A100s (est.) Closed ([OpenAI 2024])

Large 3D Models (L3DM)

Model Parameters Data Scale Key Domain(s) Training Approach Notes / Source
Point-BERT 100M+ 10M+ point clouds General 3D, LIDAR Self-supervised (BERT-style) Point-BERT (NIPS 2021)
PointNeXt 150M+ 10M+ 3D scenes (ScanNet, Waymo) Scene understanding, LIDAR Supervised/self-supervised PointNeXt (CVPR 2023)
Segment Anything 3D (SA-3D) 600M+ 5M+ 3D scans (multi-sensor, incl. LIDAR) 3D segmentation, autonomous driving Masked 3D pretraining Meta 2024
One-Former 3D 700M 2M 3D scans + millions of 2D images Joint 2D/3D segmentation Unified transformer 2024
LidarCLIP 200M+ 50M paired LIDAR scans & images 3D-2D alignment, multi-modal Contrastive (CLIP-style) 2024
PointLLM 1.2B+ 10M+ 3D point clouds + language 3D VLM (vision-lang-model) Multi-modal pretrain 2024
VoxFormer 1B 5M+ voxel grids (Waymo, NuScenes) 3D detection, segmentation Transformer 2024
MinkowskiNet-Large 200M+ 10M+ 3D scans Large-scale 3D Sparse conv, self-supervised OpenMMLab 2023/24

Vision Language Action (VLA) Models

VLAs jointly learn or reason over vision, language, and action/decision spaces.

Model / Platform Parameters Training Data Compute / GPUs Main Task / Setting Source/Year
NVIDIA Foundation Robot Policy (FRP-1) 12B 1M+ sim episodes (Isaac Gym/Sim), 200K real robot demos, 10M web video frames 512 H100, weeks Manipulation, navigation, multi-robot, simulation-to-reality NVIDIA GTC 2024
Isaac-Action LLM 7B 500K robot language-action episodes 256 H100, weeks Language-to-action, household tasks NVIDIA Isaac Lab 2024
Omniverse GPT ~20B (est.) 10M multimodal scene-action pairs (simulated and video) 1024 H100, weeks Simulation, scene understanding, 3D asset generation, instruction following NVIDIA Omniverse 2025
Kaolin-Embodied 3B 100K 3D manipulation episodes + 1M images 128 A100, days 3D vision-action (manipulation, grasping) NVIDIA Kaolin 2024
RT-X (Google DeepMind, 2024) 55B (w/ frozen LLMs) 700K robot episodes + 20M web images + language 256 TPUv4, weeks Real-world & simulated robotics, multitask RT-X 2024
Open X-Embodiment (multi-org, 2024) 1.5B–7B 500K+ robot demos (across 22 robot types) 128 A100s, weeks Multi-robot, multi-task learning Open X-Embodiment
GPT-4R (OpenAI, 2025 rumored) ~1T? (closed) ? (rumored: 1M+ real/sim demos, 100M+ images) 10,000+ A100s? Generalist agent (rumored, closed) 2025 (rumor)
OctoGait (Meta, 2025) 6B 1M humanoid motion episodes + 10M video frames 64 A100s, weeks Locomotion, manipulation, embodied tasks Meta 2025
RoboCLIP (OpenAI, 2024) 1.2B 1M RL trajectories, 50M image-text-action pairs 32 A100s, weeks Video game/robotic control from vision+lang OpenAI 2024
TouchStone (Google, 2025) 5B 2M tactile+image+lang episodes 32 TPUv5, weeks Tactile robotics, manipulation Google 2025