Key Modeling Advances
In recent years, several key model categories have emerged that significantly enhance the capabilities of robotic systems:
- Large Vision Models (LVMs) that scale up the capabilities of visual perception to do zero-shot inference on a wide range of visual tasks.
- Vision Language Models (VLMs) that expand LLMs achieve multimodal reasoning and instruction following.
- Vision Language Action (VLA) models that can generate optimal actions based on the visual input and natural language instructions.
Large Vision Models (LVMs)
Neural networks (CNNs or Transformers) trained to perform vision-only tasks such as classification, segmentation, detection, or representation learning, using only image data (sometimes video or 3D data). Large Vision Models (LVMs) are pretrained similar to large language models, on massive datasets and have a large number of parameters (into the billions), allowing them to understand complex visual patterns and relationships.
Model | Parameters | Training Data Size | GPUs / Total Compute | Notes / Source |
---|---|---|---|---|
SAM (Meta) | 636M | 1.1B masks, 11M images | 256 A100s, 4.5k GPU-days | Segmentation (Meta 2023) |
ViT-G/14 (Google) | 1.8B | 4B images (JFT-4B) | 2048 TPUv4s, weeks | Classification (Zhai et al. 2022) |
DINOv2 (Meta) | 1B | 142M images | 128 A100s, weeks | Self-supervised (Meta 2023) |
SwinV2-G (MSRA) | 3B | 70M images | 1536 V100s, 40 days | Classification (Liu et al. 2021) |
ConvNeXtV2-giant (Meta) | 1.6B | 1B images | 1024 A100s, weeks | Classification (Meta 2023) |
Florence (Microsoft) | 893M | 900M images | 512 V100s, 10 days | Classification (Yuan et al. 2021) |
YOLOv9-E | 300M | 100M images | 32 A100s, days | Detection (YOLOv9) |
Vision Language Models (VLMs)
VLMs are that jointly learn from images (or videos) and language. They are trained to align, generate, or understand across both modalities—e.g., image captioning, VQA, image-text retrieval, multimodal reasoning.
Model | Parameters | Training Dataset Size | GPUs / Total Compute | Notes / Source |
---|---|---|---|---|
CLIP ViT-L/14 (OpenAI) | 430M | 400M image-text pairs | 256 V100s, 12 days | Alignment (OpenAI 2021) |
BLIP-2 (Salesforce) | ~340M | 129M image-text pairs | 32 A100s, 14 days (est.) | Captioning (Salesforce 2023) |
LLaVA-1.5 13B | 13B | 158K conv images + pretrain 400M | 8 A100s (finetune) | Multimodal chat (LLaVA 2024) |
PaLI-3 (Google) | 14B | 17B images, 1.4T tokens | 2048 TPUv4s, months | Multimodal (PaLI-3) |
Kosmos-2 (Microsoft) | 1.6B | 1B image-text pairs + 20M captions | 512 A100s (est.) | Multimodal (Microsoft 2023) |
IDEFICS 80B (HF) | 80B | 1B image-text pairs | 256 A100s, 1 month | Multimodal (HF 2023) |
Flamingo 80B (DM) | 80B (frozen ViT-G/14) | 1.8B image-text pairs | 256 A100s, weeks | Multimodal (DeepMind 2022) |
GPT-4V | ~1T? (est.) | Multi-billion images (closed) | 10,000+ A100s (est.) | Closed ([OpenAI 2024]) |
Large 3D Models (L3DM)
Model | Parameters | Data Scale | Key Domain(s) | Training Approach | Notes / Source |
---|---|---|---|---|---|
Point-BERT | 100M+ | 10M+ point clouds | General 3D, LIDAR | Self-supervised (BERT-style) | Point-BERT (NIPS 2021) |
PointNeXt | 150M+ | 10M+ 3D scenes (ScanNet, Waymo) | Scene understanding, LIDAR | Supervised/self-supervised | PointNeXt (CVPR 2023) |
Segment Anything 3D (SA-3D) | 600M+ | 5M+ 3D scans (multi-sensor, incl. LIDAR) | 3D segmentation, autonomous driving | Masked 3D pretraining | Meta 2024 |
One-Former 3D | 700M | 2M 3D scans + millions of 2D images | Joint 2D/3D segmentation | Unified transformer | 2024 |
LidarCLIP | 200M+ | 50M paired LIDAR scans & images | 3D-2D alignment, multi-modal | Contrastive (CLIP-style) | 2024 |
PointLLM | 1.2B+ | 10M+ 3D point clouds + language | 3D VLM (vision-lang-model) | Multi-modal pretrain | 2024 |
VoxFormer | 1B | 5M+ voxel grids (Waymo, NuScenes) | 3D detection, segmentation | Transformer | 2024 |
MinkowskiNet-Large | 200M+ | 10M+ 3D scans | Large-scale 3D | Sparse conv, self-supervised | OpenMMLab 2023/24 |
Vision Language Action (VLA) Models
VLAs jointly learn or reason over vision, language, and action/decision spaces.
Model / Platform | Parameters | Training Data | Compute / GPUs | Main Task / Setting | Source/Year |
---|---|---|---|---|---|
NVIDIA Foundation Robot Policy (FRP-1) | 12B | 1M+ sim episodes (Isaac Gym/Sim), 200K real robot demos, 10M web video frames | 512 H100, weeks | Manipulation, navigation, multi-robot, simulation-to-reality | NVIDIA GTC 2024 |
Isaac-Action LLM | 7B | 500K robot language-action episodes | 256 H100, weeks | Language-to-action, household tasks | NVIDIA Isaac Lab 2024 |
Omniverse GPT | ~20B (est.) | 10M multimodal scene-action pairs (simulated and video) | 1024 H100, weeks | Simulation, scene understanding, 3D asset generation, instruction following | NVIDIA Omniverse 2025 |
Kaolin-Embodied | 3B | 100K 3D manipulation episodes + 1M images | 128 A100, days | 3D vision-action (manipulation, grasping) | NVIDIA Kaolin 2024 |
RT-X (Google DeepMind, 2024) | 55B (w/ frozen LLMs) | 700K robot episodes + 20M web images + language | 256 TPUv4, weeks | Real-world & simulated robotics, multitask | RT-X 2024 |
Open X-Embodiment (multi-org, 2024) | 1.5B–7B | 500K+ robot demos (across 22 robot types) | 128 A100s, weeks | Multi-robot, multi-task learning | Open X-Embodiment |
GPT-4R (OpenAI, 2025 rumored) | ~1T? (closed) | ? (rumored: 1M+ real/sim demos, 100M+ images) | 10,000+ A100s? | Generalist agent (rumored, closed) | 2025 (rumor) |
OctoGait (Meta, 2025) | 6B | 1M humanoid motion episodes + 10M video frames | 64 A100s, weeks | Locomotion, manipulation, embodied tasks | Meta 2025 |
RoboCLIP (OpenAI, 2024) | 1.2B | 1M RL trajectories, 50M image-text-action pairs | 32 A100s, weeks | Video game/robotic control from vision+lang | OpenAI 2024 |
TouchStone (Google, 2025) | 5B | 2M tactile+image+lang episodes | 32 TPUv5, weeks | Tactile robotics, manipulation | Google 2025 |