Key Modeling Advances

In recent years, several key model categories have emerged that significantly enhance the capabilities of robotic systems:

Large Vision Models (LVMs) that scale up the capabilities of visual perception to do zero-shot inference on a wide range of visual tasks.
Vision Language Models (VLMs) that expand LLMs achieve multimodal reasoning and instruction following.
Vision Language Action (VLA) models that can generate optimal actions based on the visual input and natural language instructions.

Large Vision Models (LVMs)

Neural networks (CNNs or Transformers) trained to perform vision-only tasks such as classification, segmentation, detection, or representation learning, using only image data (sometimes video or 3D data). Large Vision Models (LVMs) are pretrained similar to large language models, on massive datasets and have a large number of parameters (into the billions), allowing them to understand complex visual patterns and relationships.

Model	Parameters	Training Data Size	GPUs / Total Compute	Notes / Source
SAM (Meta)	636M	1.1B masks, 11M images	256 A100s, 4.5k GPU-days	Segmentation (Meta 2023)
ViT-G/14 (Google)	1.8B	4B images (JFT-4B)	2048 TPUv4s, weeks	Classification (Zhai et al. 2022)
DINOv2 (Meta)	1B	142M images	128 A100s, weeks	Self-supervised (Meta 2023)
SwinV2-G (MSRA)	3B	70M images	1536 V100s, 40 days	Classification (Liu et al. 2021)
ConvNeXtV2-giant (Meta)	1.6B	1B images	1024 A100s, weeks	Classification (Meta 2023)
Florence (Microsoft)	893M	900M images	512 V100s, 10 days	Classification (Yuan et al. 2021)
YOLOv9-E	300M	100M images	32 A100s, days	Detection (YOLOv9)

Vision Language Models (VLMs)

VLMs are that jointly learn from images (or videos) and language. They are trained to align, generate, or understand across both modalities—e.g., image captioning, VQA, image-text retrieval, multimodal reasoning.

Model	Parameters	Training Dataset Size	GPUs / Total Compute	Notes / Source
CLIP ViT-L/14 (OpenAI)	430M	400M image-text pairs	256 V100s, 12 days	Alignment (OpenAI 2021)
BLIP-2 (Salesforce)	~340M	129M image-text pairs	32 A100s, 14 days (est.)	Captioning (Salesforce 2023)
LLaVA-1.5 13B	13B	158K conv images + pretrain 400M	8 A100s (finetune)	Multimodal chat (LLaVA 2024)
PaLI-3 (Google)	14B	17B images, 1.4T tokens	2048 TPUv4s, months	Multimodal (PaLI-3)
Kosmos-2 (Microsoft)	1.6B	1B image-text pairs + 20M captions	512 A100s (est.)	Multimodal (Microsoft 2023)
IDEFICS 80B (HF)	80B	1B image-text pairs	256 A100s, 1 month	Multimodal (HF 2023)
Flamingo 80B (DM)	80B (frozen ViT-G/14)	1.8B image-text pairs	256 A100s, weeks	Multimodal (DeepMind 2022)
GPT-4V	~1T? (est.)	Multi-billion images (closed)	10,000+ A100s (est.)	Closed ([OpenAI 2024])

Large 3D Models (L3DM)

Model	Parameters	Data Scale	Key Domain(s)	Training Approach	Notes / Source
Point-BERT	100M+	10M+ point clouds	General 3D, LIDAR	Self-supervised (BERT-style)	Point-BERT (NIPS 2021)
PointNeXt	150M+	10M+ 3D scenes (ScanNet, Waymo)	Scene understanding, LIDAR	Supervised/self-supervised	PointNeXt (CVPR 2023)
Segment Anything 3D (SA-3D)	600M+	5M+ 3D scans (multi-sensor, incl. LIDAR)	3D segmentation, autonomous driving	Masked 3D pretraining	Meta 2024
One-Former 3D	700M	2M 3D scans + millions of 2D images	Joint 2D/3D segmentation	Unified transformer	2024
LidarCLIP	200M+	50M paired LIDAR scans & images	3D-2D alignment, multi-modal	Contrastive (CLIP-style)	2024
PointLLM	1.2B+	10M+ 3D point clouds + language	3D VLM (vision-lang-model)	Multi-modal pretrain	2024
VoxFormer	1B	5M+ voxel grids (Waymo, NuScenes)	3D detection, segmentation	Transformer	2024
MinkowskiNet-Large	200M+	10M+ 3D scans	Large-scale 3D	Sparse conv, self-supervised	OpenMMLab 2023/24

Vision Language Action (VLA) Models

VLAs jointly learn or reason over vision, language, and action/decision spaces.

Model / Platform	Parameters	Training Data	Compute / GPUs	Main Task / Setting	Source/Year
NVIDIA Foundation Robot Policy (FRP-1)	12B	1M+ sim episodes (Isaac Gym/Sim), 200K real robot demos, 10M web video frames	512 H100, weeks	Manipulation, navigation, multi-robot, simulation-to-reality	NVIDIA GTC 2024
Isaac-Action LLM	7B	500K robot language-action episodes	256 H100, weeks	Language-to-action, household tasks	NVIDIA Isaac Lab 2024
Omniverse GPT	~20B (est.)	10M multimodal scene-action pairs (simulated and video)	1024 H100, weeks	Simulation, scene understanding, 3D asset generation, instruction following	NVIDIA Omniverse 2025
Kaolin-Embodied	3B	100K 3D manipulation episodes + 1M images	128 A100, days	3D vision-action (manipulation, grasping)	NVIDIA Kaolin 2024
RT-X (Google DeepMind, 2024)	55B (w/ frozen LLMs)	700K robot episodes + 20M web images + language	256 TPUv4, weeks	Real-world & simulated robotics, multitask	RT-X 2024
Open X-Embodiment (multi-org, 2024)	1.5B–7B	500K+ robot demos (across 22 robot types)	128 A100s, weeks	Multi-robot, multi-task learning	Open X-Embodiment
GPT-4R (OpenAI, 2025 rumored)	~1T? (closed)	? (rumored: 1M+ real/sim demos, 100M+ images)	10,000+ A100s?	Generalist agent (rumored, closed)	2025 (rumor)
OctoGait (Meta, 2025)	6B	1M humanoid motion episodes + 10M video frames	64 A100s, weeks	Locomotion, manipulation, embodied tasks	Meta 2025
RoboCLIP (OpenAI, 2024)	1.2B	1M RL trajectories, 50M image-text-action pairs	32 A100s, weeks	Video game/robotic control from vision+lang	OpenAI 2024
TouchStone (Google, 2025)	5B	2M tactile+image+lang episodes	32 TPUv5, weeks	Tactile robotics, manipulation	Google 2025