Detectron2 Beginner’s Tutorial

Welcome to detectron2! This is the official colab tutorial of detectron2. Here, we will go through some basics usage of detectron2, including the following: * Run inference on images or videos, with an existing detectron2 model * Train a detectron2 model on a new dataset

You can make a copy of this tutorial by "File -> Open in playground mode" and make changes there.

Install detectron2


import torch, torchvision, detectron2
!nvcc --version
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
TORCHVISION_VERSION = ".".join(torchvision.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
print("detectron2:", detectron2.__version__)
print("torchvision: ", TORCHVISION_VERSION)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
torch:  1.12 ; cuda:  cu113
detectron2: 0.6
torchvision:  0.13
!nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-da6de55e-ad78-924d-8cf3-4be0595eef77)
# Some basic setup:
# Setup detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger

# import some common libraries
import numpy as np
import os, json, cv2, random
from google.colab.patches import cv2_imshow

# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from import MetadataCatalog, DatasetCatalog

Run a pre-trained detectron2 model

We first download an image from the COCO dataset:

!wget -q -O input.jpg
im = cv2.imread("./input.jpg")

Then, we create a detectron2 config and a detectron2 DefaultPredictor to run inference on this image.

cfg = get_cfg()
# add project-specific config (e.g., TensorMask) here if you're not running a model in detectron2's core library
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
# Find a model from detectron2's model zoo. You can use the https://dl.fbaipublicfiles... url as well
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(im)

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
# look at the outputs. See for specification
tensor([17,  0,  0,  0,  0,  0,  0,  0, 25,  0, 25, 25,  0,  0, 24],
Boxes(tensor([[126.5927, 244.9072, 459.8221, 480.0000],
        [251.1046, 157.8087, 338.9760, 413.6155],
        [114.8537, 268.6926, 148.2408, 398.8159],
        [  0.8249, 281.0315,  78.6042, 478.4268],
        [ 49.3939, 274.1228,  80.1528, 342.9875],
        [561.2266, 271.5830, 596.2780, 385.2542],
        [385.9034, 270.3119, 413.7115, 304.0397],
        [515.9216, 278.3663, 562.2773, 389.3731],
        [335.2385, 251.9169, 414.7485, 275.9340],
        [350.9470, 269.2095, 386.0932, 297.9067],
        [331.6270, 230.9990, 393.2777, 257.2000],
        [510.7307, 263.2674, 570.9891, 295.9456],
        [409.0903, 271.8640, 460.5584, 356.8694],
        [506.8879, 283.3292, 529.9476, 324.0202],
        [594.5665, 283.4850, 609.0558, 311.4114]], device='cuda:0'))
# We can use `Visualizer` to draw the predictions on the image.
v = Visualizer(im[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(out.get_image()[:, :, ::-1])

Train on a custom dataset

In this section, we show how to train an existing detectron2 model on a custom dataset in a new format.

We use the balloon segmentation dataset which only has one class: balloon. We’ll train a balloon segmentation model from an existing model pre-trained on COCO dataset, available in detectron2’s model zoo.

Note that COCO dataset does not have the “balloon” category. We’ll be able to recognize this new class in a few minutes.

Prepare the dataset

# download, decompress the data
!unzip > /dev/null
--2022-11-08 16:51:31--
Resolving (
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: [following]
--2022-11-08 16:51:31--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38741381 (37M) [application/octet-stream]
Saving to: ‘’ 100%[===================>]  36.95M   196MB/s    in 0.2s    

2022-11-08 16:51:31 (196 MB/s) - ‘’ saved [38741381/38741381]

replace balloon/train/via_region_data.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace __MACOSX/balloon/train/._via_region_data.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace balloon/train/53500107_d24b11b3c2_b.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: a
error:  invalid response [a]
replace balloon/train/53500107_d24b11b3c2_b.jpg? [y]es, [n]o, [A]ll, [N]one, [r]ename: A

Register the balloon dataset to detectron2, following the detectron2 custom dataset tutorial. Here, the dataset is in its custom format, therefore we write a function to parse it and prepare it into detectron2’s standard format. User should write such a function when using a dataset in custom format. See the tutorial for more details.

# if your dataset is in COCO format, this cell can be replaced by the following three lines:
# from import register_coco_instances
# register_coco_instances("my_dataset_train", {}, "json_annotation_train.json", "path/to/image/dir")
# register_coco_instances("my_dataset_val", {}, "json_annotation_val.json", "path/to/image/dir")

from detectron2.structures import BoxMode

def get_balloon_dicts(img_dir):
    json_file = os.path.join(img_dir, "via_region_data.json")
    with open(json_file) as f:
        imgs_anns = json.load(f)

    dataset_dicts = []
    for idx, v in enumerate(imgs_anns.values()):
        record = {}
        filename = os.path.join(img_dir, v["filename"])
        height, width = cv2.imread(filename).shape[:2]
        record["file_name"] = filename
        record["image_id"] = idx
        record["height"] = height
        record["width"] = width
        annos = v["regions"]
        objs = []
        for _, anno in annos.items():
            assert not anno["region_attributes"]
            anno = anno["shape_attributes"]
            px = anno["all_points_x"]
            py = anno["all_points_y"]
            poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
            poly = [p for x in poly for p in x]

            obj = {
                "bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
                "bbox_mode": BoxMode.XYXY_ABS,
                "segmentation": [poly],
                "category_id": 0,
        record["annotations"] = objs
    return dataset_dicts

for d in ["train", "val"]:
    DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts("balloon/" + d))
    MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"])
balloon_metadata = MetadataCatalog.get("balloon_train")

To verify the dataset is in correct format, let’s visualize the annotations of randomly selected samples in the training set:

dataset_dicts = get_balloon_dicts("balloon/train")
for d in random.sample(dataset_dicts, 3):
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=balloon_metadata, scale=0.5)
    out = visualizer.draw_dataset_dict(d)
    cv2_imshow(out.get_image()[:, :, ::-1])


Now, let’s fine-tune a COCO-pretrained R50-FPN Mask R-CNN model on the balloon dataset. It takes ~2 minutes to train 300 iterations on a P100 GPU.

from detectron2.engine import DefaultTrainer

cfg = get_cfg()
cfg.DATASETS.TRAIN = ("balloon_train",)
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")  # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2  # This is the real "batch size" commonly known to deep learning people
cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = []        # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # The "RoIHead batch size". 128 is faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon). (see
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
[11/08 16:58:10]: Removed 0 images with no usable annotations. 61 images left.
[11/08 16:58:10]: Distribution of instances among all 1 categories:
|  category  | #instances   |
|  balloon   | 255          |
|            |              |
[11/08 16:58:10]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[11/08 16:58:10]: Using training sampler TrainingSampler
[11/08 16:58:10]: Serializing 61 elements to byte tensors and concatenating them all ...
[11/08 16:58:10]: Serialized dataset takes 0.17 MiB
[11/08 16:58:13 d2.engine.train_loop]: Starting training from iteration 0
[11/08 16:58:52 d2.engine.hooks]: Overall training speed: 298 iterations in 0:00:35 (0.1201 s / it)
Then, we randomly select several samples to visualize the prediction results.

from detectron2.utils.visualizer import ColorMode
dataset_dicts = get_balloon_dicts("balloon/val")
for d in random.sample(dataset_dicts, 3):    
    im = cv2.imread(d["file_name"])
    outputs = predictor(im)  # format is documented at
    v = Visualizer(im[:, :, ::-1],
                   instance_mode=ColorMode.IMAGE_BW   # remove the colors of unsegmented pixels. This option is only available for segmentation models
    out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(out.get_image()[:, :, ::-1])

We can also evaluate its performance using AP metric implemented in COCO API. This gives an AP of ~70. Not bad!

from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from import build_detection_test_loader
evaluator = COCOEvaluator("balloon_val", output_dir="./output")
val_loader = build_detection_test_loader(cfg, "balloon_val")
print(inference_on_dataset(predictor.model, val_loader, evaluator))
# another equivalent way to evaluate the model is to use `trainer.test`
[11/08 16:59:00 d2.evaluation.fast_eval_api]: Evaluate annotation type *bbox*
[11/08 16:59:00 d2.evaluation.fast_eval_api]: COCOeval_opt.evaluate() finished in 0.00 seconds.
[11/08 16:59:00 d2.evaluation.fast_eval_api]: Accumulating evaluation results...
[11/08 16:59:00 d2.evaluation.fast_eval_api]: COCOeval_opt.accumulate() finished in 0.00 seconds.
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.735
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.838
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.801
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.531
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.917
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.248
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.754
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.754
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.559
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.940
[11/08 16:59:00 d2.evaluation.coco_evaluation]: Evaluation results for bbox: 
|   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
| 73.543 | 83.773 | 80.054 | 0.000 | 53.092 | 91.719 |
[11/08 16:59:00 d2.evaluation.fast_eval_api]: Evaluate annotation type *segm*
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.755
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.810
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.810
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.520
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.965
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.254
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.770
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.770
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.541
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.977
[11/08 16:59:00 d2.evaluation.coco_evaluation]: Evaluation results for segm: 
|   AP   |  AP50  |  AP75  |  APs  |  APm   |  APl   |
| 75.484 | 81.000 | 81.000 | 0.000 | 51.973 | 96.516 |
OrderedDict([('bbox', {'AP': 73.54345467878944, 'AP50': 83.77273773889017, 'AP75': 80.05421970768505, 'APs': 0.0, 'APm': 53.092234498175095, 'APl': 91.71886987843824}), ('segm', {'AP': 75.48403884002163, 'AP50': 81.00046051116738, 'AP75': 81.00046051116738, 'APs': 0.0, 'APm': 51.97258187357197, 'APl': 96.51637047762746})])

Other types of builtin models

We showcase simple demos of other types of models below:

# Inference with a keypoint detection model
cfg = get_cfg()   # get a fresh new config
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Keypoints/keypoint_rcnn_R_50_FPN_3x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(im)
v = Visualizer(im[:,:,::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(out.get_image()[:, :, ::-1])

  y_int = (pos - x_int) // w

# Inference with a panoptic segmentation model
cfg = get_cfg()
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-PanopticSegmentation/panoptic_fpn_R_101_3x.yaml")
predictor = DefaultPredictor(cfg)
panoptic_seg, segments_info = predictor(im)["panoptic_seg"]
v = Visualizer(im[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_panoptic_seg_predictions("cpu"), segments_info)
cv2_imshow(out.get_image()[:, :, ::-1])

Run panoptic segmentation on a video

# This is the video we're going to process
from IPython.display import YouTubeVideo, display
video = YouTubeVideo("ll8TgCZ0plk", width=500)
# Install dependencies, download the video, and crop 5 seconds for processing
Looking in indexes:,
# Run frame-by-frame inference demo on this video (takes 3-4 minutes) with the "" tool we provided in the repo.
!git clone
# Note: this is currently BROKEN due to missing codec. See for workaround.
%run detectron2/demo/ --config-file detectron2/configs/COCO-PanopticSegmentation/panoptic_fpn_R_101_3x.yaml --video-input video-clip.mp4 --confidence-threshold 0.6 --output video-output.mkv \
  --opts MODEL.WEIGHTS detectron2://COCO-PanopticSegmentation/panoptic_fpn_R_101_3x/139514519/model_final_cafdb1.pkl
# Download the results
from google.colab import files'video-output.mkv')
