Real-Time Object Tracking with YOLOX and ByteTrack

Learn how to track objects across video frames with YOLOX and ByteTrack.

Christian Mills


October 27, 2023

This post is part of the following series:


Welcome back to this series on real-time object detection with YOLOX! Previously, we fine-tuned a YOLOX model in PyTorch to detect hand signs and exported it to ONNX. This tutorial combines our YOLOX model with the ByteTrack object tracker to track objects continuously across video frames.

Tracking objects over time unlocks a wide range of potential applications. With our hand-sign detector, we could implement gesture-based controls to control devices and create interactive gaming and multimedia experiences. Beyond our specific model, object tracking has applications in everything from sports analysis to wildlife monitoring.

By the end of this tutorial, you will understand how to combine a YOLOX object detection model with ByteTrack, enabling you to effectively track hand signs or other objects across consecutive video frames.

This post assumes the reader has completed the previous tutorial linked below:

Getting Started with the Code

As with the previous tutorial, the code is available as a Jupyter Notebook.

Jupyter Notebook Google Colab
GitHub Repository Open In Colab

Setting Up Your Python Environment

We need to add a couple of new libraries to our Python environment. We will use OpenCV to read and write video files. I also made a package with a standalone implementation of ByteTrack. Make sure to install onnx and onnxruntime if you did not follow the previous tutorial.

Package Description
onnx This package provides a Python API for working with ONNX models. (link)
onnxruntime ONNX Runtime is a runtime accelerator for machine learning models. (link)
opencv-python Wrapper package for OpenCV python bindings. (link)
cjm-byte-track A standalone Python implementation of the ByteTrack multi-object tracker based on the official implementation. (link)

Run the following command to install these additional libraries:

# Install packages
pip install onnx onnxruntime opencv-python cjm_byte_track

Importing the Required Dependencies

With our environment updated, we can dive into the code. First, we will import the necessary Python dependencies into our Jupyter Notebook.

# Import Python Standard Library dependencies
from dataclasses import dataclass
import json
from pathlib import Path
import random
import time
from typing import List

# Import ByteTrack package
from cjm_byte_track.core import BYTETracker
from cjm_byte_track.matching import match_detections_with_tracks

# Import utility functions
from cjm_psl_utils.core import download_file
from cjm_pil_utils.core import resize_img

# Import OpenCV
import cv2

# Import numpy
import numpy as np

# Import the pandas package
import pandas as pd

# Import PIL for image manipulation
from PIL import Image, ImageDraw, ImageFont

# Import ONNX dependencies
import onnx # Import the onnx module
import onnxruntime as ort # Import the ONNX Runtime

# Import tqdm for progress bar
from import tqdm

Setting Up the Project

In this section, we will set the folder locations for our project and the directory with the ONNX model and JSON colormap file. We should also ensure we have a font file for annotating images.

Set the Directory Paths

# The name for the project
project_name = f"pytorch-yolox-object-detector"

# The path for the project folder
project_dir = Path(f"./{project_name}/")

# Create the project directory if it does not already exist
project_dir.mkdir(parents=True, exist_ok=True)

# The path to the checkpoint folder
checkpoint_dir = Path(project_dir/f"2023-08-17_16-14-43")
# checkpoint_dir = Path(project_dir/f"pretrained-coco")

    "Project Directory:": project_dir,
    "Checkpoint Directory:": checkpoint_dir,
Project Directory: pytorch-yolox-object-detector
Checkpoint Directory: pytorch-yolox-object-detector/2023-08-17_16-14-43
I made an ONNX model available on Hugging Face Hub with a colormap file in the repository linked below:
Those following along on Google Colab can drag the contents of their checkpoint folder into Colab’s file browser.

Download a Font File

# Set the name of the font file
font_file = 'KFOlCnqEu92Fr1MmEU9vAw.ttf'

# Download the font file
download_file(f"{font_file}", "./")

Loading the Checkpoint Data

Now, we can load the colormap and set the max stride value and input dimension slice.

Load the Colormap

# The colormap path
colormap_path = list(checkpoint_dir.glob('*colormap.json'))[0]

# Load the JSON colormap data
with open(colormap_path, 'r') as file:
        colormap_json = json.load(file)

# Convert the JSON data to a dictionary        
colormap_dict = {item['label']: item['color'] for item in colormap_json['items']}

# Extract the class names from the colormap
class_names = list(colormap_dict.keys())

# Make a copy of the colormap in integer format
int_colors = [tuple(int(c*255) for c in color) for color in colormap_dict.values()]

Set the Preprocessing and Post-Processing Parameters

max_stride = 32
input_dim_slice = slice(2, 4, None)

Defining Utility Functions

Next, we will define some utility functions for preparing the input data and processing the model output.

Define a Function to Prepare Images for Inference

OpenCV uses the BGR (Blue, Green, Red) color format for images, so we must change the current video frame to RGB before performing the standard preprocessing steps for the YOLOX model.

def prepare_image_for_inference(frame:np.ndarray, target_sz:int, max_stride:int):

    Prepares an image for inference by performing a series of preprocessing steps.
    1. Converts a BGR image to RGB.
    2. Resizes the image to a target size without cropping, considering a given divisor.
    3. Calculates input dimensions as multiples of the max stride.
    4. Calculates offsets based on the resized image dimensions and input dimensions.
    5. Computes the scale between the original and resized image.
    6. Crops the resized image based on calculated input dimensions.
    - frame (numpy.ndarray): The input image in BGR format.
    - target_sz (int): The target minimum size for resizing the image.
    - max_stride (int): The maximum stride to be considered for calculating input dimensions.
    - rgb_img (PIL.Image): The converted RGB image.
    - input_dims (list of int): Dimensions of the image that are multiples of max_stride.
    - offsets (numpy.ndarray): Offsets from the resized image dimensions to the input dimensions.
    - min_img_scale (float): Scale factor between the original and resized image.
    - input_img (PIL.Image): Cropped image based on the calculated input dimensions.

    # Convert the BGR image to RGB
    rgb_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    # Resize image without cropping to multiple of the max stride
    resized_img = resize_img(rgb_img, target_sz=target_sz, divisor=1)
    # Calculating the input dimensions that multiples of the max stride
    input_dims = [dim - dim % max_stride for dim in resized_img.size]
    # Calculate the offsets from the resized image dimensions to the input dimensions
    offsets = (np.array(resized_img.size) - input_dims) / 2
    # Calculate the scale between the source image and the resized image
    min_img_scale = min(rgb_img.size) / min(resized_img.size)
    # Crop the resized image to the input dimensions
    input_img = resized_img.crop(box=[*offsets, *resized_img.size - offsets])
    return rgb_img, input_dims, offsets, min_img_scale, input_img

Define Functions to Process YOLOX Output

We can use the same utility functions defined in the previous tutorial on exporting the model to ONNX.

Define a function to generate the output grids

def generate_output_grids_np(height, width, strides=[8,16,32]):
    Generate a numpy array containing grid coordinates and strides for a given height and width.

        height (int): The height of the image.
        width (int): The width of the image.

        np.ndarray: A numpy array containing grid coordinates and strides.

    all_coordinates = []

    for stride in strides:
        # Calculate the grid height and width
        grid_height = height // stride
        grid_width = width // stride

        # Generate grid coordinates
        g1, g0 = np.meshgrid(np.arange(grid_height), np.arange(grid_width), indexing='ij')

        # Create an array of strides
        s = np.full((grid_height, grid_width), stride)

        # Stack the coordinates along with the stride
        coordinates = np.stack((g0.flatten(), g1.flatten(), s.flatten()), axis=-1)

        # Append to the list

    # Concatenate all arrays in the list along the first dimension
    output_grids = np.concatenate(all_coordinates, axis=0)

    return output_grids

Define a function to calculate bounding boxes and probabilities

def calculate_boxes_and_probs(model_output:np.ndarray, output_grids:np.ndarray) -> np.ndarray:
    Calculate the bounding boxes and their probabilities.

    model_output (numpy.ndarray): The output of the model.
    output_grids (numpy.ndarray): The output grids.

    numpy.ndarray: The array containing the bounding box coordinates, class labels, and maximum probabilities.
    # Calculate the bounding box coordinates
    box_centroids = (model_output[..., :2] + output_grids[..., :2]) * output_grids[..., 2:]
    box_sizes = np.exp(model_output[..., 2:4]) * output_grids[..., 2:]

    x0, y0 = [t.squeeze(axis=2) for t in np.split(box_centroids - box_sizes / 2, 2, axis=2)]
    w, h = [t.squeeze(axis=2) for t in np.split(box_sizes, 2, axis=2)]

    # Calculate the probabilities for each class
    box_objectness = model_output[..., 4]
    box_cls_scores = model_output[..., 5:]
    box_probs = np.expand_dims(box_objectness, -1) * box_cls_scores

    # Get the maximum probability and corresponding class for each proposal
    max_probs = np.max(box_probs, axis=-1)
    labels = np.argmax(box_probs, axis=-1)

    return np.array([x0, y0, w, h, labels, max_probs]).transpose((1, 2, 0))

Define a function to extract object proposals from the raw model output

def process_outputs(outputs:np.ndarray, input_dims:tuple, bbox_conf_thresh:float):

    Process the model outputs to generate bounding box proposals filtered by confidence threshold.
    - outputs (numpy.ndarray): The raw output from the model, which will be processed to calculate boxes and probabilities.
    - input_dims (tuple of int): Dimensions (height, width) of the input image to the model.
    - bbox_conf_thresh (float): Threshold for the bounding box confidence/probability. Bounding boxes with a confidence
                                score below this threshold will be discarded.
    - numpy.array: An array of proposals where each proposal is an array containing bounding box coordinates
                   and its associated probability, sorted in descending order by probability.

    # Process the model output
    outputs = calculate_boxes_and_probs(outputs, generate_output_grids_np(*input_dims))
    # Filter the proposals based on the confidence threshold
    max_probs = outputs[:, :, -1]
    mask = max_probs > bbox_conf_thresh
    proposals = outputs[mask]
    # Sort the proposals by probability in descending order
    proposals = proposals[proposals[..., -1].argsort()][::-1]
    return proposals

Define a function to calculate the intersection-over-union

def calc_iou(proposals:np.ndarray) -> np.ndarray:
    Calculates the Intersection over Union (IoU) for all pairs of bounding boxes (x,y,w,h) in 'proposals'.

    The IoU is a measure of overlap between two bounding boxes. It is calculated as the area of
    intersection divided by the area of union of the two boxes.

    proposals (2D np.array): A NumPy array of bounding boxes, where each box is an array [x, y, width, height].

    iou (2D np.array): The IoU matrix where each element i,j represents the IoU of boxes i and j.

    # Calculate coordinates for the intersection rectangles
    x1 = np.maximum(proposals[:, 0], proposals[:, 0][:, None])
    y1 = np.maximum(proposals[:, 1], proposals[:, 1][:, None])
    x2 = np.minimum(proposals[:, 0] + proposals[:, 2], (proposals[:, 0] + proposals[:, 2])[:, None])
    y2 = np.minimum(proposals[:, 1] + proposals[:, 3], (proposals[:, 1] + proposals[:, 3])[:, None])
    # Calculate intersection areas
    intersections = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)

    # Calculate union areas
    areas = proposals[:, 2] * proposals[:, 3]
    unions = areas[:, None] + areas - intersections

    # Calculate IoUs
    iou = intersections / unions

    # Return the iou matrix
    return iou

Define a function to filter bounding box proposals using Non-Maximum Suppression

def nms_sorted_boxes(iou:np.ndarray, iou_thresh:float=0.45) -> np.ndarray:
    Applies non-maximum suppression (NMS) to sorted bounding boxes.

    It suppresses boxes that have high overlap (as defined by the IoU threshold) with a box that 
    has a higher score.

    iou (np.ndarray): An IoU matrix where each element i,j represents the IoU of boxes i and j.
    iou_thresh (float): The IoU threshold for suppression. Boxes with IoU > iou_thresh are suppressed.

    keep (np.ndarray): The indices of the boxes to keep after applying NMS.

    # Create a boolean mask to keep track of boxes
    mask = np.ones(iou.shape[0], dtype=bool)

    # Apply non-max suppression
    for i in range(iou.shape[0]):
        if mask[i]:
            # Suppress boxes with higher index and IoU > threshold
            mask[(iou[i] > iou_thresh) & (np.arange(iou.shape[0]) > i)] = False

    # Return the indices of the boxes to keep
    return np.arange(iou.shape[0])[mask]

Define a Function to Annotate Images with Bounding Boxes

Likewise, we can use the same function for annotating images with bounding boxes with PIL.

def draw_bboxes_pil(image, boxes, labels, colors, font, width=2, font_size=18, probs=None):
    Annotates an image with bounding boxes, labels, and optional probability scores.

    - image (PIL.Image): The input image on which annotations will be drawn.
    - boxes (list of tuples): A list of bounding box coordinates where each tuple is (x, y, w, h).
    - labels (list of str): A list of labels corresponding to each bounding box.
    - colors (list of str): A list of colors for each bounding box and its corresponding label.
    - font (str): Path to the font file to be used for displaying the labels.
    - width (int, optional): Width of the bounding box lines. Defaults to 2.
    - font_size (int, optional): Size of the font for the labels. Defaults to 18.
    - probs (list of float, optional): A list of probability scores corresponding to each label. Defaults to None.

    - annotated_image (PIL.Image): The image annotated with bounding boxes, labels, and optional probability scores.
    # Define a reference diagonal
    # Scale the font size using the hypotenuse of the image
    font_size = int(font_size * (np.hypot(*image.size) / REFERENCE_DIAGONAL))
    # Add probability scores to labels if provided
    if probs is not None:
        labels = [f"{label}: {prob*100:.2f}%" for label, prob in zip(labels, probs)]

    # Create an ImageDraw object for drawing on the image
    draw = ImageDraw.Draw(image)

    # Load the font file (outside the loop)
    fnt = ImageFont.truetype(font, font_size)
    # Compute the mean color value for each color
    mean_colors = [np.mean(np.array(color)) for color in colors]

    # Loop through the bounding boxes, labels, and colors
    for box, label, color, mean_color in zip(boxes, labels, colors, mean_colors):
        # Get the bounding box coordinates
        x, y, w, h = box

        # Draw the bounding box on the image
        draw.rectangle([x, y, x+w, y+h], outline=color, width=width)
        # Get the size of the label text box
        label_w, label_h = draw.textbbox(xy=(0,0), text=label, font=fnt)[2:]
        # Draw the label rectangle on the image
        draw.rectangle([x, y-label_h, x+label_w, y], outline=color, fill=color)

        # Draw the label text on the image
        font_color = 'black' if mean_color > 127.5 else 'white'
        draw.multiline_text((x, y-label_h), label, font=fnt, fill=font_color)
    return image

That takes care of the required utility functions. In the next section, we will use our ONNX model with ByteTrack to track objects in a video.

Tracking Objects in Videos

We will first initialize an inference session with our ONNX model.

Create an Inference Session

# Get a filename for the ONNX model
onnx_file_path = list(checkpoint_dir.glob('*.onnx'))[0]
# Load the model and create an InferenceSession
providers = [
    # "CUDAExecutionProvider",
sess_options = ort.SessionOptions()
session = ort.InferenceSession(onnx_file_path, sess_options=sess_options, providers=providers)

Select a Test Video

Next, we need a video to test the object tracking performance. We can use this one from Pexels, a free stock photo & video site.

# Specify the directory where videos are or will be stored.
video_dir = "./videos/"

# Name of the test video to be used.
test_video_name = "pexels-rodnae-productions-10373924.mp4"

# Construct the full path for the video using the directory and video name.
video_path = f"{video_dir}{test_video_name}"

# Define the URL for the test video stored on Huggingface's server.
test_video_url = f"{test_video_name}"

# Download the video file from the specified URL to the local video directory.
download_file(test_video_url, video_dir, False)

# Display the video using the Video function (assuming an appropriate library/module is imported).

Initialize a VideoCapture Object

Now that we have a test video, we can use OpenCV’s VideoCapture class to iterate through it and access relevant metadata.

# Open the video file located at 'video_path' using OpenCV
video_capture = cv2.VideoCapture(video_path)

# Retrieve the frame width of the video
frame_width = int(video_capture.get(3))
# Retrieve the frame height of the video
frame_height = int(video_capture.get(4))
# Retrieve the frames per second (FPS) of the video
frame_fps = int(video_capture.get(5))
# Retrieve the total number of frames in the video
frames = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))

# Create a pandas Series containing video metadata and convert it to a DataFrame
    "Frame Width:": frame_width,
    "Frame Height:": frame_height,
    "Frame FPS:": frame_fps,
    "Frames:": frames
Frame Width: 720
Frame Height: 1280
Frame FPS: 29
Frames: 226

Initialize a VideoWriter Object

We will use OpenCV’s VideoWriter class to save the annotated version of our test video.

# Construct the output video path 
video_out_path = f"{(video_dir)}{Path(video_path).stem}-byte-track.mp4"

# Initialize a VideoWriter object for video writing.
# 1. video_out_path: Specifies the name of the output video file.
# 2. cv2.VideoWriter_fourcc(*'mp4v'): Specifies the codec for the output video. 'mp4v' is used for .mp4 format.
# 3. frame_fps: Specifies the frames per second for the output video.
# 4. (frame_width, frame_height): Specifies the width and height of the frames in the output video.
video_writer = cv2.VideoWriter(video_out_path, cv2.VideoWriter_fourcc(*'mp4v'), frame_fps, (frame_width, frame_height))

Define Inference Parameters

test_sz = 288
bbox_conf_thresh = 0.1
iou_thresh = 0.45

Detect, Track, and Annotate Objects in Video Frames

In this section, we’ll iterate through each frame of our test video, detect objects, track those objects across video frames, and then annotate the video frames with the corresponding bounding boxes and tracking IDs.

We start by initializing a ByteTracker object. Then, we can iterate over the video frames using our VideoCapture object. For each frame, we pass it through the prepare_image_for_inference function to apply the preprocessing steps. We then convert the image to a NumPy array and scale the values to the range [0,1].

Next, we pass the input to our YOLOX model to get the raw prediction data. After that, we can process the raw output to extract the bounding box predictions. We then pass the bounding box predictions to the ByteTracker object so it can update its current object tracks.

Once we match the updated track data with the current bounding box predictions, we can annotate the current frame with bounding boxes and associated track IDs.

# Initialize a ByteTracker object
tracker = BYTETracker(track_thresh=0.25, track_buffer=30, match_thresh=0.8, frame_rate=frame_fps)

with tqdm(total=frames, desc="Processing frames") as pbar:
    while video_capture.isOpened():
        ret, frame =
        if ret:
            # Prepare an input image for inference
            rgb_img, input_dims, offsets, min_img_scale, input_img = prepare_image_for_inference(frame, test_sz, max_stride)
            # Convert the existing input image to NumPy format
            input_tensor_np = np.array(input_img, dtype=np.float32).transpose((2, 0, 1))[None]/255

            # Start performance counter
            start_time = time.perf_counter()
            # Run inference
            outputs =, {"input": input_tensor_np})[0]

            # Process the model output
            proposals = process_outputs(outputs, input_tensor_np.shape[input_dim_slice], bbox_conf_thresh)
            # Apply non-max suppression to the proposals with the specified threshold
            proposal_indices = nms_sorted_boxes(calc_iou(proposals[:, :-2]), iou_thresh)
            proposals = proposals[proposal_indices]
            bbox_list = (proposals[:,:4]+[*offsets, 0, 0])*min_img_scale
            label_list = [class_names[int(idx)] for idx in proposals[:,4]]
            probs_list = proposals[:,5]

            # Update tracker with detections.
            track_ids = [-1]*len(bbox_list)

            # Convert to tlbr format
            tlbr_boxes = bbox_list.copy()
            tlbr_boxes[:, 2:4] += tlbr_boxes[:, :2]

            # Update tracker with detections
            tracks = tracker.update(
                output_results=np.concatenate([tlbr_boxes, probs_list[:, np.newaxis]], axis=1),
            track_ids = match_detections_with_tracks(tlbr_boxes=tlbr_boxes, track_ids=track_ids, tracks=tracks)

            # End performance counter
            end_time = time.perf_counter()
            # Calculate the combined FPS for object detection and tracking
            fps = 1 / (end_time - start_time)
            # Display the frame rate in the progress bar

            # Filter object detections based on tracking results
            bbox_list, label_list, probs_list, track_ids = zip(*[(bbox, label, prob, track_id) 
                                                                 for bbox, label, prob, track_id 
                                                                 in zip(bbox_list, label_list, probs_list, track_ids) if track_id != -1])

            # Annotate the current frame with bounding boxes and tracking IDs
            annotated_img = draw_bboxes_pil(
                labels=[f"{track_id}-{label}" for track_id, label in zip(track_ids, label_list)],
                colors=[int_colors[class_names.index(i)] for i in label_list],  
            annotated_frame = cv2.cvtColor(np.array(annotated_img), cv2.COLOR_RGB2BGR)

Finally, we can check the annotated video to see how the object tracker performed.

The ByteTracker had no issue tracking the two hands throughout the video, as the track IDs remained the same for each hand.

Google Colab Users
  1. Don’t forget to download the the annotated video from the Colab Environment’s file browser. (tutorial link)


Congratulations on reaching the end of this tutorial on object tracking with YOLOX and ByteTrack! With this knowledge, we have unlocked a new realm of potential applications for our YOLOX model.

Combining YOLOX’s robust detection capabilities with ByteTrack’s tracking efficiency gives you a powerful toolset to work on myriad projects, from video analysis to immersive augmented reality experiences.

As a follow-up project, consider integrating our hand sign detector with ByteTrack in an application for gesture-based controls or training a new YOLOX model for other domains. The potential applications of this powerful combination are vast, limited only by your imagination.

If you found this guide helpful, consider sharing it with others.