Real-Time Object Tracking with YOLOX and ByteTrack
- Introduction
- Getting Started with the Code
- Setting Up Your Python Environment
- Importing the Required Dependencies
- Setting Up the Project
- Loading the Checkpoint Data
- Defining Utility Functions
- Tracking Objects in Videos
- Conclusion
Introduction
Welcome back to this series on real-time object detection with YOLOX! Previously, we fine-tuned a YOLOX model in PyTorch to detect hand signs and exported it to ONNX. This tutorial combines our YOLOX model with the ByteTrack object tracker to track objects continuously across video frames.
Tracking objects over time unlocks a wide range of potential applications. With our hand-sign detector, we could implement gesture-based controls to control devices and create interactive gaming and multimedia experiences. Beyond our specific model, object tracking has applications in everything from sports analysis to wildlife monitoring.
By the end of this tutorial, you will understand how to combine a YOLOX object detection model with ByteTrack, enabling you to effectively track hand signs or other objects across consecutive video frames.
Getting Started with the Code
As with the previous tutorial, the code is available as a Jupyter Notebook.
Jupyter Notebook | Google Colab |
---|---|
GitHub Repository | Open In Colab |
Setting Up Your Python Environment
We need to add a couple of new libraries to our Python environment. We will use OpenCV to read and write video files. I also made a package with a standalone implementation of ByteTrack. Make sure to install onnx
and onnxruntime
if you did not follow the previous tutorial.
Package | Description |
---|---|
onnx |
This package provides a Python API for working with ONNX models. (link) |
onnxruntime |
ONNX Runtime is a runtime accelerator for machine learning models. (link) |
opencv-python |
Wrapper package for OpenCV python bindings. (link) |
cjm-byte-track |
A standalone Python implementation of the ByteTrack multi-object tracker based on the official implementation. (link) |
Run the following command to install these additional libraries:
# Install packages
pip install onnx onnxruntime opencv-python cjm_byte_track
Importing the Required Dependencies
With our environment updated, we can dive into the code. First, we will import the necessary Python dependencies into our Jupyter Notebook.
# Import Python Standard Library dependencies
from dataclasses import dataclass
import json
from pathlib import Path
import random
import time
from typing import List
# Import ByteTrack package
from cjm_byte_track.core import BYTETracker
from cjm_byte_track.matching import match_detections_with_tracks
# Import utility functions
from cjm_psl_utils.core import download_file
from cjm_pil_utils.core import resize_img
# Import OpenCV
import cv2
# Import numpy
import numpy as np
# Import the pandas package
import pandas as pd
# Import PIL for image manipulation
from PIL import Image, ImageDraw, ImageFont
# Import ONNX dependencies
import onnx # Import the onnx module
import onnxruntime as ort # Import the ONNX Runtime
# Import tqdm for progress bar
from tqdm.auto import tqdm
Setting Up the Project
In this section, we will set the folder locations for our project and the directory with the ONNX model and JSON colormap file. We should also ensure we have a font file for annotating images.
Set the Directory Paths
# The name for the project
= f"pytorch-yolox-object-detector"
project_name
# The path for the project folder
= Path(f"./{project_name}/")
project_dir
# Create the project directory if it does not already exist
=True, exist_ok=True)
project_dir.mkdir(parents
# The path to the checkpoint folder
= Path(project_dir/f"2023-08-17_16-14-43")
checkpoint_dir # checkpoint_dir = Path(project_dir/f"pretrained-coco")
pd.Series({"Project Directory:": project_dir,
"Checkpoint Directory:": checkpoint_dir,
='columns') }).to_frame().style.hide(axis
Project Directory: | pytorch-yolox-object-detector |
---|---|
Checkpoint Directory: | pytorch-yolox-object-detector/2023-08-17_16-14-43 |
Download a Font File
# Set the name of the font file
= 'KFOlCnqEu92Fr1MmEU9vAw.ttf'
font_file
# Download the font file
f"https://fonts.gstatic.com/s/roboto/v30/{font_file}", "./") download_file(
Loading the Checkpoint Data
Now, we can load the colormap and set the max stride value and input dimension slice.
Load the Colormap
# The colormap path
= list(checkpoint_dir.glob('*colormap.json'))[0]
colormap_path
# Load the JSON colormap data
with open(colormap_path, 'r') as file:
= json.load(file)
colormap_json
# Convert the JSON data to a dictionary
= {item['label']: item['color'] for item in colormap_json['items']}
colormap_dict
# Extract the class names from the colormap
= list(colormap_dict.keys())
class_names
# Make a copy of the colormap in integer format
= [tuple(int(c*255) for c in color) for color in colormap_dict.values()] int_colors
Set the Preprocessing and Post-Processing Parameters
= 32
max_stride = slice(2, 4, None) input_dim_slice
Defining Utility Functions
Next, we will define some utility functions for preparing the input data and processing the model output.
Define a Function to Prepare Images for Inference
OpenCV uses the BGR (Blue, Green, Red) color format for images, so we must change the current video frame to RGB before performing the standard preprocessing steps for the YOLOX model.
def prepare_image_for_inference(frame:np.ndarray, target_sz:int, max_stride:int):
"""
Prepares an image for inference by performing a series of preprocessing steps.
Steps:
1. Converts a BGR image to RGB.
2. Resizes the image to a target size without cropping, considering a given divisor.
3. Calculates input dimensions as multiples of the max stride.
4. Calculates offsets based on the resized image dimensions and input dimensions.
5. Computes the scale between the original and resized image.
6. Crops the resized image based on calculated input dimensions.
Parameters:
- frame (numpy.ndarray): The input image in BGR format.
- target_sz (int): The target minimum size for resizing the image.
- max_stride (int): The maximum stride to be considered for calculating input dimensions.
Returns:
tuple:
- rgb_img (PIL.Image): The converted RGB image.
- input_dims (list of int): Dimensions of the image that are multiples of max_stride.
- offsets (numpy.ndarray): Offsets from the resized image dimensions to the input dimensions.
- min_img_scale (float): Scale factor between the original and resized image.
- input_img (PIL.Image): Cropped image based on the calculated input dimensions.
"""
# Convert the BGR image to RGB
= Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
rgb_img # Resize image without cropping to multiple of the max stride
= resize_img(rgb_img, target_sz=target_sz, divisor=1)
resized_img
# Calculating the input dimensions that multiples of the max stride
= [dim - dim % max_stride for dim in resized_img.size]
input_dims # Calculate the offsets from the resized image dimensions to the input dimensions
= (np.array(resized_img.size) - input_dims) / 2
offsets # Calculate the scale between the source image and the resized image
= min(rgb_img.size) / min(resized_img.size)
min_img_scale
# Crop the resized image to the input dimensions
= resized_img.crop(box=[*offsets, *resized_img.size - offsets])
input_img
return rgb_img, input_dims, offsets, min_img_scale, input_img
Define Functions to Process YOLOX Output
We can use the same utility functions defined in the previous tutorial on exporting the model to ONNX.
Define a function to generate the output grids
def generate_output_grids_np(height, width, strides=[8,16,32]):
"""
Generate a numpy array containing grid coordinates and strides for a given height and width.
Args:
height (int): The height of the image.
width (int): The width of the image.
Returns:
np.ndarray: A numpy array containing grid coordinates and strides.
"""
= []
all_coordinates
for stride in strides:
# Calculate the grid height and width
= height // stride
grid_height = width // stride
grid_width
# Generate grid coordinates
= np.meshgrid(np.arange(grid_height), np.arange(grid_width), indexing='ij')
g1, g0
# Create an array of strides
= np.full((grid_height, grid_width), stride)
s
# Stack the coordinates along with the stride
= np.stack((g0.flatten(), g1.flatten(), s.flatten()), axis=-1)
coordinates
# Append to the list
all_coordinates.append(coordinates)
# Concatenate all arrays in the list along the first dimension
= np.concatenate(all_coordinates, axis=0)
output_grids
return output_grids
Define a function to calculate bounding boxes and probabilities
def calculate_boxes_and_probs(model_output:np.ndarray, output_grids:np.ndarray) -> np.ndarray:
"""
Calculate the bounding boxes and their probabilities.
Parameters:
model_output (numpy.ndarray): The output of the model.
output_grids (numpy.ndarray): The output grids.
Returns:
numpy.ndarray: The array containing the bounding box coordinates, class labels, and maximum probabilities.
"""
# Calculate the bounding box coordinates
= (model_output[..., :2] + output_grids[..., :2]) * output_grids[..., 2:]
box_centroids = np.exp(model_output[..., 2:4]) * output_grids[..., 2:]
box_sizes
= [t.squeeze(axis=2) for t in np.split(box_centroids - box_sizes / 2, 2, axis=2)]
x0, y0 = [t.squeeze(axis=2) for t in np.split(box_sizes, 2, axis=2)]
w, h
# Calculate the probabilities for each class
= model_output[..., 4]
box_objectness = model_output[..., 5:]
box_cls_scores = np.expand_dims(box_objectness, -1) * box_cls_scores
box_probs
# Get the maximum probability and corresponding class for each proposal
= np.max(box_probs, axis=-1)
max_probs = np.argmax(box_probs, axis=-1)
labels
return np.array([x0, y0, w, h, labels, max_probs]).transpose((1, 2, 0))
Define a function to extract object proposals from the raw model output
def process_outputs(outputs:np.ndarray, input_dims:tuple, bbox_conf_thresh:float):
"""
Process the model outputs to generate bounding box proposals filtered by confidence threshold.
Parameters:
- outputs (numpy.ndarray): The raw output from the model, which will be processed to calculate boxes and probabilities.
- input_dims (tuple of int): Dimensions (height, width) of the input image to the model.
- bbox_conf_thresh (float): Threshold for the bounding box confidence/probability. Bounding boxes with a confidence
score below this threshold will be discarded.
Returns:
- numpy.array: An array of proposals where each proposal is an array containing bounding box coordinates
and its associated probability, sorted in descending order by probability.
"""
# Process the model output
= calculate_boxes_and_probs(outputs, generate_output_grids_np(*input_dims))
outputs # Filter the proposals based on the confidence threshold
= outputs[:, :, -1]
max_probs = max_probs > bbox_conf_thresh
mask = outputs[mask]
proposals # Sort the proposals by probability in descending order
= proposals[proposals[..., -1].argsort()][::-1]
proposals return proposals
Define a function to calculate the intersection-over-union
def calc_iou(proposals:np.ndarray) -> np.ndarray:
"""
Calculates the Intersection over Union (IoU) for all pairs of bounding boxes (x,y,w,h) in 'proposals'.
The IoU is a measure of overlap between two bounding boxes. It is calculated as the area of
intersection divided by the area of union of the two boxes.
Parameters:
proposals (2D np.array): A NumPy array of bounding boxes, where each box is an array [x, y, width, height].
Returns:
iou (2D np.array): The IoU matrix where each element i,j represents the IoU of boxes i and j.
"""
# Calculate coordinates for the intersection rectangles
= np.maximum(proposals[:, 0], proposals[:, 0][:, None])
x1 = np.maximum(proposals[:, 1], proposals[:, 1][:, None])
y1 = np.minimum(proposals[:, 0] + proposals[:, 2], (proposals[:, 0] + proposals[:, 2])[:, None])
x2 = np.minimum(proposals[:, 1] + proposals[:, 3], (proposals[:, 1] + proposals[:, 3])[:, None])
y2
# Calculate intersection areas
= np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)
intersections
# Calculate union areas
= proposals[:, 2] * proposals[:, 3]
areas = areas[:, None] + areas - intersections
unions
# Calculate IoUs
= intersections / unions
iou
# Return the iou matrix
return iou
Define a function to filter bounding box proposals using Non-Maximum Suppression
def nms_sorted_boxes(iou:np.ndarray, iou_thresh:float=0.45) -> np.ndarray:
"""
Applies non-maximum suppression (NMS) to sorted bounding boxes.
It suppresses boxes that have high overlap (as defined by the IoU threshold) with a box that
has a higher score.
Parameters:
iou (np.ndarray): An IoU matrix where each element i,j represents the IoU of boxes i and j.
iou_thresh (float): The IoU threshold for suppression. Boxes with IoU > iou_thresh are suppressed.
Returns:
keep (np.ndarray): The indices of the boxes to keep after applying NMS.
"""
# Create a boolean mask to keep track of boxes
= np.ones(iou.shape[0], dtype=bool)
mask
# Apply non-max suppression
for i in range(iou.shape[0]):
if mask[i]:
# Suppress boxes with higher index and IoU > threshold
> iou_thresh) & (np.arange(iou.shape[0]) > i)] = False
mask[(iou[i]
# Return the indices of the boxes to keep
return np.arange(iou.shape[0])[mask]
Define a Function to Annotate Images with Bounding Boxes
Likewise, we can use the same function for annotating images with bounding boxes with PIL.
def draw_bboxes_pil(image, boxes, labels, colors, font, width=2, font_size=18, probs=None):
"""
Annotates an image with bounding boxes, labels, and optional probability scores.
Parameters:
- image (PIL.Image): The input image on which annotations will be drawn.
- boxes (list of tuples): A list of bounding box coordinates where each tuple is (x, y, w, h).
- labels (list of str): A list of labels corresponding to each bounding box.
- colors (list of str): A list of colors for each bounding box and its corresponding label.
- font (str): Path to the font file to be used for displaying the labels.
- width (int, optional): Width of the bounding box lines. Defaults to 2.
- font_size (int, optional): Size of the font for the labels. Defaults to 18.
- probs (list of float, optional): A list of probability scores corresponding to each label. Defaults to None.
Returns:
- annotated_image (PIL.Image): The image annotated with bounding boxes, labels, and optional probability scores.
"""
# Define a reference diagonal
= 1000
REFERENCE_DIAGONAL
# Scale the font size using the hypotenuse of the image
= int(font_size * (np.hypot(*image.size) / REFERENCE_DIAGONAL))
font_size
# Add probability scores to labels if provided
if probs is not None:
= [f"{label}: {prob*100:.2f}%" for label, prob in zip(labels, probs)]
labels
# Create an ImageDraw object for drawing on the image
= ImageDraw.Draw(image)
draw
# Load the font file (outside the loop)
= ImageFont.truetype(font, font_size)
fnt
# Compute the mean color value for each color
= [np.mean(np.array(color)) for color in colors]
mean_colors
# Loop through the bounding boxes, labels, and colors
for box, label, color, mean_color in zip(boxes, labels, colors, mean_colors):
# Get the bounding box coordinates
= box
x, y, w, h
# Draw the bounding box on the image
+w, y+h], outline=color, width=width)
draw.rectangle([x, y, x
# Get the size of the label text box
= draw.textbbox(xy=(0,0), text=label, font=fnt)[2:]
label_w, label_h
# Draw the label rectangle on the image
-label_h, x+label_w, y], outline=color, fill=color)
draw.rectangle([x, y
# Draw the label text on the image
= 'black' if mean_color > 127.5 else 'white'
font_color -label_h), label, font=fnt, fill=font_color)
draw.multiline_text((x, y
return image
That takes care of the required utility functions. In the next section, we will use our ONNX model with ByteTrack to track objects in a video.
Tracking Objects in Videos
We will first initialize an inference session with our ONNX model.
Create an Inference Session
# Get a filename for the ONNX model
= list(checkpoint_dir.glob('*.onnx'))[0] onnx_file_path
# Load the model and create an InferenceSession
= [
providers 'CPUExecutionProvider',
# "CUDAExecutionProvider",
]= ort.SessionOptions()
sess_options = ort.InferenceSession(onnx_file_path, sess_options=sess_options, providers=providers) session
Select a Test Video
Next, we need a video to test the object tracking performance. We can use this one from Pexels, a free stock photo & video site.
# Specify the directory where videos are or will be stored.
= "./videos/"
video_dir
# Name of the test video to be used.
= "pexels-rodnae-productions-10373924.mp4"
test_video_name
# Construct the full path for the video using the directory and video name.
= f"{video_dir}{test_video_name}"
video_path
# Define the URL for the test video stored on Huggingface's server.
= f"https://huggingface.co/datasets/cj-mills/pexels-object-tracking-test-videos/resolve/main/{test_video_name}"
test_video_url
# Download the video file from the specified URL to the local video directory.
False)
download_file(test_video_url, video_dir,
# Display the video using the Video function (assuming an appropriate library/module is imported).
Video(video_path)
Initialize a VideoCapture
Object
Now that we have a test video, we can use OpenCV’s VideoCapture
class to iterate through it and access relevant metadata.
# Open the video file located at 'video_path' using OpenCV
= cv2.VideoCapture(video_path)
video_capture
# Retrieve the frame width of the video
= int(video_capture.get(3))
frame_width # Retrieve the frame height of the video
= int(video_capture.get(4))
frame_height # Retrieve the frames per second (FPS) of the video
= int(video_capture.get(5))
frame_fps # Retrieve the total number of frames in the video
= int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT))
frames
# Create a pandas Series containing video metadata and convert it to a DataFrame
pd.Series({"Frame Width:": frame_width,
"Frame Height:": frame_height,
"Frame FPS:": frame_fps,
"Frames:": frames
='columns') }).to_frame().style.hide(axis
Frame Width: | 720 |
---|---|
Frame Height: | 1280 |
Frame FPS: | 29 |
Frames: | 226 |
Initialize a VideoWriter
Object
We will use OpenCV’s VideoWriter
class to save the annotated version of our test video.
# Construct the output video path
= f"{(video_dir)}{Path(video_path).stem}-byte-track.mp4"
video_out_path
# Initialize a VideoWriter object for video writing.
# 1. video_out_path: Specifies the name of the output video file.
# 2. cv2.VideoWriter_fourcc(*'mp4v'): Specifies the codec for the output video. 'mp4v' is used for .mp4 format.
# 3. frame_fps: Specifies the frames per second for the output video.
# 4. (frame_width, frame_height): Specifies the width and height of the frames in the output video.
= cv2.VideoWriter(video_out_path, cv2.VideoWriter_fourcc(*'mp4v'), frame_fps, (frame_width, frame_height)) video_writer
Define Inference Parameters
= 288
test_sz = 0.1
bbox_conf_thresh = 0.45 iou_thresh
Detect, Track, and Annotate Objects in Video Frames
In this section, we’ll iterate through each frame of our test video, detect objects, track those objects across video frames, and then annotate the video frames with the corresponding bounding boxes and tracking IDs.
We start by initializing a ByteTracker object. Then, we can iterate over the video frames using our VideoCapture object. For each frame, we pass it through the prepare_image_for_inference
function to apply the preprocessing steps. We then convert the image to a NumPy array and scale the values to the range [0,1]
.
Next, we pass the input to our YOLOX model to get the raw prediction data. After that, we can process the raw output to extract the bounding box predictions. We then pass the bounding box predictions to the ByteTracker object so it can update its current object tracks.
Once we match the updated track data with the current bounding box predictions, we can annotate the current frame with bounding boxes and associated track IDs.
# Initialize a ByteTracker object
= BYTETracker(track_thresh=0.25, track_buffer=30, match_thresh=0.8, frame_rate=frame_fps)
tracker
with tqdm(total=frames, desc="Processing frames") as pbar:
# Iterate through each frame in the video
while video_capture.isOpened():
= video_capture.read()
ret, frame if ret:
= time.perf_counter()
start_time
# Prepare the input image for inference
= prepare_image_for_inference(frame, test_sz, max_stride)
rgb_img, input_dims, offsets, min_img_scale, input_img
# Convert the input image to NumPy format for the model
= np.array(input_img, dtype=np.float32).transpose((2, 0, 1))[None]/255
input_tensor_np
# Run inference using the ONNX session
= session.run(None, {"input": input_tensor_np})[0]
outputs
# Process the model output to get object proposals
= process_outputs(outputs, input_tensor_np.shape[input_dim_slice], bbox_conf_thresh)
proposals
# Apply non-max suppression to filter overlapping proposals
= nms_sorted_boxes(calc_iou(proposals[:, :-2]), iou_thresh)
proposal_indices = proposals[proposal_indices]
proposals
# Extract bounding boxes, labels, and probabilities from proposals
= (proposals[:,:4]+[*offsets, 0, 0])*min_img_scale
bbox_list = [class_names[int(idx)] for idx in proposals[:,4]]
label_list = proposals[:,5]
probs_list
# Initialize track IDs for detected objects
= [-1]*len(bbox_list)
track_ids
# Convert bounding boxes to top-left bottom-right (tlbr) format
= bbox_list.copy()
tlbr_boxes 2:4] += tlbr_boxes[:, :2]
tlbr_boxes[:,
# Update tracker with detections
= tracker.update(
tracks =np.concatenate([tlbr_boxes, probs_list[:, np.newaxis]], axis=1),
output_results=rgb_img.size,
img_info=rgb_img.size)
img_size
if len(tlbr_boxes) > 0 and len(tracks) > 0:
# Match detections with tracks
= match_detections_with_tracks(tlbr_boxes=tlbr_boxes, track_ids=track_ids, tracks=tracks)
track_ids
# Filter object detections based on tracking results
= zip(*[(bbox, label, prob, track_id)
bbox_list, label_list, probs_list, track_ids for bbox, label, prob, track_id
in zip(bbox_list, label_list, probs_list, track_ids) if track_id != -1])
if len(bbox_list) > 0:
# Annotate the current frame with bounding boxes and tracking IDs
= draw_bboxes_pil(
annotated_img =rgb_img,
image=bbox_list,
boxes=[f"{track_id}-{label}" for track_id, label in zip(track_ids, label_list)],
labels=probs_list,
probs=[int_colors[class_names.index(i)] for i in label_list],
colors=font_file,
font
)= cv2.cvtColor(np.array(annotated_img), cv2.COLOR_RGB2BGR)
annotated_frame else:
# If no detections, use the original frame
= frame
annotated_frame
video_writer.write(annotated_frame)1)
pbar.update(else:
break
video_capture.release() video_writer.release()
Finally, we can check the annotated video to see how the object tracker performed.
The ByteTracker had no issue tracking the two hands throughout the video, as the track IDs remained the same for each hand.
- Don’t forget to download the the annotated video from the Colab Environment’s file browser. (tutorial link)
Conclusion
Congratulations on reaching the end of this tutorial on object tracking with YOLOX and ByteTrack! With this knowledge, we have unlocked a new realm of potential applications for our YOLOX model.
Combining YOLOX’s robust detection capabilities with ByteTrack’s tracking efficiency gives you a powerful toolset to work on myriad projects, from video analysis to immersive augmented reality experiences.
As a follow-up project, consider integrating our hand sign detector with ByteTrack in an application for gesture-based controls or training a new YOLOX model for other domains. The potential applications of this powerful combination are vast, limited only by your imagination.
- Feel free to post questions or problems related to this tutorial in the comments below. I try to make time to address them on Thursdays and Fridays.
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.