Quantizing timm Image Classifiers with ONNX Runtime and TensorRT in Ubuntu
- Introduction
- Quantization Process
- Getting Started with the Code
- Setting Up Your Python Environment
- Importing the Required Dependencies
- Setting Up the Project
- Loading the Checkpoint Data
- Loading the Dataset
- Collecting Calibration Data
- Performing Inference with TensorRT
- Conclusion
Introduction
Welcome back to this series on image classification with the timm library. Previously, we fine-tuned a ResNet 18-D model in PyTorch to classify hand signs and exported it to ONNX. This tutorial covers quantizing our ONNX model and performing int8 inference using ONNX Runtime and TensorRT.
Quantization aims to make inference more computationally and memory efficient using a lower precision data type (e.g., 8-bit integer (int8)) for the model weights and activations. Modern devices increasingly have specialized hardware for running models at these lower precisions for improved performance.
ONNX Runtime includes tools to assist with quantizing our model from its original float32 precision to int8. ONNX Runtime’s execution providers also make it easier to leverage the hardware-specific inference libraries used to run models on the specialized hardware. In this tutorial, we will use the TensorRT Execution Provider to perform int8-precision inference.
TensorRT is a high-performance inference library for NVIDIA hardware. For our purposes it allows us to run our image classification model at 16-bit and 8-bit precision, while leveraging the specialized tensor cores in modern NVIDIA devices.
TensorRT requires NVIDIA hardware with CUDA Compute Capability 7.0 or higher (e.g., RTX 20-series or newer). Check the Compute Capability tables at the link below for your Nvidia hardware:
You can follow along using the free GPU-tier of Google Colab if you do not have any supported hardware.
Quantization Process
Quantizing our model involves converting the original 32-bit floating point values to 8-bit integers. float32 precision allows for a significantly greater range of possible values versus int8. To find the best way to map the float32 values to int8, we must compute the range of float32 values in the model.
The float32 values for the model weights are static, while the activation values depend on the input fed to the model. We can calculate a suitable range of activation values by feeding sample inputs through the model and recording the activations. TensorRT can then use this information when quantizing the model. We will use a subset of images from the original training dataset to generate this calibration data.
Getting Started with the Code
As with the previous tutorial, the code is available as a Jupyter Notebook.
Jupyter Notebook | Google Colab |
---|---|
GitHub Repository | Open In Colab |
Setting Up Your Python Environment
First, we must add a few new libraries to our Python environment.
Install CUDA Package
Both ONNX Runtime and TensorRT require CUDA for use with NVIDIA GPUs and support CUDA 12.x
.
We can view the available CUDA package versions using the following command:
conda search cuda -c nvidia/label/cuda-*
Loading channels: done
# Name Version Build Channel
cuda 12.0.0 h7428d3b_0 conda-forge
cuda 12.0.0 h7428d3b_1 conda-forge
cuda 12.0.0 ha770c72_0 conda-forge
cuda 12.0.0 ha804496_0 conda-forge
cuda 12.0.0 ha804496_1 conda-forge
cuda 12.1.1 h7428d3b_0 conda-forge
cuda 12.1.1 ha804496_0 conda-forge
cuda 12.2.2 h7428d3b_0 conda-forge
cuda 12.2.2 ha804496_0 conda-forge
cuda 12.3.2 h7428d3b_0 conda-forge
cuda 12.3.2 ha804496_0 conda-forge
cuda 12.4.0 h7428d3b_0 conda-forge
cuda 12.4.0 ha804496_0 conda-forge
cuda 12.4.1 h7428d3b_0 conda-forge
cuda 12.4.1 ha804496_0 conda-forge
cuda 12.5.0 h7428d3b_0 conda-forge
cuda 12.5.0 ha804496_0 conda-forge
cuda 12.5.1 h7428d3b_0 conda-forge
cuda 12.5.1 ha804496_0 conda-forge
cuda 12.6.0 h7428d3b_0 conda-forge
cuda 12.6.0 ha804496_0 conda-forge
cuda 12.6.1 h7428d3b_0 conda-forge
cuda 12.6.1 ha804496_0 conda-forge
cuda 12.6.2 h7428d3b_0 conda-forge
cuda 12.6.2 ha804496_0 conda-forge
Run the following command to install CUDA in our Python environment with Conda/Mamba.
conda install cuda -c nvidia/label/cuda-12.4.0 -y
mamba install cuda -c nvidia/label/cuda-12.4.0 -y
Install ONNX Runtime and TensorRT
The only additional libraries we need are ONNX Runtime with GPU support and TensorRT, assuming the packages used in the previous two tutorials are already in the Python environment. At the time of writing, ONNX Runtime supports TensorRT 10.x
.
Run the following commands to install the libraries:
# Install TensorRT packages
pip install -U tensorrt
# Install ONNX Runtime for CUDA 12
pip install -U 'onnxruntime-gpu==1.20.0'
With our environment updated, we can dive into the code.
Importing the Required Dependencies
First, we will import the necessary Python dependencies into our Jupyter Notebook. The ONNX Runtime package does not know where to look for the cuDNN libraries included with the cuda
package, so we load those first using the following approach adapted from the tensorrt package.
# Load cuDNN libraries
import ctypes
import glob
import os
import sys
from nvidia import cudnn
def try_load(library):
try:
=ctypes.RTLD_GLOBAL) # Use RTLD_GLOBAL to make symbols available
ctypes.CDLL(library, modeexcept OSError:
pass
def try_load_libs_from_dir(path):
# Load all .so files (Linux)
for lib in glob.iglob(os.path.join(path, "*.so*")):
try_load(lib)# Load all .dll files (Windows)
for lib in glob.iglob(os.path.join(path, "*.dll*")):
try_load(lib)
# Get the cudnn library path
= os.path.join(cudnn.__path__[0], "lib")
CUDNN_LIB_DIR
# Try loading all libraries in the cudnn lib directory
try_load_libs_from_dir(CUDNN_LIB_DIR)
# Import Python Standard Library dependencies
import json
import os
from pathlib import Path
import random
# Import utility functions
from cjm_psl_utils.core import download_file, file_extract
from cjm_pil_utils.core import resize_img, get_img_files
# Import numpy
import numpy as np
# Import the pandas package
import pandas as pd
# Do not truncate the contents of cells and display all rows and columns
'max_colwidth', None, 'display.max_rows', None, 'display.max_columns', None)
pd.set_option(
# Import PIL for image manipulation
from PIL import Image
# Import ONNX dependencies
import onnxruntime as ort # Import the ONNX Runtime
from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference
from onnxruntime.quantization import CalibrationDataReader, CalibrationMethod, create_calibrator, write_calibration_table
# Import tensorrt_libs
import tensorrt_libs
Make sure to import the tensorrt_libs
module that is part of the tensorrt
pip package. Otherwise, you will need to update the LD_LIBRARY_PATH
environment variable with the path to the TensorRT library files.
Setting Up the Project
Next, we will set the folder locations for our project, the calibration dataset, and the directory with the ONNX model and JSON class labels file.
Setting the Directory Paths
Readers following the tutorial on their local machine should select locations with read and write access to store the archived and extracted dataset. For a cloud service like Google Colab, you can set it to the current directory.
# The name for the project
= f"pytorch-timm-image-classifier"
project_name
# The path for the project folder
= Path(f"./{project_name}/")
project_dir
# Create the project directory if it does not already exist
=True, exist_ok=True)
project_dir.mkdir(parents
# Define path to store datasets
= Path("/mnt/Storage/Datasets/")
dataset_dir # Create the dataset directory if it does not exist
=True, exist_ok=True)
dataset_dir.mkdir(parents
# Define path to store archive files
= dataset_dir/'../Archive'
archive_dir # Create the archive directory if it does not exist
=True, exist_ok=True)
archive_dir.mkdir(parents
# The path to the checkpoint folder
= Path(project_dir/f"2024-02-02_15-41-23")
checkpoint_dir
pd.Series({"Project Directory:": project_dir,
"Dataset Directory:": dataset_dir,
"Archive Directory:": archive_dir,
"Checkpoint Directory:": checkpoint_dir,
='columns') }).to_frame().style.hide(axis
Project Directory: | pytorch-timm-image-classifier |
---|---|
Dataset Directory: | /mnt/Storage/Datasets |
Archive Directory: | /mnt/Storage/Datasets/../Archive |
Checkpoint Directory: | pytorch-timm-image-classifier/2024-02-02_15-41-23 |
Loading the Checkpoint Data
Now, we can load the class labels, set the path for the ONNX model.
Load the Class Labels
# The class labels path
= list(checkpoint_dir.glob('*classes.json'))[0]
class_labels_path
# Load the JSON class labels data
with open(class_labels_path, 'r') as file:
= json.load(file)
class_labels_json
# Get the list of classes
= class_labels_json['classes']
class_names
# Print the list of classes
pd.DataFrame(class_names)
0 | |
---|---|
0 | call |
1 | dislike |
2 | fist |
3 | four |
4 | like |
5 | mute |
6 | no_gesture |
7 | ok |
8 | one |
9 | palm |
10 | peace |
11 | peace_inverted |
12 | rock |
13 | stop |
14 | stop_inverted |
15 | three |
16 | three2 |
17 | two_up |
18 | two_up_inverted |
Set Model Checkpoint Information
# The onnx model path
= list(checkpoint_dir.glob('*.onnx'))[0] onnx_file_path
Loading the Dataset
Now that we set up the project, we can download our dataset and select a subset to use for calibration.
Setting the Dataset Path
We first need to construct the name for the Hugging Face Hub dataset and define where to download and extract the dataset.
# Set the name of the dataset
= 'hagrid-classification-512p-no-gesture-150k-zip'
dataset_name
# Construct the HuggingFace Hub dataset name by combining the username and dataset name
= f'cj-mills/{dataset_name}'
hf_dataset
# Create the path to the zip file that contains the dataset
= Path(f'{archive_dir}/{dataset_name.removesuffix("-zip")}.zip')
archive_path
# Create the path to the directory where the dataset will be extracted
= Path(f'{dataset_dir}/{dataset_name.removesuffix("-zip")}')
dataset_path
# Creating a Series with the dataset name and paths and converting it to a DataFrame for display
pd.Series({"HuggingFace Dataset:": hf_dataset,
"Archive Path:": archive_path,
"Dataset Path:": dataset_path
='columns') }).to_frame().style.hide(axis
HuggingFace Dataset: | cj-mills/hagrid-classification-512p-no-gesture-150k-zip |
---|---|
Archive Path: | /mnt/Storage/Datasets/../Archive/hagrid-classification-512p-no-gesture-150k.zip |
Dataset Path: | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k |
Downloading the Dataset
We can now download the dataset archive file and extract the dataset. We can delete the archive afterward to save space.
# Construct the HuggingFace Hub dataset URL
= f"https://huggingface.co/datasets/{hf_dataset}/resolve/main/{dataset_name.removesuffix('-zip')}.zip"
dataset_url print(f"HuggingFace Dataset URL: {dataset_url}")
# Set whether to delete the archive file after extracting the dataset
= True
delete_archive
# Download the dataset if not present
if dataset_path.is_dir():
print("Dataset folder already exists")
else:
print("Downloading dataset...")
download_file(dataset_url, archive_dir)
print("Extracting dataset...")
=archive_path, dest=dataset_dir)
file_extract(fname
# Delete the archive if specified
if delete_archive: archive_path.unlink()
Get Image File Paths
Once downloaded, we can get the paths to the images in the dataset.
# Get a list of all JPG image files in the dataset
= list(dataset_path.glob("./**/*.jpeg"))
img_file_paths
# Print the number of image files
print(f"Number of Images: {len(img_file_paths)}")
# Display the first five entries from the dictionary using a Pandas DataFrame
pd.DataFrame(img_file_paths).head()
Number of Images: 153735
0 | |
---|---|
0 | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k/call/3ffbf0a0-1837-42cd-8f13-33977a2b47aa.jpeg |
1 | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k/call/7f4d415e-f570-42c3-aa5a-7c907d2d461e.jpeg |
2 | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k/call/0003d6d1-3489-4f57-ab7a-44744dba93fd.jpeg |
3 | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k/call/00084dfa-60a2-4c8e-9bd9-25658382b8b7.jpeg |
4 | /mnt/Storage/Datasets/hagrid-classification-512p-no-gesture-150k/call/0010543c-be59-49e7-8f6d-fbea8f5fdc6b.jpeg |
Select Sample Images
Using every image in the dataset for the calibration process would be unnecessary and time-consuming, so we’ll select a random subset.
1234) # Set random seed for consistency
random.seed(= 0.05
sample_percentage
random.shuffle(img_file_paths)= random.sample(img_file_paths, int(len(img_file_paths)*sample_percentage)) sample_img_paths
Try to have at least 200
samples for the calibration set if adapting this tutorial to another dataset.
Collecting Calibration Data
With the dataset samples selected, we can feed them through the model and collect the calibration data.
Implement a CalibrationDataReader
First, we will implement a CalibrationDataReader
class to load and prepare samples to feed through the model.
class CalibrationDataReaderCV(CalibrationDataReader):
"""
A subclass of CalibrationDataReader specifically designed for handling
image data for calibration in computer vision tasks. This reader loads,
preprocesses, and provides images for model calibration.
"""
def __init__(self, img_file_paths, target_sz, input_name='input'):
"""
Initializes a new instance of the CalibrationDataReaderCV class.
Args:
img_file_paths (list): A list of image file paths.
target_sz (tuple): The target size (width, height) to resize images to.
input_name (str, optional): The name of the input node in the ONNX model. Default is 'input'.
"""
super().__init__() # Initialize the base class
# Initialization of instance variables
self._img_file_paths = img_file_paths
self.input_name = input_name
self.enum = iter(img_file_paths) # Create an iterator over the image paths
self.target_sz = target_sz
def get_next(self):
"""
Retrieves, processes, and returns the next image in the sequence as a NumPy array suitable for model input.
Returns:
dict: A dictionary with a single key-value pair where the key is `input_name` and the value is the
preprocessed image as a NumPy array, or None if there are no more images.
"""
= next(self.enum, None) # Get the next image path
img_path if not img_path:
return None # If there are no more paths, return None
# Load the image from the filepath and convert to RGB
= Image.open(img_path).convert('RGB')
image
# Resize the image to the target size
= resize_img(image, target_sz=self.target_sz, divisor=1)
input_img
# Convert the image to a NumPy array, normalize, and add a batch dimension
= np.array(input_img, dtype=np.float32).transpose((2, 0, 1))[None] / 255
input_tensor_np
# Return the image in a dictionary under the specified input name
return {self.input_name: input_tensor_np}
This CalibrationDataReader
class does not normalize the input as our ONNX model performs that step internally. Be sure to include any required input normalization if adapting this tutorial to another model that does not include it internally.
Specify a Cache Folder
Next, we will create a folder to store the collected calibration data and any cache files generated by TensorRT.
= checkpoint_dir/'trt_engine_cache'
trt_cache_dir =True, exist_ok=True)
trt_cache_dir.mkdir(parents trt_cache_dir
PosixPath('pytorch-timm-image-classifier/2024-02-02_15-41-23/trt_engine_cache')
Collect Calibration Data
Now, we can create a calibrator object and an instance of our custom CalibrationDataReader
object to collect the activation values and compute the range of values. The calibrator object creates a temporary ONNX model for the calibration process that we can delete afterward.
After feeding the data samples through the model, we will save the generated calibration file for TensorRT to use later.
%%time
= 288
target_sz
# Save path for temporary ONNX model used during calibration process
= onnx_file_path.parent/f"{onnx_file_path.stem}-augmented.onnx"
augmented_model_path
try:
# Create a calibrator object for the ONNX model.
= create_calibrator(
calibrator =onnx_file_path,
model=None,
op_types_to_calibrate=augmented_model_path,
augmented_model_path=CalibrationMethod.MinMax
calibrate_method
)
# Set the execution providers for the calibrator.
"CUDAExecutionProvider", "CPUExecutionProvider"])
calibrator.set_execution_providers([
# Initialize the custom CalibrationDataReader object
= CalibrationDataReaderCV(img_file_paths=sample_img_paths,
calibration_data_reader =target_sz,
target_sz=calibrator.model.graph.input[0].name)
input_name
# Collect calibration data using the specified data reader.
=calibration_data_reader)
calibrator.collect_data(data_reader
# Write the computed calibration table to the specified directory.
dir=str(trt_cache_dir))
write_calibration_table(calibrator.compute_data().data,
except Exception as e:
# Catch any exceptions that occur during the calibration process.
print("An error occurred:", e)
finally:
# Remove temporary ONNX file created during the calibration process
if augmented_model_path.exists():
augmented_model_path.unlink()
CPU times: user 51.7 s, sys: 679 ms, total: 52.3 s
Wall time: 54.1 s
Inspect TensorRT Cache Folder
Looking in the cache folder, we should see three new files.
# Print the content of the module folder as a Pandas DataFrame
for path in trt_cache_dir.iterdir()]) pd.DataFrame([path.name
0 | |
---|---|
0 | calibration.cache |
1 | calibration.flatbuffers |
2 | calibration.json |
That takes care of the calibration process. In the next section, we will create an ONNX Runtime inference session and perform inference with TensorRT.
Performing Inference with TensorRT
To have TensorRT quantize the model for int8 inference, we need to specify the path to the cache folder and the calibration table file name and enable int8 precision when initializing the inference session.
Create an Inference Session
ort.get_available_providers()
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
= [
providers 'TensorrtExecutionProvider', {
('device_id': 0, # The device ID
'trt_max_workspace_size': 4e9, # Maximum workspace size for TensorRT engine (1e9 ≈ 1GB)
'trt_engine_cache_enable': True, # Enable TensorRT engine caching
'trt_engine_cache_path': str(trt_cache_dir), # Path for TensorRT engine, profile files, and int8 calibration table
'trt_int8_enable': True, # Enable int8 mode in TensorRT
'trt_int8_calibration_table_name': 'calibration.flatbuffers', # int8 calibration table file for non-QDQ models in int8 mode
})
]
= ort.SessionOptions()
sess_opt
# Load the model and create an InferenceSession
= ort.InferenceSession(onnx_file_path, sess_options=sess_opt, providers=providers) session
Select a Test Image
We can use the same test image and input size from the previous tutorial.
= 'pexels-elina-volkova-16191659.jpg'
test_img_name = f"https://huggingface.co/datasets/cj-mills/pexel-hand-gesture-test-images/resolve/main/{test_img_name}"
test_img_url
'./', False)
download_file(test_img_url,
= Image.open(test_img_name)
test_img
display(test_img)
pd.Series({"Test Image Size:": test_img.size,
='columns') }).to_frame().style.hide(axis
Test Image Size: | (637, 960) |
---|
Prepare the Test Image
# Set the input image size
= 288
test_sz
# Resize image without cropping
= resize_img(test_img, target_sz=test_sz)
input_img
display(input_img)
pd.Series({"Input Image Size:": input_img.size
='columns') }).to_frame().style.hide(axis
Input Image Size: | (288, 416) |
---|
Prepare the Input Tensor
# Convert the existing input image to NumPy format
= np.array(input_img, dtype=np.float32).transpose((2, 0, 1))[None]/255 input_tensor_np
Build TensorRT Engine
TensorRT will build an optimized and quantized representation of our model called an engine when we first pass input to the inference session. It will save a copy of this engine object to the cache folder we specified earlier. The build process can take a bit, so caching the engine will save time for future use.
%%time
# Perform a single inference run to build the TensorRT engine for the current input dimensions
None, {"input": input_tensor_np}); session.run(
2024-11-11 18:00:16.983451530 [W:onnxruntime:Default, tensorrt_execution_provider.h:90 log] [2024-11-12 02:00:16 WARNING] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
2024-11-11 18:00:16.983548953 [W:onnxruntime:Default, tensorrt_execution_provider.h:90 log] [2024-11-12 02:00:16 WARNING] Missing scale and zero-point for tensor /model/fc/Gemm_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
2024-11-11 18:00:16.983552890 [W:onnxruntime:Default, tensorrt_execution_provider.h:90 log] [2024-11-12 02:00:16 WARNING] Missing scale and zero-point for tensor ONNXTRT_Broadcast_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
2024-11-11 18:00:16.983555175 [W:onnxruntime:Default, tensorrt_execution_provider.h:90 log] [2024-11-12 02:00:16 WARNING] Missing scale and zero-point for tensor /softmax/Softmax_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
CPU times: user 13.8 s, sys: 1.85 s, total: 15.6 s
Wall time: 25.4 s
TensorRT needs to build separate engine files for different input dimensions.
Inspect TensorRT Cache Folder
If we look in the cache folder again, we can see a new .engine
file and a new .profile
file.
# Print the content of the module folder as a Pandas DataFrame
for path in trt_cache_dir.iterdir()]) pd.DataFrame([path.name
0 | |
---|---|
0 | calibration.cache |
1 | calibration.flatbuffers |
2 | calibration.json |
3 | TensorrtExecutionProvider_TRTKernel_graph_main_graph_2087346457130887064_0_0_int8_sm89.engine |
4 | TensorrtExecutionProvider_TRTKernel_graph_main_graph_2087346457130887064_0_0_int8_sm89.profile |
Benchmark Quantized Model
With the TensorRT engine built, we can benchmark our quantized model to gauge the raw inference speeds.
%%timeit
None, {"input": input_tensor_np}) session.run(
361 μs ± 3.09 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In my testing for this model, TensoRT int8 inference tends to be about 3x faster than the CUDA execution provider with the original float32 model.
Of course, it does not matter how much faster the quantized model is if there is a significant drop in accuracy, so let’s verify the prediction results.
Compute the Predictions
# Run inference
= session.run(None, {"input": input_tensor_np})[0]
outputs
# Get the highest confidence score
= outputs.max()
confidence_score
# Get the class index with the highest confidence score and convert it to the class name
= class_names[outputs.argmax()]
pred_class
# Display the image
display(test_img)
# Store the prediction data in a Pandas Series for easy formatting
pd.Series({"Input Size:": input_img.size,
"Predicted Class:": pred_class,
"Confidence Score:": f"{confidence_score*100:.2f}%"
='columns') }).to_frame().style.hide(axis
Input Size: | (288, 416) |
---|---|
Predicted Class: | mute |
Confidence Score: | 100.00% |
The probability scores will likely differ slightly from the full-precision ONNX model, but the predicted class should be the same.
Don’t forget to download the content of the trt_engine_cache
folder from the Colab Environment’s file browser. (tutorial link)
Conclusion
Congratulations on reaching the end of this tutorial. We previously trained an image classification model in PyTorch for hand gesture recognition, and now we’ve quantized that model for optimized inference on NVIDIA hardware. Our model is now smaller, faster, and better suited for real-time applications and edge devices like the Jetson Orin Nano.
- Feel free to post questions or problems related to this tutorial in the comments below. I try to make time to address them on Thursdays and Fridays.
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.