import transformers
import datasets
import accelerate

# Only print error messages
transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

transformers.__version__, datasets.__version__, accelerate.__version__

    ('4.18.0', '2.0.0', '0.5.1')


import ast
import astor
import inspect
import textwrap
def print_source(obj, exclude_doc=True):

# Get source code
source = inspect.getsource(obj)
# Remove any common leading whitespace from every line
cleaned_source = textwrap.dedent(source)
# Parse the source into an AST node.
parsed = ast.parse(cleaned_source)

for node in ast.walk(parsed):
# Skip any nodes that are not class or function definitions
if not isinstance(node, (ast.FunctionDef, ast.ClassDef, ast.AsyncFunctionDef)):
continue

if exclude_doc and len(node.body) > 1: node.body = node.body[1:]

print(astor.to_source(parsed))


## Making Transformers Efficient in Production

• A state-of-the-art model is not very useful if it is too slow or too large to meet an application’s business requirements.
• Starting with a faster, more compact model often results in degraded performance.
• Knowledge distillation, quantization, pruning, and graph optimization are complementary techniques that can speed up predictions and reduce the memory footprint of models.
• We can combine some of these techniques to produce significant performance gains.
• Roblox: How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUs
• Roblox improved the latency and throughput of their BERT classifier by over 30x by combining knowledge distillation and quantization.

## Project: Optimize an Intent Detection Model

• The goal is to create a text-based assistant for a call center so customers can request their account balance and make bookings.
• The assistant must be able to classify a wide variety of natural language text into a set of predefined intents.
• The classifier must also handle out-of-scope queries and yield fallback responses when they do not belong to any predefined intents.

### The Model

• The baseline model is a fine-tuned BERT-base model that achieves 94% accuracy on the CLINC150 dataset.
• Hugging Face Dataset Card

### CLINC150 Dataset

• Homepage
• HuggingFace Dataset Card
• The CLINC150 dataset includes 22,500 in-scope queries across 150 intents and ten domains.
• The dataset contains 1,200 out-of-scope queries that belong to an oos intent class.

from transformers import pipeline


Instantiate a text classification pipeline with the baseline model

bert_ckpt = "transformersbook/bert-base-uncased-finetuned-clinc"
pipe = pipeline("text-classification", model=bert_ckpt)


Classify a sample query

query = """Hey, I'd like to rent a vehicle from Nov 1st to Nov 15th in
Paris and I need a 15 passenger van"""
pipe(query)

    [{'label': 'car_rental', 'score': 0.549003541469574}]


Note: The model correctly detects that the user wants to rent a vehicle.

## Creating a Performance Benchmark

• Deploying transformers in production involves a tradeoff between several constraints.
• Business and product metrics are the most important to consider.
• Model performance refers to how the model performs on a well-crafted test set representing production data.
• Model performance is especially crucial when the cost of making errors is high or when performing inference on millions of examples and minor improvements translate to significant gains.
• Latency refers to how fast the model delivers predictions. Latency is most important for real-time environments with lots of traffic.
• Memory constraints play an important role in mobile and edge devices where we need to perform inference without access to a cloud server.
• Failing to address these constraints can negatively impact the user experience.
• Running expensive cloud servers that may only need to handle a few requests can lead to ballooning costs.

Define a benchmark that measures model performance, latency, and memory usage

class PerformanceBenchmark:
def __init__(self, pipeline, dataset, optim_type="BERT baseline"):
self.pipeline = pipeline
self.dataset = dataset
self.optim_type = optim_type

def compute_accuracy(self):
# We'll define this later
pass

def compute_size(self):
# We'll define this later
pass

def time_pipeline(self):
# We'll define this later
pass

def run_benchmark(self):
metrics = {}
metrics[self.optim_type] = self.compute_size()
metrics[self.optim_type].update(self.time_pipeline())
metrics[self.optim_type].update(self.compute_accuracy())
return metrics


from datasets import load_dataset


clinc = load_dataset("clinc_oos", "plus")

clinc

    DatasetDict({
train: Dataset({
features: ['text', 'intent'],
num_rows: 15250
})
validation: Dataset({
features: ['text', 'intent'],
num_rows: 3100
})
test: Dataset({
features: ['text', 'intent'],
num_rows: 5500
})
})


Note:

• The plus configuration refers to the subset that contains the out-of-scope training examples.
• Each example consists of a query in the text column and its corresponding intent.

View an example

sample = clinc["test"][42]
sample