Conference Talk 4: Inspect - An OSS Framework for LLM Evals
- Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.
- Inspect AI: A Framework for Evaluating LLMs
- Hello World Example
- Core Concepts
- Honeycomb Dataset Example
- Composition
- Tool Use
- Agents and Tools
- Logging
- Models
- Workflow
- Q&A Session
- GitHub Repository: inspect-llm-workshop
- Slides: Intro to Inspect
Inspect AI: A Framework for Evaluating LLMs
- Inspect AI is a Python package for creating LLM evaluations developed through a collaboration between J.J. Allaire and the UK AI Safety Institute.
- Designed to address the limitations of existing evaluation tools for developing more complex evals.
- Focuses on providing a great experience for developing evals that can be reproducibly run at scale.
- Github Repository: UKGovernmentBEIS/inspect_ai
- Documentation: https://ukgovernmentbeis.github.io/inspect_ai
- Installation:
pip install inspect-ai
- VS Code Extension:
- Marketplace: Inspect AI
- Source Code: inspect-vscode
Hello World Example
- Documentation: Hello, Inspect
from inspect_ai import Task, eval, task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import (
chain_of_thought, generate, self_critique
)
@task
def theory_of_mind():
1return Task(
=example_dataset("theory_of_mind"),
dataset=[
plan2
chain_of_thought(),
generate(),
self_critique()
],3=model_graded_fact()
scorer )
- 1
-
The
Task
object brings together the dataset, solvers, and scorer, and is then evaluated using a model. - 2
- In this example we are chaining together three standard solver components. It’s also possible to create a more complex custom solver that manages state and interactions internally.
- 3
- Since the output is likely to have pretty involved language, we use a model for scoring.
Core Concepts
- Dataset: List of inputs for the LLM
- Usually includes targets (e.g., correct answers)
- Solvers: Pipelines that execute the evaluation
- Includes functions that transform the dataset inputs, call the model, and act on the model output
- Can include things like prompt engineering and multi-turn dialogue
- Can be composed together as layers, or can be a single layer with higher internal complexity
- Scores: Evaluates the output of solvers
- Ranges from simple text comparisons to model-graded assessments using custom rubrics.
Honeycomb Dataset Example
- Jupyter Notebook: honeycomb/queries.ipynb
- Dataset: queries.csv
- ~2,300 example queries (along with per-query column schemas generated offline via RAG)
- Scoring Methods:
- validate: score using the validity checker: utils.py
- critique: score using the critique prompt: critique.txt
- Process:
- Load the dataset.
- Define a pipeline that includes prompt engineering, model calling, and evaluation.
- Apply a scoring function.
Dataset
- Documentation: Datasets
- Inspect uses a standard schema for Datasets
- We map the raw data into that schema when reading it
- “columns” are saved as metadata so we can use them for prompt engineering
from inspect_ai.dataset import csv_dataset, FieldSpec
= csv_dataset(
dataset ="queries.csv",
csv_file=FieldSpec(input="user_input", metadata=["columns"]),
sample_fields=True
shuffle )
Solver
Documentation: Solver
Functions that manipulate the
TaskState
(message history and model output) to perform actions like:- Calling the model to generate text.
- Engineering prompts.
- Applying critique mechanisms.
-
generate()
: Calls the model, appends the assistant message, and updates the model outputchain_of_thought()
: Basic chain-of-thought implementation.prompt_template()
: Modifies the existing prompt by passing it through a templatemultiple_choice()
: Handles multiple-choice questions, including shuffling options, calling the model, and unshuffling to determine the chosen answer.self_critique()
- Performs self-critique by:
- Running a critique model on the initial output.
- Appending the critique to the message history.
- Calling the model again to generate a revised answer.
- Performs self-critique by:
Solver: prompt_with_schema()
- Simple prompt template that substitutes the user query and the RAG-generated column schema.
from inspect_ai.solver import solver
from inspect_ai.util import resource
@solver
def prompt_with_schema():
= resource("prompt.txt")
prompt_template
async def solve(state, generate):
# build the prompt
= prompt_template.replace(
state.user_prompt.text "{{prompt}}", state.user_prompt.text
).replace("{{columns}}", state.metadata["columns"]
)return state
return solve
Scorer
- Documentation: Scorer
- Evaluates whether solvers were successful in finding the right
output
for thetarget
defined in the dataset, and in what measure - Pluggable (i.e. provided from other packages)
- Evaluates whether solvers were successful in finding the right
- Built-In Scorers:
- Pattern matching
- Template matching
- Model grading
- Human scoring
- Example: Math Benchmark with Expression Equivalence
- Source Code: benchmarks/mathematics.py
- Uses a custom scorer
expression_equivalance
to evaluate mathematical expressions for logical equivalence, accounting for variations in formatting or simplification. - Scorer:
expression_equivalance()
- Extracts the model’s answer using regular expressions.
- Employs a few-shot prompting technique to train a model for assessing the equivalence of mathematical expressions.
- Code
@scorer(metrics=[accuracy(), bootstrap_std()]) def expression_equivalance(): async def score(state: TaskState, target: Target): # extract answer = re.search(AnswerPattern.LINE, state.output.completion) match if match: # ask the model to judge equivalance = match.group(1) answer = EQUIVALANCE_TEMPLATE % ( prompt "expression1": target.text, "expression2": answer} { )= await get_model().generate(prompt) result # return the score = result.completion.lower() == "yes" correct return Score( =CORRECT if correct else INCORRECT, value=answer, answer=state.output.completion, explanation )else: return Score( =INCORRECT, value="Answer not found in model output: " explanation+ f"{state.output.completion}", ) return score
Scorer: validate_scorer()
- Extracts and cleans JSON output from the model.
- Calls the
is_valid()
function with the column schema to determine if a valid query was generated.
from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT
from utils import is_valid, json_completion
@scorer(metrics=[accuracy()])
def validate_scorer():
async def score(state, target):
# check for valid query
= json_completion(state.output.completion)
query if is_valid(query, state.metadata["columns"]):
=CORRECT
valueelse:
=INCORRECT
value
# return score w/ query that was extracted
return Score(value=value, answer=query)
return score
- The
json_completion()
function takes care of some details around extracting JSON from a model completion (e.g. removing sorrounding backtick code block emitted by some models)
Validate Task
Task
:- The basis for defining and running evaluations
- Parameterized with a dataset, a scorer, and metrics.
- May optionally provide a default plan for execution.
Honeycomb Eval: validate()
- Combines the dataset, solver, and scorer defined above into a
Task
. - Uses a predefined system message and prompt template.
from inspect_ai import eval, task, Task
from inspect_ai.solver import system_message, generate
@task
def validate():
return Task(
=dataset,
dataset=[
plan"Honeycomb AI suggests queries based on user input."),
system_message(
prompt_with_schema(),
generate()
],=validate_scorer()
scorer )
- Run the
Task
using Inspect’seval()
function (limiting to 100 samples):
if __name__ == '__main__':
eval(validate, model="openai/gpt-4-turbo", limit=100)
- The
__name__ == '__main__'
conditional indicates that we only want to run this cell in interactive contexts.- Allows us to also use the notebook as a module callable from
inspect eval
: $ inspect eval queries.ipynb@validate
- Allows us to also use the notebook as a module callable from
Eval View
inspect view
- Example
$ inspect view Inspect view running at http://localhost:7575/
Provides an overview of evaluation results.
Allows drilling down into individual samples to examine message history, inputs, and model outputs for debugging.
Critique Task
- Documentation: Models
Scorer: critique_scorer()
- Allows using different models (e.g., GPT-4 Turbo) to critique the generated queries.
- Builds a critique prompt using a predefined template and the model’s output to have the critique model indicate whether the generated query is “good” or “bad”.
- Returns a score based on the critique model’s assessment.
import json
from inspect_ai.model import get_model
@scorer(metrics=[accuracy()])
def critique_scorer(model = "anthropic/claude-3-5-sonnet-20240620"):
async def score(state, target):
# build the critic prompt
= state.output.completion.strip()
query = resource("critique.txt").replace(
critic_prompt "{{prompt}}", state.user_prompt.text
).replace("{{columns}}", state.metadata["columns"]
).replace("{{query}}", query
)
# run the critique
= await get_model(model).generate(critic_prompt)
result try:
= json.loads(json_completion(result.completion))
parsed = CORRECT if parsed["outcome"] == "good" else INCORRECT
value = parsed["critique"]
explanation except (json.JSONDecodeError, KeyError):
= INCORRECT
value = f"JSON parsing error:\n{result.completion}"
explanation
# return value and explanation (critique text)
return Score(value=value, explanation=explanation)
return score
Honeycomb Eval: critique()
- Utilizes the same dataset and plan as
validate()
but employs a critique model for scoring.
@task
def critique():
return Task(
=dataset,
dataset=[
plan"Honeycomb AI suggests queries based on user input."),
system_message(
prompt_with_schema(),
generate()
],=critique_scorer()
scorer )
- Run the task using
eval()
(limiting to 25 samples):
if __name__ == '__main__':
eval(critique, model="openai/gpt-4-turbo", limit=25)
Critique Eval View
inspect view
- Displays critique-based evaluation results.
- Provides insights into the critique model’s explanations for incorrect answers, aiding in prompt or model improvement.
Composition
- Inspect AI encourages composing evaluations by combining solvers and scorers from different sources.
- Custom solvers and scorers can be made available in a Python package to re-use across many evals.
Example: Jailbreaking with sheppard
sheppard
: An internal package that integrates jailbreak solvers to elicit responses from models they might otherwise refuse to provide.Solver Description encode()
Message obfuscation jailbreak pap_jailbreak()
Persuasion Adversarial Prompt (PAP) payload_splitting()
Payload splitting jailbreak cr_jailbreak()
Content reinforcement - Example: Using sheppard to provide jailbreaks for a security eval:
from inspect_ai import Task, eval, task from inspect_ai.scorer import model_graded_fact from inspect_ai.solver import generate, system_message from sheppard import pap_jailbreak @task def security_guide(): return Task( =example_dataset("security_guide"), dataset=[ plan"system.txt"), system_message( pap_jailbreak(), generate() ],=model_graded_fact(model="openai/gpt-4"), scorer )
Tool Use
- Documentation: Tools
- Python functions that can be made accessible to the model during evaluation.
- Allow for more complex interactions, like web search or database lookups.
class TaskState: list[ChatMessage] messages: list[ToolDef] tools: tool_choice: ToolChoice output: ModelOutput ...
Example: Biology QA with Web Search
- Provides a web search tool to the model, enabling it to answer obscure biology questions.
return Task( =example_dataset("biology_qa"), dataset=[ plan use_tools(web_search()), generate() ],=model_graded_qa(template=GRADER_TEMPLATE), scorer )
Agents and Tools
- Documentation: Agents
- Solvers that use tools to perform tasks.
- Can use bespoke agent logic inside a solver (swapping various tools in and out)
- Can integrate existing agent libraries like Langchain as solvers.
Agent: Capture the Flag
- Example of a custom agent designed for cybersecurity evaluations, where the model interacts with tools to solve capture-the-flag challenges within a Docker environment.
Plan(=[ steps init_challenge(), use_tools([ command_exec(), create_file(), decompile(), disassemble(), check_flag(), ]),"prompts/system.txt"), system_message( initial_user_message(), generate(), check_for_flag_or_continue() ],=exit_challenge() cleanup )
Agent: LangChain
- Project Folder: inspect-llm-workshop/langchain
- Uses tavily, a search engine optimized for LLMs and RAG
- Demonstrates integrating a Langchain agent into Inspect AI using a higher-order function.
- Allows leveraging existing agent frameworks within Inspect AI’s evaluation pipeline.
@solver def wikipedia_search() -> Solver: = TavilySearchAPIWrapper() tavily_api = ([TavilySearchResults(api_wrapper=tavily_api)] + tools "wikipedia"])) load_tools([ async def agent(llm: BaseChatModel, input: dict[str, Any]): = create_openai_tools_agent(llm, tools, prompt) tools_agent = AgentExecutor.from_agent_and_tools( agent_executor =tools_agent, agent=tools tools )= await agent_executor.ainvoke(input) result return result["output"] return langchain_solver(agent)
Logging
- Documentation: Eval Logs
- Logging is crucial for debugging, analysis, and reproducibility.
- Capture all context required to debug, analyse, and reproduce evaluations
- Inspect AI Logging:
- Provides a rich Python API and JSON representation of the evaluation process.
- Offers a log viewer for interactive exploration.
- Enables programmatic access to logs for analysis and comparison.
EvalLog
Documentation: EvalLog
Returned from
eval()
Provides programmatic interface to the contents of log files
Field Type Description status
str
Status of evaluation eval
EvalSpec
Top level eval details including task, model, creation time, etc. plan
EvalPlan
List of solvers and model generation config used for the eval. samples
list[EvalSample]
Each sample evaluated, including its input, output, target, and score. results
EvalResults
Aggregated scorer results stats
EvalStats
Model token usage stats logging
list[LoggingMessage]
Logging messages (e.g. from log.info()
,log.debug()
, etc.error
EvalError
Error information
Models
- Documentation: Models
Provider Model Name Docs OpenAI openai/gpt-3.5-turbo
OpenAI Models Anthropic anthropic/claude-3-sonnet-20240229
Anthropic Models Google google/gemini-1.0-pro
Google Models Mistral mistral/mistral-large-latest
Mistral Models Hugging Face hf/openai-community/gpt2
Hugging Face Models Ollama ollama/llama3
Ollama Models TogetherAI together/lmsys/vicuna-13b-v1.5
TogetherAI Models AWS Bedrock bedrock/meta.llama2-70b-chat-v1
AWS Bedrock Models Azure AI azureai/azure-deployment-name
Azure AI Models Cloudflare cf/meta/llama-2-7b-chat-fp16
Cloudflare Models - Allows custom model providers.
- Inspect AI remains agnostic to specific model implementations, allowing flexibility and future compatibility.
Workflow
- Documentation: Workflow
Interactive Development:
- Designed for iterative development within notebooks.
- Provides tools for exploration, such as grid search over parameters.
- Ad-hoc exploration of an eval in a Notebook/REPL
= { params "system": ["devops.txt", "researcher.txt"], "grader": ["hacker.txt", "expert.txt"], "grader_model": ["openai/gpt-4", "google/gemini-1.0-pro"] }= list(product(*(params[name] for name in params))) params = [Task( tasks =json_dataset("security_guide.jsonl"), dataset=[system_message(system), generate()], plan=model_graded_fact(template=grader, model=grader_model) scorerfor system, grader, grader_model in params] ) = eval(tasks, model = "mistral/mistral-large-latest") logs plot_results(logs)
Task Parameters:
- Allows defining tasks with configurable parameters.
- Enables running evaluations with varying settings from external scripts or notebooks.
- Formalise variation with a parameterised @task function:
@task def security_guide(system="devops.txt", grader="expert.txt"): return Task( = json_dataset("security_guide.jsonl"), dataset =[system_message(system), generate()], plan=model_graded_fact(template=grader, model="openai/gpt-4") scorer ) = { params "system": ["devops.txt", "researcher.txt"], "grader": ["hacker.txt", "expert.txt"] }= list(product(*(params[name] for name in params))) params eval([security_guide(system,grader) for system, grader in params], = "mistral/mistral-large-latest") model
@task
functions are registered and addressable by external driver programs (step one in development => production)- Example
@task def security_guide(system="devops.txt", grader="expert.txt"): return Task( = json_dataset("security_guide.jsonl"), dataset =[system_message(system), generate()], plan=model_graded_fact( scorer=grader, template="openai/gpt-4" model ) )
$ inspect eval security_guide.py -T system=devops.txt $ inspect eval security_guide.py -T grader=hacker.txt
$ inspect eval security_guide.ipynb -T system=devops.txt $ inspect eval security_guide.ipynb -T grader=hacker.txt
- Example
def security_guide(system, grader="expert.txt"): return Task( = json_dataset("security_guide.jsonl"), dataset =[system_message(system), generate()], plan=model_graded_fact(template=grader, model="openai/gpt-4") scorer ) @task def devops() return security_guide("devops.txt") @task def researcher() return security_guide("researcher.txt")
$ inspect eval security_guide.py@devops $ inspect eval security_guide.py@researcher
Eval Suites:
- Documentation: Eval Suites
- Supports organizing and running multiple evaluations as suites.
Resiliency:
- Encourages running evaluations in production environments (e.g., CI) with features like log storage and retry mechanisms.
- Simplified Example
# setup log context "INSPECT_LOG_DIR"] = "./security-suite_04-07-2024" os.environ[ # run the eval suite = list_tasks("security") tasks eval(tasks, model="mistral/mistral-large-latest") # ...later, in another process that also has INSPECT_LOG_DIR = list_eval_logs(status == "error") error_logs eval_retry(error_logs)
Provenance:
- Ensures reproducibility by storing Git repository information within the log file.
- Allows recreating the evaluation environment and parameters from the log.
- Example
# read the log and extract the origin and commit = read_eval_log("security-log.json") log = log.spec.revision.origin origin = log.spec.revision.commit commit # clone the repo, checkout the commit, install deps, and run "git", "clone", revision.origin, "eval-dir"]) run([with chdir("eval-dir"): "git", "checkout", revision.commit]) run(["pip", "install", "-r", "requirements.txt"]) run([eval(log)
Q&A Session
- Integration with Posit Products: Inspect AI is not a Posit project and currently has no plans for integration.
- Evaluating Past LLM Interactions: Inspect AI can evaluate past interactions by using message history as input.
- Expanding Evaluation Metrics: The Inspect AI team plans to expand the list of built-in metrics based on community needs.
- Future Development and Direction: Long-term development is prioritized, with a focus on community collaboration and supporting a wide range of evaluation scenarios.
- Log Sources Beyond Files: Currently, logs are primarily file-based, but future development may include database logging capabilities.
- Shareable Security Tests: The Inspect AI team anticipates the creation and sharing of security test suites within the community.
- Integration with Weights & Biases: Integration with Weights & Biases is planned to streamline metric tracking and visualization.
- Design Philosophy: Inspired by principles of cleanliness, simplicity, and composability.
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.