Conference Talk 4: Inspect - An OSS Framework for LLM Evals

In this talk, J.J. Allaire walks through the core concepts and design of the Inspect framework and demonstrate its use for a variety of evaluation tasks.

Christian Mills


July 6, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.
Presentation Resources

Inspect AI: A Framework for Evaluating LLMs

Hello World Example

from inspect_ai import Task, eval, task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import (               
  chain_of_thought, generate, self_critique   

def theory_of_mind():
1    return Task(
2          chain_of_thought(),
3        scorer=model_graded_fact()
The Task object brings together the dataset, solvers, and scorer, and is then evaluated using a model.
In this example we are chaining together three standard solver components. It’s also possible to create a more complex custom solver that manages state and interactions internally.
Since the output is likely to have pretty involved language, we use a model for scoring.

Core Concepts

  • Dataset: List of inputs for the LLM
    • Usually includes targets (e.g., correct answers)
  • Solvers: Pipelines that execute the evaluation
    • Includes functions that transform the dataset inputs, call the model, and act on the model output
    • Can include things like prompt engineering and multi-turn dialogue
    • Can be composed together as layers, or can be a single layer with higher internal complexity
  • Scores: Evaluates the output of solvers
    • Ranges from simple text comparisons to model-graded assessments using custom rubrics.

Honeycomb Dataset Example

  • Jupyter Notebook: honeycomb/queries.ipynb
  • Dataset: queries.csv
    • ~2,300 example queries (along with per-query column schemas generated offline via RAG)
  • Scoring Methods:
    • validate: score using the validity checker:
    • critique: score using the critique prompt: critique.txt
  • Process:
    1. Load the dataset.
    2. Define a pipeline that includes prompt engineering, model calling, and evaluation.
    3. Apply a scoring function.


  • Documentation: Datasets
  • Inspect uses a standard schema for Datasets
    • We map the raw data into that schema when reading it
  • “columns” are saved as metadata so we can use them for prompt engineering
from inspect_ai.dataset import csv_dataset, FieldSpec

dataset = csv_dataset(
    sample_fields=FieldSpec(input="user_input", metadata=["columns"]),


  • Documentation: Solver

  • Functions that manipulate the TaskState (message history and model output) to perform actions like:

    • Calling the model to generate text.
    • Engineering prompts.
    • Applying critique mechanisms.
  • Built-In Solvers:

    • generate(): Calls the model, appends the assistant message, and updates the model output
    • chain_of_thought(): Basic chain-of-thought implementation.
    • prompt_template(): Modifies the existing prompt by passing it through a template
    • multiple_choice(): Handles multiple-choice questions, including shuffling options, calling the model, and unshuffling to determine the chosen answer.
    • self_critique()
      • Performs self-critique by:
        1. Running a critique model on the initial output.
        2. Appending the critique to the message history.
        3. Calling the model again to generate a revised answer.

Solver: prompt_with_schema()

  • Simple prompt template that substitutes the user query and the RAG-generated column schema.
from inspect_ai.solver import solver
from inspect_ai.util import resource

def prompt_with_schema():

    prompt_template = resource("prompt.txt")

    async def solve(state, generate):
        # build the prompt
        state.user_prompt.text = prompt_template.replace(
            "{{prompt}}", state.user_prompt.text
            "{{columns}}", state.metadata["columns"]
        return state

    return solve


  • Documentation: Scorer
    • Evaluates whether solvers were successful in finding the right output for the target defined in the dataset, and in what measure
    • Pluggable (i.e. provided from other packages)
  • Built-In Scorers:
    • Pattern matching
    • Template matching
    • Model grading
    • Human scoring
  • Example: Math Benchmark with Expression Equivalence
    • Source Code: benchmarks/
    • Uses a custom scorer expression_equivalance to evaluate mathematical expressions for logical equivalence, accounting for variations in formatting or simplification.
    • Scorer: expression_equivalance()
      • Extracts the model’s answer using regular expressions.
      • Employs a few-shot prompting technique to train a model for assessing the equivalence of mathematical expressions.
      • @scorer(metrics=[accuracy(), bootstrap_std()])
        def expression_equivalance():
            async def score(state: TaskState, target: Target):
                # extract answer
                match =, state.output.completion)
                if match:
                    # ask the model to judge equivalance
                    answer =
                    prompt = EQUIVALANCE_TEMPLATE % (
                        {"expression1": target.text, "expression2": answer}
                    result = await get_model().generate(prompt)
                    # return the score
                    correct = result.completion.lower() == "yes"
                    return Score(
                        value=CORRECT if correct else INCORRECT,
                    return Score(
                        explanation="Answer not found in model output: "
                        + f"{state.output.completion}",
            return score

Scorer: validate_scorer()

  • Extracts and cleans JSON output from the model.
  • Calls the is_valid() function with the column schema to determine if a valid query was generated.
from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT
from utils import is_valid, json_completion

def validate_scorer():

    async def score(state, target):
        # check for valid query
        query = json_completion(state.output.completion)
        if is_valid(query, state.metadata["columns"]):
        # return score w/ query that was extracted
        return Score(value=value, answer=query)

    return score
  • The json_completion() function takes care of some details around extracting JSON from a model completion (e.g. removing sorrounding backtick code block emitted by some models)

Validate Task

  • Task:
    • The basis for defining and running evaluations
    • Parameterized with a dataset, a scorer, and metrics.
    • May optionally provide a default plan for execution.

Honeycomb Eval: validate()

  • Combines the dataset, solver, and scorer defined above into a Task.
  • Uses a predefined system message and prompt template.
from inspect_ai import eval, task, Task
from inspect_ai.solver import system_message, generate

def validate():
    return Task(
            system_message("Honeycomb AI suggests queries based on user input."),
  • Run the Task using Inspect’s eval() function (limiting to 100 samples):
if __name__ == '__main__':
    eval(validate, model="openai/gpt-4-turbo", limit=100)

  • The __name__ == '__main__' conditional indicates that we only want to run this cell in interactive contexts.
    • Allows us to also use the notebook as a module callable from inspect eval:
    •   $ inspect eval queries.ipynb@validate

Eval View

  •   inspect view
    • Example
      $ inspect view
      Inspect view running at http://localhost:7575/
  • Provides an overview of evaluation results.

  • Allows drilling down into individual samples to examine message history, inputs, and model outputs for debugging.


Critique Task

Scorer: critique_scorer()

  • Allows using different models (e.g., GPT-4 Turbo) to critique the generated queries.
  • Builds a critique prompt using a predefined template and the model’s output to have the critique model indicate whether the generated query is “good” or “bad”.
  • Returns a score based on the critique model’s assessment.
import json
from inspect_ai.model import get_model

def critique_scorer(model = "anthropic/claude-3-5-sonnet-20240620"):

    async def score(state, target):
        # build the critic prompt
        query = state.output.completion.strip()
        critic_prompt = resource("critique.txt").replace(
            "{{prompt}}", state.user_prompt.text
            "{{columns}}", state.metadata["columns"]
            "{{query}}", query
        # run the critique
        result = await get_model(model).generate(critic_prompt)
            parsed = json.loads(json_completion(result.completion))
            value = CORRECT if parsed["outcome"] == "good" else INCORRECT
            explanation = parsed["critique"]
        except (json.JSONDecodeError, KeyError):
            value = INCORRECT
            explanation = f"JSON parsing error:\n{result.completion}"
        # return value and explanation (critique text)
        return Score(value=value, explanation=explanation)

    return score

Honeycomb Eval: critique()

  • Utilizes the same dataset and plan as validate() but employs a critique model for scoring.
def critique():
    return Task(
            system_message("Honeycomb AI suggests queries based on user input."),
  • Run the task using eval() (limiting to 25 samples):
if __name__ == '__main__':
    eval(critique, model="openai/gpt-4-turbo", limit=25)


Critique Eval View

  •   inspect view
  • Displays critique-based evaluation results.
  • Provides insights into the critique model’s explanations for incorrect answers, aiding in prompt or model improvement.



  • Inspect AI encourages composing evaluations by combining solvers and scorers from different sources.
  • Custom solvers and scorers can be made available in a Python package to re-use across many evals.

Example: Jailbreaking with sheppard

  • sheppard: An internal package that integrates jailbreak solvers to elicit responses from models they might otherwise refuse to provide.

  • Solver Description
    encode() Message obfuscation jailbreak
    pap_jailbreak() Persuasion Adversarial Prompt (PAP)
    payload_splitting() Payload splitting jailbreak
    cr_jailbreak() Content reinforcement
  • Example: Using sheppard to provide jailbreaks for a security eval:
    from inspect_ai import Task, eval, task
    from inspect_ai.scorer import model_graded_fact
    from inspect_ai.solver import generate, system_message
    from sheppard import pap_jailbreak
    def security_guide():
        return Task(

Tool Use

  • Documentation: Tools
    • Python functions that can be made accessible to the model during evaluation.
    • Allow for more complex interactions, like web search or database lookups.
    •   class TaskState:
            messages: list[ChatMessage]
            tools: list[ToolDef]
            tool_choice: ToolChoice
            output: ModelOutput

Agents and Tools

  • Documentation: Agents
    • Solvers that use tools to perform tasks.
    • Can use bespoke agent logic inside a solver (swapping various tools in and out)
    • Can integrate existing agent libraries like Langchain as solvers.

Agent: Capture the Flag

  • Example of a custom agent designed for cybersecurity evaluations, where the model interacts with tools to solve capture-the-flag challenges within a Docker environment.
  •   Plan(
                  command_exec(), create_file(),
                  decompile(), disassemble(),

Agent: LangChain

  • Project Folder: inspect-llm-workshop/langchain
    • Uses tavily, a search engine optimized for LLMs and RAG
  • Demonstrates integrating a Langchain agent into Inspect AI using a higher-order function.
  • Allows leveraging existing agent frameworks within Inspect AI’s evaluation pipeline.
  • @solver
    def wikipedia_search() -> Solver:
        tavily_api = TavilySearchAPIWrapper() 
        tools = ([TavilySearchResults(api_wrapper=tavily_api)] + 
        async def agent(llm: BaseChatModel, input: dict[str, Any]):
            tools_agent = create_openai_tools_agent(llm, tools, prompt)
            agent_executor = AgentExecutor.from_agent_and_tools(
            result = await agent_executor.ainvoke(input)
            return result["output"]
        return langchain_solver(agent)


  • Documentation: Eval Logs
  • Logging is crucial for debugging, analysis, and reproducibility.
  • Capture all context required to debug, analyse, and reproduce evaluations
  • Inspect AI Logging:
    • Provides a rich Python API and JSON representation of the evaluation process.
    • Offers a log viewer for interactive exploration.
    • Enables programmatic access to logs for analysis and comparison.


  • Documentation: EvalLog

  • Returned from eval()

  • Provides programmatic interface to the contents of log files

  • Field Type Description
    status str Status of evaluation
    eval EvalSpec Top level eval details including task, model, creation time, etc.
    plan EvalPlan List of solvers and model generation config used for the eval.
    samples list[EvalSample] Each sample evaluated, including its input, output, target, and score.
    results EvalResults Aggregated scorer results
    stats EvalStats Model token usage stats
    logging list[LoggingMessage] Logging messages (e.g. from, log.debug(), etc.
    error EvalError Error information



Interactive Development:

  • Designed for iterative development within notebooks.
  • Provides tools for exploration, such as grid search over parameters.
  • Ad-hoc exploration of an eval in a Notebook/REPL
    params = {
       "system": ["devops.txt", "researcher.txt"],
       "grader": ["hacker.txt", "expert.txt"],
       "grader_model": ["openai/gpt-4", "google/gemini-1.0-pro"]
    params = list(product(*(params[name] for name in params)))
    tasks = [Task(
        plan=[system_message(system), generate()],
        scorer=model_graded_fact(template=grader, model=grader_model)
    ) for system, grader, grader_model in params]
    logs = eval(tasks, model = "mistral/mistral-large-latest")

Task Parameters:

  • Allows defining tasks with configurable parameters.
  • Enables running evaluations with varying settings from external scripts or notebooks.
  • Formalise variation with a parameterised @task function:
    def security_guide(system="devops.txt", grader="expert.txt"):
       return Task(
          dataset = json_dataset("security_guide.jsonl"),
          plan=[system_message(system), generate()],
          scorer=model_graded_fact(template=grader, model="openai/gpt-4")
    params = {
       "system": ["devops.txt", "researcher.txt"],
       "grader": ["hacker.txt", "expert.txt"]
    params = list(product(*(params[name] for name in params)))
    eval([security_guide(system,grader) for system, grader in params],
         model = "mistral/mistral-large-latest")
  • @task functions are registered and addressable by external driver programs (step one in development => production)
    • Example
      def security_guide(system="devops.txt", grader="expert.txt"):
          return Task(
              dataset = json_dataset("security_guide.jsonl"),
              plan=[system_message(system), generate()],
      $ inspect eval -T system=devops.txt 
      $ inspect eval -T grader=hacker.txt 
      $ inspect eval security_guide.ipynb -T system=devops.txt 
      $ inspect eval security_guide.ipynb -T grader=hacker.txt 
    • Example
      def security_guide(system, grader="expert.txt"):
         return Task(
            dataset = json_dataset("security_guide.jsonl"),
            plan=[system_message(system), generate()],
            scorer=model_graded_fact(template=grader, model="openai/gpt-4")
      def devops()
         return security_guide("devops.txt")
      def researcher()
         return security_guide("researcher.txt")
      $ inspect eval
      $ inspect eval

Eval Suites:

  • Documentation: Eval Suites
  • Supports organizing and running multiple evaluations as suites.


  • Encourages running evaluations in production environments (e.g., CI) with features like log storage and retry mechanisms.
  • Simplified Example
    # setup log context
    os.environ["INSPECT_LOG_DIR"] = "./security-suite_04-07-2024"
    # run the eval suite
    tasks = list_tasks("security")
    eval(tasks, model="mistral/mistral-large-latest")
    # ...later, in another process that also has INSPECT_LOG_DIR
    error_logs = list_eval_logs(status == "error")


  • Ensures reproducibility by storing Git repository information within the log file.
  • Allows recreating the evaluation environment and parameters from the log.
  • Example
    # read the log and extract the origin and commit
    log = read_eval_log("security-log.json")
    origin = log.spec.revision.origin
    commit = log.spec.revision.commit
    # clone the repo, checkout the commit, install deps, and run
    run(["git", "clone", revision.origin, "eval-dir"])
    with chdir("eval-dir"):
       run(["git", "checkout", revision.commit])
       run(["pip", "install", "-r", "requirements.txt"])

Q&A Session

  • Integration with Posit Products: Inspect AI is not a Posit project and currently has no plans for integration.
  • Evaluating Past LLM Interactions: Inspect AI can evaluate past interactions by using message history as input.
  • Expanding Evaluation Metrics: The Inspect AI team plans to expand the list of built-in metrics based on community needs.
  • Future Development and Direction: Long-term development is prioritized, with a focus on community collaboration and supporting a wide range of evaluation scenarios.
  • Log Sources Beyond Files: Currently, logs are primarily file-based, but future development may include database logging capabilities.
  • Shareable Security Tests: The Inspect AI team anticipates the creation and sharing of security test suites within the community.
  • Integration with Weights & Biases: Integration with Weights & Biases is planned to streamline metric tracking and visualization.
  • Design Philosophy: Inspired by principles of cleanliness, simplicity, and composability.