Livestream: Lessons from a Year of Building with LLMs

notes
llms
This live discussion between six AI experts and practitioners centers on the practical lessons learned from a year of building real-world applications with LLMs, emphasizing the critical importance of data literacy, rigorous evaluation, and iterative development processes.
Author

Christian Mills

Published

August 30, 2024

This post is part of the following series:
  • Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Introduction and Background

  • Hugo Bowne-Anderson: Host of the Vanishing Gradients podcast and live stream. Has a background in data science, machine learning, and education.
  • Live Stream Focus: Discussion of a report co-authored by six AI professionals (Eugene Yan, Shreya Shankar, Hamel Husain, Brian Bischof, Charles Frye, and Jason Liu), focusing on practical lessons learned from building real-world applications with Large Language Models (LLMs) over the past year.
  • Report Link: https://applied-llms.org/
  • Recording Link: https://www.youtube.com/live/c0gcsprsFig
  • Key Themes: The report covers tactical, operational, and strategic aspects of building LLM systems, including:
    • Tactical: Prompt engineering, data management, evaluation strategies.
    • Operational: Building development pipelines, integrating LLMs into existing workflows.
    • Strategic: Understanding business use cases, defining success metrics, and building trust with stakeholders.

Panelist Introductions and Motivations

  • Eugene Yan:
    • Works at Amazon Books, focusing on recommendation systems and search.
    • Interested in using LLMs to enhance recommendations and search by improving customer understanding.
    • Co-authored several key resources on prompt engineering and LLM evaluation.
    • Blog: https://eugeneyan.com
  • Shreya Shankar:
    • Researcher and ML engineer pursuing a PhD focused on data management, UX, and HCI for machine learning.
    • Interested in building intelligent software with LLMs, particularly for small teams and early-stage products.
    • Emphasis on developing evaluation methods that are simple and accessible.
    • Conducts research on evaluating LLM output quality and validating evaluation methods.
  • Hamel Husain:
    • 25 years of experience in machine learning, including work on developer tools and ML infrastructure at GitHub.
    • Led research at GitHub that contributed to Copilot.
    • Passionate about using LLMs to accelerate software development and launch applications faster.
    • Strong advocate for prioritizing evaluation (evals) as a core part of the AI development process.
    • Blog post: https://hamel.dev/blog/posts/evals/
  • Brian Bischof:
    • Head of AI at Hex, leads the team developing Magic (an AI-powered data science tool).
    • Extensive experience building data teams at Blue Bottle Coffee, Stitch Fix, and Weights & Biases.
    • Passionate about applying LLMs to answer questions with data, particularly in the context of data science workflows.
    • Strong advocate for using notebooks for exploring and understanding data.
  • Charles Frye:
    • Works at Modal, focused on teaching people to build AI applications.
    • Background in psychopharmacology, neurobiology, and neural networks.
    • Interested in developing intelligent software and using LLMs to democratize cognition.
    • Key contributor to Modal’s infrastructure and developer tools, including the TensorRT-LLM library.

The Genesis of the Report

  • Origin: The report originated from discussions within a group chat among the co-authors.
  • Initial Motivation:
    • Brian Bischof: Was considering writing about a year of LLMs.
    • Eugene Yan: Had already started drafting similar content.
    • Charles Frye: Suggested a collaborative effort.
    • Hamel Husain and Jason Liu: Enthusiastically joined the project.
    • Shreya Shankar: Was invited for her expertise in evaluation and editing.
  • Organization:
    • Brian Bischof: Proposed structuring the report into tactical, operational, and strategic levels.
  • Collaborative Workflow:
    • Authors contributed their ideas and experiences.
    • Content was synthesized and organized collaboratively.
    • Charles Frye played a key role in editing and consolidating the material.

The Importance of Evaluation

  • Traditional Evals vs. LLM Evals:
    • Shreya Shankar:
      • Traditional evaluation methods often fail to uncover issues arising from real-world data idiosyncrasies (e.g., typos, inconsistent casing).
      • Canonical benchmarks and datasets often rely on clean, pre-processed data, which does not reflect real-world data challenges.
      • Example: Mistral struggles to retrieve names from documents if the name is in lowercase.
      • Traditional evals often lack methods for validating the evaluation methods themselves.
      • LLM-based evals can make evaluation more accessible to smaller teams who lack resources for traditional methods.
      • EvalGen Paper: Discusses a flow-based approach to validating evaluation methods and constructing appropriate evals for deployment.
  • Overcoming Barriers to Evaluation:
    • Developers often struggle with the concept of evals and how to get started.
    • Shreya Shankar’s research tools provide a visual, intuitive approach to building and understanding evals, similar to Scratch for programming.
    • This “Scratch for evals” approach makes evals more accessible and understandable.
  • Evaluation as an Integral Part of AI Development:
    • Evaluation is not optional, it’s an essential part of the AI development process.
    • Evals are not separate from building AI, they are how you build AI.
    • Measuring progress and having a systematic approach to improvement are critical.
    • Focus on making evals frictionless and integrated into the workflow to enable continuous improvement.
  • Shifting the Focus from Tools to Process:
    • Developers often focus on tools (e.g., vector databases, embeddings) rather than understanding the underlying process of building and improving AI applications.
    • It’s essential to shift the focus from tools to the process of evaluation and data understanding.
  • Evaluation as a Proxy for Loss Functions:
    • Traditional machine learning relies on explicit loss functions for optimization.
    • In generative AI, the loss functions are less defined, and evals act as a proxy for measuring progress and alignment with desired outcomes.

Underappreciated Aspects of LLM Development

  • Equipping Engineers with LLM Skills:
    • Organizations need to focus on training existing software engineers to understand and effectively use LLMs.
    • Key areas for training:
      • Basic evaluation methods (e.g., using synthetic data, Kaggle datasets).
      • Understanding the autoregressive nature of LLM generation and its impact on latency.
      • Understanding context as conditioning.
  • Moving from Prototype to Production:
    • The industry has focused heavily on demos and prototypes, and the bar for “production” has been lowered with generative AI.
    • Prototype++: Products are being launched without rigorous evaluation or clear methods for quantifying improvements.
    • The definition of “production” should include systematic improvement processes and clear evidence of progress for stakeholders.
    • Challenge: Moving from prototype++ to true production-ready LLM applications with robust evals and improvement processes.
  • The Importance of Data Literacy:
    • Data literacy is essential for working with LLMs, even if you are not training models from scratch.
    • Data literacy enables developers to:
      • Analyze and debug systems: By examining data, developers can identify issues and understand system behavior.
      • Develop domain-specific evals: Understanding the data is crucial for creating meaningful and effective evaluation methods.
    • Challenge: Overcoming the misconception that AI automates everything and realizing the ongoing need to look at and understand data.
  • Specific Data Literacy Skills:
    • Assessing Output Quality:
      • Develop methods for determining whether an output is good or bad.
      • Use a combination of binary indicators (e.g., conciseness, tone, presence of specific phrases) to simplify evaluation.
    • Pairwise Comparisons: Compare two models or pipelines directly instead of relying on individual ratings.
    • Understanding Implicit Constraints: Define criteria for “good” that align with user expectations.
    • Simplifying Evaluation: Break down complex evaluation tasks into smaller, more manageable binary assessments.
  • The Role of Human-in-the-Loop Evaluation:
    • Importance of manual data labeling and review, especially in the early stages of development.
    • “Homework” approach at Hex: Team members use interactive Hex applications to provide feedback on model outputs and assist with data labeling.
    • This process helps to bootstrap data sets and align LLM evaluations with human judgment.

Misconceptions and Knowledge Gaps in Organizations

  • The “AI Engineer” Narrative and Its Limitations:
    • Popular Characterization:
      • AI engineers are portrayed as primarily focused on tools, infrastructure, chains, and agents, with limited involvement in training, evals, inference, and data.
    • Skills Gaps:
      • Neglecting data literacy and evaluation skills creates significant challenges beyond the MVP stage.
      • AI engineers may find themselves unable to systematically improve their applications due to a lack of understanding of data and evaluation.
    • Title Issues:
      • The “AI engineer” title sets unrealistic expectations and places undue pressure on individuals when projects face challenges.
  • Addressing the Talent Gap:
    • The most significant impact on AI product development is the talent and skills of the team.
    • Organizations need to hire for data literacy and evaluation expertise, not just tooling and infrastructure skills.
    • Consulting Work: A significant portion of Hamel’s consulting work arises from addressing this talent gap and helping organizations build effective AI teams.
  • Real-World Example of Successful “AI Engineer” Hiring:
    • Successfully hired “AI engineers” by focusing on core data science skills.
    • Take-home exam focused on data cleaning, demonstrating the importance of data literacy in his definition of the role.
  • The Importance of Data Literacy in AI Engineering:
    • Data literacy is essential for AI engineers to analyze outputs, understand failure modes, and develop effective evaluation strategies.
    • It’s crucial for AI engineers to be able to examine data and draw conclusions about system performance without relying solely on AI tools.

Promising Opportunities and Future Challenges

  • Focusing on Unsexy, Expensive, and Slow Tasks:
    • LLMs offer the potential to automate tasks that are currently time-consuming and costly for humans.
    • Examples:
      • Classification.
      • Information Extraction.
      • Quiz Generation from Textbooks.
    • Focus: Identify tasks that are currently unsexy but can be effectively delegated to LLMs, leading to cost savings and increased efficiency.
  • Developing More Thoughtful UX:
    • LLMs should not be seen as one-shot wonders with perfect UX.
    • Graceful failure modes and user-friendly interfaces are essential.
    • Co-pilot Mentality: Design systems that allow users to edit and interact with LLM outputs, turning failures into learning opportunities.
  • Empowering End Users as Programmers:
    • AI will not be able to read minds, so empowering users to interact and refine LLM outputs is crucial.
    • ChatGPT Example: Users act as programmers by refining their prompts and editing previous messages to achieve their desired outcomes.
    • Notebook Interfaces: Offer a flexible workspace for both technical and non-technical users to interact with and programmatically control LLMs.

Key Insights from Collaboration

  • Value of Community and Alignment:
    • The most valuable aspect of the collaboration was the opportunity to connect with other experts, discuss ideas, and learn from each other’s experiences.
    • Having a network of trusted peers to consult with and debate challenging questions is essential.
  • Surprising Alignment Despite Diverse Perspectives:
    • The authors, despite working in different parts of the LLM stack (research, infrastructure, product development, etc.), found a remarkable degree of alignment in their key insights.
    • They had “grabbed different parts of the elephant” but had all come to similar conclusions about the core principles of building with LLMs.
  • The Collaborative Process as a “Raid Boss”:
    • The collaborative effort was akin to battling a raid boss in an MMORPG, requiring a skilled and coordinated team effort.
    • Charles Frye played a critical role in editing and consolidating the vast amount of material.
    • Hamel Husain demonstrated high agency by quickly setting up a website for the report.
    • The collaborative process was enjoyable, inspiring, and ultimately led to a highly impactful resource for the community.
  • Impact Beyond Industry:
    • The report’s impact has extended beyond industry, reaching academics and researchers who traditionally do not engage with industry blogs.
    • The report’s timing was critical, capturing the widespread interest in LLMs and their potential to transform computing.

The Importance of Data-Centric AI Development

  • Data Literacy as a Core Skill:
    • The ability to understand and work with data is essential for success with LLMs.
    • Traditional software engineering often focuses on the syntactic correctness of code, while data science emphasizes the semantic meaning of data.
    • Challenge: Bridging this gap in perspective and helping software engineers develop data literacy skills.
  • Developing Intuition for Data Distributions:
    • Experience with data analysis and visualization leads to an intuitive understanding of data distributions.
    • This intuition allows for quickly identifying issues in data or model outputs, even without specific theoretical knowledge.

Building Trust with Stakeholders and Users

  • Collaborative Design:
    • Involve designers, UX professionals, and domain experts early in the process to understand user needs and build trust.
    • Co-designing with stakeholders helps identify potential pitfalls and ensure the application aligns with user expectations.
  • Slow Rollout and Iterative Feedback:
    • Slowly roll out LLM applications to small groups of users to gather feedback and identify issues before wider deployment.
    • Repeated interactions with small groups provide more valuable insights than initial interactions with larger groups.
  • User Feedback as the Antidote to Demoitis:
    • Gathering user feedback through beta programs and interactions at meetups is crucial for moving beyond the hype of demos.
    • Directly observing user reactions provides valuable insights into usability and helps identify areas for improvement.

Systems Thinking vs. Model Focus

  • The Importance of Systems-Level Design:
    • Building successful LLM applications requires a systems-level perspective, not just a focus on the model itself.
    • Example: At Hex, the team started by designing the overall system architecture, including prompt templating, evaluation frameworks, and context construction (RAG).
    • These architectural decisions have proven durable and have guided the project’s development.
  • Key System Design Considerations:
    • Evals: Prioritize building robust evaluation frameworks from the outset.
    • Composability: Design systems with composability in mind, especially for context construction and prompt generation.
    • Meta-Programming: Think of prompt construction as a form of meta-programming, allowing for flexible and adaptable system behavior.
  • Leveraging Prior Experience from ML:
    • Much of the knowledge and best practices from traditional machine learning apply to building LLM systems.
    • Example: The principles outlined in the book Machine Learning Design Patterns are directly relevant to LLM application development.

Final Advice and Future Directions

  • Prioritize Evaluation:
    • Start by creating sample inputs and ideal outputs, then build evaluation methods to measure progress and identify issues.
    • Focus on creating a gold standard dataset, even if it’s small, to guide evaluation and refinement.
  • Read, Build, and Share:
    • Read: Engage with high-quality resources and articles that distill key concepts and best practices.
    • Build: Get hands-on experience by building demos and experimenting with LLMs to understand their capabilities and limitations.
    • Share: Contribute to the community by sharing your experiences, insights, and code to accelerate learning and adoption.
  • Focus on Iterative Improvement:
    • Embrace a process of validated iterative improvement, similar to gradient descent in optimization.
    • LLMs introduce complexity and uncertainty, but the fundamental principles of building complex systems remain the same: make small, validated steps toward improvement.
    • Key Elements: Data, experimentation, operationalization, and rapid deployment to production.
    • Focus on iterative experimentation and building evaluation frameworks to guide development toward user-centric solutions.
      • Zero-to-One Mentality: Break down the problem into smaller, manageable chunks and iteratively improve each component.
  • The Future of Knowledge Access:
    • LLMs have the potential to make knowledge more accessible, personalized, and useful.
    • Example: Memory extenders and personal Memex systems.
    • LLMs will transform how we interact with information, socialize, and carry out our daily lives.

Conclusion

  • The report and this discussion highlight the essential lessons learned from a year of building with LLMs.
  • Key Takeaways:
    • Data Literacy: Prioritize understanding and working with data as a foundational skill.
    • Evaluation: Make evaluation a core part of the development process, focusing on human-in-the-loop methods and iterative refinement.
    • Systems Thinking: Build end-to-end systems that go beyond model focus, considering aspects like prompt design, context construction, and evaluation frameworks.
    • Process over Tools: Focus on developing robust processes and methodologies rather than relying solely on tools.
    • User-Centricity: Design applications with user needs in mind, prioritizing trust, feedback, and iterative improvement.
  • The future of LLMs is bright, with the potential to transform how we interact with information and build intelligent software.

About Me:
  • I’m Christian Mills, a deep learning consultant specializing in computer vision and practical AI implementations.
  • I help clients leverage cutting-edge AI technologies to solve real-world problems.
  • Learn more about me or reach out via email at [email protected] to discuss your project.