Office Hours 4: Modal with Charles Frye

notes

llms

This Q&A session covers a wide array of topics related to Modal, a platform designed to simplify the execution of Python code in the cloud.

Author

Christian Mills

Published

July 6, 2024

This post is part of the following series:

Mastering LLMs Course Notes: My notes from the course Mastering LLMs: A Conference For Developers & Data Scientists by Hamel Husain and Dan Becker.

Startup Times and Optimization

Concern: Modal’s startup time compared to traditional server-based solutions.
Modal’s Performance:
- Container startup (Docker run): 1-2 seconds (50th percentile startup time).
- Slowdown occurs when environment setup requires loading large elements into memory (e.g., language model weights).
Solutions for Faster Startup:
- Keep Warm: Leave applications running for minimal latency, especially crucial for GPU-bound tasks. (Trade-off: potentially higher cost for idle resources).
  - Keep containers warm for longer with container_idle_timeout
- CUDA Checkpointing: New feature under integration, expected to accelerate subsequent invocations.
  - GitHub Repository: cuda-checkpoint
- CPU Tasks: Easily sliced and diced, making them cost-effective in keep-warm mode due to minimal resource consumption during idle periods.
- Optimization Potential: The LLM fine-tuning repo hasn’t been fully optimized for boot times; improvements are possible.
  - GitHub Repository: llm-finetuning

Iterative Development Workflow

Challenge: Fine-tuning models locally on a small scale with debugging and then scaling up on Modal with a full dataset and larger models.
Recommendations:
- modal.Image Class:
  - Base class for container images to run functions in.
  - Utilize for environment definition, ensuring consistency between local and remote setups.
  - Guide: Custom containers
- Dependency Management: Leverage tools like pip freeze and poetry for tighter control over environments.
- Hardware Considerations: Be mindful of potential discrepancies between local and Modal GPUs.

Cost Comparison and Value Proposition

Concern: Fine-tuning on Modal appears more expensive than platforms like Jarvis Labs.
Modal’s Pricing:
- Transparent, based on underlying cloud provider costs.
- No hidden fees or inflated pricing strategies.
When Modal Wins:
- High Operational Overhead: Modal excels when the effort of managing servers (spinning up, down, utilization tracking) outweighs the raw compute cost.
- Unpredictable Workloads: Serverless nature shines when demand fluctuates, and predicting utilization is challenging.
- Scalability Needs: Modal simplifies scaling to thousands of GPUs, surpassing the limitations of individual users or smaller organizations.
- GPU Accessibility: Modal offers readily available GPUs, circumventing the challenges of procurement and allocation.
- Developer Experience: Streamlined workflow and reduced operational burden can justify a potential price premium for some users.

Understanding Modal’s Cost Structure

Question: How can a keep-warm FastAPI app on Modal cost only 30 cents per month when CPU core pricing suggests a much higher cost?
Explanation:
- Time-Slicing: CPUs are shared efficiently, and Modal only charges for actual usage, not idle time.
- Low Utilization: Web apps typically have low average CPU utilization, further reducing costs.
- RAM-Based Pricing: During idle periods, charges are primarily determined by RAM usage, which is often minimal for lightweight apps.

Streaming Output from LLMs

Question: Availability of examples showcasing streamed output from LLMs in FastAPI apps.
Answer:
- Examples for streaming and FastAPI integration are available in the documentation:
  - Fast inference with vLLM (Mixtral 8x7B)
  - QuiLLMan: Voice Chat with LLMs
Modal’s Async Support: Modal simplifies asynchronous programming, making streaming implementations easier.
- Guide: Asynchronous API usage

Data Privacy

Question: Modal’s policy on data privacy and potential use of user data for model training.
Answer:
- Commitment to Security: Modal is SOC 2 compliant and working towards SOC 2 Type 2 certification, demonstrating a high standard of data security.
- User Data Protection: Modal treats user application data as confidential. Permission is sought before reviewing data, even for support purposes.
- No User Data Training: Modal, as an infrastructure company, doesn’t use customer data for training internal models.

Balancing Cost and Uptime for GPU Inference

Question: Finding the sweet spot between cost and uptime for GPU inference when needing varying levels of availability.
Rule of Thumb:
- Modal tends to be more cost-effective when utilization is 60% or lower.
- Consider factors like acceptable latency and workload characteristics (batch jobs vs. real-time requests).

Local vs. Cloud Workload Distribution

Question: Deciding when to utilize a local GPU (e.g., RTX 4090) versus offloading to Modal, considering cost and time efficiency.
Workload Breakdown:
- Inference: Local GPUs are well-suited due to typically small batch sizes, making VRAM less of a constraint.
- Evaluations: Larger eval sets might benefit from cloud GPUs for faster throughput, especially when running multiple evaluations concurrently.
- Fine-tuning: Often memory-intensive due to gradients and optimizer states. Cloud GPUs provide ample VRAM and simplify the use of techniques like sharding or larger batch sizes.
Don’t undervalue your time: Spending a little more on faster cloud compute can save a significant amount of time versus trying to run everything locally on a single GPU.

Quick Q&A

Autoscaling: Modal supports autoscaling with configurable parameters.
- Explanation: How does autoscaling work on Modal?
- Auto-scaling LLM inference endpoints: Hosting any LLaMA 3 model with Text Generation Inference (TGI)
Docker Image Access: Downloading built Docker images is not currently supported. Users can build and provide their own images.
Inference Serving:
- vLLM for its ease of use and rapid development
- TensorRT-LLM is a potentially faster but more involved alternative.
Demo Preparation: “Hello World” and “TRT LLM” examples are good starting points.
- Example: Hello, world!
- Example: Serverless TensorRT-LLM (LLaMA 3 8B)

About Me:

I’m Christian Mills, an Applied AI Consultant and Educator.

Whether I’m writing an in-depth tutorial or sharing detailed notes, my goal is the same: to bring clarity to complex topics and find practical, valuable insights.

If you need a strategic partner who brings this level of depth and systematic thinking to your AI project, I’m here to help. Let’s talk about de-risking your roadmap and building a real-world solution.

Start the conversation with my Quick AI Project Assessment or learn more about my approach.

Understanding Modal

Startup Times and Optimization

Local Development and Modal Integration

Iterative Development Workflow

Modal for CPU-Intensive Workloads

Cost Comparison and Value Proposition

Understanding Modal’s Cost Structure

Streaming Output from LLMs

Code Portability and Modal Dependency

Data Privacy

Running Databases on Modal

Balancing Cost and Uptime for GPU Inference

Local vs. Cloud Workload Distribution

Quick Q&A