Notes on ICML 2024 Tutorial: Physics of Language Models
- YouTube Recording: ICML 2024 Tutorial: Physics of Language Models
- Project Page: Physics of Language Models
- Speaker: Zeyuan Allen-Zhu
Introduction
Spectrum of “Theory” in Language Models
- The term “theory” in the context of language models encompasses a broad spectrum, ranging from rigorous mathematical proofs to empirical observations (ethology).
- Mathematical Theory: Involves proving theorems about learnability, often with idealistic assumptions and limited applicability to real-world, deep networks. Progress is slow.
- Ethology (Animal Behavior Science): Involves experimenting with large language models (LLMs) like GPT-4 through APIs, leading to discoveries like “chain of thought”. Progress is very rapid.
- Pros and Cons:
- Mathematical Theory:
- Pros: Rigorous theorems.
- Cons: Idealistic assumptions, shallow networks, slow progress, limited practical relevance.
- Ethology:
- Pros: Accessible to everyone, potential for significant discoveries (e.g., chain of thought).
- Cons: Concerns about scientific rigor (data contamination, lack of control, model specificity, limited internal understanding).
- Mathematical Theory:
- Historical Context: The slow, patient progress of scientific discovery in the past (e.g., Newton’s laws building upon Kepler’s laws, which in turn built upon Tycho Brahe’s observations) contrasts sharply with the rapid pace of current AI development.
- Analogy: The analogy of Newton’s laws to mathematical theory and Tycho’s observations to ethology is not entirely accurate. There’s a gap between simply observing LLM behavior and developing a true “physics” of language models.
Concerns with Purely Ethological Approaches
- Data Concerns: Studying models trained on internet data may lack scientific rigor due to biases, bugs (e.g., parity check failures in GPT-4), and the need for controlled studies.
- Model Specificity: Observations might be specific to a particular model version (e.g., a bug in a specific GPT-4 release) and not generalizable.
- Data Contamination: Benchmarks like GSM8K can be compromised by unintentional data leakage (e.g., translating problems into other languages and posting them online).
- Lack of Internal Understanding: Observing external behavior reveals little about the internal workings and failure modes of LLMs. Geocentrism analogy: Observing the sun and moon’s movement doesn’t reveal the true heliocentric model.
The Physics of Language Models: A Proposed Approach
- Decomposition: Break down “intelligence” into building blocks (language structures, knowledge, reasoning) and study them individually.
- Synthetic Data: Use controlled, idealized synthetic datasets to manipulate variables (difficulty, type, amount, format) and understand their impact.
- Repeatability: Focus on smaller models (e.g., 100 million parameters) to enable repeated, controlled experiments, which are infeasible with multi-billion parameter models. Universal laws can still be derived.
- Probing: Investigate the inner workings of language models to understand how they function.
Presentation Structure
- The presentation covers three main parts, presented in reverse order:
- Language Structures: How LLMs learn language structures, focusing on context-free grammars (CFGs) (joint work with Professor Yuanzhi Li).
- Reasoning: How LLMs perform reasoning, specifically at the level of grade-school math (joint work with Tian Ye, Zicheng Xu, and Yuanzhi Li ).
- Knowledge: How LLMs acquire and manipulate knowledge (joint work with Professor Yuanzhi Li).
Part 3: Knowledge
- YouTube Recording: Physics of Language Models: Part 3.1 + 3.2, Knowledge Storage, Extraction and Manipulation
- Paper: Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
- Paper: Physics of Language Models: Part 3.2, Knowledge Manipulation
- Paper: Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
3.1 Knowledge Extraction
Introduction
- Problem: LLMs often fail simple knowledge manipulation tasks (e.g., parity checks on birth years, comparing celebrity birth dates).
- Prerequisite to Studying Manipulation: Before assessing manipulation, it’s crucial to determine if the model can even extract the relevant knowledge from its pre-training data. Can the model retrieve a celebrity’s birth year?
- Controlled Experiments: It is essential to conduct controlled experiments to determine the model’s ability to:
- A. Extract knowledge.
- B. Avoid data contamination (e.g., the question being revealed through publication).
- C. Manipulate that knowledge.
- D. Understand the concepts needed for manipulation (e.g., even/odd).
Synthetic Biography Dataset
Data Generation: Create synthetic biography data for N individuals, using sentence templates or LLMs.
Example Biography:
Anya Briar Forger was born on October 2, 1996. She spent her early years in Princeton, NJ. She received mentorship and guidance from faculty members at MIT. She completed her education with a focus on Communications. She had a professional role at Meta Platforms. She was employed in Menlo Park, CA.
Attributes: Each person has six attributes:
- birth date
- birth city
- university
- major
- employer
- work city
Question-Answer (QA) Data: Generate six QA pairs per person, one for each attribute. This acts as instruction fine-tuning data.
Example QA:
What is the birth date of Anya Briar Forger? Answer: October 2, 1996.
Which university did Anya Briar Forger study? Answer: MIT.
Which company did Anya Briar Forger work for? Answer: Meta Platforms.
What is the birth city of Anya Briar Forger? Answer: Princeton, NJ.
What major did Anya Briar Forger study? Answer: Communications.
Where did Anya Briar Forger work? Answer: Menlo Park, CA.
Experiment Setup
- Training/Test Split: Reveal only half of the QA data during training.
- Out-of-Distribution Evaluation: Evaluate the model on the remaining half of the individuals.
- Knowledge Extraction: If the model performs well on the test set, it demonstrates knowledge extraction – generalizing the ability to answer questions to new individuals based on their biographies. Performance on the training set only demonstrates memorization.
Result: Mixed Training
- Mixed Training: If biography data and QA data are mixed during pre-training, the model achieves high accuracy (
86.6%
) on out-of-distribution knowledge extraction. - Practical Scenario (Not Mixed): In practice, pre-training (e.g., on Wikipedia) and instruction fine-tuning are separate. This leads to very poor knowledge extraction.
- Universality: This failure is independent of model size, architecture (GPT, GPT-2, LLaMA), data size, and training parameters. Over
500
experiments consistently showed near-0%
accuracy.
Result: Knowledge Augmentation
- Catch: The initial experiments used only one biography per person.
- Knowledge Augmentation: Generate multiple biography entries per person, using different writing styles, permutations, or translations.
- Impact: With knowledge augmentation (e.g., five biographies per person), accuracy dramatically increases (
96%)
. - Conclusion: Unless mixed training is used, knowledge augmentation is absolutely necessary for knowledge extraction.
Probing: Where and How is Knowledge Stored?
- Probing Technique: Feed a pre-trained model (e.g., GPT-2) with a biography entry and examine the hidden states of the last layer.
- Focus: Probe for specific knowledge (e.g., employer name) at different token positions.
- Observation (No Augmentation):
- Without knowledge augmentation, previous token positions (before the employer name) show near-zero probing accuracy.
- The model learns the “wrong logic”, storing information jointly with preceding values, rather than associating it directly with the key (person’s name).
- Example: The model may store that someone born on October 2nd, 1996, in Princeton, who studied communications at MIT works for Meta.
- Mathematical Form (No Augmentation):
[value 5]
(employer) is stored in a tuple defined by the key and all preceding values. - Observation (With Augmentation): With knowledge augmentation, the model stores knowledge differently. The hidden state immediately after the person’s name already encodes the employer name.
- Mathematical Form (With Augmentation):
[value 5]
is directly stored with the key (person’s name). - Conclusion: Knowledge augmentation changes how knowledge is stored, which in turn affects its extractability via instruction fine-tuning.
Result: Celebrity Helps Minorities
- Controlled Experiment: Consider a dataset with celebrities (multiple biographies per person) and minorities (one biography per person).
- Training: Pre-train on both groups, but fine-tune only on the celebrities’ QA data.
- Observation: Knowledge extraction accuracy for the minorities is high, even though they had no knowledge augmentation and weren’t part of the fine-tuning data.
- Explanation (Probing): The inclusion of celebrity data teaches the model to store knowledge in the correct format, benefiting even the minorities.
- “Donald Trump Effect”: The existence of multiple Donald Trump biographies improves LLMs’ ability to extract knowledge about minorities.
- Conclusion: Augmenting only part of the data (e.g., celebrities) can lead to knowledge extraction for all individuals.
Summary of 3.1
- Distinction: There’s a crucial difference between knowledge storage and knowledge extraction. Memorization doesn’t guarantee extractability.
- Extractability Requirements:
- Mixed training.
- Knowledge augmentation.
- Bidirectional Models (BERT, DeBERTa): Fail at knowledge extraction even with mixed training and augmentation. (paper)
3.2 Knowledge Manipulation
Introduction
- Assumption: Assume knowledge is fully extractable (based on the findings of 3.1).
- Focus: Study LLMs’ ability to manipulate knowledge.
- Simplest Task: Knowledge classification (e.g., classifying months into even/odd categories).
Knowledge Classification Experiment
Setup: Pre-train on biographies, fine-tune to extract birth dates.
Classification Task: Classify the 12 months into two categories (even/odd).
With and Without Chain of Thought (COT):
Without COT: Direct answer (yes/no).
Was Anya Briar Forger born in an even month? Answer (without CoT): Yes
With COT: Explicitly state the birth month, then answer yes/no.
Was Anya Briar Forger born in an even month? Answer (with CoT): October; so it is Yes
Fine-tuning: Fine-tune sufficiently to achieve perfect accuracy on the training set.
Out-of-Distribution Evaluation: Evaluate on the remaining half of the individuals.
Result: COT is Crucial for Knowledge Manipulation
- Observation (Without COT): Out-of-distribution accuracy is extremely low (near random guessing).
- Observation (With COT in Training): Including COT in training does not improve accuracy during evaluation without COT.
- Conclusion: Knowledge manipulation (even the simplest form) requires COT both during training and inference. The model must explicitly state the knowledge before manipulating it.
- Contrast with Reasoning: This is different from reasoning tasks (e.g., adding small numbers), where LLMs can skip steps.
- This is a statement only discoverable via controlled experiments.
Result: Knowledge Inverse Search is Impossible
Inverse Search Task: Fine-tune the model to answer questions like “Who was born on [date] in [city] and works for [employer]?”
Question: Who was born on October 2, 1996, in Princeton, NJ, studied Communications at MIT, and worked for Meta Platforms at Menlo Park, CA?
Answer: Anya Briar Forger
Out-of-Distribution Evaluation: Evaluate on the remaining half of the individuals.
Observation:
- Zero accuracy, regardless of model size, data size, training method (mixed training, fine-tuning, knowledge augmentation), or fine-tuning parameters.
- Hundreds of pre-training regimes were tested.
Exception: Inverse search is only possible if knowledge is already reversed in the pre-training data (e.g., person’s name at the end of the biography).
Paper on Knowledge Reversal:
- A separate paper with Meta colleagues explores how to practically reverse knowledge.
- Reversal must happen in the pre-training phase.
- Changing to a bi-directional model (like BERT) does not solve this.
- Paper: Reverse Training to Nurse the Reversal Curse
- Reversal must happen in the pre-training phase.
- A separate paper with Meta colleagues explores how to practically reverse knowledge.
Conclusion: Knowledge inverse search is generally impossible without pre-training data modification.
Connections to Practice
- Parity Checks and Ranking: GPT-4 and LLaMA also fail at parity checks and ranking tasks (comparing celebrity birth dates) without COT. With COT, accuracy improves significantly.
- Chinese Idiom Task: GPT-4 fails at filling in missing characters in Chinese idioms (a form of inverse search), demonstrating the practical limitations.
- Turing Test: These failures can distinguish current AI models from humans, who can perform these tasks mentally without explicit statements.
Result (Skipped): Knowledge Partial Search
- Language models might be able to fully extract knowledge (e.g., birthday) but not the individual component words (like the birth year).
- Related to “multi-token prediction” work from Meta colleagues: Predicting multiple future tokens can change knowledge storage and improve capabilities.
Summary of 3.2
- The model must state knowledge explicitly before manipulating it.
- Knowledge inverse search is impossible unless the knowledge is reversed in the pre-trained data.
- A concurrent work refers to this as the “reversal curse”: If a model learns “A is B”, it doesn’t learn “B is A.”
3.3 Scaling Laws for Knowledge Capacity
Introduction
- Goal: Determine the relationship between model size and knowledge storage capacity.
- “Bit” Definition: Information-theoretic bits in the dataset.
Measuring Information Bits in Synthetic Data
Random Generation:
If birthdates are uniformly drawn from \(( 12\) (months) \(\times 28\) (days) \(\times 200 (years) )\) possibilities, this is \(\log_2(12 \times 28 \times 200) = 60.21\) bits.
If cities are uniformly drawn from \(300\) US cities, this is \(\log_2(300) = 8.23\) bits.
General Formula: A formula can be created to calculate the information content of any synthetic knowledge dataset, regardless of writing style variations.
bioD: a synthetic data with hyperparameters:
- ( N ) — distinct names from ( N_0 ) possible names
- ( K ) — number of knowledge attributes
- ( T ) — vocabulary size
- ( C, L ) — values in ( C ) chunks, each of length ( L )
- ( D ) — value has diversity ( D )
\(\log_2 \binom{N_0}{N} + NKC \log_2 D + K \log_2 \binom{T^L}{D}\)
Scaling Law Experiment
- Pre-training: Pre-train a language model on synthetically generated knowledge data.
- Knowledge Measurement: Calculate the amount of knowledge learned by the model (accounting for partial correctness).
- Major Discovery: LLMs consistently achieve two bits per parameter in knowledge storage, if sufficiently trained.
Universality of the Two Bits Per Parameter Scaling Law
- Model Size, Depth, Width: Holds for a wide range of model sizes, depths, and widths (as long as the transformer has at least two layers).
- Data Types: Regardless of the specific parameters of the synthetic knowledge data.
- Rewriting: Independent of how the data is rewritten.
- Training Parameters: Holds for a wide range of training parameters.
Conjecture: 7 Billion Parameters for Human Knowledge
- Based on an estimate of the information content of English Wikipedia and textbooks, a 7-billion parameter model should be sufficient to store all such knowledge.
Sufficient Training: 1000 Exposures
- Definition: Each piece of knowledge needs to be exposed approximately 1000 times during pre-training to reach the two bits per parameter capacity.
- Exposure: Doesn’t mean 1000 training passes; it means the same knowledge, possibly in different writing styles, is seen 1000 times.
- Controlled Experiment: If each piece of knowledge is exposed the same number of times (e.g., 1000), the two bits per parameter scaling law holds.
- Fixing data size, increasing model size doesn’t increase knowledge learned beyond the data’s inherent information content.
- Before that point, the model’s knowledge capacity closely follows two bits per parameter.
Insufficient Training: Rare Knowledge
- 100 Exposures: If knowledge is exposed only 100 times (rare knowledge), the capacity decreases to approximately one bit per parameter.
- Architecture Differences: With rare knowledge, differences between model architectures emerge.
- GPT-2: Performs better.
- LLaMA, Mistral: Perform worse (by a factor of 1.3).
- MLP Layers: Reducing the size of GPT-2’s MLP layers doesn’t significantly affect capacity, but removing them does.
- Disclaimers: This comparison is only for knowledge capacity and only for rare knowledge.
Gated MLP is the Culprit
- Controlled Experiment: By systematically comparing GPT-2 (rotary version) and LLaMA (which have several architectural differences), it’s found that the gated MLP in LLaMA is responsible for the reduced knowledge capacity.
- Fix: Replacing LLaMA’s gated MLP with a standard MLP restores the one bit per parameter capacity (a
30%
improvement).
Result: Mixed Quality Data
- Controlled Experiment: Compare training on:
- Scenario 1: Only “good” data (rich in knowledge, 100 exposures per piece).
- Scenario 2: “Good” data (100 exposures) and “bad” data (junk data).
- Observation: A 20-fold difference in the amount of “good” knowledge stored. The mere presence of junk data significantly harms the LLM’s ability to learn from the good data.
- Increasing training time on the “good” data in Scenario 2, does not fully compensate for the harm caused by the junk data.
Solution: Domain Tokens
- Technique: Prepend each piece of pre-training data with a domain token (e.g., the domain name or URL).
- Impact: Significantly mitigates the negative impact of junk data.
- 20x worse becomes 10x worse.
- 3x worse becomes fully restored.
- Mechanism: LLMs automatically learn to prioritize high-quality domains without explicit instruction.
Summary of 3.3
- Sufficient Training: Two bits per parameter capacity, regardless of architecture.
- Insufficient Training (Rare Knowledge): Architecture matters; GPT-2’s standard MLP outperforms LLaMA’s gated MLP.
- Quantization and MOEs (Skipped): Int8 quantization maintains the two bits per parameter capacity (4:1 compression ratio).
- Mixed Quality Data: Domain tokens are crucial for mitigating the negative impact of junk data.
Reflection on the “Physics of Language Models” Approach (End of Part 3)
- Knowledge Focus: Part 3 focused solely on knowledge, using synthetic data.
- Small Models:
- Most results are replicable with 100-million parameter models, enabling extensive controlled experiments (data variations, training process tweaks, architecture modifications).
- An H100 can pretrain in a day.
- Eight V100s can pretrain in a day.
- Even scaling down the synthetic data by 5x maintains the validity of the results.
- Probing: All statements are supported by probing, revealing the internal workings of the models.
Part 2: Reasoning
- YouTube Recording: Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- YouTube Recording: Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
- Paper: Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process
- Paper: Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
- GitHub Repository: iGSM
2.2 Learning from Mistakes
Introduction
- Discovery: LLMs often know they have made mistakes.
Regretful Behavior
- Mistake Type: The model starts to compute a parameter but then realizes it’s not ready.
- Probing: Probing at the point of the mistake reveals the model’s internal state shows “regret” – it wants to go back.
Experiment: Allowing the Model to Go Back
- Error Detector: A model pre-trained on correct data can act as an error detector (through probing or fine-tuning).
- Assisted Generation: Use the error detector to trigger backtracking during generation.
- Result: Only a small improvement (
2%
). - Drawbacks:
- Requires two models (generator and detector).
- Limited improvement because it relies on randomness for correction (regeneration), similar to beam search (which gives zero improvement).
Pre-training with Mistakes and Corrections
Data Modification: Introduce mistakes (with probability p) and corrections (“
[BACK]
” token) into the synthetic math dataset.Autoregressive Training: The model still uses autoregressive language modeling; it sees its previous mistakes.
Result: Significant accuracy gain.
78% :arrow_right: 95% (med, op=23)
84% :arrow_right: 96% (hard, op=32)
Properties of Training with Mistakes
Higher \(p\) is Better: More mistakes during training lead to better performance.
\(p\) 0.05 0.1 0.2 0.5 Medium 78% 84% 91% 92% Hard 84% 89% 88% 93% No Inference-Time Mistakes: Even with high p, the model doesn’t make more mistakes during inference (due to temperature 0 or beam search).
No Label Masking Needed: Label masking (preventing the model from learning from mistakes) is unnecessary.
Shortest Solutions: The model still generates the shortest solutions (level 1 and 2 reasoning).
Pre-training is Crucial
- Fine-tuning Fails: Fine-tuning a model (pre-trained on correct data) with mistake/correction data does not improve performance. Error correction is a much harder skill than error detection and must be learned during pre-training.
Generating Fake Mistakes in Practice
Dumber Idea (Works): Create fake mistakes by inserting a future sentence from the solution into an earlier position.
78% :arrow_right: 91% (med, op=23) 84% :arrow_right: 92% (hard, op=32)
Smarter Idea (Doesn’t Work): Create fake mistakes by inserting a random unused problem parameters.
78% :arrow_right: 87% (med, op=23) 84% :arrow_right: 87% (hard, op=32)
Conclusion: The dumber, cheaper method is more effective.
Slogan: “Pre-train with fake mistakes and no more regret.”
Summary of 2.2
- LLMs exhibit regret when making mistakes.
- Pre-training with mistakes and corrections is crucial for learning error correction.
- Fine-tuning and beam search are insufficient.
- Fake mistakes can be easily generated and are effective.
Reflection on the “Physics of Language Models” Approach (End of Part 2)
- Reasoning Focus: Part 2 focused solely on reasoning, using synthetic data.
- Small Models: 100-million parameter models were sufficient.
- Controlled Experiments: Manipulated data difficulty, mistake types, and training processes.
- Probing: Used probing to understand reasoning, mistakes, and the relationship between model depth and reasoning length.
Part 1: Language Structures
Introduction
- Two Goals:
- Interpretation Beyond Tokens: Provide precise interpretations of how LLMs learn non-trivial, hierarchical algorithms, going beyond simple token-level interpretations (like induction heads).
- Learning Language Structures: Understand how LLMs learn complex language structures, addressing the question of “format learning” (hallucination).
- hallucination (learn “format” faster than “task”)
Context-Free Grammars (CFGs)
- Approach: Study how LLMs learn CFGs, using synthetic CFGs that are intentionally difficult.
- CFG Generation: Generate sentences from a CFG tree by recursively applying rules, starting from the root.
- Synthetic CFG Design:
- Small vocabulary size (e.g., 1, 2, 3) to make local parsing difficult.
- Large number of possible sentences (e.g., 10^80) to prevent memorization.
- CFGs vs. English Grammar: Synthetic CFGs are much harder than English grammar, requiring dynamic programming for parsing (not just greedy approaches).
Experiment: Pre-training on CFG Data
- Models:
- GPT (vanilla, absolute positional embedding).
- GPT (rotary embedding).
- GPT (relative attention).
- “GPT Stupid” (uniform attention with exponentially increasing spans).
- Metrics:
- Accuracy (generating valid sentences from a valid prefix)
- Diversity
- Distribution difference (KL divergence)
- Observation:
- Relative attention and rotary embedding GPTs perform well.
- Vanilla GPT performs poorly.
- “GPT Stupid” performs surprisingly well.
Conclusion: Importance of Relative Attention
- Strong connection between rotary embedding/relative attention and the ability to learn language structures.
- Rotary embedding is preferred in practice (LLaMA, Mistral) for efficiency, but relative attention is slightly better.
- “GPT Stupid” demonstrates that even uniform attention with varying spans is beneficial, suggesting that future attention-free models should incorporate this concept.
How LLMs Learn CFGs: Probing
- Hidden CFG Trees: The model doesn’t see the underlying CFG tree, only the generated sentences.
- Probing: Does the model secretly learn to parse the CFG trees? Are the parsing trees encoded in the hidden embeddings?
- Answer: Yes, the model learns the CFG trees, and the information is stored locally in the hidden states. The information about each subtree is linearly encoded around its ending position.
- BERT Doesn’t: BERT (encoder-based models) do not learn the CFGs in this way. Masked language modeling is an easier task than language modeling, not requiring full parsing.
Dynamic Programming (DP)
- Human Parsing: Humans use dynamic programming to parse CFGs.
- DP States:
DP(i,j,a)
represents whether symbola
can generate the subsequence fromi
toj
. - DP Transition Functions: Connect DP states to determine larger subtrees.
- Observation (Probing):
- DP states are locally stored in the hidden states.
- Attention patterns in the transformer precisely serve as DP transition functions.
Two Levels of Dynamic Programming
- Parsing DP: Determining if a symbol can generate a subsequence.
- Generation DP: Determining the next token and its probability given a prefix. This requires another, less-known level of dynamic programming.
- Observation (Probing): Both levels of DP (states and transition functions) are present in the trained transformer.
Summary of Part 1
- GPTs can learn long, synthetic CFGs, requiring non-trivial planning and dynamic programming (harder than topological sort).
- Probing reveals DP states in hidden states and DP transition functions in attention mechanisms.
- BERT doesn’t learn CFGs in the same way; language modeling is a harder task.
- GPTs can learn implicit/corrupted CFGs (details in the paper).
- The dynamic programming used is non-trivial, unknown to many software engineers and interview candidates. This surpasses the speaker’s abilities at age 17. GPT-4 is likely to perform well, but it has seen dynamic programming in training materials. However, GPT-2 has learned this without seeing any definitions of dynamic programming.
Final Thoughts: Future Science
- Synthetic Data: Synthetic data is becoming increasingly important as real-world data becomes exhausted.
- GPT-5/GPT-6: To surpass current limitations (e.g., GPT-4’s reasoning limit), synthetic data will be necessary.
- Research Questions: What are the optimal formats for synthetic data to maximize knowledge acquisition and reasoning abilities?
- AGI: This research is crucial for building language models that approach AGI.
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.