Llama 3 Secrets Every Engineer Must Know

Jul 25, 2024

Data, training procedures and compute are the secrets of big labs and available only to the GPU rich. The paper is amazing, it divulges an unprecedented amount of details about the nitty gritty of training one of these big beasts. It's a model that in terms of flops is close to the range the executive order mentions, which may imply the next gen models need approval.

Data Mix and Recipe:

Llama 3 was trained on approximately 15 trillion multilingual tokens, a significant increase from previous versions.
The data mix includes roughly 50% of tokens corresponding to general knowledge, 25% of mathematical and reasoning tokens, 17% code tokens, and 8% multilingual tokens.
They used extensive data cleaning and filtering techniques, including HTML boilerplate removal, deduplication, and quality classifiers.
The team employed a novel "annealing" phase where small amounts of high-quality data (especially for math and code) were introduced near the end of pre-training. The annealing phase is particularly interesting, as it allows the model to adapt to high-quality data without catastrophic forgetting of general knowledge.
Synthetic data generation played a major role, with models used to create and filter high-quality examples across various domains.
Implementation of Monte Carlo Tree Search as a technique to improve the quality of step-by-step reasoning traces, which likely helps explore different reasoning paths and select the most promising ones based on learned reward models.
Feedback loops involving Direct Preference Optimization (DPO), Supervised Fine-Tuning (SFT) and Rejection Sampling

Takeaway for engineers:

Invest time in data preparation. Clean, high-quality data can often lead to better results than simply increasing model size.
Consider multi-stage training approaches in your own projects. Gradually introducing specialized data can lead to better overall performance.
This paper validates positive feedback loops can work in a real world industrial setting, since previously they were shown in limited experiments or in the game of go.

Architectural Differences and Innovations

Llama 3 boasts an impressive 405 billion parameters, making it one of the largest publicly disclosed models to date.
Llama 3 uses group query attention, an extension of multi-query attention, which balances output quality and efficiency.
The context window has been extended to 128k tokens, a significant increase from previous versions.
Llama 3 incorporates multimodal capabilities through a compositional approach similar to Google's Flamingo model, integrating vision and language processing to handle interleaved visual and textual data, but extends this concept to include video and speech recognition.

The training infrastructure is equally impressive:

16,000 H100 GPUs used over 54 days
Achieved 41% GPU utilization (considered good for this scale)
Custom networking solutions to handle the massive data flow

Takeaway for engineers: Model improvement isn't just about throwing more hardware at the problem. You have to co-design the model and the infrastructure. See the hardware lottery paper, it’s very insightful.

Evaluating output fidelity and quality as the model scales

Novel techniques developed by the Llama 3 researchers to evaluate how well their large language models perform and maintain accuracy, especially when scaled up to massive sizes like the 405B parameter model
Scaling laws: The team developed new scaling laws that allow them to predict model performance on downstream tasks based on pre-training metrics. This helps them estimate how well larger models will perform before fully training them.
Downstream task evaluation: Rather than relying solely on perplexity or next-token prediction, they evaluate model performance on actual downstream tasks like reasoning, math, and coding to get a more holistic view of capabilities.
Extensive benchmarking: The paper describes rigorous testing across a wide range of benchmarks and comparisons to other leading models like GPT-4 to assess relative performance.
Multimodal integration: They developed methods to evaluate performance when integrating other modalities like vision and speech into the language model.
Long-context evaluation: With the 128K token context window, they had to develop new ways to assess performance on very long inputs.
Factuality assessment: They created techniques to evaluate the model's ability to refuse to answer when it doesn't know something, improving factual accuracy.

Takeaway for engineers: these techniques can be used to rigorously assess model quality and capabilities at scale, and maintain and improve performance across a broad range of tasks.

What the New Model Enables:

Improved performance on various benchmarks, especially in math and reasoning tasks. This is important as it implies on workflow and business cases, an agent can better break down a request and logically solve the implementable steps
Enhanced multilingual and long-context understanding.
Better factuality and "knowing what it doesn't know" through specific training techniques.
Potential for more advanced tool use and multi-step reasoning.

"Secret Sauce":

The extensive use of synthetic data generation and self-improvement techniques appears to be a key innovation.
The data mix recipe, especially the annealing phase and the focus on high-quality data for specific domains, seems to be crucial.
The use of Monte Carlo tree search for certain tasks is noted as a novel approach in large language models.

Open Questions

What are the long-term implications of the architectural choices made for Llama 3?
How do the data cleaning and filtering techniques impact model bias and performance across different domains?
What advancements in tokenization and multilingual support can we expect in future iterations?
How will the scaling laws and fidelity prediction methods influence the development of even larger models?

From a coding perspective, while specific implementation details aren't provided, the model's improved performance on coding tasks and the ability to generate, understand, and debug code more effectively could be particularly useful for developers. However, practical application may still require careful prompt engineering or fine-tuning for specific use cases.

Overall, while Llama 3 represents a significant advancement in open-source language models, it's important to note that its practical applications and limitations in real-world scenarios are still being explored. One has to note also the significant resources and contribution from 200 core contributors and 600 other partial collaborators and the cost in building the data center as well as the power and everything it takes to run it, because each new generation of models often requires about 10 times more computational resources than the previous one to get significantly better quality. The Llama 3 paper does not have the same visuals as launching a rocket successfully but for our community is at least as impressive.

Thanks to Cosmin Negruseri and Seth Stafford for reading the draft and the other buddies from the DeepLearning Study Group for the questions and the debate on the llama 3 paper.

Further interesting reads and twitter threads:

https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/

https://www.mofo.com/resources/insights/231107-the-ai-executive-order-presidential-authority

https://arxiv.org/pdf/2405.16455

https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/

https://x.com/eugeneyan/status/1816135035629756614

https://x.com/jiayq/status/1816288341249384777

https://x.com/jphme/status/1815789501026861308

anti-vc

Discussion about this post