Llama (LLM)

Three generations of Llama. Source: Devopedia 2024.
Three generations of Llama. Source: Devopedia 2024.

Llama is a Large Language Model (LLM) released by Meta. It's an open-source Foundation Model (FM) that researchers can fine-tune for their specific tasks. Meta released Llama-1 and Llama-2 in 2023, and Llama-3 in 2024. When it was first released, the case-sensitive acronym LLaMA (Large Language Model Meta AI) was common.

Meta released these models at different sizes, all below 100M parameters. Meta's approach has been to release small models pre-trained on lots of high-quality data. Smaller models are easier to fine-tune and deploy since independent researchers may not have access to large-scale infrastructure.

Since its release, the research community has fine-tuned numerous models using Llama as the base. Meta itself released some fine-tuned models: Llama-2-Chat, Code Llama (based on Llama-2), and Llama-3-Instruct.


  • What downstream models has the community created from Llama?
    Llama and its fine-tuned derivatives. Source: Zhao et al. 2023, fig. 5.
    Llama and its fine-tuned derivatives. Source: Zhao et al. 2023, fig. 5.

    The developer community has fine-tuned Llama for many specific tasks. By being open source, Llama has advanced LLM research and democratized the accessibility of LLMs.

    Often Llama is trained with instructions or conversations, thus leading to the following downstream models: Alpaca, Alpaca-LoRA, BELLE, Vicuna, Koala, WizardML, OpenAssistant Llama, and many more. This is similar to InstructGPT fine-tuned from GPT-3.

    Though trained for coding, CodeLlama was fine-tuned to yield Llemma that's better in mathematics.

    Compared to GPT-4 Turbo's 128K, Llama-3's context window is only 8192. In LongLlama (fine-tuned from OpenLlama), context window is 256k. There's also LongLLaMA Code (fine-tuned from CodeLlama). Giraffe (fine-tuned from Llama-2-13B) achieves a context window of 32k.

    For multimodal apps, Vicuna is the preferred starting point for further fine-tuning. This has led to LLaVA, MiniGPT4, InstructBLIP, and PandaGPT. For the Chinese language, Colossal-Llama-2 (7B and 13B) has shown good performance. ColossalAI also accelerates Llama-3 pretraining by 18% compared to Megatron-DeepSpeed. This example shows how third-party libraries enhance the Llama ecosystem.

  • Is Meta's Llama really open source?

    Meta claims that its Llama models are open source. Llama was initially accessible only by invitation. Llama2 and Llama3 were open source from the outset. Following Meta's lead, Microsoft, Mistral, Snowflake and Databricks started offering their own open-source models.

    However, there's some debate if Llama is truly open. The pre-training datasets and fine-tuned datasets are not published. The licensing has some limitations. While commerical use is permitted, the weights can be used to improve only Llama and its fine-tuned models. Special licensing is needed if monthly active users is above 700M.

    To address this, OpenLM Research trained an LLM similar to Llama and open sourced it. Called OpenLlama, it comes in three sizes: 3B, 7B, 13B. These were trained on 1T tokens from various open datasets. Weights are available in EasyLM and PyTorch formats.

  • What's the architecture of Llama?
    Llama architecture compared with the original transformer. Source: Ibe 2024.
    Llama architecture compared with the original transformer. Source: Ibe 2024.

    Unlike the original transformer's encoder-decoder architecture, Llama uses a causal decoder-only architecture. Multiple layers are used (Nx in figure), each layer consisting of multi-head attention and a feedforward neural network. Rotary Position Embedding (RoPE), SwiGLU activation function and RMSNorm pre-normalization are used.

    RoPE is a balance between absolute and relative positional embedding. It's works better with longer sequences.

    SwiGLU requires more compute compared to the traditional ReLU activation but it gives better results.

    Normalization happens before attention and FFNN. Unlike traditional LayerNorm, RMSNorm ignores the mean and only rescales invariance.

    Memory bandwidth is a bottleneck during inference. To alleviate this without much loss of accuracy, Llama-3 replaces Multi-Head Attention (MHA) with Grouped-Query Attention (GQA). A few query heads share the same K and V matrices. GQA was first introduced into the larger Llama-2 models.

    Llama uses Byte-level Byte Pair Encoding (BBPE). Llama-3's larger vocabulary of 128,256 yields 15% fewer tokens compared to Llama-2 that had a 32,000 vocabulary.

  • How do I calculate the number of model parameters in Llama?
    Llama-2-13B model details. Source: Bhargava 2023.
    Llama-2-13B model details. Source: Bhargava 2023.

    Here's an example of Llama-2-13B parameter calculation. Vocabulary is 32000. Embedding dimension is 5120. There are 40 attention heads. Attention dimension is 128. In each attention layer we've \((40\times128)\times5120 = 5120\times5120\) matrices, one each for K, V, Q and output. There are 40 attention layers. Each MLP block has a projection layer dimension of 13824 with three projections: gated, up, and down. Each normalization layer has 5120 parameters. Thus we have,

    $$\begin{align}\\&32000\times5120 & embeddings\\&+(128\times40\times5120\times4\,+ & attention:QKVO\\&\quad13824\times5120\times3\,+ & MLP:projections\\&\quad5120\times2) & RMSNorm\\&\quad\quad\quad\quad\times40 & layers\\&+32000\times5120 & output\\&+5120 & RMSNorm\\&=13,015,864,320\\\end{align}$$

    Here's a similar calculation for Llama-3-70B. Since this model uses GQA, there are 64 Q and O heads but only 8 K and V heads. This yields the factor \(2+2/8=2.25\) Thus we have,

    $$\begin{align}\\&128256\times8192 & embeddings\\&+(128\times64\times8192\times2.25\,+ & attention:QKVO\\&\quad28672\times8192\times3\,+ & MLP:projections\\&\quad8192\times2) & RMSNorm\\&\quad\quad\quad\quad\times80 & layers\\&+128256\times8192 & output\\&+8192 & RMSNorm\\&=70,553,706,496\\\end{align}$$

  • Could you share details of Llama pre-training?
    Perplexity (PPL) improves with more training. Source: Touvron et al. 2023b, fig. 5.
    Perplexity (PPL) improves with more training. Source: Touvron et al. 2023b, fig. 5.

    Llama-1-65B was pre-trained with 1.4T tokens on 2048 x 80GB-A100-GPU over 21 days. Data was mostly from CommonCrawl and C4. Llama-2-70B saw 2T tokens but Llama-3-70B saw 15T tokens. Llama-3-8B and Llama-3-70B took 1.3M and 6.4M GPU hours for pretraining.

    Llama-3 knowledge cutoff was March 2023 (8B) and December 2023 (70B). Training used data, model, and pipeline parallelization on two 24K GPU clusters. Meta achieved an effective training time exceeding 95%. Both pre-trained and fine-tuned models were evaluated on a variety of benchmarks.

    Llama-3 used high-volume high-quality data from various sources. About 5% of the data was non-English from 30 languages. Data was preprocessed with heuristic and safety filters. Personal information was removed. Semantic deduplication was done. Llama-2 helped classify data quality. Models continued to improve log-linearly when trained on data larger than that predicted by the Chinchilla Scaling Law. The law states that for a 7B model about 200B tokens are needed but Llama-3 models were trained with 15T tokens. Compared to GPT-4's 1.7T parameters, Meta's preference is to train smaller models with lots of pre-training data.

  • How's the performance of Llama?
    Performance of Llama-3 pre-trained model. Source: Meta 2024.
    Performance of Llama-3 pre-trained model. Source: Meta 2024.

    Across various benchmarks, Llama-3 and Llama-3-Instruct outperformed other models of similar size. This is significant since Llama-2 couldn't compete against proprietary models GPT and PaLM. Llama-3-8B faired slightly worse than Gemma-7B on the ARC-Challenge. Llama-3-Instruct outperformed Gemini-1.5-Pro on MMLU, HumanEval and GSM-8K while the latter did better in GPQA and MATH benchmarks.

    Llama-1-13B gave similar performance to GPT-3-175B despite being 10x smaller. This can be attributed to the larger training dataset: 1T for Llama-1-13B versus 300B for GPT-3. On Natural Questions dataset, smaller models using 5-shot approach performed similar to Llama-1-65B 0-shot. However, smaller models aren't good at quantitative reasoning. While Llama isn't tuned to follow instructions, Llama-1-65B is able to follow basic instructions.

    On the LMSYS Chatbot Arena Leaderboard (May 2024), Llama-3-70b-Instruct was the top performing open-source model. Overall, it was in the 11th position. The next best open-source model was Cohere's Command R+. The top 10 models were all proprietary models including GPT-4o, Gemini-1.5-Pro, Claude-Opus and Yi-Large. One researcher commented that the gap between proprietary and open-source models is narrowing.

    Memory requirement of Llama-2 with 4-bit quantization is 37.6GB (70B), 8.9GB (13B) and 5.5GB (7B).


LLaMA model in its different sizes. Source: Touvron et al. 2023a.
LLaMA model in its different sizes. Source: Touvron et al. 2023a.

Meta releases an LLM by the name LLaMA (Large Language Model Meta AI). This is released in four sizes: 7B, 13B, 33B, 65B. Meta open sources the model but the model weights are shared only upon request. However, the weights get leaked online in March.

Supervised fine-tuning of Alpaca-7B using synthetic data. Source: Taori et al. 2023.
Supervised fine-tuning of Alpaca-7B using synthetic data. Source: Taori et al. 2023.

A Stanford University research team releases Alpaca-7B open-source model. It's fine-tuned from Llama-7B with 52k instruction-following demonstrations. These instructions themselves were generated by OpenAI's text-davinci-003. While the model shows good results, it also suffers from hallucinations, toxicity and stereotypes. Another research group releases Vicuna-13B, fine-tuned from Llama-13B with 70k user-shared ChatGPT conversations at a cost of only $600.


To address some licensing limitations of Meta's Llama, OpenLM Research releases OpenLlama with a more permissive licensing. Importantly, pre-training datasets are open. In July, they release v2 and v3 versions of the model. Since its architecture is same as Llama's, OpenLlama's weights can be used on Llama models.

Training of Llama-2-Chat. Source: Touvron et al. 2023b, fig. 4.
Training of Llama-2-Chat. Source: Touvron et al. 2023b, fig. 4.

Meta releases Llama-2 in three sizes: 7B, 13B, 70B. Although Meta developed a 34B model, it chooses not to release it. Compared to Llama-1, context window is increased from 2048 to 4096. 34B and 70B models use GQA rather than MHA. Meta also releases a fine-tuned chat model called Llama-2-Chat in three sizes: 7B, 13B, 70B. Llama-2-Chat was put through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Only 27,540 high-quality SFT annotations were used. Compared to Llama-2, Llama-2-Chat is a safer and more helpful model.

Training of Code Llama. Source: Meta 2023.
Training of Code Llama. Source: Meta 2023.

Based on Llama-2 pretrained model, Meta releases Code Llama in four sizes: 7B, 13B, 34B, 70B. These are fine-tuned for coding. It has three variants: CodeLlama foundation model, CodeLlama-Python and CodeLlama-Instruct. CodeLlama-Instruct is recommended for code generation from natural language prompts. The smaller models are fast and suited for real-time code completion tasks. The larger models are more accurate. The 7B model can be served from a single GPU.


Meta annouces that Llama-2 models are now available on AWS Bedrock as a managed service. Other cloud platforms are set to follow shortly. Tens of thousands of startups are using the Llama models. On Hugging Face, Llama has 7000+ derivatives, some of which have improved common benchmarks by 10%. AMD, Intel, Nvidia, and Google are offering software and hardware optimizations for Llama. In October, Dell announces plan to bring Llama-2 to enterprises.


Xia et al. release Sheared Llama that's a pruned version of Llama-2-7B. They release it in sizes 1.3B and 2.7B. They show that this approach of structured pruning is cost-effective towards building small models. It's must be preferred over training a small model from scratch.

Performance and carbon footprint of Llama-3 models. Source: Adapted from Meta Llama 2024a.
Performance and carbon footprint of Llama-3 models. Source: Adapted from Meta Llama 2024a.

Meta release Llama-3 in two sizes: 8B, 70B. Compared to Llama-2, both sizes use GQA, token vocabulary is increased from 32000 to 128256, tokenizer changes from BPE sentencepiece to BPE tiktoken, and pre-training data is increased from 2T tokens to 15T tokens. Meta also releases a fine-tuned model called Llama-3-Instruct in sizes 8B and 70B. Meta also announces a 400B model that's still being trained and could be released in a few months. It's seen that Llama-3-8B outperforms even the bigger model Llama-2-13B.


A search for "llama3" on Hugging Face shows 5700+ models, just one month after the release of Llama-3. A search for "llama" shows 33k+ models. A title-only search for "llama" on arXiv brings up 95 technical papers. These numbers exclude fine-tuned Llama models (eg. Alpaca, Vicuna) that are not named "llama". The numbers show the popularity and rapid adoption of open-source LLMs.

Sample Code

  • # Sample usage
    # Source: https://huggingface.co/docs/transformers/main/en/model_doc/llama3
    # Accessed 2024-05-30
    import transformers
    import torch
    model_id = "meta-llama/Meta-Llama-3-8B"
    pipeline = transformers.pipeline("text-generation", model=model_id, 
                model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
    pipeline("Hey how are you doing today?")


  1. Ainslie, J., J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, and S. Sanghai. 2023. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." v3, arXiv, December 23. Accessed 2024-05-28.
  2. Bhargava, S. 2023. "Mastering Llama Math (Part-1): A Step-by-Step Guide to Counting Parameters in Llama-2." On Medium, October 22. Accessed 2024-05-30.
  3. Chan, L. 2023. "Meta announces Llama 2; open sources it for commercial use." Blog, LessWrong, July 19. Accessed 2024-05-30.
  4. HPC-AI Tech. 2024. "ColossalAI." On GitHub, May 30. Accessed 2024-05-30.
  5. Hendrycks, D. 2024. "Scaling Laws." Section 2.4 in Introduction to AI Safety, Ethics and Society, Center for AI Safety. Accessed 2024-05-24.
  6. Hugging Face. 2024a. "Model search: 'llama3'." Hugging Face, May 29. Accessed 2024-05-28.
  7. Hugging Face. 2024b. "Model search: 'llama'." Hugging Face, May 29. Accessed 2024-05-28.
  8. Huggy Llama. 2023. "config.json." huggyllama/llama-30b, on Hugging Face, April 4. Accessed 2024-05-28.
  9. IBM. 2023. "IBM Plans to Make Llama 2 Available within its Watsonx AI and Data Platform." Press release, IBM, August 9. Accessed 2024-05-30.
  10. Ibe, C. 2024. "Unlocking Low-Resource Language Understanding: Enhancing Translation with Llama 3 Fine-Tuning." On Medium, April 25. Accessed 2024-05-30.
  11. Isaac, M. 2024. "How A.I. Made Mark Zuckerberg Popular Again in Silicon Valley." The New York Times, May 29. Accessed 2024-05-30.
  12. Karpathy, A. 2024. "Congrats to @AIatMeta on Llama 3 release!!" Tweet, on X, April 19. Accessed 2024-05-30.
  13. Kerner, S. M. 2023. "Dell and Meta partner to bring Llama 2 open source AI to enterprise users on-premises." VentureBeat, October 31. Accessed 2024-05-30.
  14. Kyo. 2024. "Llama 3 Statistics." Tweet, on X, April 20. Accessed 2024-05-30.
  15. LMSYS. 2024. "Chatbot Arena Leaderboard." LMSYS, on Hugging Face, May 30. Accessed 2024-05-30.
  16. Labonne, M. 2024. "Arena ELO graph updated with new models." Tweet, on X, April 19. Accessed 2024-05-30.
  17. Li, C. 2020. "OpenAI's GPT-3 Language Model: A Technical Overview." Blog, Lambda Labs, June 3. Accessed 2024-05-30.
  18. Lowe, R. and J. Leike. 2022. "Aligning language models to follow instructions." OpenAI, January 27. Accessed 2024-05-30.
  19. Meta. 2023. "Introducing Code Llama, a state-of-the-art large language model for coding." Blog, Meta, August 24. Updated 2024-01-29. Accessed 2024-05-23.
  20. Meta. 2024. "Introducing Meta Llama 3: The most capable openly available LLM to date." Blog, Meta, April 18. Accessed 2024-05-30.
  21. Meta Llama. 2024a. "Meta-Llama-3-8B-Instruct." Meta Llama, on Hugging Face, May 13. Accessed 2024-05-28.
  22. Meta Llama. 2024b. "Llama 3 Evaluation Details." llama3, on GitHub, Meta Llama, April 23. Accessed 2024-05-31.
  23. Meta Llama. 2024c. "Model Details." llama3, on GitHub, Meta Llama, April 21. Accessed 2024-05-31.
  24. Morgan, T.P. 2024. "Meta’s Llama 3 AI Is Smart, But Who Is Going To Profit From It?" The Next Platform, April 22. Accessed 2024-05-30.
  25. OpenAI. 2024. "Models." Documentations, OpenAI. Accessed 2024-05-30.
  26. OpenLM Research. 2023. "OpenLLaMA: An Open Reproduction of LLaMA." open_llama, on GitHub, July 16. Accessed 2024-05-30.
  27. Pal, A., Deep Karkhanis, Manley Roberts, Samuel Dooley, Arvind Sundararajan, and Siddartha Naidu. 2023. "Giraffe: Adventures in Expanding Context Lengths in LLMs." v1, arXiv, August 21. Accessed 2024-05-30.
  28. Sagio, A. 2023. "A brief history of LLaMA models." Blog, AGI Sphere, Sagio Development LLC, August 11. Accessed 2024-05-30.
  29. Spisak, J. and S. Edunov. 2023. "The Llama Ecosystem: Past, Present, and Future." Blog, Meta, September 27. Accessed 2024-05-30.
  30. Staniszewski, K. 2023. "LongLLaMA: Focused Transformer Training for Context Scaling." long_llama, on GitHub, November 8.
  31. Swarup, M. 2023. "How to Use Meta Llama 2 Large Language Model on Vultr Cloud GPU." Documentation, Vultr, August 10. Accessed 2024-05-30.
  32. Taori, R., I. Gulrajani, T. Zhang, Y. Dubois, and X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. 2023. "Alpaca: A Strong, Replicable Instruction-Following Model." CRFM, Stanford University, March 13. Accessed 2024-05-23.
  33. TheBloke. 2023. "config.json." TheBloke/Llama-2-13B-fp16, on Hugging Face, July 20. Accessed 2024-05-28.
  34. Touvron, H., T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. 2023a. "LLaMA: Open and Efficient Foundation Language Models." v1, arXiv, February 27. Accessed 2024-05-30.
  35. Touvron, H., L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. 2023b. "Llama 2: Open Foundation and Fine-Tuned Chat Models." v2, arXiv, July 19. Accessed 2024-05-30.
  36. Unsloth. 2024a. "config.json." unsloth/llama-3-8b-bnb-4bit, on Hugging Face, May 25. Accessed 2024-05-28.
  37. Unsloth. 2024b. "config.json." unsloth/llama-3-70b-bnb-4bit, on Hugging Face, April 18. Accessed 2024-05-28.
  38. Vicuna. 2023. "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality." Blog, LMSYS, March 30. Accessed 2024-05-28.
  39. Wolfe, C. R. 2023. "LLaMA-2 from the Ground Up." Deep (Learning) Focus, on Substack, August 14. Accessed 2024-05-30.
  40. Xia, M., Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. 2024. "Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning." v2, arXiv, April 11. Accessed 2024-05-30.
  41. Zhang, B. and R. Sennrich. 2019. "Root Mean Square Layer Normalization." v1, arXiv, October 16. Accessed 2024-05-30.
  42. Zhao, W. X., K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie and J.-R. Wen. 2023. "A Survey of Large Language Models." v13, arXiv, November 24. Accessed 2024-05-30.
  43. arXiv. 2024. "Search: 'llama'." arXiv. Accessed 2024-05-30.

Further Reading

  1. Meta Llama: Open-Source Code on GitHub
  2. Llama 3: Documentation on Hugging Face
  3. Introducing Llama 3
  4. Llama 2: Technical Paper
  5. LLaMA: Technical Paper
  6. LLaMA-2 from the Ground Up

Article Stats

Author-wise Stats for Article Edits

No. of Edits
No. of Chats

Cite As

Devopedia. 2024. "Llama (LLM)." Version 4, May 31. Accessed 2024-06-25. https://devopedia.org/llama-llm
Contributed by
1 author

Last updated on
2024-05-31 05:12:40