Llama (LLM)

Article Info

Contributed by
1 author

Last updated on
2024-05-31 05:12:40

Code Llama
Large Language Model
Language Modelling
LLM Pre-Training
LlamaIndex
Generative Pre-Trained Transformer

Article Versions

4 2024-05-31 05:12:40
4100,4099 4,4100

By arvindpdmn

Completing all content.
3 2024-05-29 15:30:21
4099,4088 3,4099

By arvindpdmn

Completed Milestones, Sample Code, See Also, Further Reading sections.
2 2024-05-09 11:38:57
4088,4087 2,4088

By arvindpdmn

See Also and Further Reading sections added.
1 2024-05-09 05:52:49
1,4087

By arvindpdmn

Initial questions added.

Chat Room

Submitting ...

You are editing an existing chat message.

Three generations of Llama. Source: Devopedia 2024.

Llama is a Large Language Model (LLM) released by Meta. It's an open-source Foundation Model (FM) that researchers can fine-tune for their specific tasks. Meta released Llama-1 and Llama-2 in 2023, and Llama-3 in 2024. When it was first released, the case-sensitive acronym LLaMA (Large Language Model Meta AI) was common.

Meta released these models at different sizes, all below 100M parameters. Meta's approach has been to release small models pre-trained on lots of high-quality data. Smaller models are easier to fine-tune and deploy since independent researchers may not have access to large-scale infrastructure.

Since its release, the research community has fine-tuned numerous models using Llama as the base. Meta itself released some fine-tuned models: Llama-2-Chat, Code Llama (based on Llama-2), and Llama-3-Instruct.

Discussion

What downstream models has the community created from Llama?
Llama and its fine-tuned derivatives. Source: Zhao et al. 2023, fig. 5.
The developer community has fine-tuned Llama for many specific tasks. By being open source, Llama has advanced LLM research and democratized the accessibility of LLMs.
Often Llama is trained with instructions or conversations, thus leading to the following downstream models: Alpaca, Alpaca-LoRA, BELLE, Vicuna, Koala, WizardML, OpenAssistant Llama, and many more. This is similar to InstructGPT fine-tuned from GPT-3.
Though trained for coding, CodeLlama was fine-tuned to yield Llemma that's better in mathematics.
Compared to GPT-4 Turbo's 128K, Llama-3's context window is only 8192. In LongLlama (fine-tuned from OpenLlama), context window is 256k. There's also LongLLaMA Code (fine-tuned from CodeLlama). Giraffe (fine-tuned from Llama-2-13B) achieves a context window of 32k.
For multimodal apps, Vicuna is the preferred starting point for further fine-tuning. This has led to LLaVA, MiniGPT4, InstructBLIP, and PandaGPT. For the Chinese language, Colossal-Llama-2 (7B and 13B) has shown good performance. ColossalAI also accelerates Llama-3 pretraining by 18% compared to Megatron-DeepSpeed. This example shows how third-party libraries enhance the Llama ecosystem.
Is Meta's Llama really open source?
Meta claims that its Llama models are open source. Llama was initially accessible only by invitation. Llama2 and Llama3 were open source from the outset. Following Meta's lead, Microsoft, Mistral, Snowflake and Databricks started offering their own open-source models.
However, there's some debate if Llama is truly open. The pre-training datasets and fine-tuned datasets are not published. The licensing has some limitations. While commerical use is permitted, the weights can be used to improve only Llama and its fine-tuned models. Special licensing is needed if monthly active users is above 700M.
To address this, OpenLM Research trained an LLM similar to Llama and open sourced it. Called OpenLlama, it comes in three sizes: 3B, 7B, 13B. These were trained on 1T tokens from various open datasets. Weights are available in EasyLM and PyTorch formats.
What's the architecture of Llama?
Llama architecture compared with the original transformer. Source: Ibe 2024.
Unlike the original transformer's encoder-decoder architecture, Llama uses a causal decoder-only architecture. Multiple layers are used (Nx in figure), each layer consisting of multi-head attention and a feedforward neural network. Rotary Position Embedding (RoPE), SwiGLU activation function and RMSNorm pre-normalization are used.
RoPE is a balance between absolute and relative positional embedding. It's works better with longer sequences.
SwiGLU requires more compute compared to the traditional ReLU activation but it gives better results.
Normalization happens before attention and FFNN. Unlike traditional LayerNorm, RMSNorm ignores the mean and only rescales invariance.
Memory bandwidth is a bottleneck during inference. To alleviate this without much loss of accuracy, Llama-3 replaces Multi-Head Attention (MHA) with Grouped-Query Attention (GQA). A few query heads share the same K and V matrices. GQA was first introduced into the larger Llama-2 models.
Llama uses Byte-level Byte Pair Encoding (BBPE). Llama-3's larger vocabulary of 128,256 yields 15% fewer tokens compared to Llama-2 that had a 32,000 vocabulary.
How do I calculate the number of model parameters in Llama?
Llama-2-13B model details. Source: Bhargava 2023.
Here's an example of Llama-2-13B parameter calculation. Vocabulary is 32000. Embedding dimension is 5120. There are 40 attention heads. Attention dimension is 128. In each attention layer we've $(40\times128)\times5120 = 5120\times5120$ matrices, one each for K, V, Q and output. There are 40 attention layers. Each MLP block has a projection layer dimension of 13824 with three projections: gated, up, and down. Each normalization layer has 5120 parameters. Thus we have,
$$\begin{align}\\&32000\times5120 & embeddings\\&+(128\times40\times5120\times4\,+ & attention:QKVO\\&\quad13824\times5120\times3\,+ & MLP:projections\\&\quad5120\times2) & RMSNorm\\&\quad\quad\quad\quad\times40 & layers\\&+32000\times5120 & output\\&+5120 & RMSNorm\\&=13,015,864,320\\\end{align}$$
Here's a similar calculation for Llama-3-70B. Since this model uses GQA, there are 64 Q and O heads but only 8 K and V heads. This yields the factor $2+2/8=2.25$ Thus we have,
$$\begin{align}\\&128256\times8192 & embeddings\\&+(128\times64\times8192\times2.25\,+ & attention:QKVO\\&\quad28672\times8192\times3\,+ & MLP:projections\\&\quad8192\times2) & RMSNorm\\&\quad\quad\quad\quad\times80 & layers\\&+128256\times8192 & output\\&+8192 & RMSNorm\\&=70,553,706,496\\\end{align}$$
Could you share details of Llama pre-training?
Perplexity (PPL) improves with more training. Source: Touvron et al. 2023b, fig. 5.
Llama-1-65B was pre-trained with 1.4T tokens on 2048 x 80GB-A100-GPU over 21 days. Data was mostly from CommonCrawl and C4. Llama-2-70B saw 2T tokens but Llama-3-70B saw 15T tokens. Llama-3-8B and Llama-3-70B took 1.3M and 6.4M GPU hours for pretraining.
Llama-3 knowledge cutoff was March 2023 (8B) and December 2023 (70B). Training used data, model, and pipeline parallelization on two 24K GPU clusters. Meta achieved an effective training time exceeding 95%. Both pre-trained and fine-tuned models were evaluated on a variety of benchmarks.
Llama-3 used high-volume high-quality data from various sources. About 5% of the data was non-English from 30 languages. Data was preprocessed with heuristic and safety filters. Personal information was removed. Semantic deduplication was done. Llama-2 helped classify data quality. Models continued to improve log-linearly when trained on data larger than that predicted by the Chinchilla Scaling Law. The law states that for a 7B model about 200B tokens are needed but Llama-3 models were trained with 15T tokens. Compared to GPT-4's 1.7T parameters, Meta's preference is to train smaller models with lots of pre-training data.
How's the performance of Llama?
Performance of Llama-3 pre-trained model. Source: Meta 2024.
Across various benchmarks, Llama-3 and Llama-3-Instruct outperformed other models of similar size. This is significant since Llama-2 couldn't compete against proprietary models GPT and PaLM. Llama-3-8B faired slightly worse than Gemma-7B on the ARC-Challenge. Llama-3-Instruct outperformed Gemini-1.5-Pro on MMLU, HumanEval and GSM-8K while the latter did better in GPQA and MATH benchmarks.
Llama-1-13B gave similar performance to GPT-3-175B despite being 10x smaller. This can be attributed to the larger training dataset: 1T for Llama-1-13B versus 300B for GPT-3. On Natural Questions dataset, smaller models using 5-shot approach performed similar to Llama-1-65B 0-shot. However, smaller models aren't good at quantitative reasoning. While Llama isn't tuned to follow instructions, Llama-1-65B is able to follow basic instructions.
On the LMSYS Chatbot Arena Leaderboard (May 2024), Llama-3-70b-Instruct was the top performing open-source model. Overall, it was in the 11th position. The next best open-source model was Cohere's Command R+. The top 10 models were all proprietary models including GPT-4o, Gemini-1.5-Pro, Claude-Opus and Yi-Large. One researcher commented that the gap between proprietary and open-source models is narrowing.
Memory requirement of Llama-2 with 4-bit quantization is 37.6GB (70B), 8.9GB (13B) and 5.5GB (7B).

Milestones

Feb
2023

LLaMA model in its different sizes. Source: Touvron et al. 2023a.

Meta releases an LLM by the name LLaMA (Large Language Model Meta AI). This is released in four sizes: 7B, 13B, 33B, 65B. Meta open sources the model but the model weights are shared only upon request. However, the weights get leaked online in March.

Mar
2023

Supervised fine-tuning of Alpaca-7B using synthetic data. Source: Taori et al. 2023.

A Stanford University research team releases Alpaca-7B open-source model. It's fine-tuned from Llama-7B with 52k instruction-following demonstrations. These instructions themselves were generated by OpenAI's text-davinci-003. While the model shows good results, it also suffers from hallucinations, toxicity and stereotypes. Another research group releases Vicuna-13B, fine-tuned from Llama-13B with 70k user-shared ChatGPT conversations at a cost of only $600.

May
2023

To address some licensing limitations of Meta's Llama, OpenLM Research releases OpenLlama with a more permissive licensing. Importantly, pre-training datasets are open. In July, they release v2 and v3 versions of the model. Since its architecture is same as Llama's, OpenLlama's weights can be used on Llama models.

Jul
2023

Training of Llama-2-Chat. Source: Touvron et al. 2023b, fig. 4.

Meta releases Llama-2 in three sizes: 7B, 13B, 70B. Although Meta developed a 34B model, it chooses not to release it. Compared to Llama-1, context window is increased from 2048 to 4096. 34B and 70B models use GQA rather than MHA. Meta also releases a fine-tuned chat model called Llama-2-Chat in three sizes: 7B, 13B, 70B. Llama-2-Chat was put through Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Only 27,540 high-quality SFT annotations were used. Compared to Llama-2, Llama-2-Chat is a safer and more helpful model.

Aug
2023

Training of Code Llama. Source: Meta 2023.

Based on Llama-2 pretrained model, Meta releases Code Llama in four sizes: 7B, 13B, 34B, 70B. These are fine-tuned for coding. It has three variants: CodeLlama foundation model, CodeLlama-Python and CodeLlama-Instruct. CodeLlama-Instruct is recommended for code generation from natural language prompts. The smaller models are fast and suited for real-time code completion tasks. The larger models are more accurate. The 7B model can be served from a single GPU.

Sep
2023

Meta annouces that Llama-2 models are now available on AWS Bedrock as a managed service. Other cloud platforms are set to follow shortly. Tens of thousands of startups are using the Llama models. On Hugging Face, Llama has 7000+ derivatives, some of which have improved common benchmarks by 10%. AMD, Intel, Nvidia, and Google are offering software and hardware optimizations for Llama. In October, Dell announces plan to bring Llama-2 to enterprises.

Oct
2023

Xia et al. release Sheared Llama that's a pruned version of Llama-2-7B. They release it in sizes 1.3B and 2.7B. They show that this approach of structured pruning is cost-effective towards building small models. It's must be preferred over training a small model from scratch.

Apr
2024

Performance and carbon footprint of Llama-3 models. Source: Adapted from Meta Llama 2024a.

Meta release Llama-3 in two sizes: 8B, 70B. Compared to Llama-2, both sizes use GQA, token vocabulary is increased from 32000 to 128256, tokenizer changes from BPE sentencepiece to BPE tiktoken, and pre-training data is increased from 2T tokens to 15T tokens. Meta also releases a fine-tuned model called Llama-3-Instruct in sizes 8B and 70B. Meta also announces a 400B model that's still being trained and could be released in a few months. It's seen that Llama-3-8B outperforms even the bigger model Llama-2-13B.

May
2024

A search for "llama3" on Hugging Face shows 5700+ models, just one month after the release of Llama-3. A search for "llama" shows 33k+ models. A title-only search for "llama" on arXiv brings up 95 technical papers. These numbers exclude fine-tuned Llama models (eg. Alpaca, Vicuna) that are not named "llama". The numbers show the popularity and rapid adoption of open-source LLMs.

Sample Code

python

# Sample usage
# Source: https://huggingface.co/docs/transformers/main/en/model_doc/llama3
# Accessed 2024-05-30
 
import transformers
import torch
 
model_id = "meta-llama/Meta-Llama-3-8B"
 
pipeline = transformers.pipeline("text-generation", model=model_id, 
            model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
pipeline("Hey how are you doing today?")

References

Article Stats

1835

Words

Authors

Edits

Chats

Likes

610

Hits

Cite As

Devopedia. 2024. "Llama (LLM)." Version 4, May 31. Accessed 2024-06-25. https://devopedia.org/llama-llm

Contributed by
1 author

Last updated on
2024-05-31 05:12:40

algorithms machine learning modelling llm

Code Llama
Large Language Model
Language Modelling
LLM Pre-Training
LlamaIndex
Generative Pre-Trained Transformer

Llama (LLM)

Discussion

Milestones

Sample Code

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login