Large Language Model

Article Info

Contributed by
1 author

Last updated on
2024-05-26 10:41:20

design modelling llm

LLM App
LLMs for Code
LLM Evaluation Metrics
LLM Hallucination
Prompt Engineering
Generative Artificial Intelligence

Article Versions

3 2024-05-26 10:41:20
4095,4094 3,4095

By arvindpdmn

Adding more refs and citations.
2 2024-05-25 11:29:27
4094,4072 2,4094

By arvindpdmn

Completing all content. More citations to be added.
1 2024-04-06 11:13:51
1,4072

By arvindpdmn

Initial questions added.

Chat Room

Submitting ...

You are editing an existing chat message.

A selection of LLMs on a timeline. Source: Zhao et al. 2023, fig. 3.

By seeing lots of text, a language model learns the probability of a sequence of words. A Large Language Model (LLM) also learns certain nuances of the language itself. Without being explicitly taught the rules of grammar, it encodes in its model parameters the syntax of the language and word semantics.

Given an input prompt, an LLM predicts the next most probable word. Hence, LLMs are generative in nature. LLMs come under the more general discipline of Generative AI.

There's no definite answer to what makes an LLM large. Any model trained on billions of words and learns a few billion parameters is perhaps an LLM. Above a certain threshold size, LLMs are seen to exhibit emergent behaviour.

Most users will use pre-trained LMs, perhaps fine-tune them for specific use cases and invoke them via apps.

Discussion

How are LLMs trained and deployed?
LLM pre-training, fine-tuning and prompting. Source: Wolfe 2024.
An LLM is trained on lots of unlabelled data. This is self-supervised learning: the model automatically learns latent patterns and relationships. Training data comes from a variety of sources including webpages, books, discussion forums, technical journals, code samples, product documentation, etc. The end result is a Pre-Trained Language Model (PLM).
It's possible to deploy a PLM for making inferences. However, a PLM is what we call a Foundation Model (FM). It's not trained for any specific task, such as language translation, code generation or text summarization. For better results, a PLM becomes the foundation on which it's fine-tuned for a specific task. Task-specific or domain-specific data (which may also include labels) is used for fine-tuning. This is much smaller than the pre-training dataset. The end result is a Fine-Tuned LLM.
LLMs are typically deployed in the cloud. Users interact with apps. Apps call the LLM APIs. Apps may use agents to mediate interactions between users and the LLM. Agents have personas and memory. They've access to tools and external knowledge sources. Agents may enhance a user query before it's fed into the LLM as a prompt.
What's the internal architecture of an LLM?
Architecture of GPT LLM. Source: Lee 2023, fig. 1.
LLMs are based on the transformer architecture that was invented in 2017. Transformers are a specific type of Artificial Neural Networks (ANNs). An LLM typically has many attention layers. Each layer consists of a multi-head attention block plus a feedforward neural network (FFNN). Output of one layer feeds into the next. Finally, an FFNN outputs the next token. Attention itself is an approach to learn and quantify how a word is related to other words surrounding it.
The original transformer used an encoder-decoder architecture because that research focused on machine translation. For example, to translate a sentence from English to French, the encoder would encode the English sentence and the decoder would predict the French words one at a time. BART (2020) and T5 (2022) used the encoder-decoder architecture. BERT (2018) was an encoder-only transformer. Most modern transformers including GPT-4, Claude 3 and Llama 3 are decoder-only. Decoder-only transformers are autoregressive, that is, each word is generated based on preceding words.
Recent research has brought a few variations of transformers. In addition, there's attempt to reformulate attention as RNNs.
What are the building blocks of LLM?
Illustrating GPT-4's tokenization. Source: OpenAI 2024.
Computers don't understand words the way humans do. Words are therefore represented as numbers. Transformers use a sequence of numbers called a vector. The number of items in a vector is called its dimension.
In reality, a word is decomposed into one or more of a more basic unit called a token. LLMs internally deal with tokens. There are many tokenizers: Byte Pair Encoding (BPE), SentencePiece, Unigram, WordPiece, etc. The output of a tokenizer is a vocabulary of tokens called encoding.
Embeddings are learned representations of tokens. Learning happens via language modelling tasks such as predicting the next token or masked tokens. Embeddings therefore capture the context in which tokens occur in the language. Mathematically, embeddings are tokens represented as vectors. Tokens that are similar or related (king and queen, coffee and tea) are likely to be close to one another in the vector space.
ChatGPT uses Byte-level BPE (BBPE) with 100k-token vocabulary at about 100 tokens per 75 words. An example embedding from OpenAPI is text-embedding-3-large of 3072-dimensional vectors. Assuming 32-bit floats, this has a memory requirement of \(3072\cdot4\cdot100000\) = \(1.23GB\).
What techniques are being used to improve LLMs?
Fine-tuning takes a foundation model and trains it for specific tasks. With this approach, many fine-tuned models can be obtained from a common foundation model. Fine-tuning is a lot less expensive than pre-training. It also requires far less training data. Some methods fine-tune the entire model. Others add extra parameters and only these are fine-tuned.
It's possible to elicit better responses from PLMs just by customizing the prompts. Called Prompt Engineering or In-Context Learning (ICL), user queries are enhanced with templates and a few illustrative examples of the task at hand.
Retrieval Augmented Generation (RAG) is a technique in which LLMs are given additional context along with the prompt. Given a user query, relevant context is retrieved from a knowledge base. This knowledge base contains private, up-to-date or domain-specific data. Context helps LLMs to generate more accurate responses.
LLMs are huge but their capacity is often underutilized. This means that similar performance can be obtained by a smaller model. Quantization, knowledge distillation and pruning are techniques for model compression. Using high-quality high-volume pre-training data, it's also possible to train a smaller model with only a small loss of accuracy.
What are some applications of LLM?
LLMs fine-tuned for various tasks. Source: Merritt 2023.
Considering text-only applications, LLMs are being used for information extraction, text summarization, question answering, commonsense reasoning, sentiment analysis, content generation, code generation, language translation, and more. For example, companies can provide customer support via chatbots that have access to product documents, user manuals and warranty information. E-commerce websites can provide an auto-generated product review based on user-written reviews. Job-seekers can use LLMs to write resumes customized for each job description. LLMs can create case summaries for legal teams or extract the sentiment from financial reports.
Multimodal applications span not just text but also speech, audio, image and video content. Applications include image captioning, object recognition, image generation, image enhancement, speech transcription, speech recognition, video generation, video question answering, video segmentation, and more. Video search and retrieval is possible. An audio podcast can be created from a technical publication. In radiology, LLMs can process text, handwritten notes and MRI scans for diagnosis. LLMs can present financial information in the form of charts.
What are some examples of LLMs?
Among the FMs are Claude 3 (Anthropic), Gemini (Google), GPT-4 (OpenAI), Jurassic-2 (AI21 Labs), PaLM 2 (Google), Stable LM (Stability AI), and Titan Text G1 (Amazon). Among the open-source FMs are Command R (Cohere), Falcon 180B (TTI), Jamba (AI21 Labs), Llama 3 (Meta), Mixtral 8x22B (Mistral AI), and T5 (Google). Open source means that architecture, weights and in some cases even pre-training datasets are published.
InstructGPT is fine-tuned from GPT-3 to follow instructions and align with human values. It can be used for technical documentation, customer support, translation, etc. Similar instruct LLMs fine-tuned from their corresponding base models include Alpaca-7B, Dolly-7B, Falcon-7B-Instruct and Mistral-7B-Instruct. WizardMath is fine-tuned from Llama-2 to solve math problems. Code Llama is fine-tuned from Llama-2 for code generation in many popular programming languages. Likewise, Codex is fine-tuned from GPT-3. Flan-T5 is fine-tuned from T5 on many tasks.
Some models are specialized during pre-training itself and fine-tuned further if necessary. Domain-specific PLMs include BloombergGPT (financial) and BioBERT (biomedical). CroissantLLM is pre-trained on English and French tokens. It can then be fine-tuned for chat, translation and summarization tasks. PolyCoder is pre-trained on code in a dozen programming languages.
What are some challenges with LLMs?
Called hallucination, LLMs can give seemingly convincing answers that are wrong. Answers could be self-contradictory, nonsensical, or ungrounded with respect to the input context. RAG and RLHF mitigate this problem.
From data, LLMs also learn some bad stuff: bias, hate speech, self-harm, jailbreaks, etc. Their responses may leak private or copyrighted information. It's therefore wise to filter responses before delivering them to users. Bad actors could use LLMs to create misinformation, adversarial attacks and malware.
LLMs are pre-trained on billions or even trillions of tokens. At such volumes, ensuring high-quality data is a challenge. Data contamination happens when training data finds its way into test datasets. Data pollution happens when LLM-generated data (with hallucination and misinformation) gets used for training the next generation of models. Training LLMs with synthetic training data can lead to model collapse.
Evaluation metrics and benchmarks are far from ideal. These may fail to evaluate LLMs at a qualitative level or in a domain-specific manner. In production, evaluation needs to be done continously since models tend to drift.

Milestones

1980

Interest towards language modelling is motivated by speech recognition. Statistics is used, with n-gram modelling being a popular approach. Language models in this era are therefore named Statistical Language Models (SLMs).

2003

Bengio et al. propose a language model based on a feedforward neural network. They learn a distributed representation of words (which would later be called embeddings) and a joint probability function of word sequences. Thus is born the Neural Language Model (NLM). In later years, RNN and LSTM architectures are used for NLM.

2014

Encoder-decoder model with attention. Source: Weng 2018, fig. 4.

For machine translation, Bahdanau et al. introduce the concept of attention to a seq2seq model. The decoder "pays attention" to important parts of the source sentence. This allows the decoder to do soft alignment with the encoder. Thus, the encoder isn't forced to compress all relevant information into a single vector.

2017

Vaswani et al. propose the transformer model based on the concept of self-attention where words attend to other words in the sequence. They use an encoder-decoder architecture. Essential concepts used in the research include input/output embeddings, positional embeddings, and multi-head attention. Unlike RNNs, transformers can be parallelized.

2018

The idea of a Pre-Trained Language Model (PLM) that can be later be fine-tuned for specific tasks in born. Two models released as PLMs include GPT (June) and BERT (October). Both use transformers but GPT is autoregressive and decoder-only whereas BERT is bidirectional and encoder-only. Due to these PLMs, some think that "NLP's ImageNet moment has arrived" and 2018 is NLP's "watershed moment".

2019

Illustrating PEFT (2b) and comparing it with full finetuning (2a). Source: Raschka 2023.

Rather than fine-tune all parameters of an LLM, it's more efficient to fine-tune only a few parameters. This approach is called Parameter-Efficient Fine-Tuning (PEFT). In time, this approach becomes popular and many variants of PEFT are proposed: LoRA (2021), , (IA)3 (2022), and QLoRA (2023).

2020

In-Context Learning with a few examples of the task. Source: Bashir 2023.

New research suggests that foundation models can be applied with better results even without fine-tuning. One approach called Retrieval-Augmented Generation supplements LLMs with external knowledge sources. The sources are searched and relevant information is added to the query before prompting the LLM. In another approach called In-Context Learning (ICL) the prompt includes a few examples of the task at hand.

2022

In January, OpenAI announces InstructGPT, a model that's fine-tuned from GPT-3. Unlike GPT-3 that can sometimes give wrong or unhelpful answers, InstructGPT is aligned to user needs. It can follow instructions in the prompts. Supervised training was done with a technique called Reinforcement Learning from Human Feedback (RLHF). In November, OpenAI releases to the public a similar conversational model called ChatGPT fine-tuned from GPT-3.5. ChatGPT becomes so popular that it reaches a 100M user base within 2 months.

May
2023

LLMs by size, volume of training data and release date. Source: Bhayana 2024, fig. 1.

Google releases PaLM-2-340B pre-trained on 3.6T tokens. Their larger model PaLM-540B (April 2022) was trained on only 780B tokens. Thus, we see research interest in smaller models pre-trained on relatively larger datasets. This is motivated by the Chinchilla Scaling Law, first noted in 2022. Other small models released in 2023 include Llama-2-7B, Mistral-7B, Orca-2-7B, and Phi-2-2.7B. Compare these with OpenAI's GPT-4 (March 2023) of 1.8T parameters trained on 13T tokens.

Feb
2024

Google's Gemini 1.5 Pro claims a context window of 1M tokens. This is an advancement compared to 100K in Claude 1.3 (Mar 2023) and 128K in GPT-4 Turbo (Nov 2023). A context window is essentially the number of tokens (input and output) over which attention is computed. A larger context window can capture long-distance relationships.

References

Article Stats

2133

Words

Authors

Edits

Chats

Likes

496

Hits

Cite As

Devopedia. 2024. "Large Language Model." Version 3, May 26. Accessed 2024-06-25. https://devopedia.org/large-language-model

Contributed by
1 author

Last updated on
2024-05-26 10:41:20

design modelling llm

LLM App
LLMs for Code
LLM Evaluation Metrics
LLM Hallucination
Prompt Engineering
Generative Artificial Intelligence

Large Language Model

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Large Language Model

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login