Large Language Model trained on huge amounts of unlabelled/labelled data. It is not trained for and does not excel at any specific task. Rather it is a model that has a great generic understanding of language. It can serve as a great starting point for fine-tuning for more specific tasks.
The process of taking a pre-trained model (foundation model) and training at least some of the weights for a specific use-case. Fine-tuning is much more efficient than training from scratch, as we're starting already with a trained model that has generic language understanding. And we also get away with far less data. The use-cases might be specific tasks like sentiment-analysis, or we might fine-tune for text generation for a specific domain, such as legal documents. When to fine-tune and when to use zero-/few-shot learning is important to consider.
A token is a unit of text (close to a word) that is easy to understand by a machine. It is represented as a number or vector and can be words, subwords, characters or other units of text. We use tokens instead of characters because it is a better representation of text. A tokenizer takes in text and transforms it into tokens. Tokens are the basic unit when working with LLMs and you will encounter them in many places. For example token limits with LLM models and pay per tokens used in input and output. To get a feeling for how text is transformed into token, check the OpenAI tokenizer page.
Embeddings are machine-understandable representations of text. Sequences of tokens (typically limited to some thousands of tokens) are transformed into high-dimensional vectors. These embedding vectors have inherent meaning and allows an LLM to understand the fundamental meaning of text. While embeddings are much older than LLMs, modern embeddings are usually learned during LLM training. They can then be used independently for other NLP tasks such as computing similarities, classification, or clustering.
In sentiment analysis, a text is analyzed for emotions, opinions and attitudes. These extracted metrics are important for many business applications, as many decisions are taken on KPIs of products, marketing or user interactions. While it has been around for a while and there are many NLP sentiment analysis techniques, LLMs have simplified and improved sentiment analysis a lot. LLM foundation models can be used out of the box for sentiment analysis, but models that have been fine-tuned for sentiment analysis with a labelled dataset perform better. Labels labels for text snippets could be 'positive', 'neutral' and 'negative'. There are many fine-tuned sentiment analysis models, e.g. on hugging face. Only in specific cases, e.g. domain specific language, does a model have to be actually trained on your own custom dataset.
The art of communicating to an LLM exactly what it should generate. More specific instructions usually help in guiding the LLM output. We usually think of prompt engineering in the following dimensions:
- Instruction: This should specify exactly what the task is, such as "summarize", "write", "classify", etc.
- Context: Additional contextual information (e.g. from a vector database) to steer the output.
- Input Data: The actual input or question used to generate the response for.
- Output Indicator: The output format or type we want.
Prompt engineering is usually an iterative process and requires a lot of experimentation. It helps to be precise in the instructions and avoid ambiguity. Often this leads to longer and longer prompts during experimentation, which is not ideal; tokens are limited, and it becomes more difficult for an LLM to understand, which parts of the prompts are actually the most important.
This is when we tell the LLM what to generate in a single prompt, without giving any examples or context, as described in the prompt-engineering section above. It's the usual approach for methods like ChatGPT, which have been fine-tuned on instructions, e.g. using RLHF.
Few-Shot Approach / In-Context Learning
In-Context Learning (ICL) / few-shot learning is a prompt engineering method where we provide the model with a few examples of the task to perform directly in the prompt in natural language. For example, it could be a couple of pairs of prompts and answers and finally the real prompt for which we want to generate the answer. In this way, the LLM uses the examples provided and will generate an answer in the same style or format. It means that we can use LLM foundation models out of the box to perform our task without having to fine-tune.
Step-By-Step Reasoning / Chain of Thoughts
Chain of Thoughts (CoT) prompting enables complex reasoning by outputting intermediate reasoning steps. In many cases, it performs better than directly generating the final answer. CoT can be a zero-shot approach, for example by telling the LLM something like "Let's think step by step". Or it could be a few-shot approach, were we give examples as pairs of questions and answers that reason about how we reach the final solution.
LLMs can often sound extremely confident, even when the generated content does not make sense. This is because LLMs have been mainly trained to complete text, so the style of the output can be confident, while the actual content might be something made up. There are some mitigation techniques that can be used in zero-shot approaches, such as telling the LLM to "not provide an answer if you don't know", or control the generation by other means. If we constrain the model and limit it's freedom to generate content, we usually decrease the hallucinations. However, hallucinations are just a part of LLMs and can't be completely removed. In many cases, it is better to focus on good few-shot approaches or RAG.
Retrieval Augmented Generation (RAG)
In RAG, we provide contextual information directly in the prompt together with the instruction. That information usually comes from a larger (vector) database and we need to select the best matching information based on the user prompt. This process is usually done with a database search, where we match the user prompt to all available chunks of text from a large database, using embedding distance, classical ranking algorithms, or a combination of the two. However, there are many more techniques to improving this retrieval step.
Causal Language Modeling (CLM)
CLM is a way to train an LLM by predicting the next word or token. During training, the model predicts the next word/token, which is known and can be used for training. As such, it does not need labelled data for training. GPT-2 is an example of a causal language model.
Masked Language Modeling (MLM)
MLM is a way to train an LLM by predicting masked words or tokens given the surrounding words/tokens as input. During training, words/tokens are randomly masked and can be compared to what the model predicts. As such, unlabelled data is used during training. BERT is an example of a masked language model.
With the temperature, we can control the randomness of the generated output. It modifies the probabilities of the next words by scaling the final softmax layer. This then shapes the probabilities. A lower temperature will mostly pick the most likely word, while a higher value creates more randomness and diversity in the output, as more unlikely words can be chosen. Read the documentation of the specific LLM and API that you're using for what values to use and implementation details. Also note that different LLMs might be differently affected by the temperature parameter.
When generating the next word, the top-k parameter defines from how many words the next word is sampled. E.g. with k=3, the model would only consider the top 3 most likely words, and choose one randomly. Top-k sampling might help constraining the output.
When generating the next word, with top-p sampling (also known as nucleus sampling), the model chooses from the most likely words that cumulatively add up to p. Top-p sampling allows for a bit more diversity of the generated text.
E.g. assuming we have the next words sorted by probabilities: p(w1)=0.17, p(w2)=0.1, p(w3)=0.09, p(w4)=0.05, p(w5)=0.04. For a value p=0.4, we would add up the probabilities of w1, w2, w3, w4, which would result in a cumulative value of 0.17 + 0.1 + 0.09 + 0.05 = 0.43 At this point we would stop, as the cumulative probability exceets 0.4 and not consider w5.
Transformers are the basic building blocks of LLMs, introduced by the paper "Attention is All You Need". A transformer consists of two main blocks: An encoder and a decoder. On a very high-level, the encoder takes in embeddings and generates an abstract representation of it, extracting the meaning and context from it. The decoder then takes this encoded representation, and generates an output. Since we have multiple parallel transformer blocks, they can each focus on different information contained in the input.
Multi-headed (Self-) Attention Layer
At the core of a transformer block lies the multi-headed attention layer. An attention layer takes in an embedding vector and calculates how each token relates to other tokens. If both inputs are the same embedding, we talk about self-attention. Attention can also be calculated between two different vectors, in this case we would just talk about attention. Or we might have very specific inputs, e.g. encoder-decoder attention in the transformer architecture. See this blog post for a detailed discussion about how and why attention works so well.
In LLMs, we use multi-headed attention. By using multiple attention layers in parallel, we can have different attention layers focus on different information. E.g. one layer might focus on subject-object relationship, while another might focus on adjectives, another on sentiment, etc. Note that we do not specify what the individual attention layers do, what the attention layers learn is completely emergent from training the LLM on huge amounts of text!
Context Limit / Context Window / Context Length
The number of tokens an LLM can process. Usually shared for input and output (completion). This is one of the current limitations of LLMs, as it limits both the memory/history in a chat system, and the number of context chunks in RAG systems. It's also important to consider that the longer the input prompt, the less tokens you will have for the output. Models with a larger context windows are continually being developed, as this seems to be a limiting factor for users.
LLM Model Size
The size of an LLM is usually measured in the number of parameters, in the billions. In some cases, this is obvious, e.g. LLaMA-7b has 7 billion parameters. For other models, such as GPT, we don't know the exact number of parameters and there are only educated guesses. The model size plays an important role:
- The larger the model, the more memory (RAM or VRAM) and computing power is needed to run inference.
- The model size corresponds with the model's memory: how much a model can remember about what was shown during training.
- Certain capabilities of LLMs often only emerge with a certain model size.
Reinforcement Learning from Human Feedback (RLHF)
Training technique where we fine-tune a pre-trained LLM on human feedback. The process starts with creating a reward model, which is a separate model that takes in a generated output from an LLM, and outputs a score (how good the input text is). The dataset to train the reward model is usually collected by users giving feedback on LLM responses, either live or offline. Once a reward model has been trained, it is then used as a reward function to fine-tune the pre-trained LLM. See this blog post for an in-depth discussion of how RLHF works.
Low-Rank Adaptation (LoRA)
Efficient fine-tuning method where to original weights are kept frozen, and only a small 'update matrices' (hence adaptation) is trained on top of the original weights. Those 'update matrices' are generally added to the attention layers. Instead of full matrices, rank-decomposition matrices (hence low-rank) are trained, which have significantly fewer parameters. LoRas require a lot less memory to train, and are very portable, both due to the small number of parameters. They can be applied also for multi-modal modals, e.g. in stable diffusion models, LoRA is applied in the cross-attention layers of the denoising U-Net (where image and text prompt connect).
Open-source library that makes it easy to build GenAI applications. It has integration for most common tools like APIs, databases, memory, etc. and makes the connection of those tools very easy. By chaining a sequence of calls, complex logic that goes beyond a single LLM call can be built easily: Routing, RAG, memory, agents, etc.
Open-source library often used to build LLM apps (any LLM integration, not limited to LLaMA). It specializes in connecting private or domain-specific data to LLMs. This includes data ingestion, structuring, indexing, retrieval, etc.