Denken

Announcement: Launch of my YouTube channel focused on Data Science

Aritra Sen — Mon, 11 Sep 2023 12:02:21 +0000

Want to get rid of Monday morning blues!! Need a refresher in Python for Data Science!! Here is something can help …

Happy to launch my YouTube channel with first series of tutorials with bare minimum python you need to get started with Data Science. Series contains 7 video tutorials with the below mentioned topics –

1. Python variables and datatypes
2. Datatype conversation and different type of loops
3. Functions, Scope of Variables, Lambda Functions
4. Sorting List, Tuple, Dictionary & Zip Function
5. Class and Instances (Basics of OOPs)
6. NumPy
7. Pandas

Channel link – Aritra Sen – YouTube
Series link – https://youtube.com/playlist?list=PLOrU905yPYXIJrnREmOXHKnq2DhMwKdU7&si=lSmnQYxcyDX6zLQE

Future series of playlists to come –
– Deep Learning with Pytorch
– NLP Zero to LLM
– Graph Neural Network

Subscribe the channel for more such contents.

Thanks,
Aritra

Generative AI: LLMs: How to do LLM inference on CPU using Llama-2 1.9

Aritra Sen — Thu, 07 Sep 2023 11:45:58 +0000

In the last few posts, we talked about how to use Llama-2 model for performing different NLP tasks and for most of the cases I have used GPU in Kaggle kernels. Now there can be requirements to that you don’t have GPU and you need to build some apps using CPU only. In this short post we will see how we can use ctransformers library to load and do inference using LLama-2 in CPU only. ctransformers library are python bindings for the Transformer models implemented in C/C++ using GGML library. Run the below command to install ctransformer library.

ctransformer library essentially helps to load the quantized models into CPU. With the ever-increasing size of LLMs, quantization plays a crucial role to use these giant models in community hardware efficiently with minimum compromise in the model performance. Recently, 8-bit and 4-bit quantization has helped us of running LLMs on consumer hardware. GGML (created by Georgi Gerganov , hence the name) was designed to be used with the llama.cpp library. The library is written in C/C++ for efficient inference of Llama models. It can load GGML models and run them on a CPU. To get the llama-2 7B GGML models for different quantization to this hugging-face link – TheBloke/Llama-2-7B-Chat-GGML at main (huggingface.co)

Based on the choice of quantization you can download the corresponding model file and place it in a local folder as shown below –

As you can see, I have downloaded two different quantized models from the above given link 2bit and 4 bit quantized models.

They follow a particular naming convention: “q” + the number of bits used to store the weights (precision) + a particular variant. Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by TheBloke:

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Once you download and place the model in your local file system, now you can easily load the model using below shown process and see how fast the model loads from disk –

Once loaded 4 bit llama-2 quantized model takes around 3.53 GB disk space. Similar way you can also load 13B Llama-2 models in your CPU also for inference from this link. – https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main

Using the 4bit quantized Llama-2 model and Gradio I have created the below shown demo using CPU only.

Demo of LLM using Llama-2 and Gradio

Do let me know in the comments if you like the video or not and in case you want me to create YouTube videos along with blogposts in future. Thanks for reading.

Reference: Quantize Llama models with GGML and llama.cpp | Towards Data Science

Generative AI: LLMs: Reduce Hallucinations with Retrieval-Augmented-Generation (RAG) 1.8

Aritra Sen — Sat, 26 Aug 2023 16:13:42 +0000

Though there is a huge hype and excitement about LLMs as they are really good at several NLP related tasks, however, they also come with few of the below mentioned issues:

Frozen in time – LLMs are “frozen in time” and lack up-to-date information. This is due to the fact that these LLMs are trained with a cutoff date and beyond that they are not aware anything, for the same reason if you can ask ChatGPT about Llama-2 they won’t be able to answer your questions with correct answers. These LLMs tends to hallucinate on these unknown questions and gives you a convincing wrong answer.

Lack of domain-specific knowledge – LLMs are trained on open-sourced datasets for generalized tasks, meaning they do not know your or any company’s private data. So again, for domain specific questions they tend to give you again convincing wrong answers.

When a user sends above mentioned two types of questions to any LLM, they tend to hallucinate and give wrong answers as shown below due to the lack of context for the LLMs.

LLM with Hallucination (Image: Author)

In the last blogpost we talked about how we can split documents into chunks and then create embeddings. These embeddings can be stored in any Vector Store for future uses as shown below –

Vector Store (Image: Author)

These documents can be anything like:
– Domain specific documents
– Documents / Knowledge base which are related to time frame after the training cut-off date of LLM.
– Company specific sensitive internal documents.
– etc.
This type of Vector Stores can act as a knowledge base.

Retrieval Augmented Generation (RAG) can help us to tackle the above-mentioned issues, RAG using the question embedding retrieves the nearest neighbor context from the Knowledge Base and adds the context to the query. Once the context is added to the query then the RAG sends the context aware query to the LLM for relevant answer. As now the model has the correct context to answer the query, the problem of hallucination can be reduced drastically. This process also much simpler than fine tuning or full finetuning process. The whole process is shown below-

RAG (Image: Author)

New documents can be easily added back to the Knowledge Base so the problem of ‘frozen in time’ can also be solved using the RAG methodology.
Example:

Now let’s get our hands dirty with RAG with Llama-2.

Do like, share and comment if you have any questions or suggestions.

Generative AI: LLMs: Semantic Search and Conversation Retrieval QA using Vector Store and LangChain 1.7

Aritra Sen — Fri, 25 Aug 2023 17:02:46 +0000

In the last few blogposts, we have gone through the basics of LLMs and different fine-tuning approaches and basics of LangChain. In this post we will mainly work with the embeddings from LLM, how we can store these LLM embeddings in Vector Store and using this persistent vector db how we can do Semantic search. Below are the high-level steps which we will do to perform the require operations –

Semantic Search using Vector Store (Credit: Author)

Before going into the coding, let’s go through the steps in details-

Loading Document:
Using LangChain we can load different types of documents like pdf , csv , html etc. Follow this page to get more detailed understanding of the different document loaders – https://python.langchain.com/docs/modules/data_connection/document_loaders/
For some of the document loaders like HTMLLoader and PDF Loader we need to install dependent libraires like BeautifulSoup, pypdf.
Transform Documents to Chunks:
Once we load the document, we can access the document using page_content , however sometimes this page contents can be very large to be feed into the model (every LLM has a max input token limitations). So, we can create document chunks using below mentioned processes –
1. By using a chunk-size based on character length.
2. By using the size of input tokens.
Create Embeddings:
Using LangChain we can create numeric embeddings of the text chunks. LangChain supports different LLM embeddings like OpenAI embedding, Sentence Transformer embedding etc.
Vector Store:
Using Vector Store, we can store these document embeddings (persistent storage) for future uses like Semantic search. A user can send a search text and using LLM first we can convert that text to embeddings, using this query embedding and Vector Store embedding we can perform semantic search and retrieve the most relevant document/text using Vector Store. For this tutorial we will use the open-source vector store named chromadb (Vector stores | Langchain). Using vector store we can easily add, update, or delete new vectors.

Now let’s get our hands dirty.

Do like, share and comment if you have any questions or suggestions.

Generative AI: LLMs: LangChain + Llama-2-chat on Amazon mobile review dataset 1.6

Aritra Sen — Thu, 17 Aug 2023 09:09:36 +0000

In the last post we talked about in detail how we can fine tune a pretrained Llama-2 model using QLoRA. Llama-2 has two sets of models, first one was the model used in previous blogpost which is pretrained model then there is a instruction finetuned Llama-2 chat model which we will use in this post.
Llama-2 has been pretrained on an extensive corpus of self-supervised data, followed by alignment with human preferences via techniques such as Reinforcement Learning with Human Feedback (RLHF) to obtain the Llama-2 chat as shown in the below given image (Source: Llama2 paper)

Prompt formats are kind of different in case of Llama-2 and Llama-2 as shown below-

Langchain:

LangChain gives us the building blocks to interface with any language model.

Prompts: Templatize, dynamically select, and manage model inputs
Language models: Make calls to language models through common interfaces.
Output parsers: Extract information from model outputs.

Langchain flow (Source: Model I/O | Langchain)

In the below notebook, we will try out Llama-2-chat model and will explore the benefits of using Langchain as a platform to several LLM tasks like

Text summarization
Sentiment Analysis
Topic extraction
Battery issues identification from mobile review.

Do like, share and comment if you have any questions or suggestions.

Generative AI: LLMs: Finetuning Llama2 with QLoRA on custom dataset 1.5

Aritra Sen — Thu, 27 Jul 2023 14:47:12 +0000

In the last post in this series, we have gone through the inner workings of LoRA fine tuning process. In this blogpost we will use the concepts of LoRA with the quantization method. We will use the newly launched Llama2 which is one of the biggest LLM launch in the history of open-source models. Below are the steps to be used in the below given notebooks and details about each process:

Install required packages.
Prepare the dataset for instruction fine-tuning.
Define quantization_config using BitsAndBytes.
Load the Llama-2 shared model with quantization_config.
Create the Llama-2 tokenizer.
create the peft_config to finetune the LoRA for q,v attention metrices.
Define the training arguments.
create the trainer with SFTTrainer.
train the model.
Inference phase.

Before start coding the whole process, let’s understand few of concepts which we are yet to go through in this blog post series-

Data Preparation:
In this post we will use the dialogsum from hugging face dataset module. Dataset has 4 features which is divided in the train, test and validation dataset. Features are – [‘id’, ‘dialogue’, ‘summary’, ‘topic’]. Our interest of features are dialogue and summary. Basic we trying to fine tune our model for a text summarization task using this dataset. We will prepare the dataset in such a way that it will be used for instruction finetuning. Instruction fine-tuning uses a set of labeled examples in the form of {prompt,instruction,input,output} pairs to further train the pre-trained model in for a particular task. Below function is self-explanatory for the data preparation step.

Preparation instruction-based dataset (Credit: Author)

Quantization:
With the rapid development of LLM, its feels like every other day we are getting new LLMs with are indeed very large with lot of parameters. Most challenging aspect is to fit these models with minimum hardware requirements like a single GPU. For example, to fine-tune BLOOM-176B, you’d need 72 GPUs (8x 80GB A100 GPUs). Lot of research is going to find ways to fit these models in easily accessible Hardwares. One such way is Quantization. To understand this process let’s first understand the data types which are being used and how they are represented. Size of any model would highly be deepened on the number of parameters and the precision (float32, float16 or bfloat16) of these parameters. Idea is to reduce the model size using lower precision without affecting the model performance as shown below-

Credit: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

FP32: 8 bits are reserved for the “exponent”, 23 bits for the “mantissa” and 1 bit for the “sign” of the number. With this datatype huge range of numbers can be represented.
FP16: 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. Due to reduction of precision lesser range of numbers can be represented. This exposes FP16 numbers
to the risk of overflowing (trying to represent a number that is very large) and underflowing (representing a number that is very small).
BF16: To tackle the problem of FP16 , BF16 has been introduced where 8 bits are reserved for the exponent (which is the same as FP32) and 7 bits are reserved for the fraction.

I hope you got an idea that using Quantization we can reduce the size of the model. There are techniques of quantization also , for more details please refer to this wonderfully written hugging face blog – https://huggingface.co/blog/4bit-transformers-bitsandbytes .We will use the bitsandbytes library to load the Llama-2 model with quantization parameters.

Quantization Config (Credit: Author)

Llama-2:
Abstract from the Llama-2 paper by Meta:

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested,and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Get the details of the all the models available: Models – Hugging Face

Sharded Model:
Sharded model is helpful for distributed training of large pretrained models like LLMs. It is achieved by sharding the model parameters, gradients, and optimizer states across data parallel processes, and it can also offload sharded model parameters to a CPU. In this coding exercise we have used the shared model version of Llama-2 to work in a single GPU. You can see that 14 shards has been downloaded which initializing the model for the first time.

Sharded Llama-2 model (Credit: Author)

Peft config:
In the last blog post we discussed in detail that in LoRA we train task specific low rank adopters which are generally q,v metrices of attention layers and keep the everything of the pretrained model weights freezed. Using the peft library we will create new low rank adopters for q_proj and v_proj as shown below with given rank (r=8).

Configuring LoRA(Credit: Author)

Training Arguments:
Using Supervised Fine-tuning Trainer (huggingface.co) (SFTTrainer) we are finetuning the Llama-2 model on out custom dataset. To keep the article short please refer for training arguments details – Trainer (huggingface.co)

Inference using QLoRA Adopters:
Once the adapter is trained you can pass the saved model in the get_peft_model function along with the original model to get the new LoRA finetuned model.

Llama2 + QLoRA = Finetuned model (Credit: Author)

You can get an idea of the whole process how to fine-tune a Llama-2 model using QLoRA. I have run the model in Kaggle kernel with 1 GPU and whole process is working fine. I will work further refining the process to improve the QLoRA outputs.
Update: Code changes has been done to fix the repeating text output. Now the text summarization is working properly.

Do like, share and comment if you have any questions or suggestions.

Generative AI: LLMs: LoRA fine tuning 1.4

Aritra Sen — Wed, 19 Jul 2023 13:42:29 +0000

In the last post we discussed two approaches to fine tuning using feature-based method, these options may not be always efficient in terms of computational complexity as well as time complexity. Full fine tuning of any LLM models needs to stitch the below mentioned steps together:

Load dataset in memory
Load pretrained model in the memory
Foward pass through the network
Loss calculation and gradient calculations
Optimize the weights

Combination of all these steps can produce lot of challenges in terms of –

Memory requirements
Computational requirements like more GPU
Training time and associated cost
Inference time

In one of the paper published by Microsoft, it has been shown that there exists a way called Parameter Efficient fine tuning which can help to tackle the above-mentioned problems. In this paper a technique called LoRA (Low Rank Adoption) has been introduced. In principle the concept resolves around the concept of Matrix Decomposition in lower ranks. A full fine-tuning of LLM goes through mainly two separate steps to generate the embeddings – 1. Foward pass through the network 2. Weights updates and in the end get the final embeddings as shown below –

Full finetuning of LLMs (Image: Author)

In case of LoRA it has been shown that pretrained model has low intrinsic dimensions, in other words there exists a low dimension reparameterization that is as effective doing the full parameter fine tuning. Pretrained weights can decomposed in low rank (Rank is linearly independently rows or columns of a matrix) matrices as shown below-

Low Rank representation (Image: Author)

For examples imagine that W is the pre trained weights with the dimension of 512 X 64. So we can say that if we want to full finetune the weights that total number of parameters would be 512 X 64 = 32768 which is lot of parameters to train. However, if we use two low rank matrices where the rank is 4 then these two matrices A and B can be represented (low dimension reparameterization) as follows

– A – 4 X 64 and B – 512 X 4.

So the total numbers of parameters would be (4 x 64 + 512 x 4) = 2304 which lot less when we compared to approximately 32k parameters. During training time, we freeze the pre-trained model parameters frozen and only train these two low rank matrices. During inference we combine these two matrices and add back to the pre-trained model weights as shown below –

LoRA (Image: Huggingface)

We can also train these low rank metrices for specific tasks and during inference time we can add back the task specific LoRA weights to the pretrained weights as shown below –

Task Specific LoRA (Image: Author)

In the previously mentioned paper it has been shown that similar model performence like full fine tuning can be achieved with LoRA as shown below –

LoRA performance comparison (Image: LoRA Paper)

In the LoRA they used the low rank adoption for different attention weight matrices like Q ,V. The study in the paper has been limited to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency. Surprisingly it has been observed that with low rank as low as r=1 very good performance can be achieved (r is hyperparameter to tune).

How to chose the rank ? (Image: LoRA paper)

In the next blog post we will implement the LoRA in code. Do like, share the post in case you find this post useful. Thanks for reading.

Generative AI: LLMs: Feature base finetuning 1.3

Aritra Sen — Wed, 12 Jul 2023 07:27:35 +0000

In the last post we talked about how to do In-context finetuning using few shot techniques, In-context finetuning works when we don’t have much data, or we don’t have access to the full model. This technique has certain limitations like the more examples you add in the prompt the context length increases a lot and there is always cut off on how much benefit you can get out in-context fine tuning.

Here comes the technique of the feature based fine tuning when we have lot of data to fine-tune LLM and we have full access to the LLM for doing any downstream task like Classification, Sentiment analysis etc. In general feature based fine tuning can be done using the below mentioned two approaches, I already have written two blog posts on these two approaches, I am attached the link of these tutorials here:

Update the weights of the pre-trained LLM model along with the classification layer.
In practice, finetuning all layers almost always results in superior performance; however, this process is a resource intensive and time-consuming process. Hardware requirements like GPU is almost essential.

1.1 – Fine Tune a Transformer Model (1/2)

Code Example of Approach 1

2. Update only the weights of the classification layer and not the pre-trained LLM model.
This process acts as using the pre-trained LLM model for feature extraction. This approach is much more efficient in terms of resource consumption and time required. Different heads can be trained for different downstream tasks using this approach.

1.2 – Fine Tune a Transformer Model (2/2)

Code Example of Approach 2

Feature based finetuning of LLMs (Performence vs Training time) (Source : SEBASTIAN RASCHKA, PHD)

From the above image we can see that feature based finetuning requires more training time to get optimal model performance and these processes are always not resource efficient finetuning approaches.
More fine-tuning approaches to come in this blog post series.

Do like, share and comment if you have any questions.

Generative AI: LLMs: In Context Learning 1.2

Aritra Sen — Mon, 10 Jul 2023 14:16:28 +0000

From this blog post onwards, we will talk about different fine-tuning approaches for LLMs. As discussed in the last last post In context learning helps in below mentioned two situations:
1. We don’t have access to the full model. We only have access to the API of the model.
2. When we don’t have much data to train any model.
Using OpenAI API key below I tried to show how we can do in context learning.

Few of the limitation of in context learning is that the more examples we add in the prompt the context length increases with number of examples which is not an efficient fine-tuning approach. If we lot of data better approach to fine tuning with instruction in given in the OpenAI documentation.

Do like, share and comment if you have any questions.

Generative AI: LLMs: Finetuning Approaches 1.1

Aritra Sen — Thu, 06 Jul 2023 15:00:00 +0000

In the last post in this Generative AI with LLMs series we talked about different types of LLM model and how they are generally pre-trained. These Deep Learning language models with large numbers of parameters are generally trained on open-sourced data like Common Crawl, The Pile, MassiveText, blogs, Wikipedia, GitHub etc. These datasets are generally from different domains and different topics which are generic in nature. However, these LLMs may not perform as well on specific task in hand without finetuning. For example, if you want to use a pretrained LLM for NLP tasks or bio-medical documents/texts, finetuning (or in context learning) it on a corpus of bio-medical documents can significantly improve the model’s performance.

This blog posts will try to show and discuss different available approaches to do the finetuning for a task in hand. We will discuss each of the topics briefly in this post and in future post we go in depth to the topics in detail with code.

Let’s get started with brief discussion on each of the above shown approaches:

In context learning:
In context learning is way to go when we don’t have access to the full LLM and we are using an API to access the LLM for example when we are using OpenAI gpt-35-turbo model to make API calls. We have seen that with the recent development of GPT 3/ Chat GPT we can do zero shot prompting or few-shot prompting to get better results when we use these models for a task in hand. In few shot prompting we provide one or multiple examples of the task embedded in the input prompt to the model. In the next blog post tutorial, we will go through we can do it in python.

Feature based finetuning:
In feature based finetuning we should have access to the full LLM model like BERT which can be finetuned in generally two of the approaches to get much better performance on domain specific downstream task like sentiment classification. In Feature base finetuning we can attach a new classification or task specific head and the newly added task specific head can be trained or we can also tune all the layers of LLM along with the newly added head. More in depth discussion to be done in the future blog posts.

Quantization (Model run time/space optimization):

Quantization is generally a model optimization technique. In the feature base finetuning we generally play with the number of params to fine tune, Quantization takes a different approach where we try to represent the weights, biases and gradients of the LLMs with low precision data types like 8-bit integer (INT8) instead of the usual 32-bit floating point (FP32). By reducing the number of bits, we can reduce the size of the model to be finetuned, which in turn can help in the finetuning process by reducing the memory and run time.

Multitask instruction finetuning:

So far, all the above methods mentioned above talks about finetuning the whole LLM for a single downstream task. Tuning the model for a single downstream task can lead to the phenomenon named ‘Catastrophic forgetting’ where learns to do the task for which it was finetuned however it performs very poorly on the other tasks. For example, a LLM finetuned for sentiment classification can start performing very poorly on other tasks like text summarization, named entity recognition. To avoid catastrophic forgetting we can fine tune the model on mixture of instruction prompts. FLAN group of models like FLAN-T5 is one such model.

Parameter Efficient Fine Tuning (PEFT):
PEFT reuses the pre-trained model with minimum new params to be trained using the fine-tuning process. Go for PEFT when you want to optimize the below mentioned criteria:

computational costs and hardware (requires fewer GPUs and GPU time)
Minimum training time
Better modeling performance by reducing overfitting)
Requires less storage as newly added or trained params are very minimum in size.

On a high level we can categorize the PEFT in below mentioned approaches, which we would discuss in detail in with code implementation in future blog posts.

LoRA (reparameterization):
In Low Rank Adaption LLM fine tuning new two low rank metrices (using matrix decomposition) gets introduced to the fine-tuning process keeping the pre-trained model weights unchanged. These low rank metrices can be specific to different tasks. During inference time these task specific low rank metrices can be added back to the pre-trained model weights to get better performance on individual tasks.

2. Adapters:
According to research a BERT model trained with the adapter method reaches a similar modeling performance comparable to a fully finetuned BERT model while only requiring the training of 3.6% of the parameters. This method adds additional parameters/layers to each transformer block and trains only these additional parameters keeping original parameters frozen.

3. Soft prompt tuning:
Soft prompt tuning is different than traditional prompt tuning, in soft prompt tuning or prefix tuning prepends tunable tensors to the embeddings on the input.

Reinforcement Learning with Human feedback (RLHF):
In this process we include human feedback in the loop of fine tuning. In this process at first step is to generate human labelled dataset where human rank the LLM outputs based on certain criteria like toxicity/relevance/quality of output. Using this labelled dataset, a reward model in trained which gives reward to the outputs generated by LLM. Based on the reward a RL algorithm (proximal policy update) fine tunes the weights of the LLMs. This technique is one of the main reasons behind the success of ChatGPT.

In the future blog post we will try to get into depth of these techniques.

Thanks for your time, hope you have enjoyed reading, do share if you like the post.