What are Quantized LLMs?

What are Quantized LLMs?

Most Large Language Models (LLMs) require 30GB+ of GPU memory or RAM to run inference. This raises the question: How are we able to fine-tune LLMs like Llama 3 8B on consumer GPUs or even use them locally? Not everyone has access to multiple expensive GPUs. The answer lies in model quantization. 

In this tutorial, we will learn about Quantization in LLMs and convert Google’s Gemma model into a quantized model using Llama.cpp. 

What is Quantization in LLMs?

Quantization in LLMs is a compression technique where the model’s precision is reduced, typically from a 32-bit floating-point number (FP32) to lower-precision data representation like 8-bit or even 4-bit integers. This compression significantly reduces the model size and the computational resources required to run it. 

For example, a model that originally requires 16GB of memory can be compressed to just 4.5GB through quantization. This allows the model to be used on devices with limited resources, including those with a single CPU. Moreover, the quantized models require less storage, they are energy efficient, and provide faster inference.

How to Quantize LLM using llama.ccp?

In this guide, we will learn to convert the Google Gemma 2B model from 5GB to 1.5GB using 4-bit quantization. We will use the Llama.cpp framework to first convert the model into GGUF format and then quantize. It is pretty straightforward, and you can also code along using the Colab notebook.

  1. Start the Google’s Collab session with “T4 GPU” run time. 
  2. Clone a llama.cpp repository.
  3. Build the framework using the `make` command and LLAMA_CUDA as argument. 
  4. Install all the necessary Python packages.
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && git pull && make clean && LLAMA_CUDA=1 make
!pip install -r llama.cpp/requirements.txt
  1. Access to most of the LLMs is restricted. To access them, you have to log in to your Hugging Face account by providing the API key.
from huggingface_hub import notebook_login


  1. Download the Gemma 2B models files into the new directory. 
from huggingface_hub import snapshot_download
model_name = "google/gemma-2b"
base_model = "./original_model/"
snapshot_download(repo_id=model_name, local_dir=base_model, local_dir_use_symlinks=False)


  1. Create a new directory for quantized models.
  2. Run the Python script from llama.cpp repository to convert safetensor model files into GGUF format. 
!mkdir ./quantized_model/
!python llama.cpp/convert-hf-to-gguf.py ./original_model --outfile ./quantized_model/gemma-2b-fp16.gguf --outtype f16


  1. Run the `llama-quantize` script from the llama.cpp repository to convert the Gemma model into the lower precision data type using the ‘Q4_K_M’ method. It quantizes the model to 4-bit precision with knowledge distillation and mapping techniques for better performance.
! ./llama.cpp/llama-quantize ./quantized_model/gemma-2b-fp16.gguf ./quantized_model/gemma-2b-Q4_K_M.gguf Q4_K_M

Our model has been reduced to 1548.98 MB from 4780.29 MB, which is almost a quarter of the original size.


  1. Download the quantize model by navigating to the file and opening the “quantized_model” folder. 


  1. Download and install Jan by going to the official website https://jan.ai/
  2. Import the quantize model by clicking on the “Hub” 🪟 menu on the left panel and clicking on the “Import Model” button. It is quite simple. 
  3. Select the quantized model on the “Model” tab on the right panel and start prompting. I am getting almost 50 tokens per second. Which is quite good and the model is good at performing simple task. 



Due to quantization, I can use models locally on my laptop with limited GPU memory. The quantized models are smaller, require less storage, and perform as good as the original models when handling simple queries.

If you want to learn an even simpler quantization technique, then you should try GGUF My Repo Hugging Face Spaces. It is a web application deployed on Hugging Face that will just ask you to provide the name of the model uploaded on Hugging Face and quantize that model for you. It is that simple. You don’t need to run scripts or set up an environment. It is ready to go all the time. 

Image from Hugging Face Space


In this tutorial, we learned about quantization and how it democratizes access to large language models. It also allows common users to experience state-of-the-art models on their laptops without any issues. The only drawback of quantization is lower accuracy and sometimes poor results.

I will be writing a lot about large language models and the latest AI technology, so stay tuned and keep visiting the website.

Posted in AI

Leave a Reply

Your email address will not be published. Required fields are marked *