Big parameters, small GPUs
@stevhliu|July 11, 2025 (6 months ago)37 views
I recently gave my first talk at Sonoma AI, a local AI meetup. The talk was about how Transformers and Diffusers reduce the memory required to load large models on consumer GPUs.
This post recaps and summarizes the talk with some additional details.
#memory maths
Llama 3.1 8B Instruct was downloaded over 5.5M times in the past month, making it the most downloaded text generation model on Hugging Face. But how much GPU memory is required to load this popular model for inference?
You can get a pretty good estimate by multiplying the number of parameters by the number of bytes per parameter (plus a little extra for the forward pass).
Llama 3.1 8B Instruct has 8B parameters and is stored in bfloat16 (half-precision), which is 2 bytes per parameter.
The problem is, many free-tier or consumer GPUs don't have that much memory. A free T4 GPU instance on Colaboratory has 16GB of GPU memory. But only 15GB of it is actually available. And buying a sufficiently powerful GPU can be expensive.
This is not very accessible.
Transformers and Diffusers lowers the barrier to fitting large models into GPU memory.
#Big Model Inference
A model is typically loaded like this.
- Create the model with randomly initialized weights (16GB).
- Load the weights in memory (16GB).
- Load the weights in the empty model.
- Move the model to the device for inference.
Big Model Inference (BMI) loads a model like this.
-
Create an empty model with the PyTorch meta device. This creates tensors without any data attached. Tensors can be any size without worrying about memory constraints. It only creates tensors with the expected shape.
Transformers instantiates a model directly on the meta device. This avoids loading a model into memory twice.
-
The
device_mapoptimally distributes model weights. This is automatic, but you can design your owndevice_mapby assigning each module/layer to a device. From the shape and dtype of each tensor on the meta device, you can figure out how much memory the actual weights require.Transformers tries to fit as many weights as possible on your fastest device (GPU) first. If they don't all fit, it places the remaining weights on the CPU. And if that still doesn't fit, the rest of the weights are offloaded to disk.
It even accounts for certain layers that shouldn't be split, like layers with residual connections.
-
Load model shards into memory instead of loading the entire model. Once a shard is loaded, the weights are placed in the model and moved to the appropriate device. The shard is discarded, and the next shard is loaded.
You only need enough CPU memory to load the biggest shard rather than the entire model.
-
Load the weights in the empty model.
-
Move the model to the device for inference.
-
Repeat step 3 until all weights are loaded.
#device_map
For multiple GPUs, device_map can split the model weights using different strategies.
autosplits weights so each GPU is used equally.balanced_low_0splits weights so each GPU is used equally except the first one. This reserves space for working with the outputs of the model, such as the generate function.sequentialfills the GPUs in order so the last one may not be used at all if not necessary.
Set the device_map argument in from_pretrained to distribute model weights across GPUs.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
dtype=torch.bfloat16,
device_map="auto"
)Transformers uses hooks to make sure weights are correctly moved and placed,.
- Make sure all the inputs of a module are on the same device as the weights.
- If the weights are offloaded to the CPU, move them to the GPU before the
forwardpass and back to the CPU after. - If the weights are offloaded to disk, they are loaded onto the CPU, then the GPU before the
forwardpass, and then their memory is freed afterwards.
This is slower than tensor parallelism because GPUs are used sequentially causing some to idle.
#dtype
The dtype is the data type of the elements in a tensor. It affects how much memory is required and what kind of numerical values a tensor can represent.
Tensor values are calculated from the sign, exponent, and significand (mantissa).
- The sign determines if a value is positive or negative.
- The exponent determines the scale or magnitude of the value and the range of values a number can represent.
- The significand determines the precision and or number of significant digits.
fp32 is considered full precision and takes up 32 bits in memory. 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand.
Lower precision dtype has fewer bits and requires less memory to store.
fp16 is half-precision and takes up 16 bits in memory. 1 bit for the sign, 5 bits for the exponent, and 10 bits for the significand.
bf16 is also half-precision but represents a wider range of values. 1 bit for the sign, 8 bits for the exponent, and 7 bits for the significand.
PyTorch loads a model in fp32 by default even if the model weights are in fp16 because you can't access the model until after you've loaded it with from_pretrained().
Loading a model in fp32 and again in fp16 wastes memory. Use the dtype argument in from_pretrained to explicitly set the dtype to avoid this.
I recommend using the "auto" option to let Transformers automatically get the most optimal dtype from the model weights.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
dtype="auto"
)#quantization
Quantization takes the idea of dtypes to an even lower level, usually from floating points to integers.
int8 takes up 8 bits in memory. 1 bit for the sign and 7 bits for the significand.
The original range of values are quantized to a lower range.
- Map the min/max values from fp32 to int8.
- The min/max values have different distances to 0. 0 in fp32 doesn't equal 0 in int8.
- Calculate a scaling factor to get a linear mapping for the remaining values and adjust them with the zero-point value to account for the different distances to 0.
- Dequantize weights with scaling factor and zero-point so you can perform computations with your inputs (presumably in fp16/bf16).
The quantization and dequantization steps may decrease inference speed and be lossy, especially for lower quantization levels like int4.
With Transformers, choose and configure a quantization backend. Then plug the quantization_config into from_pretrained to quantize a model.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
quantization_config=quantization_config
)#offloading
Diffusers offers several offloading options. Offloading moves weights off the GPU to another device when they're not in use. This is useful for large models like Flux.1 [dev].
Flux.1 [dev] requires ~9GB of memory for the two text encoders and ~22GB for the transformer model. Loading and generating an image uses ~33GB in bf16.
Diffusers offers 3 offloading options.
- Model offloading moves a component (for example, the transformer) to the GPU only when it is needed. Other components are offloaded to the CPU.
import torch
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()- CPU offloading moves the weights for a given layer on the GPU for computation and offloads them back to the CPU when they're not used. It is extremely slow because of the many transfers between CPU and GPU.
import torch
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16
)
pipeline.enable_sequential_cpu_offload()- Group offloading moves n-layers at a time from the GPU to the CPU for computation. Unlike CPU offloading, it uses CUDA streams to prefetch the next layer's parameters during computation. Overlapping computation and data transfer makes it much faster. You can even offload to disk if you need more memory.
import torch
from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16
)
apply_group_offloading(
pipeline.transformer,
offload_type="block_level",
num_blocks_per_group=2,
offload_device=torch.device("cpu"),
onload_device=torch.device("cuda"),
use_stream=True,
)#tensor parallelism
Tensor parallelism distributes model weights (tensors) across multiple GPUs. This helps you fit large models into memory that wouldn't otherwise fit on a single GPU.
It is faster because each GPU can perform computations in parallel and sync the results at the end to return the final output.
There is a bit of communication overhead between GPUs, so it is best for single machines with multiple GPUs that communicate with faster intra-node communications.
Set the tp_plan argument in from_pretrained to use tensor parallelism.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8",
dtype=torch.bfloat16,
tp_plan="auto"
)#kv cache
Decoder models predict one token at a time. The predicted token is dependent on all of the previous context. Every time the model predicts a new token, it ends up performing some of the same calculations again.
Performing the same calculations every time is wasteful and slows down inference.
A key-value (kv) cache stores the previously calculated kv values and reuses them to avoid recomputation. At each step, you're only calculating the kv value for the current token rather than all the previous ones.
However, storing the kv values requires memory that grows linearly with sequence length.
Transformers provides two memory-optimized cache types.
- OffloadedCache moves the cache to the CPU. Only the current layer cache is kept on the GPU so the model's
forwardmethod can use it. The next layer cache is prefetched and the previous layer cache is sent back to the CPU. - QuantizedCache quantizes the cache.
Configure the cache_implementation argument in generate to use either cache type.
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
dtype=torch.bfloat16,
device_map="auto"
)
model.generate(
**inputs,
do_sample=False,
max_new_tokens=23,
cache_implementation="offloaded"
)#resources
- This video visually explains how Big Model Inference works.
- The Quantization concepts docs explain different quantization schemes (affine, int4, and fp8) and techniques.
- The tensor parallelism chapter from the Ultra-Scale Playbook provides a more detailed explanation, including column-wise versus row-wise sharding.