Big parameters, small GPUs

@stevhliu|July 11, 2025 (1d ago)0 views

I recently gave my first talk at a local AI meetup, Sonoma AI. The talk was about how Transformers and Diffusers reduce the memory required to load large models on consumer GPUs.

This post recaps and summarizes the talk with some additional details and code examples.

memory maths
Big Model Inference
torch_dtype
quantization
offloading
tensor parallelism
device_map
kv cache
resources

#memory maths

Llama 3.1 8B Instruct has been downloaded over 5.5M times in the past month.

You can get a pretty good estimate of how much GPU memory is required to load a model for inference by multiplying the number of parameters by the number of bytes per parameter (plus a little extra extra for the forward pass).

Llama 3.1 8B Instruct has 8B parameters and is stored in bfloat16 (half-precision), which takes up 2 bytes (16 bits) per parameter.

8B parameters * 2 bytes/parameter = 16GB

The problem is that many free-tier or consumer GPUs don't have that much memory. And if they do, they're expensive. A T4 GPU instance on Colaboratory has 16GB of GPU memory, but only 15GB of it is actually available.

This is not very accessible.

But with Transformers and Diffusers, it is possible to load these large models into memory even on consumer GPUs and run them for inference.

This talk focuses on how this is possible.

#Big Model Inference

A model is typically loaded according to the following steps.

Create the model with randomly initialized weights (16GB).
Load the model weights in memory (16GB).
Load the weights in the model.
Move the model on device for inference.

Big Model Inference (BMI) loads a model like this.

Create an empty model without weights.
The first step creates an empty model with the PyTorch meta device. This creates tensors without any data attached. It only creates tensors with the expected shape.

You can create tensors of any size without worrying about memory, because the tensors don't actually hold any data.

Transformers instantiates a model directly on the meta device.
Plan where each model layer goes.
The second step uses the device_map to optimally distribute model weights. From the shape and dtype of each tensor on the meta device, you can figure out how much memory the actual weights require.

Transformers tries to fit as many weights as possible on your fastest device (GPU) first. If they don't all fit, it places the remaining weights on the CPU. And if that still doesn't fit, the rest of the weights are offloaded to disk.

It even accounts for certain layers that shouldn't be split like layers with residual connections. This is done automatically but you can also design your own device_map by assigning each module/layer to a device.
Load part of the weights in memory.
The third step starts loading model shards into memory instead of loading the entire model into memory. Once a shard is loaded, the weights are placed in the model and moved to the appropriate device. The loaded shard is discarded, and the next shard is loaded.

Instead of requiring enough memory to fit the entire model into memory, you only need enough CPU memory to load the biggest shard.

Disk offload is an additional option if you don't have enough GPU and CPU memory.
Load the weights in the empty model.
Move the model on device for inference.
Repeat step 3 for the shard until all the weights are loaded.

With BMI, the meta device avoids loading a model into memory twice.

#torch_dtype

The dtype indicates the type of elements stored in a tensor. It affects how much memory is required and what kind of numerical values a tensor can represent.

The tensor values are calculated from the sign, exponent, and significand (mantissa).

The sign determines whether a value is positive or negative.
The exponent determines the scale or magnitude of the value and the range of values a number can represent.
The significand determines the precision and or number of significant digits.

fp32 is considered full precision and takes up 32 bits in memory. 1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand.

3.1415927410125732

fp16 is half-precision and takes up 16 bits in memory. 1 bit for the sign, 5 bits for the exponent, and 10 bits for the significand.

bf16 is also half-precision but can represent a wider range of values. 1 bit for the sign, 8 bits for the exponent, and 7 bits for the significand.

3.140625

A lower precision dtype has fewer bits and requires less memory to store.

PyTorch loads a model in fp32 by default even if the model weights are in fp16 because you can't access the model until after you've loaded it with from_pretrained().

It is a waste of memory to load a model in fp32 and then again in fp16. To avoid this, use the torch_dtype argument in from_pretrained to explicitly set the dtype.

I recommend using the "auto" option to let Transformers automatically get the dtype from the model weights.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype="auto"
)

#quantization

Quantization takes the idea of dtypes to an even lower level. From floating points to integers.

int8 takes up 8 bits in memory. 1 bit for the sign and 7 bits for the significand.

The range of values represented by fp32 is mapped to the much smaller range represented by int8. Consider a basic example of linear quantization below.

Map the min/max values from fp32 to int8.
The min/max values have different distances to 0 (0 in fp32 doesn't equal 0 in int8).
Calculate a scaling factor to obtain a linear mapping for the remaining values and adjust them with the zero-point value to account for the different distances to 0.
Dequantize the weights with the scaling factor and zero-point to perform computations with your inputs (presumably in fp16/bf16).

The quantization and dequantization steps may decrease inference speed though. Quantization can also be lossy, especially for lower quantization levels like int4.

With Transformers, choose and configure a quantization backend. Then plug the quantization_config into from_pretrained to quantize a model.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=quantization_config
)

#offloading

Diffusers offers various offloading options. Offloading moves weights off the GPU to another device when they're not in use. This is useful for large models like Flux.1 [dev].

For Flux.1 [dev], the memory requirements are ~9GB for the two text encoders and ~22GB for the transformer model. Loading and generating an image takes ~33GB in bf16.

Diffusers offers 3 offloading options.

Model offloading moves a component (for example, the transformer) to the GPU only when it is needed and the other components are offloaded to the CPU.

import torch
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

Flux.1 [dev] uses 22.6GB with model offloading

CPU offloading moves the weights for a given layer on the GPU for computation and offloads them back to the CPU when they're not used. It is extremely slow because of all the transfers between the CPU and GPU.

import torch
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)
pipeline.enable_sequential_cpu_offload()

Flux.1 [dev] uses 2.4GB with CPU offloading

Group offloading moves n-layers at a time from the GPU to the CPU for computation. The difference with CPU offloading is it uses CUDA streams to prefetch the next layers parameters during computation. Overlapping computation and data transfer makes it much faster. You can even offload to disk if you need more memory.

import torch
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)
apply_group_offloading(
    pipeline.transformer,
    offload_type="block_level",
    num_blocks_per_group=2,
    offload_device=torch.device("cpu"),
    onload_device=torch.device("cuda"),
    use_stream=True,
)

Flux.1 [dev] uses 4.41GB with group offloading

#tensor parallelism

Tensor parallelism distributes model weights (tensors) across multiple GPUs. This helps you fit large models into memory that wouldn't otherwise fit on a single GPU.

It is also faster because each GPU can perform computations in parallel. Each GPU performs its calculations on its tensor slice and the results are synced at the end to return the final result.

There is a bit of communication overhead between GPUs, so it is best for single machines with multiple GPUs that communicate with faster intra-node communications.

Set the tp_plan argument in from_pretrained to use tensor parallelism.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8",
    torch_dtype=torch.bfloat16,
    tp_plan="auto"
)

#device_map

For multiple GPUs, device_map can split the model weights using different strategies.

auto splits weights so each GPU is used equally.
balanced_low_0 splits weights so each GPU is used equally except the first one. This reserves space for working with the outputs of the model, such as the generate function.
sequential fills the GPUs in order so the last one may not be used at all if not necessary.

Set the device_map argument in from_pretrained to distribute model weights across GPUs.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

To make sure weights are correctly moved and placed, Transformers uses hooks to:

Make sure all the inputs of a module are on the same device as the weights.
If the weights are offloaded to the CPU, move them to the GPU before the forward pass and back to the CPU after.
If the weights are offloaded to disk, they are loaded onto the CPU, then the GPU before the forward pass, and then their memory is freed afterwards.

This is slower than tensor parallelism because GPUs are used sequentially. Some GPUs are idling.

#kv cache

Autoregressive or decoder models predict one token at a time. The predicted token is dependent on all of the previous context. Every time the model predicts a new token, it ends up performing some of the same calculations again.

Performing the same calculations repeatedly is wasteful and slows down inference.

A key-value (kv) cache stores the previously calculated kv values and reuses them to avoid recomputation. At each step, you're only calculating the kv value for the current token rather than all the previous ones.

However, storing the kv values requires memory that grows linearly with sequence length.

Transformers provides two memory optimized cache types.

OffloadedCache moves the cache to the CPU. Only the current layer cache is kept on the GPU so the model's forward method can use it. The next layer cache is prefetched and the previous layer cache is sent back to the CPU.
QuantizedCache quantizes the cache.

Configure the cache_implementation argument in generate to use either cache type.

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model.generate(
    **inputs,
    do_sample=False, 
    max_new_tokens=23,
    cache_implementation="offloaded"
)

#resources

This video visually explains how Big Model Inference works.
The Quantization concepts docs explain different quantization schemes (affine, int4, and fp8) and techniques.
The tensor parallelism chapter from the Ultra-Scale Playbook provides a more detailed explanation, including column-wise versus row-wise sharding.