Transformers Compendium

Science is made up of so many things that appear obvious after they are explained. - Kynes, Dune

A collection of working notes on how Transformers loads large models without doubling peak memory.

#PyTorch meta device

A basic loading flow materializes a tensor and fills it with random values from the model config, which consumes memory. A 70B fp16 model requires ~140GB of memory. Next, the real model parameters are read into a second copy in CPU memory, requiring another ~140GB of memory. Finally, the randomly initialized parameters are overwritten by the real parameters.

During this process, peak memory is 2x the model size, because there are two copies of the model in memory. Moreover, the work required to load the randomly initialized parameters adds wasteful overhead because they're eventually discarded.

init_contexts.extend([torch.device("meta"), init.meta_device_safe_creation_ops()])

# get_init_context returns torch.device("meta")
model_init_context = cls.get_init_context(dtype, is_quantized, _is_ds_init_called, allow_all_kernels)

config = copy.deepcopy(config)
with ContextManagers(model_init_context):
    # every parameter is placed on the meta device
    model = cls(config, *model_args, **model_kwargs)
    patch_output_recorders(model)

The PyTorch meta device lets Transformers build a model of any size regardless of device memory.

Building the model on the meta device doesn't consume any memory. The meta device only holds the metadata for the parameters the model expects, like shape and dtype, creating a skeleton of the model from the config.

From those expected shapes and dtypes, Transformers computes how many bytes each parameter slot will occupy once it's loaded. If a device map is being inferred for automatic placement, sharding, CPU/disk offload, or multi-device loading, those planned sizes help assign parameters to devices with enough available memory.

{
    "model.layers.0.self_attn.q_proj.weight": "0",
    "model.layers.31.mlp.up_proj.weight":     "1",
    "model.layers.45.mlp.down_proj.weight":   "cpu",
    "model.layers.60.mlp.down_proj.weight":   "disk",
    ...
}

When a tensor is materialized, the safetensors slice is read into a temporary CPU allocation and then moved or cast to the parameter's target device. That temporary CPU allocation is released after the materialized tensor is returned.

At any given moment, only a small number of parameters occupies the CPU allocation. Peak transient memory comes from the in-flight tensors and any conversion temporaries, not from loading an entire shard at once.

#Lazy safetensor slices

Tensors in a safetensors file have a unique name or key like model.layers.0.self_attn.q_proj.weight. This lets the loader know which weight block in the file corresponds to which parameter.

for k in file_pointer.keys():
    merged_state_dict[k] = file_pointer.get_slice(k)  # don't materialize yet

When a safetensors file is opened, get_slice(key) returns a lightweight slice object for a tensor that actually exists in the checkpoint. The slice exposes that stored tensor's shape and dtype without materializing the tensor bytes.

This is the same kind of metadata as the meta tensor, but it answers a different question. The meta tensor describes the empty parameter slot the model needs. The safetensors slice describes the real weight block available on disk for that slot.

A slice is very lightweight, so Transformers can walk through every entry - sorted by their key, which groups related tensors together (like an attention block's q_proj/k_proj/v_proj) - in the checkpoint. The device map decides where each materialized tensor should go.

def _materialize_copy(tensor: torch.Tensor, device=None, dtype=None) -> torch.Tensor:
    # This slicing is what actually loads the tensor from the safetensors slice object
    tensor = tensor[...]
    if dtype is not None or device is not None:
        tensor = tensor.to(device=device, dtype=dtype)
    return tensor

When the loader is ready to materialize a checkpoint entry, it indexes the slice with tensor[...], which reads the tensor bytes from disk. The resulting real torch.Tensor is assigned into the parameter slot, overwriting the placeholder on the meta device. Parameters assigned to "disk" are indexed or offloaded instead and remain lazily loadable through Accelerate hooks.

Any parameter the model expects, but the checkpoint doesn't contain, remains on the meta device until finalization, where tied weights are handled and truly missing parameters are initialized.

#Releasing conversion temporaries

Transformers' dynamic weight loading system is able to load checkpoints that don't match the model's expected parameter layout at runtime. There are two methods available for aligning the formats.

WeightRenaming.convert() maps checkpoint source tensors to the expected model key by renaming it if it's different.
WeightConverter.convert() performs a series of operations (for example, concatenating q, k, v into a single tensor) that restructures the weights from their checkpoint layout to the model's expected layout.

These methods identify source patterns that describe how weights are named in a checkpoint. For example, model.layers.5.self_attn.qkv.weight matches the pattern *.self_attn.qkv.weight. Based on the pattern, different checkpoint tensors are sorted to separate buckets in self.collected_tensors. Each bucket is stored in a dict where the value is a list of pending loads for that pattern.

On the async path, each tensor is submitted to the thread pool as soon as it's collected, and the resulting Future is appended to the matching bucket. A tensor can be materialized as soon as a worker thread is free, even while the main thread continues scanning and grouping checkpoint keys.

def materialize_tensors(self):
    collected_tensors = {}
    for key in list(self.collected_tensors.keys()):
        tensors = self.collected_tensors.pop(key)  # <-- pop, not read
        # ... resolve futures/callables ...
        collected_tensors[key] = tensors
    return collected_tensors

When Transformers is ready to convert a bucket of collected source tensors, materialize_tensors() calls pop(key) to remove that pattern's bucket from self.collected_tensors. Popping the bucket removes its reference from self.collected_tensors so they do not keep accumulating after it's been processed. After it's popped, those Futures are resolved into real torch.Tensors and stored in a local collected_tensors dict.

for first_param_name, mapping in tqdm(param_name_to_load.items(), desc="Loading weights"):
    realized_value = mapping.convert(...)
    for target_name, param in realized_value.items():
        set_param_for_module(model, target_name, param, ...)
    del realized_value  # <-- immediately free after setting

convert() renames, reshapes, or transforms the collected source tensors into the realized target tensors. An outer loading loop writes each realized tensor into the model, or offloads it if the device map assigned that parameter to "disk". The local results are deleted to release memory that is no longer required.

#Disabling async loading

has_on_the_fly_quantization = hf_quantizer is not None and not hf_quantizer.pre_quantized

if (
    is_env_variable_true("HF_DEACTIVATE_ASYNC_LOAD")
    or "disk" in device_map.values()
    or has_on_the_fly_quantization
):
    thread_pool = None

Under tighter memory loading scenarios, like disk offload or on-the-fly quantization, Transformers disables async loading and uses the sync path.

The async path is faster because it spawns a worker thread to materialize tensors even while the main thread is still iterating over the checkpoint, collecting and sorting keys, or when it's running convert().

But the async path can create conditions where allowing the worker thread to run ahead can temporarily materialize more tensors than the follow-up step can process.

This is observed for on-the-fly quantization. The loader explicitly disables the async path with thread_pool=None because worker threads may pull weights to the device faster than the main thread can quantize and release those weights.

def _job():
    return _materialize_copy(tensor, device, dtype)

if thread_pool is not None:
    return thread_pool.submit(_job)
else:
    # Return the Callable here, not the Tensor itself, so we actually delay loading to avoid saturating cpu
    # memory during Conversion
    return _job

The sync path doesn't submit a job to a worker thread. It returns the _job function (a callable) that remembers which safetensors slice to load, the target device, and dtype, but it hasn't read the tensor yet. This is unlike a Future on the async path which may start reading immediately.

The loader stores _job in collected_tensors[...]. When materialize_tensors() invokes the callable with func(), the tensor bytes are read from disk. This keeps the collection phase cheap because while the loader is still scanning checkpoint keys and grouping tensors by pattern, collected_tensors holds small callables rather than real torch.Tensor objects.

If the sync path loaded tensors immediately instead, every matched weight becomes a real torch.Tensor during collection and stays in collected_tensors until convert() processes them. Returning a callable avoids buildup by deferring each read until the conversion step is ready to consume it.

#Disk offloading

if param_device == "disk" and (target_name not in model_buffers or offload_buffers):
    disk_offload_index = offload_and_maybe_resave_param(
        target_name, param, loading_info, disk_offload_folder, disk_offload_index, mapping
    )

For safetensors files, Transformers builds an index for each parameter that maps to "disk". It records the shard file and tensor name that already holds that weight so it can remain in the original file instead of being copied again.

Tensors mapped to "disk" are either re-saved to disk_offload_folder or, if they're already present in the offload index, left in their original safetensors file.

Accelerate installs hooks that load those parameters on demand, such as during a forward pass. This avoids keeping disk-offloaded parameters on the CPU or GPU and avoids extra writes when the original safetensors file can be reused.

Transformers uses the sync path when parameters are offloaded to disk to keep loading sequential under disk-offload memory constraints.

#Weight tying

# Perform the actual tying
source_param = self.get_parameter_or_buffer(source_param_name)
if "." in target_param_name:
    parent_name, name = target_param_name.rsplit(".", 1)
    parent = self.get_submodule(parent_name)
else:
    name = target_param_name
    parent = self
# Tie the weights
setattr(parent, name, source_param)

Two parameter names, like lm_head and embed_tokens, may point to the same underlying tensor. This is known as weight tying. The model uses one shared embedding matrix to read input tokens and to project hidden states back into vocabulary logits. During training, the model updates one shared embedding instead of learning two separate matrices.

A useful side effect is that only one tensor allocation is needed for both parameter names.

Checkpoints often store only one copy of the tied weights (like embed_tokens) and is “missing” lm_head because it's understood that tie_weights() enables lm_head to share the same matrix as embed_tokens.

def mark_tied_weights_as_initialized(self, loading_info):
    """Adds the `_is_hf_initialized` flag on parameters that will be tied, in order to avoid initializing them
    later as they will be tied (overwritten) anyway.
    This is very important as most embeddings are tied, and they are huge params (vocabularies are often 256k), so
    running inits on them is very costly."""
    for tied_param in getattr(self, "all_tied_weights_keys", {}).keys():
        param = self.get_parameter(tied_param)
        param._is_hf_initialized = True

Loaded parameters are marked with _is_hf_initialized when they're assigned. mark_tied_weights_as_initialized() also marks known tied target parameters, such as a missing lm_head, so the post-load initialization pass does not allocate and initialize a tensor that will later be replaced by tie_weights(). After loading, only truly missing parameters are initialized using the model's default initialization scheme.

#Fuse dtype and device transfers

def _materialize_copy(tensor: torch.Tensor, device=None, dtype=None) -> torch.Tensor:
    ...
    if dtype is not None or device is not None:
        tensor = tensor.to(device=device, dtype=dtype)
    return tensor

When a lazy slice is materialized, device transfer and dtype casting are combined into one call. If the source dtype already matches the target dtype, this is only a device transfer.

The memory benefit appears when the source and target dtype differs. Splitting the device transfer and dtype casting into two calls copies the tensor to the device in the source dtype, then allocates a second tensor for the cast.

Combining device and dtype lets PyTorch produce the destination tensor directly in the requested dtype on the requested device.

Thank you Cyril Vallez for the feedback and adding clarity to some of the mechanisms I was fuzzy on.