Megatron CP, HCP and DCP

Context parallel (CP) is used to mitigate the memory pressure when the length of sequences is beyond one single GPU can take. In this post, I'm going to share variants about CP: p2p, a2a, hybrid CP as well as dynamic CP.

Ring attention (`p2p` CP)

For a sequence, we can see that each CP rank holds $\mathbf Q$ for tokens in the zig-zag order for load balance. During forward, each rank sends $\mathbf {KV}$ to next neighbor while receiving from previous neighbor, in the ring order. Meanwhile (with computation-communication overlap), each rank also calculate the partial result of $\mathrm{softmax}(\mathbf{Q} \mathbf{K}^\mathrm T)$ using online softmax.

NVIDIA/Megatron-LM/megatron/core/utils.pyLines 2353 to 2364 in 2f754f4

Loading…

🫣

Q: Why transmitting $\mathbf{KV}$ , rather than $\mathbf{Q}$ ?

The partial output $\mathbf{O}$ is stored locally and then fed into online softmax for next step
If we store $\mathbf{KV}$ in each rank and let $\mathbf{Q}$ rotate in-flight, the intermediate results have to be transferred together with $\mathbf{Q}$
Attention variants like GQA and MQA use less size for $\mathbf{KV}$ compared to $\mathbf{Q}$

And this is why we require $\mathrm{seq\_len} \mod (2 \times \mathrm{cp\_size}) = 0$ .

CP attention dispatch in Megatron

NVIDIA/Megatron-LM/megatron/core/extensions/transformer_engine.pyLines 1437 to 1463 in 03f4111

Loading…

The actual ring attention and online softmax happen inside TE.

We now know the online softmax can be used for the masked score calculation in one pass. But in ring attention, there is an extra step to multiply the score and $\mathbf V$ . So how does CP merge the partial result where $\mathbf V$ is encoded? We have:

\mathbf{O}_a = \mathrm{softmax}(\mathbf{s}_a) \cdot \mathbf{V}_a = \frac{\exp(\mathbf{s}_a) \cdot \mathbf{V}_a}{z_a}, \ \mathrm{where} \ \mathbf{s}_a = \frac{\mathbf{Q}_a \mathbf{K}_a^\mathrm{T}}{\sqrt{d}}, z_a = \sum \exp(\mathbf{s}_a).

Similarly, we have $\mathbf{O}_b = \frac{\exp(\mathbf{s}_b) \cdot \mathbf{V}_b}{z_b}$ . As CP splits the computation along the sequence dimension, it's actually a special (weighted) reduction in rowwise matrix multiplication parallel:

\mathbf{O} = \frac{z_a \mathbf{O}_a + z_b \mathbf{O}_b}{z}, z = z_a + z_b,

we can maintain $z$ via LSE $L$ :

L = \log(z),

thus $\mathbf{O} = \exp(L_a - L) \mathbf{O}_a + \exp(L_b - L) \mathbf{O}_b$ . Note that this update is associative, therefore we just need to maintain $L_t$ and $\mathbf{O}_t$ at each ring step $t$ . Let $L'$ and $\mathbf{O}'$ be the partial LSE and output produced at step $t+1$ , the running update is:

\begin{aligned} m &= \max(L_t,\, L'), \\ L_{t+1} &= m + \log\!\left(\exp(L_t - m) + \exp(L' - m)\right), \\ \alpha &= \exp(L_t - L_{t+1}),\quad \beta = \exp(L' - L_{t+1}), \\ \mathbf{O}_{t+1} &= \alpha\, \mathbf{O}_t + \beta\, \mathbf{O}', \end{aligned}

where $m$ subtracts the maximum for numerical stability, and $\alpha, \beta \in (0,1)$ with $\alpha + \beta = 1$ act as the soft-mixing weights on the running accumulator and the new partial.

📌

TE does not implement this recurrence on $\mathbf{O}$ literally. Only the LSE accumulator $L$ is streamed across ring steps; each step's partial $\mathbf{O}'$ and $L'$ is stashed into a per-step list. After the ring loop finishes, a single post-loop pass folds them in via $\mathbf{O} = \sum_i \exp(L_i - L)\,\mathbf{O}_i$ , which is mathematically equivalent to the streaming form once the final $L$ is known (the running $\alpha$ rescaling collapses). This may leads to larger runtime memory consumption.

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 115 to 126 in 9ad2e7b

Loading…

Ulysses ¹ (`a2a` CP)

There are 2 types of layers in the entire LLM model: sequence-level (e.g., attention) and token-level (e.g., MoE, MLP). For those token-level layers, there is no need to keep the entire sequence as no interaction happens between positions within a sequence. Thus, we can split at the sequence dimension to let each CP rank handle different parts of the activation.

In Megatron, attention CP is handled by TE internally:

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 470 to 524 in 80ea313

Loading…

It forms a 3-stage software defined pipeline:

local reshape
output buffer creation + async communication trigger
wait on communication and post-transformer

The shape manipulation during this process is shown as follow:

Before attention CP pipeline

By default, Megatron CP scheduler organizes the token in a zig-zag order for load balance, but in a2a CP, each rank now owns the full sequence, so TE has to permute again to recover the casual order.

Then after the attention, each CP rank holds the full-sequence and head-sharded tensor, TE needs to convert it back to the sequence-shard format:

CP software pipeline code in TE

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 526 to 585 in 80ea313

Loading…

This time, it's another 3-stage software pipeline conducting the steps:

reorder to the load-balanced token order
async collective
post-processing, including reshape

After attention CP pipeline

Hybrid (hierarchical) CP ²

In MBridge, you can enable the a2a+p2p CP via the following config:

cfg.model.context_parallel_size = 8
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [4, 2]   # prod == 8
cfg.dist.use_decentralized_pg = False                    # <- required for Bridge

Intrinstically, CP size is bound by the number of KV heads due to the a2a operation in the attention layer. To exceed such a limit, HCP can be used by combining intra node a2a + inter node p2p so that it can provide larger CP group size. a2a happens within inner-CP ranks with faster GPU interconnections, and p2p runs on inter-node ranks with IB connected.

Hybrid CP with inner CP size = 4 and outer CP size = 2

We use icp and ocp to denote the inner and outer CP size, thus the total CP size equals to icp $\times$ ocp. The entire sequence is still split into CP size chunks, where each a2a domain owns icp consecutive tokens in the sequence dimension.

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 1537 to 1561 in 80ea313

Loading…

Inside TE, each ICP rank conducts a2a to flip the activation from context-split to head-split status, so that each ICP rank now maintains full sequence but $1/$ icp head of the activation. Then before attention, it has to restore the casual order from load balancing order using the index mapping chunk_ids_for_a2a obtained from get_seq_chunk_ids_for_reordering_before_attn. After inner a2a, each rank now has $[s/$ ocp $, b, h/$ icp $, d]$ for $\mathbf{Q}$ , $\mathbf{K}$ and $\mathbf{V}$ as the attention inputs.

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 1458 to 1470 in 80ea313

Loading…

Then TE specifies the send and receive peers during p2p exchanges for $\mathbf{K}$ and $\mathbf{V}$ with total ocp ring steps. Similar to pure p2p CP aforementioned, each rank retains its own $\mathbf{Q}$ split and exchanges $\mathbf{K}$ / $\mathbf{V}$ for attention calculation.

Software pipelining code for p2p CP attention in TE

NVIDIA/TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.pyLines 1744 to 2021 in 80ea313

Loading…

🤯

In HCP, the inner CP group undos the load balance and permutes token into causual order (e.g., $L_{00} - L_{04}$ ) in Fig. HCP. As for the outer CP group, the order is still kept as zig-zag for load balance. For example, with ocp set to 2, outer CP rank 0 owns $\mathbf{Q}_0$ and $\mathbf{Q}_3$ and the other outer CP rank contains $\mathbf{Q}_1$ and $\mathbf{Q}_2$ , while tokens within $\mathbf{Q}_i$ is in causal order.

`allgather` CP

There is another CP mode allgather which is probably the simplest among these 4 modes. For each CP rank, $\mathbf{K}$ and $\mathbf{V}$ are Allgathered to produce the full-sequence, full-head attention input. Besides, the token order is also un-zig-zagged such that $\mathbf{K}$ and $\mathbf{V}$ are in causual order.

AG CP shape transform

Attention query $\mathbf{Q}$ is still sharded across CP ranks, which is still under zig-zag order. Therefore, the computation is balanced across CP ranks.

Sequence packing and balance

Traditional sbhd data layout requires all sequence within a batch have to be padded into a fixed length, resulting to FLOPs wastes. In Megatron, one can convert sbhd into thd layout where t is the total number of tokens within a batch, which is expressed as follows:

NVIDIA/Megatron-LM/megatron/core/packed_seq_params.pyLines 9 to 26 in 4415119

Loading…

cu_seqlens_q(kv)_padded is presented for CUDA graph static input spec. Combined with CP, the thd tokens are scattered across CP ranks.

TE CUDA kernel

NVIDIA/TransformerEngine/transformer_engine/common/fused_attn/context_parallel.cuLines 80 to 106 in 80ea313

Loading…

NVIDIA/TransformerEngine/transformer_engine/common/fused_attn/context_parallel.cuLines 673 to 677 in 80ea313

Loading…

Here is a concrete example (given CP size = 3, and cu_seq_lens = $[0, 12, 36, 42]$ ), the kernel outputs zig-zagged indices for each CP rank:

Rank 0 indices: [ 0,  1, 10, 11, 12, 13, 14, 15, 32, 33, 34, 35, 36, 41]
Rank 1 indices: [ 2,  3,  8,  9, 16, 17, 18, 19, 28, 29, 30, 31, 37, 40]
Rank 2 indices: [ 4,  5,  6,  7, 20, 21, 22, 23, 24, 25, 26, 27, 38, 39]

Positional embeddings

When enabling sequence packing, TE should take care the positional embedding (PE), as the token order has been changed:

NVIDIA/TransformerEngine/transformer_engine/common/fused_rope/fused_rope.cuLines 463 to 486 in 80ea313

Loading…

We can see under thd format, o_stride_s_or_t is set as $h \times d$ and the batch dimension is ignored. Also, the CUDA kernel maps token positions into freqs with s_id_for_freqs considering both the sequence packing and CP split:

NVIDIA/TransformerEngine/transformer_engine/common/fused_rope/fused_rope.cuLines 132 to 170 in 80ea313

Loading…

DCP: Dynamic DP $\times$ CP groups

Even with CP combined with sequence packing, there are still some bottlenecks that hinders the performance ³:

static CP is pinned to the worst-case sequence in the batch.
equal pack lengths ≠ equal compute (DP imbalance)
CP communication stops hiding behind compute when packs are short

To overcome these, dynamic CP (DCP) is proposed with the idea that at each micro batch, the scheduler dynamically choose the best combination of CP size and the composed packed sequence. With multiple CP group with varied CP size (shared with DP group), Megatron can scale the CP domain accommandated with the sequence length distribution.

Currently (2026/05), you can only enable DCP via MCore, you may enable this feature for training via:

torchrun ... pretrain_gpt.py \
    --tensor-model-parallel-size $TP \
    --context-parallel-size $CP \
    --hybrid-context-parallel \
    --max-seqlen-per-dp-cp-rank $MAX_SEQ_PER_RANK \
    ...

‼️

--hybrid-context-parallel option here is the fundamantally different feature against the aforementioned HCP. See the "pitfalls" section in MBridge docs.

The key argument is --max-seqlen-per-dp-cp-rank, which controls the maximal sequence length each DCP rank can receive.

NVIDIA/Megatron-LM/megatron/core/model_parallel_config.pyLines 56 to 70 in 4415119

Loading…

DCP group

When DCP enabled, Megatron creates several DCP groups whose sizes are the power of 2s in range $[2, \log_2(N)]$ where $N$ is the total number of ranks in the DP and CP dimension.

NVIDIA/Megatron-LM/megatron/core/parallel_state.pyLines 421 to 443 in 4415119

Loading…

❤️

Switching a rank between CP and DP doesn't need to reshard model weights, so we can think it's "free" to make such exchange.

One thing to mention: per-token gradient scaling across the DP-CP group: the final gradients are divided by the total number of tokens in this micro batch across all DP-CP ranks.

NVIDIA/Megatron-LM/megatron/core/distributed/finalize_model_grads.pyLines 546 to 559 in 4415119

Loading…

Runtime CP size decision: `BalancedCPScheduler`

Given a batch of samples, max_seq_len_per_rank and the total number of DCP ranks, Megatron develops a scheduler named BalancedCPScheduler to determine:

how DCP ranks are split into DP and CP groups
how the batch, with sequences with various lengths, are fed into different DP ranks

Concrete batch (8 ranks, max_seqlen_per_dp_cp_rank = 4K, workload = $L^2 \div$ cp_size, represented by length in following diagram):

sample	length $L$	CP size	stage	assigned ranks
$s0$	32K	8	0	0..7
$s1$	16K	4	1	0..3
$s2$	8K	2	1	4..5
$s3$	4K	1	1	6
$s4$	4K	1	1	6
$s5$	4K	1	1	7
$s6$	4K	1	1	7
$s7$	2K	1	1	6

Sequence scheduling goal: pack one global batch's sub-sequences onto the GPU rectangle, balanced and idle-free, wtih each sub-sequence is a rectangle: taller for longer sequences (more GPUs needed), wider for more per-GPU work.

Bucket samples by size class. Group sub-sequences by how many GPUs they need, with roughly equal total work per bucket. Buckets get processed largest-first.
Greedy fill. Walk the buckets and place each sub-sequence either into an existing group of the right size or onto fresh free GPUs — whichever leaves the worst-loaded GPU lighter.
Stop when balanced. Once column heights are close enough, close the round.
Trim overshoot. If one column ended up tall, peel its last-placed sub-sequence back into the leftover queue if that helps even things out.
Fill empty GPUs. If any GPUs are still idle, keep doubling the smallest existing group's reach until every GPU has work, sliding neighbors aside as needed.
Repeat per round. Whatever didn't fit goes into the next scheduling round, separated by a barrier so groups can safely reshape between rounds.

🫆

Concepts need to distinct:

global batch: per DataLoader.__next__ / per optimizer step
micro batch: per forward / backward
schedule round: the number of sync barriers in the DCP ranks

Note that Megatron does NOT provide an optimal scheduling plan for DCP.

Sequence packing with DCP

If you specify the thd format together with DCP, Megatron will split the packed sequences back to sbhd format since two consecutive sequences in a rank may have different CP sizes, and using THD sequence packing format will lose such flexibility. So there will be exactly $t$ times of forwards / backwards in each DCP rank, where $t$ is the number of sequences in current group (scheduling round).

Outra

So when to use different CP mode? I think MBridge docs already have the answer:

😎

This post also answers the question: why sbhd is the popular batch format among recipies?

A: AlltoAll and split operations ship continuous tensors if the first dimension is sequence dim.

Megatron CP, HCP and DCP

Ring attention (p2p CP)

Ulysses 1 (a2a CP)

Hybrid (hierarchical) CP 2

allgather CP

Sequence packing and balance

Positional embeddings

DCP: Dynamic DP×\times×CP groups

DCP group

Runtime CP size decision: BalancedCPScheduler

Sequence packing with DCP

Outra

Footnotes

Ring attention (`p2p` CP)

Ulysses ¹ (`a2a` CP)

Hybrid (hierarchical) CP ²

`allgather` CP

DCP: Dynamic DP $\times$ CP groups

Runtime CP size decision: `BalancedCPScheduler`