Megatron CP, HCP and DCP

May 26, 2026

Context parallel (CP) is used to mitigate the memory pressure when the length of sequences is beyond one single GPU can take. In this post, I'm going to share variants about CP: p2p, a2a, hybrid CP as well as dynamic CP.

Ring attention (p2p CP)

For a sequence, we can see that each CP rank holds Q\mathbf Q for tokens in the zig-zag order for load balance. During forward, each rank sends KV\mathbf {KV} to next neighbor while receiving from previous neighbor, in the ring order. Meanwhile (with computation-communication overlap), each rank also calculate the partial result of softmax(QKT)\mathrm{softmax}(\mathbf{Q} \mathbf{K}^\mathrm T) using online softmax.

Loading…
🫣

Q: Why transmitting KV\mathbf{KV}, rather than Q\mathbf{Q}?

  1. The partial output O\mathbf{O} is stored locally and then fed into online softmax for next step
  2. If we store KV\mathbf{KV} in each rank and let Q\mathbf{Q} rotate in-flight, the intermediate results have to be transferred together with Q\mathbf{Q}
  3. Attention variants like GQA and MQA use less size for KV\mathbf{KV} compared to Q\mathbf{Q}

And this is why we require seq_lenmod  (2Γ—cp_size)=0\mathrm{seq\_len} \mod (2 \times \mathrm{cp\_size}) = 0.

CP attention dispatch in Megatron

The actual ring attention and online softmax happen inside TE.


We now know the online softmax can be used for the masked score calculation in one pass. But in ring attention, there is an extra step to multiply the score and V\mathbf V. So how does CP merge the partial result where V\mathbf V is encoded? We have:

Oa=softmax(sa)β‹…Va=exp⁑(sa)β‹…Vaza,Β whereΒ sa=QaKaTd,za=βˆ‘exp⁑(sa).\mathbf{O}_a = \mathrm{softmax}(\mathbf{s}_a) \cdot \mathbf{V}_a = \frac{\exp(\mathbf{s}_a) \cdot \mathbf{V}_a}{z_a}, \ \mathrm{where} \ \mathbf{s}_a = \frac{\mathbf{Q}_a \mathbf{K}_a^\mathrm{T}}{\sqrt{d}}, z_a = \sum \exp(\mathbf{s}_a).

Similarly, we have Ob=exp⁑(sb)β‹…Vbzb\mathbf{O}_b = \frac{\exp(\mathbf{s}_b) \cdot \mathbf{V}_b}{z_b}. As CP splits the computation along the sequence dimension, it's actually a special (weighted) reduction in rowwise matrix multiplication parallel:

O=zaOa+zbObz,z=za+zb,\mathbf{O} = \frac{z_a \mathbf{O}_a + z_b \mathbf{O}_b}{z}, z = z_a + z_b,

we can maintain zz via LSE LL:

L=log⁑(z),L = \log(z),

thus O=exp⁑(Laβˆ’L)Oa+exp⁑(Lbβˆ’L)Ob\mathbf{O} = \exp(L_a - L) \mathbf{O}_a + \exp(L_b - L) \mathbf{O}_b. Note that this update is associative, therefore we just need to maintain LtL_t and Ot\mathbf{O}_t at each ring step tt. Let Lβ€²L' and Oβ€²\mathbf{O}' be the partial LSE and output produced at step t+1t+1, the running update is:

m=max⁑(Lt, Lβ€²),Lt+1=m+log⁑ ⁣(exp⁑(Ltβˆ’m)+exp⁑(Lβ€²βˆ’m)),Ξ±=exp⁑(Ltβˆ’Lt+1),Ξ²=exp⁑(Lβ€²βˆ’Lt+1),Ot+1=α Ot+β Oβ€²,\begin{aligned} m &= \max(L_t,\, L'), \\ L_{t+1} &= m + \log\!\left(\exp(L_t - m) + \exp(L' - m)\right), \\ \alpha &= \exp(L_t - L_{t+1}),\quad \beta = \exp(L' - L_{t+1}), \\ \mathbf{O}_{t+1} &= \alpha\, \mathbf{O}_t + \beta\, \mathbf{O}', \end{aligned}

where mm subtracts the maximum for numerical stability, and α,β∈(0,1)\alpha, \beta \in (0,1) with α+β=1\alpha + \beta = 1 act as the soft-mixing weights on the running accumulator and the new partial.

πŸ“Œ

TE does not implement this recurrence on O\mathbf{O} literally. Only the LSE accumulator LL is streamed across ring steps; each step's partial Oβ€²\mathbf{O}' and Lβ€²L' is stashed into a per-step list. After the ring loop finishes, a single post-loop pass folds them in via O=βˆ‘iexp⁑(Liβˆ’L) Oi\mathbf{O} = \sum_i \exp(L_i - L)\,\mathbf{O}_i, which is mathematically equivalent to the streaming form once the final LL is known (the running Ξ±\alpha rescaling collapses). This may leads to larger runtime memory consumption.

Ulysses 1 (a2a CP)

There are 2 types of layers in the entire LLM model: sequence-level (e.g., attention) and token-level (e.g., MoE, MLP). For those token-level layers, there is no need to keep the entire sequence as no interaction happens between positions within a sequence. Thus, we can split at the sequence dimension to let each CP rank handle different parts of the activation.

In Megatron, attention CP is handled by TE internally:

It forms a 3-stage software defined pipeline:

  1. local reshape
  2. output buffer creation + async communication trigger
  3. wait on communication and post-transformer

The shape manipulation during this process is shown as follow:

Before attention CP pipeline

By default, Megatron CP scheduler organizes the token in a zig-zag order for load balance, but in a2a CP, each rank now owns the full sequence, so TE has to permute again to recover the casual order.

Then after the attention, each CP rank holds the full-sequence and head-sharded tensor, TE needs to convert it back to the sequence-shard format:

CP software pipeline code in TE

This time, it's another 3-stage software pipeline conducting the steps:

  1. reorder to the load-balanced token order
  2. async collective
  3. post-processing, including reshape

After attention CP pipeline

Hybrid (hierarchical) CP 2

In MBridge, you can enable the a2a+p2p CP via the following config:

cfg.model.context_parallel_size = 8
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [4, 2]   # prod == 8
cfg.dist.use_decentralized_pg = False                    # <- required for Bridge

Intrinstically, CP size is bound by the number of KV heads due to the a2a operation in the attention layer. To exceed such a limit, HCP can be used by combining intra node a2a + inter node p2p so that it can provide larger CP group size. a2a happens within inner-CP ranks with faster GPU interconnections, and p2p runs on inter-node ranks with IB connected.

Hybrid CP with inner CP size = 4 and outer CP size = 2

We use icp and ocp to denote the inner and outer CP size, thus the total CP size equals to icp Γ—\times ocp. The entire sequence is still split into CP size chunks, where each a2a domain owns icp consecutive tokens in the sequence dimension.

Inside TE, each ICP rank conducts a2a to flip the activation from context-split to head-split status, so that each ICP rank now maintains full sequence but 1/1/ icp head of the activation. Then before attention, it has to restore the casual order from load balancing order using the index mapping chunk_ids_for_a2a obtained from get_seq_chunk_ids_for_reordering_before_attn. After inner a2a, each rank now has [s/[s/ocp,b,h/, b, h/icp,d], d] for Q\mathbf{Q}, K\mathbf{K} and V\mathbf{V} as the attention inputs.

Then TE specifies the send and receive peers during p2p exchanges for K\mathbf{K} and V\mathbf{V} with total ocp ring steps. Similar to pure p2p CP aforementioned, each rank retains its own Q\mathbf{Q} split and exchanges K\mathbf{K} / V\mathbf{V} for attention calculation.

Software pipelining code for p2p CP attention in TE
🀯

In HCP, the inner CP group undos the load balance and permutes token into causual order (e.g., L00βˆ’L04L_{00} - L_{04}) in Fig. HCP. As for the outer CP group, the order is still kept as zig-zag for load balance. For example, with ocp set to 2, outer CP rank 0 owns Q0\mathbf{Q}_0 and Q3\mathbf{Q}_3 and the other outer CP rank contains Q1\mathbf{Q}_1 and Q2\mathbf{Q}_2, while tokens within Qi\mathbf{Q}_i is in causal order.

allgather CP

There is another CP mode allgather which is probably the simplest among these 4 modes. For each CP rank, K\mathbf{K} and V\mathbf{V} are Allgathered to produce the full-sequence, full-head attention input. Besides, the token order is also un-zig-zagged such that K\mathbf{K} and V\mathbf{V} are in causual order.

AG CP shape transform

Attention query Q\mathbf{Q} is still sharded across CP ranks, which is still under zig-zag order. Therefore, the computation is balanced across CP ranks.

Sequence packing and balance

Traditional sbhd data layout requires all sequence within a batch have to be padded into a fixed length, resulting to FLOPs wastes. In Megatron, one can convert sbhd into thd layout where t is the total number of tokens within a batch, which is expressed as follows:

cu_seqlens_q(kv)_padded is presented for CUDA graph static input spec. Combined with CP, the thd tokens are scattered across CP ranks.

TE CUDA kernel

Here is a concrete example (given CP size = 3, and cu_seq_lens = [0,12,36,42][0, 12, 36, 42]), the kernel outputs zig-zagged indices for each CP rank:

Rank 0 indices: [ 0,  1, 10, 11, 12, 13, 14, 15, 32, 33, 34, 35, 36, 41]
Rank 1 indices: [ 2,  3,  8,  9, 16, 17, 18, 19, 28, 29, 30, 31, 37, 40]
Rank 2 indices: [ 4,  5,  6,  7, 20, 21, 22, 23, 24, 25, 26, 27, 38, 39]

Positional embeddings

When enabling sequence packing, TE should take care the positional embedding (PE), as the token order has been changed:

We can see under thd format, o_stride_s_or_t is set as hΓ—dh \times d and the batch dimension is ignored. Also, the CUDA kernel maps token positions into freqs with s_id_for_freqs considering both the sequence packing and CP split:

DCP: Dynamic DPΓ—\timesCP groups

Even with CP combined with sequence packing, there are still some bottlenecks that hinders the performance 3:

  • static CP is pinned to the worst-case sequence in the batch.
  • equal pack lengths β‰  equal compute (DP imbalance)
  • CP communication stops hiding behind compute when packs are short

To overcome these, dynamic CP (DCP) is proposed with the idea that at each micro batch, the scheduler dynamically choose the best combination of CP size and the composed packed sequence. With multiple CP group with varied CP size (shared with DP group), Megatron can scale the CP domain accommandated with the sequence length distribution.

Currently (2026/05), you can only enable DCP via MCore, you may enable this feature for training via:

torchrun ... pretrain_gpt.py \
    --tensor-model-parallel-size $TP \
    --context-parallel-size $CP \
    --hybrid-context-parallel \
    --max-seqlen-per-dp-cp-rank $MAX_SEQ_PER_RANK \
    ...
‼️

--hybrid-context-parallel option here is the fundamantally different feature against the aforementioned HCP. See the "pitfalls" section in MBridge docs.

The key argument is --max-seqlen-per-dp-cp-rank, which controls the maximal sequence length each DCP rank can receive.

DCP group

When DCP enabled, Megatron creates several DCP groups whose sizes are the power of 2s in range [2,log⁑2(N)][2, \log_2(N)] where NN is the total number of ranks in the DP and CP dimension.

❀️

Switching a rank between CP and DP doesn't need to reshard model weights, so we can think it's "free" to make such exchange.

One thing to mention: per-token gradient scaling across the DP-CP group: the final gradients are divided by the total number of tokens in this micro batch across all DP-CP ranks.

Runtime CP size decision: BalancedCPScheduler

Given a batch of samples, max_seq_len_per_rank and the total number of DCP ranks, Megatron develops a scheduler named BalancedCPScheduler to determine:

  • how DCP ranks are split into DP and CP groups
  • how the batch, with sequences with various lengths, are fed into different DP ranks

Concrete batch (8 ranks, max_seqlen_per_dp_cp_rank = 4K, workload = L2Γ·L^2 \div cp_size, represented by length in following diagram):

samplelength LLCP sizestageassigned ranks
s0s032K800..7
s1s116K410..3
s2s28K214..5
s3s34K116
s4s44K116
s5s54K117
s6s64K117
s7s72K116

Sequence scheduling goal: pack one global batch's sub-sequences onto the GPU rectangle, balanced and idle-free, wtih each sub-sequence is a rectangle: taller for longer sequences (more GPUs needed), wider for more per-GPU work.

  1. Bucket samples by size class. Group sub-sequences by how many GPUs they need, with roughly equal total work per bucket. Buckets get processed largest-first.
  2. Greedy fill. Walk the buckets and place each sub-sequence either into an existing group of the right size or onto fresh free GPUs β€” whichever leaves the worst-loaded GPU lighter.
  3. Stop when balanced. Once column heights are close enough, close the round.
  4. Trim overshoot. If one column ended up tall, peel its last-placed sub-sequence back into the leftover queue if that helps even things out.
  5. Fill empty GPUs. If any GPUs are still idle, keep doubling the smallest existing group's reach until every GPU has work, sliding neighbors aside as needed.
  6. Repeat per round. Whatever didn't fit goes into the next scheduling round, separated by a barrier so groups can safely reshape between rounds.
πŸ«†

Concepts need to distinct:

  • global batch: per DataLoader.__next__ / per optimizer step
  • micro batch: per forward / backward
  • schedule round: the number of sync barriers in the DCP ranks

Note that Megatron does NOT provide an optimal scheduling plan for DCP.

Sequence packing with DCP

If you specify the thd format together with DCP, Megatron will split the packed sequences back to sbhd format since two consecutive sequences in a rank may have different CP sizes, and using THD sequence packing format will lose such flexibility. So there will be exactly tt times of forwards / backwards in each DCP rank, where tt is the number of sequences in current group (scheduling round).

Outra

So when to use different CP mode? I think MBridge docs already have the answer:

😎

This post also answers the question: why sbhd is the popular batch format among recipies?

A: AlltoAll and split operations ship continuous tensors if the first dimension is sequence dim.

Footnotes

  1. https://arxiv.org/abs/2309.14509 ↩

  2. https://docs.nvidia.com/nemo/megatron-bridge/latest/training/hybrid-context-parallel.html ↩

  3. https://developer.nvidia.com/blog/speeding-up-variable-length-training-with-dynamic-context-parallelism-and-nvidia-megatron-core/ ↩