Tamás Takács | trajectory separation for scalable credit assignment in cooperative multi-agent reinforcement learning

nov 12, 2026 / tamás takács / 14 min read

Trajectory Separation for Scalable Credit Assignment in Cooperative Multi-Agent Reinforcement Learning

Tamás Takács¹,László Gulyás¹,&Gergely Magyar¹

¹ELTE Eötvös Loránd University, Department of Artificial Intelligence, Budapest, Hungary

Accepted at the International Conference on Computational Collective Intelligence (ICCCI 2026). An early version was presented as a poster at the Magyar Machine Learning Találkozó (Hungarian ML Conference) in Budapest.

Abstract

Credit assignment in multi-agent reinforcement learning (MARL) is especially challenging with shared global rewards where individual contributions are difficult to disentangle, often causing the “lazy agent” phenomenon. We propose trajectory separation within the Proximal Policy Optimization (PPO) framework: a shared encoder processes all observations in a single forward pass, while per-entity trajectories isolate individual contributions to learning signals. This reduces inter-agent interference and enables fine-grained credit assignment without specialized architectures.

Evaluated in the Lux AI environment, a large-scale, competitive multi-agent benchmark from the NeurIPS 2023 competition, our method achieves:

14× the resource throughput of global-trajectory variants
+240% returns over IPPO and +1,282% over MAPPO
200K parameters and two hours of training, compared to 6.08M parameters and two days for the best competition RL entry

Ablation analysis confirms that each component of trajectory separation, per-entity rewards, critics, and advantages, is individually necessary, with per-entity rewards being the most critical (−78% throughput when removed). We further demonstrate that prioritizing data quality via trajectory separation outperforms increasing data quantity: a single highest-advantage training sample recovers 86% of full separation performance.

Introduction

Multi-agent reinforcement learning has produced striking emergent behaviors at scale, from large competitive benchmarks like Dota 2 and StarCraft II to research-grade environments such as Melting Pot and Lux AI. Centralized Training with Decentralized Execution (CTDE) became the workhorse paradigm, combining local observations with a shared global reward, but relying solely on a global reward obscures individual contributions and slows convergence.

We address this with a hybrid MARL framework that decomposes the global reward per entity and structures training into distinct trajectories for each agent group, providing clearer learning signals while still maintaining shared representations.

Motivating Paradoxes

When a MARL system trains with a single shared trajectory, poor credit assignment produces three concrete pathologies:

Posthumous reinforcement. Actions by surviving agents retroactively reinforce a deceased agent’s past behaviors, because the destroyed agent’s transitions remain in the shared rollout.
Inherited credit. Newly spawned entities inherit reinforcement from prior entities despite no causal involvement in the rewarded outcomes.
Diluted signals. Shared rewards reinforce all agents equally, diluting the per-agent gradient as team size grows.

Our trajectory separation mechanism is designed to eliminate all three at the source, before the PPO update sees them.

A central challenge in cooperative MARL is the credit assignment problem: determining each agent’s contribution to the team’s global reward.

Value factorization methods such as [VDN], [QMIX], Weighted QMIX, and QPLEX decompose the joint action-value into per-agent terms, but impose structural constraints (additive, monotonic, or mixing-network-based) and remain value-based, limiting scalability.

Policy-gradient approaches include [COMA] (counterfactual baselines), HAPPO (sequential trust-region updates), FACMAC (factored continuous critics), and MAT (attention-based sequence modeling). Newer work like ACAF and PRD targets asynchronous coordination and partial reward decoupling, respectively. All of these methods introduce specialized architectures or update schemes that add complexity and may not scale gracefully.

In contrast, our trajectory separation method operates directly at the trajectory level within standard PPO, requiring no mixing networks, sequential updates, or attention mechanisms. While the individual ingredients exist in isolation, prior work has not combined them at the trajectory level: value factorization decomposes the value function but maintains a single joint trajectory, IPPO decomposes networks but shares the global reward. Trajectory separation decomposes the training signal itself: each entity maintains its own reward stream, value baseline, and advantage estimate while sharing a single encoder.

Background

We model the multi-agent setting as a Markov Decision Process $(\mathcal{S}, \mathcal{A}, P, R, \gamma, \rho_0)$ , where agents maximize expected discounted return $J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [\sum_{t=0}^T \gamma^t R(s_t, a_t)]$ . Proximal Policy Optimization (PPO) stabilizes updates via a clipped surrogate objective:

L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t,\; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_{\text{old}}}(a_t|s_t)$ is the importance sampling ratio and $\hat{A}_t$ is the GAE-estimated advantage.

In the CTDE paradigm, MAPPO applies PPO with a centralized critic conditioning on the global state, and IPPO trains each agent independently with only local observations. Both serve as strong baselines but neither decomposes rewards or trajectories at the individual agent level.

Trajectory Separation

We combine trajectory separation with shared feature extractors, allowing individual agent contributions to be evaluated independently while utilizing common perceptual representations.

Let $\mathcal{N} = \{1, 2, \dots, N\}$ denote a set of agent groups. Each group $i$ generates its own trajectory $\tau_i = \{(s_t^i, a_t^i, r_t^i)\}_{t=0}^{T}$ , and the entity-specific reward is derived from environment interactions as:

r_t^i = f_{\text{decomp}}(R_t^{\text{global}}, e_i, s_t)

The trajectory separation mechanism decouples these trajectories. A single environment step yields $N$ parallel training samples, improving the resolution of credit assignment without altering the PPO loss formulation. The total policy loss aggregates over all groups:

L^{\text{TS}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L^{\text{PPO}}_i(\theta)

Figure 1: **Trajectory-separated policy loss computation for N=4 groups over T=3 steps.** Per-group ratios and advantages yield per-cell losses, averaged over active cells into a scalar policy loss. Inactive timesteps (entity not yet spawned or destroyed) are excluded from the loss.

To correctly terminate trajectories, each of the $N$ groups maintains its own done flag: a group is marked done when all its constituent agents become inactive (e.g. destroyed or out of power), computed as the logical conjunction over member agents’ termination states. Crucially, groups can terminate at different timesteps within the same episode.

Group-level done flags — Figure 2: **Single-agent done signal (top) vs. group-level done flags under trajectory separation (bottom).** Groups terminate independently when all members become inactive. Once a group's done flag is set, GAE resets its advantage computation, preventing posthumous reinforcement from later timesteps.

Distinction from Independent Learning (IPPO)

While trajectory separation shares the per-entity focus of independent learning, three key differences explain its superior performance:

Shared encoder. TS uses a single forward pass with shared features, enabling implicit coordination, whereas standard IPPO uses fully independent networks. In our experiments, all methods share the same architecture for a fair comparison.
Per-entity value estimation. TS computes separate value estimates capturing both local and team-level information, IPPO uses only local estimates.
Localized rewards. TS assigns entity-specific rewards proportional to individual contributions, IPPO broadcasts team reward equally, diluting credit.

Architecture and Environment

We adopt a pixel-to-pixel architecture: a convolutional encoder processes the global spatial observation, producing per-pixel outputs from which agent actions are derived at their respective positions. The shared backbone uses residual blocks with batch normalization, Leaky ReLU activations, and Squeeze-and-Excitation modules, with spectral normalization to stabilize training. Actor and critic networks share this backbone but use separate weights to avoid conflict between policy optimization and value estimation objectives.

Lux AI is a fully observable, turn-based 2D grid game where teams manage factories and mobile units to collect Ice and Ore under power and terrain constraints. We selected it because it provides large-scale, dynamic multi-agent coordination with long time horizons and spatial reasoning, with a NeurIPS-sponsored competition baseline to compare against.

Results

Global Trajectory vs. Separated Trajectories

Separated trajectory variants significantly outperform global trajectory variants. The separated approach enables agents to transfer over 40,000 ice to factories per episode, $14\times$ higher than the global variant, while keeping factories alive for the full 1,000-step episode.

Global vs separated trajectories — Figure 3: **Impact of trajectory separation vs. global variants on episode length, ice transferred, and alive factories.** Separated trajectories converge faster and maintain factories until the end of the game by the 50,000th step.

Variant	Ice Transferred	Episode Length
Sep. all (ours)	44,472 ± 12,065	1,000 ± 0
Global fact., sep. units	43,528 ± 11,327	1,000 ± 0
Sep. fact., global units	3,573 ± 3,753	485 ± 315
Global fact., global units	6,162 ± 4,413	742 ± 290
Single global trajectory	3,066 ± 2,996	466 ± 302

Notably, the asymmetric collapse between “global factories + separated units” (43,528) and “separated factories + global units” (3,573) confirms that unit-level credit carries the signal. Factory-level trajectory separation is nearly free to drop, unit-level separation is critical.

Optimal Number of Trajectories

We grouped units into $N$ groups based on modulo of their unique IDs. Ice throughput increased monotonically with $N$ : from 3,573 with a single group, to 14,388 ( $N{=}2$ ), 29,459 ( $N{=}4$ ), 32,049 ( $N{=}8$ ), 36,847 ( $N{=}16$ ), and 44,472 with fully separated trajectories, a 12.4× improvement from 1 group to full separation. The jump from 16 groups to fully separated yielded a further 20% gain, indicating that maximal decomposition best captured the diversity of individual agent behaviors.

Effect of splitting into N groups — Figure 4: **Effect of splitting the global trajectory into N separate groups on episode length, ice transferred, and alive factories.** Performance increases monotonically with N, with the largest marginal gain from 16 groups to full separation (+20%).

Trajectory Number Reduction through Grouping

While trajectory separation provided efficient training, it could lead to imbalanced mini-batches due to varying numbers of active trajectories per step. We tested bounded grouping rules: map quadrants (22,385 ice), closest factory (27,009), and closest heavy unit (35,224). The closest-heavy heuristic outperformed spatial groupings, suggesting behavioral similarity was a stronger grouping criterion than spatial proximity, yet all heuristics underperformed fully separated trajectories (44,472). Any grouping sacrificed credit assignment resolution.

Effectiveness of grouping rules — Figure 5: **Effectiveness of different agent grouping rules on episode length, ice transferred, and alive factories.** Non-random grouping strategies do not surpass random assignments, and all underperform fully separated trajectories.

Train Sample Reduction

Given that all grouping methods underperformed full separation, we adopted training sample reduction, selecting the top $N$ training examples from each step. Selecting the single highest-advantage trajectory achieved 38,072 ice (86% of full separation), top absolute advantage reached 33,926, top return 23,156, and random selection only 2,550 (comparable to the global baseline at 3,066). This suggests data quality over data quantity: a single well-chosen training sample nearly matched using all samples.

We also investigated random sampling with different sample sizes (1, 2, 4, 8, 16). Performance improved with sample count, but sampling a single training example from an environment step with separate trajectories yielded the same suboptimal outcome as using a global trajectory, reinforcing that trajectory separation’s benefit came from the diversity of per-entity training signals rather than mere data augmentation.

Random sample size impact — Figure 7: **Impact of random sample size on episode length, ice transferred, and alive factories.** Random sampling does not match full separation; performance scales with sample count.

Ablation: Which Components Matter?

Systematically removing separation components reveals their relative importance:

Figure 8: **Ablation of separated components on episode length, ice transferred, and alive factories.** Removing per-entity rewards causes the largest drop (−78%). The global-advantage variant collapses below the single-trajectory baseline.

Replacing per-entity rewards with a global reward dropped ice throughput by 78% (44,472 → 9,614), the most critical component.
Replacing per-entity critics with a global critic still allowed convergence but reduced throughput by 31% (44,472 → 30,687).
Eliminating both (global advantage) collapsed performance to 2,643 ice, below even the single global trajectory baseline (3,066), since a fully global advantage yields identical values for all entities.

Reward assignment ablation — Figure 9: **Effect of varying own-reward proportion.** Performance increases monotonically with localized rewards: 0% own reward yields only 9,614 ice and fails to converge, while 75% converges but with 24% less throughput than 100% own reward.

Comparison with MAPPO and IPPO

We compared TS against MAPPO (centralized critic, team reward broadcast equally) and IPPO (independent critic, team reward broadcast equally). All three methods share the same network architecture and PPO hyperparameters, the only differences are in critic structure and reward assignment.

TS vs MAPPO and IPPO — Figure 10: **Lux AI learning curves: TS vs. MAPPO and IPPO** (shading: ±1 std over 3 seeds). TS reaches the maximum episode length consistently across all seeds by step 49,152, MAPPO never reaches it, IPPO achieves it only once.

Metric	TS (ours)	MAPPO	IPPO
Episode Length	1,000.0 ± 0.0	391.1 ± 109.0	866.5 ± 41.0
Return	87.51 ± 8.78	6.33 ± 3.60	25.70 ± 4.76
Factory Count	4,028.7 ± 25.6	1,024.2 ± 206.0	1,946.3 ± 91.7

TS achieves a return of $87.51$ compared to IPPO’s $25.70$ and MAPPO’s $6.33$ , a 240% and 1,282% improvement respectively. Welch’s $t$ -tests and Cohen’s $d$ effect sizes confirm large statistical significance ( $p < 0.005$ , $d > 7$ ), with the return metric yielding $d = 9.2$ for TS vs. MAPPO and $d = 8.8$ for TS vs. IPPO.

Comparison with Competition Solutions

We compared against top submissions from the 2023 NeurIPS Lux AI Competition, focusing on factory survival.

Method	Params	Train Time	Env Steps
TS (ours)	200K	2 hours	102K
Organizer baseline	451K	2 days	1.4M
Best RL entry (FLG)	6.08M	2 days	65M

Our method achieves the target with only 200K parameters and 2 hours of training, compared to FLG’s 6.08M parameters and 2 days, and the organizer baseline’s 451K parameters and 2 days.

Conclusion

We presented a trajectory separation technique within a hybrid control framework that combines shared feature extractors with per-entity policy updates. In Lux AI, the method achieves $14\times$ the resource throughput of global-trajectory baselines and improves returns by 240% over IPPO and 1,282% over MAPPO, while using significantly fewer parameters and training steps than competition solutions.

A key limitation is that coordination in our framework emerges only through shared features, not explicit communication. In settings with sparse rewards or strong temporal dependencies, additional mechanisms such as causal influence modeling or tailored reward shaping may be required. More broadly, trajectory separation applies to any policy-gradient algorithm maintaining rollout buffers (e.g. TRPO, A3C). Future work will extend the approach to additional cooperative environments with varying team sizes, heterogeneous agent populations, partially observable settings, and explore adaptive blending of localized and centralized critics.

Citation

Read the full paper for more details: [Paper], and if you find this work informative please consider citing it:

Citation BibTeX

@inproceedings{takacs2026trajectoryseparation,
author    = {Tak\'acs, Tam\'as and Guly\'as, L\'aszl\'o and Magyar, Gergely},
title     = {Trajectory Separation for Scalable Credit Assignment in Cooperative Multi-Agent Reinforcement Learning},
booktitle = {Proceedings of the International Conference on Computational Collective Intelligence (ICCCI)},
year      = {2026},
publisher = {Springer},
series    = {LNCS}
}

back to papers