Qwen Team Proposes GSPO for Qwen3, Claims DeepSeek's GRPO is Ill-Posed

Qwen Team has introduced Group Sequence Policy Optimization (GSPO) as a new reinforcement learning algorithm for fine-tuning large language models (LLMs), replacing DeepSeek’s Group Relative Policy Optimization (GRPO), which they argue is fundamentally flawed. GRPO applies token-level importance sampling, leading to high variance and instability during training—especially for long sequences and Mixture-of-Experts (MoE) models. GSPO resolves this by shifting to sequence-level importance sampling, normalizing for length and stabilizing gradients. This change improves training efficiency, scalability, and convergence, particularly in complex models like Qwen3-30B-A3B-Base.

Key Ideas

GRPO’s token-level importance sampling introduces instability and gradient noise, often causing model collapse.
GSPO applies sequence-level importance sampling, reducing variance and enabling stable, efficient training—especially for MoE architectures.

Why It Matters?

Everyone building LLM-based solutions can benefit from GSPO’s improved scalability and efficiency. By integrating models trained with GSPO, firms can deliver more reliable and performant AI services, particularly in domains requiring complex reasoning or code generation.