Skip to content

A rejection-sampling based distribution alignment method for extreme actor-policy mismatch RL Training

License

Notifications You must be signed in to change notification settings

Infini-AI-Lab/jackpot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,691 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor–Policy Mismatch RL

Paper Blog

Zhuoming Chen*, Hongyi Liu*, Yang Zhou*, Haizhong Zheng, Beidi Chen

Carnegie Mellon University
(* = Equal Contributions, alphabetically ordering based on lastnames)


What is the Problem that Jackpot Solves?

General big picture

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive, occupying more than 80% of the total cost. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces severe distribution mismatch that destabilizes learning. Particularly, Is it possible to perform rollouts using a completely different model from the one we ultimately want to train?

We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-k probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, Jackpot substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

Why TIS is not Enough?

Importance Sampling based methods are vulnerable to extreme actor-policy mismatch. Once the actor drifts too far, many tokens that the actor samples with high probability have very low probability under the policy, since $p_{\text{inf}} > p_{\text{target}}$. These actor trajectories are effectively treated as low-likelihood samples by the policy, causing TIS to train on tokens the policy would never select at inference and creating a widening train-inference mismatch. It motivates us to look at distribution alignment methods that directly modify $p_{\text{inf}}$. Jackpot builds on top of the TIS methods and directly modifies the rollout trajectories and rollout distribution via Optimal Budgeted Rejection Sampling (OBRS).

Our Method: Jackpot

Key Ideas

General big picture

  • OBRS for alignment: naive rejection sampling is too strict, rejecting all rollout tokens. Instead under a rejection budget, Jackpot stochastically reject tokens that minimizes the KL divergence of the modified rollout distribution and the policy distribution with theoretical guaranteed.
  • System efficiency: approximate the normalization constant with a top-k union over inference and policy distributions, then apply batch-level bias correction for stability.

Algorithm Overview (More details in the paper)

Jackpot proceeds in two phases per iteration (Algorithm 1 in the PDF):

  1. Efficient rollout: sample actions using the inference model once, and store top-k log-probabilities.
  2. PPO update with Jackpot reweighting: compute the standard PPO objective, approximate the OBRS normalizer Z using a top-k union of inference and policy logits, apply OBRS weights and truncated IS corrections, then update the policy with the reweighted loss.

Implementation note from the paper: Jackpot only reweights quantities from standard PPO/GRPO forward passes; it does not introduce extra model forward passes or re-rollouts.

Empirical Studies

General big picture

  • In extreme actor–policy mismatch (using a small model to rollout while asking a larger and stronger model to train on the rollout), naive off-policy training and TIS baselines often collapse, while Jackpot maintains stability over substantially more update steps.
  • Jackpot approaches on-policy performance in multiple settings, including the Qwen3-8B policy trained with a Qwen3-1.7B actor for up to 300 update steps with batchsize 64 in DeepScaleR dataset.
  • When actor-policy mismatch is milder (e.g., large-batchsize staleness offpolicy or FP8 KV quantization), Jackpot enables removal of the clipping ratio in the PPO objective (under the setting detailed in the paper), while seeing training more stable than TIS.

Limitations Noted in the Paper

  • Even with Jackpot, two model joint training will still crash eventually after 300 steps with batchsize 64.
  • The paper does not validate Jackpot on very large models (e.g., 32B variants), due to limit of resources.

Code Structure

We based our implementation on Verl https://github.com/verl-project/verl. For installation, you can simply run the following

pip install -e .[vllm] 

We prepare detailed example running scripts with Jackpot support under the following path examples/jackpot_examples_gsm8k , including instructions examples/jackpot_examples_gsm8k/README.md

Here is the detailed explanation of all the parameters we added to verl arguments for Jackpot

# inside actor.yaml 
use_jackpot: true # Whether to enable Jackpot (OBRS) loss 
jackpot_log_probs_to_keep: 20 # Number of top-k log-probs to keep for Jackpot
jackpot_lambda: 1.0 # Scaling factor for Jackpot loss
jackpot_clip_ratio: 3.0 # Clipping ratio for Jackpot importance weights
jackpot_use_latest_logits: true # Whether to recompute Jackpot with the latest logits instead of cached top-k, we recommend turn it on when using jackpot 
jackpot_use_topk_renorm: true # Whether to renormalize Jackpot weights using the top-k slice
jackpot_mask_only: false # Mask Jackpot weights without renormalization outside the top-k slice

Bibliography

If you think our work is helpful, please consider citing us using the following BibTeX.

@misc{chen2026jackpotoptimalbudgetedrejection,
      title={Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning}, 
      author={Zhuoming Chen and Hongyi Liu and Yang Zhou and Haizhong Zheng and Beidi Chen},
      year={2026},
      eprint={2602.06107},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.06107}, 
} 

About

A rejection-sampling based distribution alignment method for extreme actor-policy mismatch RL Training

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •