Carnegie Mellon University
(* = Equal Contributions, alphabetically ordering based on lastnames)
Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive, occupying more than 80% of the total cost. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces severe distribution mismatch that destabilizes learning. Particularly, Is it possible to perform rollouts using a completely different model from the one we ultimately want to train?
We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-k probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, Jackpot substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.
Importance Sampling based methods are vulnerable to extreme actor-policy mismatch.
Once the actor drifts too far, many tokens that the actor samples with high probability have very low probability under the policy, since
- OBRS for alignment: naive rejection sampling is too strict, rejecting all rollout tokens. Instead under a rejection budget, Jackpot stochastically reject tokens that minimizes the KL divergence of the modified rollout distribution and the policy distribution with theoretical guaranteed.
- System efficiency: approximate the normalization constant with a top-
kunion over inference and policy distributions, then apply batch-level bias correction for stability.
Jackpot proceeds in two phases per iteration (Algorithm 1 in the PDF):
- Efficient rollout: sample actions using the inference model once, and store top-
klog-probabilities. - PPO update with Jackpot reweighting: compute the standard PPO objective, approximate the OBRS normalizer
Zusing a top-kunion of inference and policy logits, apply OBRS weights and truncated IS corrections, then update the policy with the reweighted loss.
Implementation note from the paper: Jackpot only reweights quantities from standard PPO/GRPO forward passes; it does not introduce extra model forward passes or re-rollouts.
- In extreme actor–policy mismatch (using a small model to rollout while asking a larger and stronger model to train on the rollout), naive off-policy training and TIS baselines often collapse, while Jackpot maintains stability over substantially more update steps.
- Jackpot approaches on-policy performance in multiple settings, including the Qwen3-8B policy trained with a Qwen3-1.7B actor for up to 300 update steps with batchsize 64 in DeepScaleR dataset.
- When actor-policy mismatch is milder (e.g., large-batchsize staleness offpolicy or FP8 KV quantization), Jackpot enables removal of the clipping ratio in the PPO objective (under the setting detailed in the paper), while seeing training more stable than TIS.
- Even with Jackpot, two model joint training will still crash eventually after 300 steps with batchsize 64.
- The paper does not validate Jackpot on very large models (e.g., 32B variants), due to limit of resources.
We based our implementation on Verl https://github.com/verl-project/verl. For installation, you can simply run the following
pip install -e .[vllm]
We prepare detailed example running scripts with Jackpot support under the following path examples/jackpot_examples_gsm8k , including instructions examples/jackpot_examples_gsm8k/README.md
Here is the detailed explanation of all the parameters we added to verl arguments for Jackpot
# inside actor.yaml
use_jackpot: true # Whether to enable Jackpot (OBRS) loss
jackpot_log_probs_to_keep: 20 # Number of top-k log-probs to keep for Jackpot
jackpot_lambda: 1.0 # Scaling factor for Jackpot loss
jackpot_clip_ratio: 3.0 # Clipping ratio for Jackpot importance weights
jackpot_use_latest_logits: true # Whether to recompute Jackpot with the latest logits instead of cached top-k, we recommend turn it on when using jackpot
jackpot_use_topk_renorm: true # Whether to renormalize Jackpot weights using the top-k slice
jackpot_mask_only: false # Mask Jackpot weights without renormalization outside the top-k sliceIf you think our work is helpful, please consider citing us using the following BibTeX.
@misc{chen2026jackpotoptimalbudgetedrejection,
title={Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning},
author={Zhuoming Chen and Hongyi Liu and Yang Zhou and Haizhong Zheng and Beidi Chen},
year={2026},
eprint={2602.06107},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.06107},
} 

