Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor–Policy Mismatch RL

Zhuoming Chen*, Hongyi Liu*, Yang Zhou*, Haizhong Zheng, Beidi Chen

Carnegie Mellon University
(* = Equal Contributions, alphabetically ordering based on lastnames)

What is the Problem that Jackpot Solves?

Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive, occupying more than 80% of the total cost. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces severe distribution mismatch that destabilizes learning. Particularly, Is it possible to perform rollouts using a completely different model from the one we ultimately want to train?

We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-k probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, Jackpot substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

Why TIS is not Enough?

Importance Sampling based methods are vulnerable to extreme actor-policy mismatch. Once the actor drifts too far, many tokens that the actor samples with high probability have very low probability under the policy, since $p_{\text{inf}} > p_{\text{target}}$. These actor trajectories are effectively treated as low-likelihood samples by the policy, causing TIS to train on tokens the policy would never select at inference and creating a widening train-inference mismatch. It motivates us to look at distribution alignment methods that directly modify $p_{\text{inf}}$. Jackpot builds on top of the TIS methods and directly modifies the rollout trajectories and rollout distribution via Optimal Budgeted Rejection Sampling (OBRS).

Our Method: Jackpot

Key Ideas

OBRS for alignment: naive rejection sampling is too strict, rejecting all rollout tokens. Instead under a rejection budget, Jackpot stochastically reject tokens that minimizes the KL divergence of the modified rollout distribution and the policy distribution with theoretical guaranteed.
System efficiency: approximate the normalization constant with a top-k union over inference and policy distributions, then apply batch-level bias correction for stability.

Algorithm Overview (More details in the paper)

Jackpot proceeds in two phases per iteration (Algorithm 1 in the PDF):

Efficient rollout: sample actions using the inference model once, and store top-k log-probabilities.
PPO update with Jackpot reweighting: compute the standard PPO objective, approximate the OBRS normalizer Z using a top-k union of inference and policy logits, apply OBRS weights and truncated IS corrections, then update the policy with the reweighted loss.

Implementation note from the paper: Jackpot only reweights quantities from standard PPO/GRPO forward passes; it does not introduce extra model forward passes or re-rollouts.

Empirical Studies

In extreme actor–policy mismatch (using a small model to rollout while asking a larger and stronger model to train on the rollout), naive off-policy training and TIS baselines often collapse, while Jackpot maintains stability over substantially more update steps.
Jackpot approaches on-policy performance in multiple settings, including the Qwen3-8B policy trained with a Qwen3-1.7B actor for up to 300 update steps with batchsize 64 in DeepScaleR dataset.
When actor-policy mismatch is milder (e.g., large-batchsize staleness offpolicy or FP8 KV quantization), Jackpot enables removal of the clipping ratio in the PPO objective (under the setting detailed in the paper), while seeing training more stable than TIS.

Limitations Noted in the Paper

Even with Jackpot, two model joint training will still crash eventually after 300 steps with batchsize 64.
The paper does not validate Jackpot on very large models (e.g., 32B variants), due to limit of resources.

Code Structure

We based our implementation on Verl https://github.com/verl-project/verl. For installation, you can simply run the following

pip install -e .[vllm]

We prepare detailed example running scripts with Jackpot support under the following path examples/jackpot_examples_gsm8k , including instructions examples/jackpot_examples_gsm8k/README.md

Here is the detailed explanation of all the parameters we added to verl arguments for Jackpot

# inside actor.yaml 
use_jackpot: true # Whether to enable Jackpot (OBRS) loss 
jackpot_log_probs_to_keep: 20 # Number of top-k log-probs to keep for Jackpot
jackpot_lambda: 1.0 # Scaling factor for Jackpot loss
jackpot_clip_ratio: 3.0 # Clipping ratio for Jackpot importance weights
jackpot_use_latest_logits: true # Whether to recompute Jackpot with the latest logits instead of cached top-k, we recommend turn it on when using jackpot 
jackpot_use_topk_renorm: true # Whether to renormalize Jackpot weights using the top-k slice
jackpot_mask_only: false # Mask Jackpot weights without renormalization outside the top-k slice

Bibliography

If you think our work is helpful, please consider citing us using the following BibTeX.

@misc{chen2026jackpotoptimalbudgetedrejection,
      title={Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning}, 
      author={Zhuoming Chen and Hongyi Liu and Yang Zhou and Haizhong Zheng and Beidi Chen},
      year={2026},
      eprint={2602.06107},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.06107}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,691 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
docker		docker
docs		docs
examples		examples
figures		figures
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Jackpot_Optimal_Budgeted_Rejection_Sampling_for_Extreme_Actor_policy_Mismatch_Reinforcement_Learning__Copy_.pdf		Jackpot_Optimal_Budgeted_Rejection_Sampling_for_Extreme_Actor_policy_Mismatch_Reinforcement_Learning__Copy_.pdf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor–Policy Mismatch RL

What is the Problem that Jackpot Solves?

Why TIS is not Enough?

Our Method: Jackpot

Key Ideas

Algorithm Overview (More details in the paper)

Empirical Studies

Limitations Noted in the Paper

Code Structure

Bibliography

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Infini-AI-Lab/jackpot

Folders and files

Latest commit

History

Repository files navigation

Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor–Policy Mismatch RL

What is the Problem that Jackpot Solves?

Why TIS is not Enough?

Our Method: Jackpot

Key Ideas

Algorithm Overview (More details in the paper)

Empirical Studies

Limitations Noted in the Paper

Code Structure

Bibliography

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages