Skip to content

Conversation

@omrib40
Copy link

@omrib40 omrib40 commented Feb 1, 2026

What does this PR do?

Type of change: New feature

Overview: Adding ultrachat_200k for data utils.

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Summary by CodeRabbit

  • New Features
    • Added support for the ultrachat_200k dataset, making it available for model training and evaluation workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@omrib40 omrib40 requested a review from a team as a code owner February 1, 2026 16:39
@omrib40 omrib40 requested a review from ChenhanYu February 1, 2026 16:39
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

📝 Walkthrough

Walkthrough

Added a new dataset configuration entry "ultrachat_200k" to the SUPPORTED_DATASET_CONFIG dictionary. The configuration specifies HuggingFaceH4 dataset path, train_sft split, and a preprocessing function that joins message content with newlines.

Changes

Cohort / File(s) Summary
Dataset Configuration Addition
modelopt/torch/utils/dataset_utils.py
Added "ultrachat_200k" dataset configuration with HuggingFace path, training split specification, and preprocessing logic to join message content fields with newlines.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add ultrachat 200k to data_utils' accurately describes the main change—adding a new dataset configuration entry to the dataset utilities.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant