Skip to content

fix(wc): respect C/POSIX locale for character counting#11006

Open
naoNao89 wants to merge 2 commits intouutils:mainfrom
naoNao89:fix-wc-locale-chars
Open

fix(wc): respect C/POSIX locale for character counting#11006
naoNao89 wants to merge 2 commits intouutils:mainfrom
naoNao89:fix-wc-locale-chars

Conversation

@naoNao89
Copy link
Contributor

@naoNao89 naoNao89 commented Feb 18, 2026

In C/POSIX locale, wc -m now counts bytes (not UTF-8 chars), matching GNU coreutils behavior using MB_CUR_MAX logic

Fixes #9712
Fixes #5831

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/rm/isatty. tests/rm/isatty is passing on 'main'. Maybe you have to rebase?
GNU test failed: tests/tail/retry. tests/tail/retry is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/date/date-locale-hour is no longer failing!
Congrats! The gnu test tests/cp/link-heap is now passing!

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 18, 2026

Merging this PR will improve performance by ×2.2

⚡ 1 improved benchmark
✅ 287 untouched benchmarks
⏩ 38 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation wc_chars_large_line_count[100000] 1,022.9 µs 455.3 µs ×2.2

Comparing naoNao89:fix-wc-locale-chars (224212a) with main (289d701)

Open in CodSpeed

Footnotes

  1. 38 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sylvestre
Copy link
Contributor

As no human would write such duplication of code, i guess it is LLM generated ...
Please review the changes before submitting them for review...

Copy link
Contributor

@cakebaker cakebaker Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the newly added tests you always use -m or -cm. However, this means only your changes in count_fast.rs are tested due to the logic in word_count_from_reader in wc.rs. To test your changes in wc.rs you also have to provide -w or -L.

@naoNao89
Copy link
Contributor Author

naoNao89 commented Feb 18, 2026

sr, refactored 💀 dup is_c_or_posix_locale()

@naoNao89 naoNao89 force-pushed the fix-wc-locale-chars branch 3 times, most recently from d906c13 to 1ed5ccd Compare February 18, 2026 12:51
}
if SHOW_CHARS {
total.chars += 1;
if chars_are_bytes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seriously ?!
please review your patches before substitutions ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought clippy had the ability to check for empty if :v

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/date/date-locale-hour. tests/date/date-locale-hour is passing on 'main'. Maybe you have to rebase?
Skipping an intermittent issue tests/pr/bounded-memory (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/cut/bounded-memory is no longer failing!

Modify wc -m to count bytes instead of UTF-8 characters when LC_ALL,
LC_CTYPE, or LANG is set to C or POSIX. This matches GNU coreutils
behavior where MB_CUR_MAX == 1 in these locales.

Changes:
- Add is_c_or_posix_locale() helper in count_fast.rs
- Export and reuse function in wc.rs to avoid duplication
- Update fast path and UTF-8 decoding path
- Add regression tests with Vietnamese text

Fixes uutils#9712, fixes uutils#5831.
@naoNao89 naoNao89 force-pushed the fix-wc-locale-chars branch from 1ed5ccd to bf04096 Compare February 18, 2026 13:02
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/rm/isatty. tests/rm/isatty is passing on 'main'. Maybe you have to rebase?
Note: The gnu test tests/rm/many-dir-entries-vs-OOM is now being skipped but was previously passing.

Add tests with -w flag to ensure both count_fast.rs and wc.rs
paths are tested for locale-aware character counting.
@github-actions
Copy link

GNU testsuite comparison:

Congrats! The gnu test tests/cut/bounded-memory is no longer failing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wc -m returns character count instead of byte count in C/POSIX locale Discrepancy in output length with special characters

3 participants

Comments