Skip to content

Comments

tac: support non-UTF-8 separator#10934

Merged
ChrisDryden merged 1 commit intouutils:mainfrom
victor-prokhorov:tac-non-utf8-separator
Feb 21, 2026
Merged

tac: support non-UTF-8 separator#10934
ChrisDryden merged 1 commit intouutils:mainfrom
victor-prokhorov:tac-non-utf8-separator

Conversation

@victor-prokhorov
Copy link
Contributor

fixes #9502

open question: GNU tac also accepts non-UTF-8 separators in regex mode. right now, in this pr i return an error. should i handle this in the same PR, or i should do it in another?

@ChrisDryden
Copy link
Collaborator

The ultimate goal it to get the non utf-8 stuff to be fully matching how gnu handles it, can you try to handle it in the same pr?

@codspeed-hq
Copy link

codspeed-hq bot commented Feb 14, 2026

Merging this PR will improve performance by 3.55%

⚡ 1 improved benchmark
✅ 287 untouched benchmarks
⏩ 40 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation cp_large_file[16] 389.6 µs 376.3 µs +3.55%

Comparing victor-prokhorov:tac-non-utf8-separator (5bde2a8) with main (a972eee)

Open in CodSpeed

Footnotes

  1. 40 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@ChrisDryden
Copy link
Collaborator

By fixing this it should also fix the tac-locale test so you can use that to guide you on whether the fix is working properly

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/date/date-locale-hour. tests/date/date-locale-hour is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/misc/usage_vs_getopt (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/tac/tac is no longer failing!
Congrats! The gnu test tests/tac/tac-locale is no longer failing!

@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/tail/symlink (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)
Congrats! The gnu test tests/tac/tac is no longer failing!
Congrats! The gnu test tests/tac/tac-locale is no longer failing!

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/date/date-locale-hour. tests/date/date-locale-hour is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/tac/tac is no longer failing!
Congrats! The gnu test tests/tac/tac-locale is no longer failing!
Congrats! The gnu test tests/tail/tail-n0f is now passing!

@victor-prokhorov
Copy link
Contributor Author

i found this trick of turning the u flag here https://docs.rs/regex/1.10.4/regex/#opt-out-of-unicode-support, should it be included in the inline documentation?
i used expect because the invariance is upheld by construction, is returning a result preferred? please advise
the integration tests i wrote somewhat overlap gnu test scripts, let me know if you prefer them removed


match c {
match b {
_ if inside_brackets && !b.is_ascii() => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL GNU also ignores non-ASCII bytes inside bracket expressions

last_byte = Some(*b);
}
_ if !b.is_ascii() => {
let _ = write!(result, r"(?-u:\x{b:02x})");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

@ChrisDryden ChrisDryden merged commit 23eb2ad into uutils:main Feb 21, 2026
155 of 157 checks passed
@ChrisDryden
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tac cannot handle seperators that are not UTF-8.

2 participants