Skip to content

Restructure regexp encoding validation#3984

Merged
kddnewton merged 1 commit intomainfrom
regex
Mar 11, 2026
Merged

Restructure regexp encoding validation#3984
kddnewton merged 1 commit intomainfrom
regex

Conversation

@kddnewton
Copy link
Collaborator

Move all the logic from prism.c into regexp.c. Now regexp.c does two passes. The first pass scans the raw source to track escape types, non-ASCII literals, and multibyte validity for encoding validation. The second pass scans the unescaped content for named capture extraction (needed because escape sequences like line continuations alter group names).

Fixed a couple of things along the way. ascii_only was previously computed from unescaped content, but we can do that as we go to avoid scanning again. Unicode properties also now properly error for regexp with modifiers.

Fixes #2104
Fixes #2620
Fixes #3734

@kddnewton kddnewton force-pushed the regex branch 3 times, most recently from dcb65b7 to fa0502a Compare March 11, 2026 03:54
Move all the logic from prism.c into regexp.c. Now regexp.c does two passes. The first pass scans the raw source to track escape types, non-ASCII literals, and multibyte validity for encoding validation. The second pass scans the unescaped content for named capture extraction (needed because escape sequences like line continuations alter group names).

Fixed a couple of things along the way. ascii_only was previously computed from unescaped content, but we can do that as we go to avoid scanning again. Unicode properties also now properly error for regexp with modifiers.
@kddnewton kddnewton merged commit ae5ea48 into main Mar 11, 2026
67 checks passed
@kddnewton kddnewton deleted the regex branch March 11, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant