@@ -386,73 +386,29 @@ Names (identifiers and keywords)
386386:data:`~token.NAME` tokens represent *identifiers*, *keywords*, and
387387*soft keywords*.
388388
389- Within the ASCII range (U+0001..U+007F), the valid characters for names
390- include the uppercase and lowercase letters (``A-Z`` and ``a-z``),
391- the underscore ``_`` and, except for the first character, the digits
392- ``0`` through ``9``.
389+ Names are composed of the following characters:
390+
391+ * uppercase and lowercase letters (``A-Z`` and ``a-z``),
392+ * the underscore (``_``),
393+ * digits (``0`` through ``9``), which cannot appear as the first character, and
394+ * non-ASCII characters. Valid names may only contain "letter-like" and
395+ "digit-like" characters; see :ref:`lexical-names-nonascii` for details.
393396
394397Names must contain at least one character, but have no upper length limit.
395398Case is significant.
396399
397- Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like"
398- and "number-like" characters from outside the ASCII range, as detailed below.
399-
400- All identifiers are converted into the `normalization form`_ NFKC while
401- parsing; comparison of identifiers is based on NFKC.
402-
403- Formally, the first character of a normalized identifier must belong to the
404- set ``id_start``, which is the union of:
405-
406- * Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
407- * Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
408- * Unicode category ``<Lt>`` - titlecase letters
409- * Unicode category ``<Lm>`` - modifier letters
410- * Unicode category ``<Lo>`` - other letters
411- * Unicode category ``<Nl>`` - letter numbers
412- * {``"_"``} - the underscore
413- * ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
414- to support backwards compatibility
415-
416- The remaining characters must belong to the set ``id_continue``, which is the
417- union of:
418-
419- * all characters in ``id_start``
420- * Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
421- * Unicode category ``<Pc>`` - connector punctuations
422- * Unicode category ``<Mn>`` - nonspacing marks
423- * Unicode category ``<Mc>`` - spacing combining marks
424- * ``<Other_ID_Continue>`` - another explicit set of characters in
425- `PropList.txt`_ to support backwards compatibility
426-
427- Unicode categories use the version of the Unicode Character Database as
428- included in the :mod:`unicodedata` module.
429-
430- These sets are based on the Unicode standard annex `UAX-31`_.
431- See also :pep:`3131` for further details.
432-
433- Even more formally, names are described by the following lexical definitions:
400+ Formally, names are described by the following lexical definitions:
434401
435402.. grammar-snippet::
436403 :group: python-grammar
437404
438- NAME: `xid_start` `xid_continue`*
439- id_start: <Lu> | <Ll> | <Lt> | <Lm> | <Lo> | <Nl> | "_" | <Other_ID_Start>
440- id_continue: `id_start` | <Nd> | <Pc> | <Mn> | <Mc> | <Other_ID_Continue>
441- xid_start: <all characters in `id_start` whose NFKC normalization is
442- in (`id_start` `xid_continue`*)">
443- xid_continue: <all characters in `id_continue` whose NFKC normalization is
444- in (`id_continue`*)">
445- identifier: <`NAME`, except keywords>
405+ NAME: `name_start` `name_continue`*
406+ name_start: "a"..."z" | "A"..."Z" | "_" | <non-ASCII character>
407+ name_continue: name_start | "0"..."9"
408+ identifier: <`NAME`, except keywords>
446409
447- A non-normative listing of all valid identifier characters as defined by
448- Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode
449- Character Database.
450-
451-
452- .. _UAX-31: https://www.unicode.org/reports/tr31/
453- .. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
454- .. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
455- .. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
410+ Note that not all names matched by this grammar are valid; see
411+ :ref:`lexical-names-nonascii` for details.
456412
457413
458414.. _keywords:
@@ -555,6 +511,95 @@ characters:
555511 :ref:`atom-identifiers`.
556512
557513
514+ .. _lexical-names-nonascii:
515+
516+ Non-ASCII characters in names
517+ -----------------------------
518+
519+ Names that contain non-ASCII characters need additional normalization
520+ and validation beyond the rules and grammar explained
521+ :ref:`above <identifiers>`.
522+ For example, ``ř_1``, ``蛇``, or ``साँप`` are valid names, but ``r〰2``,
523+ ``€``, or ``🐍`` are not.
524+
525+ This section explains the exact rules.
526+
527+ All names are converted into the `normalization form`_ NFKC while parsing.
528+ This means that, for example, some typographic variants of characters are
529+ converted to their "basic" form. For example, ``fiⁿₐˡᵢᶻₐᵗᵢᵒₙ`` normalizes to
530+ ``finalization``, so Python treats them as the same name::
531+
532+ >>> fiⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3
533+ >>> finalization
534+ 3
535+
536+ .. note::
537+
538+ Normalization is done at the lexical level only.
539+ Run-time functions that take names as *strings* generally do not normalize
540+ their arguments.
541+ For example, the variable defined above is accessible at run time in the
542+ :func:`globals` dictionary as ``globals()["finalization"]`` but not
543+ ``globals()["fiⁿₐˡᵢᶻₐᵗᵢᵒₙ"]``.
544+
545+ Similarly to how ASCII-only names must contain only letters, digits and
546+ the underscore, and cannot start with a digit, a valid name must
547+ start with a character in the "letter-like" set ``xid_start``,
548+ and the remaining characters must be in the "letter- and digit-like" set
549+ ``xid_continue``.
550+
551+ These sets based on the *XID_Start* and *XID_Continue* sets as defined by the
552+ Unicode standard annex `UAX-31`_.
553+ Python's ``xid_start`` additionally includes the underscore (``_``).
554+ Note that Python does not necessarily conform to `UAX-31`_.
555+
556+ A non-normative listing of characters in the *XID_Start* and *XID_Continue*
557+ sets as defined by Unicode is available in the `DerivedCoreProperties.txt`_
558+ file in the Unicode Character Database.
559+ For reference, the construction rules for the ``xid_*`` sets are given below.
560+
561+ The set ``id_start`` is defined as the union of:
562+
563+ * Unicode category ``<Lu>`` - uppercase letters (includes ``A`` to ``Z``)
564+ * Unicode category ``<Ll>`` - lowercase letters (includes ``a`` to ``z``)
565+ * Unicode category ``<Lt>`` - titlecase letters
566+ * Unicode category ``<Lm>`` - modifier letters
567+ * Unicode category ``<Lo>`` - other letters
568+ * Unicode category ``<Nl>`` - letter numbers
569+ * {``"_"``} - the underscore
570+ * ``<Other_ID_Start>`` - an explicit set of characters in `PropList.txt`_
571+ to support backwards compatibility
572+
573+ The set ``xid_start`` then closes this set under NFKC normalization, by
574+ removing all characters whose normalization is not of the form
575+ ``id_start id_continue*``.
576+
577+ The set ``id_continue`` is defined as the union of:
578+
579+ * ``id_start`` (see above)
580+ * Unicode category ``<Nd>`` - decimal numbers (includes ``0`` to ``9``)
581+ * Unicode category ``<Pc>`` - connector punctuations
582+ * Unicode category ``<Mn>`` - nonspacing marks
583+ * Unicode category ``<Mc>`` - spacing combining marks
584+ * ``<Other_ID_Continue>`` - another explicit set of characters in
585+ `PropList.txt`_ to support backwards compatibility
586+
587+ Again, ``xid_continue`` closes this set under NFKC normalization.
588+
589+ Unicode categories use the version of the Unicode Character Database as
590+ included in the :mod:`unicodedata` module.
591+
592+ .. _UAX-31: https://www.unicode.org/reports/tr31/
593+ .. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt
594+ .. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt
595+ .. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms
596+
597+ .. seealso::
598+
599+ * :pep:`3131` -- Supporting Non-ASCII Identifiers
600+ * :pep:`672` -- Unicode-related Security Considerations for Python
601+
602+
558603.. _literals:
559604
560605Literals
0 commit comments