Unicode Objects and Codecs¶
Unicode Objects¶
Since the implementation of PEP 393 in Python 3.3, Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient. There are special cases for strings where all code points are below 128, 256, or 65536; otherwise, code points must be below 1114112 (which is the full Unicode range).
UTF-8 representation is created on demand and cached in the Unicode object.
Note
The Py_UNICODE representation has been removed since Python 3.12
with deprecated APIs.
See PEP 623 for more information.
Unicode Type¶
These are the basic Unicode object types used for the Unicode implementation in Python:
-
PyTypeObject PyUnicode_Type¶
- Part of the Stable ABI.
This instance of
PyTypeObjectrepresents the Python Unicode type. It is exposed to Python code asstr.
-
PyTypeObject PyUnicodeIter_Type¶
- Part of the Stable ABI.
This instance of
PyTypeObjectrepresents the Python Unicode iterator type. It is used to iterate over Unicode string objects.
-
type Py_UCS4¶
-
type Py_UCS2¶
-
type Py_UCS1¶
- Part of the Stable ABI.
These types are typedefs for unsigned integer types wide enough to contain characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with single Unicode characters, use
Py_UCS4.Added in version 3.3.
-
type PyASCIIObject¶
-
type PyCompactUnicodeObject¶
-
type PyUnicodeObject¶
These subtypes of
PyObjectrepresent a Python Unicode object. In almost all cases, they shouldn’t be used directly, since all API functions that deal with Unicode objects take and returnPyObjectpointers.Added in version 3.3.
The structure of a particular object can be determined using the following macros. The macros cannot fail; their behavior is undefined if their argument is not a Python Unicode object.
-
PyUnicode_IS_COMPACT(o)¶
True if o uses the
PyCompactUnicodeObjectstructure.Added in version 3.3.
-
PyUnicode_IS_COMPACT_ASCII(o)¶
True if o uses the
PyASCIIObjectstructure.Added in version 3.3.
-
PyUnicode_IS_COMPACT(o)¶
The following APIs are C macros and static inlined functions for fast checks and access to internal read-only data of Unicode objects: