Unicode Objects and Codecs

Unicode Objects

Since the implementation of PEP 393 in Python 3.3, Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient. There are special cases for strings where all code points are below 128, 256, or 65536; otherwise, code points must be below 1114112 (which is the full Unicode range).

UTF-8 representation is created on demand and cached in the Unicode object.

Note

The Py_UNICODE representation has been removed since Python 3.12 with deprecated APIs. See PEP 623 for more information.

Unicode Type

These are the basic Unicode object types used for the Unicode implementation in Python:

PyTypeObject PyUnicode_Type
Part of the Stable ABI.

This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

PyTypeObject PyUnicodeIter_Type
Part of the Stable ABI.

This instance of PyTypeObject represents the Python Unicode iterator type. It is used to iterate over Unicode string objects.

type Py_UCS4
type Py_UCS2
type Py_UCS1
Part of the Stable ABI.

These types are typedefs for unsigned integer types wide enough to contain characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with single Unicode characters, use Py_UCS4.

Added in version 3.3.

type PyASCIIObject
type PyCompactUnicodeObject
type PyUnicodeObject

These subtypes of PyObject represent a Python Unicode object. In almost all cases, they shouldn’t be used directly, since all API functions that deal with Unicode objects take and return PyObject pointers.

Added in version 3.3.

The structure of a particular object can be determined using the following macros. The macros cannot fail; their behavior is undefined if their argument is not a Python Unicode object.

PyUnicode_IS_COMPACT(o)

True if o uses the PyCompactUnicodeObject structure.

Added in version 3.3.

PyUnicode_IS_COMPACT_ASCII(o)

True if o uses the PyASCIIObject structure.

Added in version 3.3.

The following APIs are C macros and static inlined functions for fast checks and access to internal read-only data of Unicode objects:

int PyUnicode_Check(PyObject *obj)

Return true if the object obj is a Unicode object or an instance of a Unicode subtype. This function always succeeds.

int PyUnicode_CheckExact(PyObject *obj)

Return true if the object obj is a Unicode object, but not an instance of a subtype. This function always succeeds.

Py_ssize_t PyUnicode_GET_LENGTH(PyObject *unicode)

Return the length of the Unicode string, in code points. unicode has to be a Unicode object in the “canonical” representation (not checked).

Added in version 3.3.

Py_UCS1 *PyUnicode_1BYTE_DATA(PyObject *unicode)
Py_UCS2 *PyUnicode_2BYTE_DATA(PyObject *unicode)
Py_UCS4 *PyUnicode_4BYTE_DATA(PyObject *unicode)

Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 integer types for direct character access. No checks are performed if the canonical representation has the correct character size; use PyUnicode_KIND() to select the right function.

Added in version 3.3.

PyUnicode_1BYTE_KIND
PyUnicode_2BYTE_KIND
PyUnicode_4BYTE_KIND

Return values of the PyUnicode_KIND() macro.

Added in version 3.3.

Changed in version 3.12: PyUnicode_WCHAR_KIND has been removed.

int PyUnicode_KIND(PyObject *unicode)

Return one of the PyUnicode kind constants (see above) that indicate how many bytes per character this Unicode object uses to store its data. unicode has to be a Unicode object in the “canonical” representation (not checked).

Added in version 3.3.

void *PyUnicode_DATA(PyObject *unicode)

Return a void pointer to the raw Unicode buffer. unicode has to be a Unicode object in the “canonical” representation (not checked).

Added in version 3.3.

void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, Py_UCS4 value)

Write the code point value to the given zero-based index in a string.

The kind value and data pointer must have been obtained from a string using PyUnicode_KIND() and PyUnicode_DATA() respectively. You must hold a reference to that string while calling PyUnicode_WRITE(). All requirements of PyUnicode_WriteChar() also apply.

The function performs no checks for any of its requirements, and is intended for usage in loops.

Added in version 3.3.

Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)

Read a code point from a canonical representation data (as obtained with PyUnicode_DATA()). No checks or ready calls are performed.

Added in version 3.3.

Py_UCS4 PyUnicode_READ_CHAR(PyObject *unicode, Py_ssize_t index)

Read a character from a Unicode object unicode, which must be in the “canonical” representation. This is less efficient than PyUnicode_READ() if you do multiple consecutive reads.

Added in version 3.3.

Py_UCS4 PyUnicode_MAX_CHAR_VALUE(PyObject *unicode)

Return the maximum code point that is suitable for creating another string based on unicode, which must be in the “canonical” representation. This is always an approximation but more efficient than iterating over the string.

Added in version 3.3.

int PyUnicode_IsIdentifier(PyObject *unicode)
Part of the Stable ABI.

Return 1 if the string is a valid identifier according to the language definition, section Names (identifiers and keywords). Return 0 otherwise.

Changed in version 3.9: The function does not call Py_FatalError() anymore if the string is not ready.

unsigned int PyUnicode_IS_ASCII(PyObject *unicode)

Return true if the string only contains ASCII characters. Equivalent to str.isascii().

Added in version 3.2.

Py_hash_t PyUnstable_Unicode_GET_CACHED_HASH(PyObject *str)
This is Unstable API. It may change without warning in minor releases.

If the hash of str, as returned by PyObject_Hash(), has been cached and is immediately available, return it. Otherwise, return -1 without setting an exception.

If str is not a string (that is, if PyUnicode_Check(obj) is false), the behavior is undefined.

This function never fails with an exception.

Note that there are no guarantees on when an object’s hash is cached, and the (non-)existence of a cached hash does not imply that the string has any other properties.

Unicode Character Properties

Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.

int Py_UNICODE_ISSPACE(Py_UCS4 ch)

Return 1 or 0 depending on whether ch is a whitespace character.

int Py_UNICODE_ISLOWER(Py_UCS4 ch)

Return 1 or 0 depending on whether ch is a lowercase character.

int Py_UNICODE_ISUPPER(