Unicode物件與編碼

Unicode Objects

Since the implementation of PEP 393 in Python 3.3, Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient. There are special cases for strings where all code points are below 128, 256, or 65536; otherwise, code points must be below 1114112 (which is the full Unicode range).

Py_UNICODE* and UTF-8 representations are created on demand and cached in the Unicode object. The Py_UNICODE* representation is deprecated and inefficient; it should be avoided in performance- or memory-sensitive situations.

Due to the transition between the old APIs and the new APIs, unicode objects can internally be in two states depending on how they were created:

  • 「canonical」 unicode objects are all objects created by a non-deprecated unicode API. They use the most efficient representation allowed by the implementation.
  • 「legacy」 unicode objects have been created through one of the deprecated APIs (typically PyUnicode_FromUnicode()) and only bear the Py_UNICODE* representation; you will have to call PyUnicode_READY() on them before calling any other API.

Unicode Type

These are the basic Unicode object types used for the Unicode implementation in Python:

Py_UCS4
Py_UCS2
Py_UCS1

These types are typedefs for unsigned integer types wide enough to contain characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with single Unicode characters, use Py_UCS4.

3.3 版新加入.

Py_UNICODE

This is a typedef of wchar_t, which is a 16-bit type or 32-bit type depending on the platform.

3.3 版更變: In previous versions, this was a 16-bit type or a 32-bit type depending on whether you selected a 「narrow」 or 「wide」 Unicode version of Python at build time.

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

These subtypes of PyObject represent a Python Unicode object. In almost all cases, they shouldn’t be used directly, since all API functions that deal with Unicode objects take and return PyObject pointers.

3.3 版新加入.

PyTypeObject PyUnicode_Type

This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects:

int PyUnicode_Check(PyObject *o)

Return true if the object o is a Unicode object or an instance of a Unicode subtype.

int PyUnicode_CheckExact(PyObject *o)

Return true if the object o is a Unicode object, but not an instance of a subtype.

int PyUnicode_READY(PyObject *o)

Ensure the string object o is in the 「canonical」 representation. This is required before using any of the access macros described below.

Returns 0 on success and -1 with an exception set on failure, which in particular happens if memory allocation fails.

3.3 版新加入.

Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)

Return the length of the Unicode string, in code points. o has to be a Unicode object in the 「canonical」 representation (not checked).

3.3 版新加入.

Py_UCS1* PyUnicode_1BYTE_DATA(PyObject *o)
Py_UCS2* PyUnicode_2BYTE_DATA(PyObject *o)
Py_UCS4* PyUnicode_4BYTE_DATA(PyObject *o)

Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 integer types for direct character access. No checks are performed if the canonical representation has the correct character size; use PyUnicode_KIND() to select the right macro. Make sure PyUnicode_READY() has been called before accessing this.

3.3 版新加入.

PyUnicode_WCHAR_KIND
PyUnicode_1BYTE_KIND
PyUnicode_2BYTE_KIND
PyUnicode_4BYTE_KIND

Return values of the PyUnicode_KIND() macro.

3.3 版新加入.

int PyUnicode_KIND(PyObject *o)

Return one of the PyUnicode kind constants (see above) that indicate how many bytes per character this Unicode object uses to store its data. o has to be a Unicode object in the 「canonical」 representation (not checked).

3.3 版新加入.

void* PyUnicode_DATA(PyObject *o)

Return a void pointer to the raw unicode buffer. o has to be a Unicode object in the 「canonical」 representation (not checked).

3.3 版新加入.

void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, Py_UCS4 value)

Write into a canonical representation data (as obtained with PyUnicode_DATA()). This macro does not do any sanity checks and is intended for usage in loops. The caller should cache the kind value and data pointer as obtained from other macro calls. index is the index in the string (starts at 0) and value is the new code point value which should be written to that location.

3.3 版新加入.

Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)

Read a code point from a canonical representation data (as obtained with PyUnicode_DATA()). No checks or ready calls are performed.

3.3 版新加入.

Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)

Read a character from a Unicode object o, which must be in the 「canonical」 representation. This is less efficient than PyUnicode_READ() if you do multiple consecutive reads.

3.3 版新加入.

PyUnicode_MAX_CHAR_VALUE(PyObject *o)

Return the maximum code point that is suitable for creating another string based on o, which must be in the 「canonical」 representation. This is always an approximation but more efficient than iterating over the string.

3.3 版新加入.

int PyUnicode_ClearFreeList()

Clear the free list. Return the total number of freed items.

Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)

Return the size of the deprecated Py_UNICODE representation, in code units (this includes surrogate pairs as 2 units). o has to be a Unicode object (not checked).

Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using PyUnicode_GET_LENGTH().

Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)

Return the size of the deprecated Py_UNICODE representation in bytes. o has to be a Unicode object (not checked).

Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using PyUnicode_GET_LENGTH().

Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
const char* PyUnicode_AS_DATA(PyObject *o)

Return a pointer to a Py_UNICODE representation of the object. The returned buffer is always terminated with an extra null code point. It may also contain embedded null code points, which would cause the string to be truncated when used in most C functions. The AS_DATA form casts the pointer to const char *. The o argument has to be a Unicode object (not checked).

3.3 版更變: This macro is now inefficient – because in many cases the Py_UNICODE representation does not exist and needs to be created – and can fail (return NULL with an exception set). Try to port the code to use the new PyUnicode_nBYTE_DATA() macros or use PyUnicode_WRITE() or PyUnicode_READ().

Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using the PyUnicode_nBYTE_DATA() family of macros.

Unicode Character Properties

Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.

int Py_UNICODE_ISSPACE(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is a whitespace character.

int Py_UNICODE_ISLOWER(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is a lowercase character.

int Py_UNICODE_ISUPPER(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is an uppercase character.

int Py_UNICODE_ISTITLE(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is a titlecase character.

int