Unicode物件與編碼¶
Unicode Objects¶
Since the implementation of PEP 393 in Python 3.3, Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient. There are special cases for strings where all code points are below 128, 256, or 65536; otherwise, code points must be below 1114112 (which is the full Unicode range).
Py_UNICODE* and UTF-8 representations are created on demand and cached
in the Unicode object. The Py_UNICODE* representation is deprecated
and inefficient; it should be avoided in performance- or memory-sensitive
situations.
Due to the transition between the old APIs and the new APIs, unicode objects can internally be in two states depending on how they were created:
- 「canonical」 unicode objects are all objects created by a non-deprecated unicode API. They use the most efficient representation allowed by the implementation.
- 「legacy」 unicode objects have been created through one of the deprecated
APIs (typically
PyUnicode_FromUnicode()) and only bear thePy_UNICODE*representation; you will have to callPyUnicode_READY()on them before calling any other API.
Unicode Type¶
These are the basic Unicode object types used for the Unicode implementation in Python:
-
Py_UCS4¶ -
Py_UCS2¶ -
Py_UCS1¶ These types are typedefs for unsigned integer types wide enough to contain characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with single Unicode characters, use
Py_UCS4.3.3 版新加入.
-
Py_UNICODE¶ This is a typedef of
wchar_t, which is a 16-bit type or 32-bit type depending on the platform.3.3 版更變: In previous versions, this was a 16-bit type or a 32-bit type depending on whether you selected a 「narrow」 or 「wide」 Unicode version of Python at build time.
-
PyASCIIObject¶ -
PyCompactUnicodeObject¶ -
PyUnicodeObject¶ These subtypes of
PyObjectrepresent a Python Unicode object. In almost all cases, they shouldn’t be used directly, since all API functions that deal with Unicode objects take and returnPyObjectpointers.3.3 版新加入.
-
PyTypeObject
PyUnicode_Type¶ This instance of
PyTypeObjectrepresents the Python Unicode type. It is exposed to Python code asstr.
The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects:
-
int
PyUnicode_Check(PyObject *o)¶ Return true if the object o is a Unicode object or an instance of a Unicode subtype.
-
int
PyUnicode_CheckExact(PyObject *o)¶ Return true if the object o is a Unicode object, but not an instance of a subtype.
-
int
PyUnicode_READY(PyObject *o)¶ Ensure the string object o is in the 「canonical」 representation. This is required before using any of the access macros described below.
Returns
0on success and-1with an exception set on failure, which in particular happens if memory allocation fails.3.3 版新加入.
-
Py_ssize_t
PyUnicode_GET_LENGTH(PyObject *o)¶ Return the length of the Unicode string, in code points. o has to be a Unicode object in the 「canonical」 representation (not checked).
3.3 版新加入.
-
Py_UCS1*
PyUnicode_1BYTE_DATA(PyObject *o)¶ -
Py_UCS2*
PyUnicode_2BYTE_DATA(PyObject *o)¶ -
Py_UCS4*
PyUnicode_4BYTE_DATA(PyObject *o)¶ Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 integer types for direct character access. No checks are performed if the canonical representation has the correct character size; use
PyUnicode_KIND()to select the right macro. Make surePyUnicode_READY()has been called before accessing this.3.3 版新加入.
-
PyUnicode_WCHAR_KIND¶ -
PyUnicode_1BYTE_KIND¶ -
PyUnicode_2BYTE_KIND¶ -
PyUnicode_4BYTE_KIND¶ Return values of the
PyUnicode_KIND()macro.3.3 版新加入.
-
int
PyUnicode_KIND(PyObject *o)¶ Return one of the PyUnicode kind constants (see above) that indicate how many bytes per character this Unicode object uses to store its data. o has to be a Unicode object in the 「canonical」 representation (not checked).
3.3 版新加入.
-
void*
PyUnicode_DATA(PyObject *o)¶ Return a void pointer to the raw unicode buffer. o has to be a Unicode object in the 「canonical」 representation (not checked).
3.3 版新加入.
-
void
PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, Py_UCS4 value)¶ Write into a canonical representation data (as obtained with
PyUnicode_DATA()). This macro does not do any sanity checks and is intended for usage in loops. The caller should cache the kind value and data pointer as obtained from other macro calls. index is the index in the string (starts at 0) and value is the new code point value which should be written to that location.3.3 版新加入.
-
Py_UCS4
PyUnicode_READ(int kind, void *data, Py_ssize_t index)¶ Read a code point from a canonical representation data (as obtained with
PyUnicode_DATA()). No checks or ready calls are performed.3.3 版新加入.
-
Py_UCS4
PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)¶ Read a character from a Unicode object o, which must be in the 「canonical」 representation. This is less efficient than
PyUnicode_READ()if you do multiple consecutive reads.3.3 版新加入.
-
PyUnicode_MAX_CHAR_VALUE(PyObject *o)¶ Return the maximum code point that is suitable for creating another string based on o, which must be in the 「canonical」 representation. This is always an approximation but more efficient than iterating over the string.
3.3 版新加入.
-
int
PyUnicode_ClearFreeList()¶ Clear the free list. Return the total number of freed items.
-
Py_ssize_t
PyUnicode_GET_SIZE(PyObject *o)¶ Return the size of the deprecated
Py_UNICODErepresentation, in code units (this includes surrogate pairs as 2 units). o has to be a Unicode object (not checked).Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using
PyUnicode_GET_LENGTH().
-
Py_ssize_t
PyUnicode_GET_DATA_SIZE(PyObject *o)¶ Return the size of the deprecated
Py_UNICODErepresentation in bytes. o has to be a Unicode object (not checked).Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using
PyUnicode_GET_LENGTH().
-
Py_UNICODE*
PyUnicode_AS_UNICODE(PyObject *o)¶ -
const char*
PyUnicode_AS_DATA(PyObject *o)¶ Return a pointer to a
Py_UNICODErepresentation of the object. The returned buffer is always terminated with an extra null code point. It may also contain embedded null code points, which would cause the string to be truncated when used in most C functions. TheAS_DATAform casts the pointer toconst char *. The o argument has to be a Unicode object (not checked).3.3 版更變: This macro is now inefficient – because in many cases the
Py_UNICODErepresentation does not exist and needs to be created – and can fail (return NULL with an exception set). Try to port the code to use the newPyUnicode_nBYTE_DATA()macros or usePyUnicode_WRITE()orPyUnicode_READ().Deprecated since version 3.3, will be removed in version 4.0: Part of the old-style Unicode API, please migrate to using the
PyUnicode_nBYTE_DATA()family of macros.
Unicode Character Properties¶
Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.
-
int
Py_UNICODE_ISSPACE(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a whitespace character.
-
int
Py_UNICODE_ISLOWER(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a lowercase character.
-
int
Py_UNICODE_ISUPPER(Py_UNICODE ch)¶ Return
1or0depending on whether ch is an uppercase character.
-
int
Py_UNICODE_ISTITLE(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a titlecase character.
- int
