Unicode Objects and Codecs¶
Unicode Objects¶
Unicode Type¶
These are the basic Unicode object types used for the Unicode implementation in Python:
-
Py_UNICODE¶ This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python’s default builds use a 16-bit type for
Py_UNICODEand store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type forPy_UNICODEand store Unicode data internally as UCS4. On platforms wherewchar_tis available and compatible with the chosen Python Unicode build variant,Py_UNICODEis a typedef alias forwchar_tto enhance native platform compatibility. On all other platforms,Py_UNICODEis a typedef alias for eitherunsigned short(UCS2) orunsigned long(UCS4).
Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep this in mind when writing extensions or interfaces.
-
PyTypeObject
PyUnicode_Type¶ This instance of
PyTypeObjectrepresents the Python Unicode type. It is exposed to Python code asunicodeandtypes.UnicodeType.
The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects:
-
int
PyUnicode_Check(PyObject *o)¶ Return true if the object o is a Unicode object or an instance of a Unicode subtype.
Changed in version 2.2: Allowed subtypes to be accepted.
-
int
PyUnicode_CheckExact(PyObject *o)¶ Return true if the object o is a Unicode object, but not an instance of a subtype.
New in version 2.2.
-
Py_ssize_t
PyUnicode_GET_SIZE(PyObject *o)¶ Return the size of the object. o has to be a
PyUnicodeObject(not checked).Changed in version 2.5: This function returned an
inttype. This might require changes in your code for properly supporting 64-bit systems.
-
Py_ssize_t
PyUnicode_GET_DATA_SIZE(PyObject *o)¶ Return the size of the object’s internal buffer in bytes. o has to be a
PyUnicodeObject(not checked).Changed in version 2.5: This function returned an
inttype. This might require changes in your code for properly supporting 64-bit systems.
-
Py_UNICODE*
PyUnicode_AS_UNICODE(PyObject *o)¶ Return a pointer to the internal
Py_UNICODEbuffer of the object. o has to be aPyUnicodeObject(not checked).
-
const char*
PyUnicode_AS_DATA(PyObject *o)¶ Return a pointer to the internal buffer of the object. o has to be a
PyUnicodeObject(not checked).
-
int
PyUnicode_ClearFreeList()¶ Clear the free list. Return the total number of freed items.
New in version 2.6.
Unicode Character Properties¶
Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.
-
int
Py_UNICODE_ISSPACE(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a whitespace character.
-
int
Py_UNICODE_ISLOWER(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a lowercase character.
-
int
Py_UNICODE_ISUPPER(Py_UNICODE ch)¶ Return
1or0depending on whether ch is an uppercase character.
-
int
Py_UNICODE_ISTITLE(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a titlecase character.
-
int
Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a linebreak character.
-
int
Py_UNICODE_ISDECIMAL(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a decimal character.
-
int
Py_UNICODE_ISDIGIT(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a digit character.
-
int
Py_UNICODE_ISNUMERIC(Py_UNICODE ch)¶ Return
1or0depending on whether ch is a numeric character.
-
int
Py_UNICODE_ISALPHA(Py_UNICODE ch)¶ Return
1or0depending on whether ch is an alphabetic character.
-
int
Py_UNICODE_ISALNUM(Py_UNICODE ch)¶ Return
1or0depending on whether ch is an alphanumeric character.
These APIs can be used for fast direct character conversions:
-
Py_UNICODE
Py_UNICODE_TOLOWER(Py_UNICODE ch)¶ Return the character ch converted to lower case.
-
Py_UNICODE
Py_UNICODE_TOUPPER(Py_UNICODE ch)¶ Return the character ch converted to upper case.
-
Py_UNICODE
Py_UNICODE_TOTITLE(Py_UNICODE ch)¶ Return the character ch converted to title case.
-
int
Py_UNICODE_TODECIMAL(Py_UNICODE ch)¶ Return the character ch converted to a decimal positive integer. Return
-1if this is not possible. This macro does not raise exceptions.
-
int
Py_UNICODE_TODIGIT(Py_UNICODE ch)¶ Return the character ch converted to a single digit integer. Return
-1if this is not possible. This macro does not raise exceptions.
-
double
Py_UNICODE_TONUMERIC(Py_UNICODE ch)¶ Return the character ch converted to a double. Return
-1.0if this is not possible. This macro does not raise exceptions.
Plain Py_UNICODE¶
To create Unicode objects and access their basic sequence properties, use these APIs:
-
PyObject*
PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)¶ - Return value: New reference.
Create a Unicode object from the Py_UNICODE buffer u of the given size. u may be NULL which causes the contents to be undefined. It is the user’s responsibility to fill in the needed data. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object. Therefore, modification of the resulting Unicode object is only allowed when u is NULL.
Changed in version 2.5: This function used an
inttype for size. This might require changes in your code for properly supporting 64-bit systems.
-
PyObject*
PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)¶ - Return value: New reference.
Create a Unicode object from the char buffer u. The bytes will be interpreted as being UTF-8 encoded. u may also be NULL which causes the contents to be undefined. It is the user’s responsibility to fill in the needed data. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object. Therefore, modification of the resulting Unicode object is only allowed when u is NULL.
New in version 2.6.
-
PyObject *
PyUnicode_FromString(const char *u)¶ - Return value: New reference.
Create a Unicode object from a UTF-8 encoded null-terminated char buffer u.
New in version 2.6.
-
PyObject*
PyUnicode_FromFormat(const char *format, ...)¶ - Return value: New reference.
Take a C
printf()-style format string and a variable number of arguments, calculate the size of the resulting Python unicode string and return a string with the values formatted into it. The variable arguments must be C types and must correspond exactly to the format characters in the format string. The following format characters are allowed:Format Characters
Type
Comment
%%n/a
The literal % character.
%cint
A single character, represented as a C int.
%dint
Exactly equivalent to
printf("%d").%uunsigned int
Exactly equivalent to
printf("%u").%ldlong
Exactly equivalent to
printf("%ld").%luunsigned long
Exactly equivalent to
printf("%lu").%zdPy_ssize_t
Exactly equivalent to
printf("%zd").%zusize_t
Exactly equivalent to
printf("%zu").%iint
Exactly equivalent to
printf("%i").%xint
Exactly equivalent to
printf("%x").%schar*
A null-terminated C character array.
%pvoid*
The hex representation of a C pointer. Mostly equivalent to
printf("%p")except that it is guaranteed to start with the literal0xregardless of what the platform’sprintfyields.%UPyObject*
A unicode object.
%VPyObject*, char *
A unicode object (which may be NULL) and a null-terminated C character array as a second parameter (which will be used, if the first parameter is NULL).
%SPyObject*
The result of calling
PyObject_Unicode().%RPyObject*
The result of calling
PyObject_Repr().An unrecognized format character causes all the rest of the format string to be copied as-is to the result string, and any extra arguments discarded.
New in version 2.6.
-
PyObject*
PyUnicode_FromFormatV(const char *format, va_list vargs)¶ - Return value: New reference.
Identical to
PyUnicode_FromFormat()except that it takes exactly two arguments.New in version 2.6.
