xml.parsers.expat --- 使用 Expat 進行快速 XML 剖析


備註

如果你需要剖析不受信任或未經驗證的資料,請參閱 XML 安全性

The xml.parsers.expat module is a Python interface to the Expat non-validating XML parser. The module provides a single extension type, xmlparser, that represents the current state of an XML parser. After an xmlparser object has been created, various attributes of the object can be set to handler functions. When an XML document is then fed to the parser, the handler functions are called for the character data and markup in the XML document.

This module uses the pyexpat module to provide access to the Expat parser. Direct use of the pyexpat module is deprecated.

這個模組提供一個例外和一個型別物件:

exception xml.parsers.expat.ExpatError

The exception raised when Expat reports an error. See section ExpatError 例外 for more information on interpreting Expat errors.

exception xml.parsers.expat.error

ExpatError 的別名。

xml.parsers.expat.XMLParserType

The type of the return values from the ParserCreate() function.

xml.parsers.expat 模組包含兩個函式:

xml.parsers.expat.ErrorString(errno)

回傳一個給定錯誤編號 errno 的解釋字串。

xml.parsers.expat.ParserCreate(encoding=None, namespace_separator=None)

Creates and returns a new xmlparser object. encoding, if specified, must be a string naming the encoding used by the XML data. Expat doesn't support as many encodings as Python does, and its repertoire of encodings can't be extended; it supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII. If encoding [1] is given it will override the implicit or explicit encoding of the document.

Parsers created through ParserCreate() are called "root" parsers, in the sense that they do not have any parent parser attached. Non-root parsers are created by parser.ExternalEntityParserCreate.

Expat can optionally do XML namespace processing for you, enabled by providing a value for namespace_separator. The value must be a one-character string; a ValueError will be raised if the string has an illegal length (None is considered the same as omission). When namespace processing is enabled, element type names and attribute names that belong to a namespace will be expanded. The element name passed to the element handlers StartElementHandler and EndElementHandler will be the concatenation of the namespace URI, the namespace separator character, and the local part of the name. If the namespace separator is a zero byte (chr(0)) then the namespace URI and the local part will be concatenated without any separator.

For example, if namespace_separator is set to a space character (' ') and the following document is parsed:

<?xml version="1.0"?>
<root xmlns    = "http://default-namespace.org/"
      xmlns:py = "http://www.python.org/ns/">
  <py:elem1 />
  <elem2 xmlns="" />
</root>

StartElementHandler 將會收到每個元素的以下字串:

http://default-namespace.org/ root
http://www.python.org/ns/ elem1
elem2

Due to limitations in the Expat library used by pyexpat, the xmlparser instance returned can only be used to parse a single XML document. Call ParserCreate for each document to provide unique parser instances.

也參考

Expat XML 剖析器

Expat 專案的首頁。

XMLParser 物件

xmlparser 物件擁有以下方法:

xmlparser.Parse(data[, isfinal])

Parses the contents of the string data, calling the appropriate handler functions to process the parsed data. isfinal must be true on the final call to this method; it allows the parsing of a single file in fragments, not the submission of multiple files. data can be the empty string at any time.

xmlparser.ParseFile(file)

Parse XML data reading from the object file. file only needs to provide the read(nbytes) method, returning the empty string when there's no more data.

xmlparser.SetBase(base)

Sets the base to be used for resolving relative URIs in system identifiers in declarations. Resolving relative identifiers is left to the application: this value will be passed through as the base argument to the ExternalEntityRefHandler(), NotationDeclHandler(), and UnparsedEntityDeclHandler() functions.

xmlparser.GetBase()

Returns a string containing the base set by a previous call to SetBase(), or None if SetBase() hasn't been called.

xmlparser.GetInputContext()

Returns the input data that generated the current event as a string. The data is in the encoding of the entity which contains the text. When called while an event handler is not active, the return value is None.

xmlparser.ExternalEntityParserCreate(context[, encoding])

Create a "child" parser which can be used to parse an external parsed entity referred to by content parsed by the parent parser. The context parameter should be the string passed to the ExternalEntityRefHandler() handler function, described below. The child parser is created with the ordered_attributes and specified_attributes set to the values of this parser.

xmlparser.SetParamEntityParsing(flag)

Control parsing of parameter entities (including the external DTD subset). Possible flag values are XML_PARAM_ENTITY_PARSING_NEVER, XML_PARAM_ENTITY_PARSING_UNLESS_STANDALONE and XML_PARAM_ENTITY_PARSING_ALWAYS. Return true if setting the flag was successful.

xmlparser.UseForeignDTD([flag])

Calling this with a true value for flag (the default) will cause Expat to call the ExternalEntityRefHandler with None for all arguments to allow an alternate DTD to be loaded. If the document does not contain a document type declaration, the ExternalEntityRefHandler will still be called, but the StartDoctypeDeclHandler and EndDoctypeDeclHandler will not be called.

Passing a false value for flag will cancel a previous call that passed a true value, but otherwise has no effect.

This method can only be called before the Parse() or ParseFile() methods are called; calling it after either of those have been called causes ExpatError to be raised with the code attribute set to errors.codes[errors.XML_ERROR_CANT_CHANGE_FEATURE_ONCE_PARSING].

xmlparser.SetReparseDeferralEnabled(enabled)

警告

Calling SetReparseDeferralEnabled(False) has security implications, as detailed below; please make sure to understand these consequences prior to using the SetReparseDeferralEnabled method.

Expat 2.6.0 introduced a security mechanism called "reparse deferral" where instead of causing denial of service through quadratic runtime from reparsing large tokens, reparsing of unfinished tokens is now delayed by default until a sufficient amount of input is reached. Due to this delay, registered handlers may — depending of the sizing of input chunks pushed to Expat — no longer be called right after pushing new input to the parser. Where immediate feedback and taking over responsibility of protecting against denial of service from large tokens are both wanted, calling SetReparseDeferralEnabled(False) disables reparse deferral for the current Expat parser instance, temporarily or altogether. Calling SetReparseDeferralEnabled(True) allows re-enabling reparse deferral.

Note that SetReparseDeferralEnabled() has been backported to some prior releases of CPython as a security fix. Check for availability of SetReparseDeferralEnabled() using hasattr() if used in code running across a variety of Python versions.

在 3.13 版被加入.

xmlparser.GetReparseDeferralEnabled()

Returns whether reparse deferral is currently enabled for the given Expat parser instance.

在 3.13 版被加入.

xmlparser 物件擁有以下方法來減輕一些常見的 XML 漏洞。

xmlparser.SetAllocTrackerActivationThreshold(threshold, /)

Sets the number of allocated bytes of dynamic memory needed to activate protection against disproportionate use of RAM.

By default, parser objects have an allocation activation threshold of 64 MiB, or equivalently 67,108,864 bytes.

An ExpatError is raised if this method is called on a non-root parser. The corresponding lineno and offset should not be used as they may have no special meaning.

在 3.14.1 版被加入.

xmlparser.SetAllocTrackerMaximumAmplification(max_factor, /)

Sets the maximum amplification factor between direct input and bytes of dynamic memory allocated.

The amplification factor is calculated as allocated / direct while parsing, where direct is the number of bytes read from the primary document in parsing and allocated is the number of bytes of dynamic memory allocated in the parser hierarchy.

The max_factor value must be a non-NaN