urllib.parse --- 將 URL 剖析成元件¶
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a "relative URL" to an absolute URL given a "base URL."
The module has been designed to match the internet RFC on Relative Uniform
Resource Locators. It supports the following URL schemes: file, ftp,
gopher, hdl, http, https, imap, itms-services, mailto, mms,
news, nntp, prospero, rsync, rtsp, rtsps, rtspu,
sftp, shttp, sip, sips, snews, svn, svn+ssh,
telnet, wais, ws, wss.
CPython 實作細節: The inclusion of the itms-services URL scheme can prevent an app from
passing Apple's App Store review process for the macOS and iOS App Stores.
Handling for the itms-services scheme is always removed on iOS; on
macOS, it may be removed if CPython has been built with the
--with-app-store-compliance option.
The urllib.parse module defines functions that fall into two broad
categories: URL parsing and URL quoting. These are covered in detail in
the following sections.
This module's functions use the deprecated term netloc (or net_loc),
which was introduced in RFC 1808. However, this term has been obsoleted by
RFC 3986, which introduced the term authority as its replacement.
The use of netloc is continued for backward compatibility.
URL Parsing¶
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
- urllib.parse.urlsplit(urlstring, scheme=None, allow_fragments=True)¶
Parse a URL into five components, returning a 5-item named tuple
SplitResultorSplitResultBytes. This corresponds to the general structure of a URL:scheme://netloc/path?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up into smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:>>> from urllib.parse import urlsplit >>> urlsplit("scheme://netloc/path?query#fragment") SplitResult(scheme='scheme', netloc='netloc', path='/path', query='query', fragment='fragment') >>> o = urlsplit("http://docs.python.org:80/3/library/urllib.parse.html?" ... "highlight=params#url-parsing") >>> o SplitResult(scheme='http', netloc='docs.python.org:80', path='/3/library/urllib.parse.html', query='highlight=params', fragment='url-parsing') >>> o.scheme 'http' >>> o.netloc 'docs.python.org:80' >>> o.hostname 'docs.python.org' >>> o.port 80 >>> o._replace(fragment="").geturl() 'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
Following the syntax specifications in RFC 1808,
urlsplit()recognizes a netloc only if it is properly introduced by '//'. Otherwise the input is presumed to be a relative URL and thus to start with a path component.>>> from urllib.parse import urlsplit >>> urlsplit('//www.cwi.nl:80/%7Eguido/Python.html') SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', query='', fragment='') >>> urlsplit('www.cwi.nl/%7Eguido/Python.html') SplitResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', query='', fragment='') >>> urlsplit('help/Python.html') SplitResult(scheme='', netloc='', path='help/Python.html', query='', fragment='')
The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one. It should be the same type (text or bytes) as urlstring, except that the default value
''is always allowed, and is automatically converted tob''if appropriate.If the allow_fragments argument is false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and
fragmentis set to the empty string in the return value.The return value is a named tuple, which means that its items can be accessed by index or as named attributes, which are:
屬性
Index
Value
Value if not present
scheme0
URL scheme specifier
scheme parameter
netloc1
Network location part
empty string
path2
Hierarchical path
empty string
query3
Query component
empty string
fragment4
Fragment identifier
empty string
usernameUser name
passwordPassword
hostnameHost name (lower case)
portPort number as integer, if present
Reading the
portattribute will raise aValueErrorif an invalid port is specified in the URL. See section Structured Parse Results for more information on the result object.Unmatched square brackets in the
netlocattribute will raise aValueError.Characters in the
netlocattribute that decompose under NFKC normalization (as used by the IDNA encoding) into any of/,?,#,@, or:will raise aValueError. If the URL is decomposed before parsing, no error will be raised.Following some of the WHATWG spec that updates RFC 3986, leading C0 control and space characters are stripped from the URL.
\n,\rand tab\tcharacters are removed from the URL at any position.As is the case with all named tuples, the subclass has a few additional methods and attributes that are particularly useful. One such method is
_replace(). The_replace()method will return a newSplitResultobject replacing specified fields with new values.>>> from urllib.parse import urlsplit >>> u = urlsplit('//www.cwi.nl:80/%7Eguido/Python.html') >>> u SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', query='', fragment='') >>> u._replace(scheme='http') SplitResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', query='', fragment='')
警告
urlsplit()does not perform validation. See URL parsing security for details.在 3.2 版的變更: 新增剖析 IPv6 URL 的能力。
在 3.3 版的變更: The fragment is now parsed for all URL schemes (unless allow_fragments is false), in accordance with RFC 3986. Previously, an allowlist of schemes that support fragments existed.
在 3.6 版的變更: Out-of-range port numbers now raise
ValueError, instead of returningNone.在 3.8 版的變更: Characters that affect netloc parsing under NFKC normalization will now raise
ValueError.在 3.10 版的變更: ASCII newline and tab characters are stripped from the URL.
在 3.12 版的變更: Leading WHATWG C0 control and space characters are stripped from the URL.
- urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None, separator='&')¶
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
The optional argument keep_blank_values is a flag indicating whether blank values in percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were not included.
The optional argument strict_parsing is a flag indicating what to do with parsing errors. If false (the default), errors are silently ignored. If true, errors raise a