urllib.parse --- 將 URL 剖析成元件¶
This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a "relative URL" to an absolute URL given a "base URL."
The module has been designed to match the internet RFC on Relative Uniform
Resource Locators. It supports the following URL schemes: file, ftp,
gopher, hdl, http, https, imap, itms-services, mailto, mms,
news, nntp, prospero, rsync, rtsp, rtsps, rtspu,
sftp, shttp, sip, sips, snews, svn, svn+ssh,
telnet, wais, ws, wss.
CPython 實作細節: The inclusion of the itms-services URL scheme can prevent an app from
passing Apple's App Store review process for the macOS and iOS App Stores.
Handling for the itms-services scheme is always removed on iOS; on
macOS, it may be removed if CPython has been built with the
--with-app-store-compliance option.
The urllib.parse module defines functions that fall into two broad
categories: URL parsing and URL quoting. These are covered in detail in
the following sections.
This module's functions use the deprecated term netloc (or net_loc),
which was introduced in RFC 1808. However, this term has been obsoleted by
RFC 3986, which introduced the term authority as its replacement.
The use of netloc is continued for backward compatibility.
URL Parsing¶
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
- urllib.parse.urlsplit(urlstring, scheme=None, allow_fragments=True)¶
Parse a URL into five components, returning a 5-item named tuple
SplitResultorSplitResultBytes. This corresponds to the general structure of a URL:scheme://netloc/path?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up into smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:>>> from urllib.parse import urlsplit >>> urlsplit("scheme://netloc/path?query#fragment") SplitResult(scheme='scheme', netloc='netloc', path='/path', query='query', fragment='fragment') >>> o = urlsplit("http://docs.python.org:80/3/library/urllib.parse.html?" ... "highlight=params#url-parsing") >>> o SplitResult(scheme='http', netloc='docs.python.org:80', path='/3/library/urllib.parse.html', query='highlight=params', fragment='url-parsing') >>> o.scheme 'http' >>> o.netloc 'docs.python.org:80' >>> o.hostname 'docs.python.org' >>> o.port 80 >>> o._replace(fragment="").geturl() 'http://docs.python.org:80/3/library/urllib.parse.html?highlight=params'
Following the syntax specifications in RFC 1808,
urlsplit()recognizes a netloc only if it is properly introduced by '//'. Otherwise the input is presumed to be a relative URL and thus to start with a path component.>>> from urllib.parse import urlsplit >>> urlsplit('//www.cwi.nl:80/%7Eguido/Python.html') SplitResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', query='', fragment='') >>> urlsplit('www.cwi.nl/%7Eguido/Python.html') SplitResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', query='', fragment='') >>> urlsplit('help/Python.html') SplitResult(scheme='', netloc='', path='help/Python.html', query='', fragment='')
The scheme argument gives the default addressing scheme, to be used only if the URL does not specify one. It should be the same type (text or bytes) as urlstring, except that the default value
''is always allowed, and is automatically converted tob''if appropriate.If the allow_fragments argument is false, fragment identifiers are not recognized. Instead, they are parsed as part of the path, parameters or query component, and
fragmentis set to the empty string in the return value.The return value is a named tuple, which means that its items can be accessed by index or as named attributes, which are:
屬性
Index
Value
Value if not present
scheme0
URL scheme specifier
scheme parameter
netloc1
Network location part
empty string
path2
Hierarchical path
empty string
query3
Query component
empty string
fragment4
Fragment identifier
empty string
usernameUser name
passwordPassword
hostnameHost name (lower case)
portPort number as integer, if present