Urllib error httperror http error 405 method not allowed - Исправление ошибок и поиск оптимальных решений проблем

Python urllib Library | Python’s urllib.request for HTTP Requests

In this tutorial, we will learn about the Python urllib.request and make the GET request to the sample URL. We will also make the GET request to a sample REST API for some JSON data. We will also learn about the POST request using the urllib.request library. The urllib library plays an essential role in opening the URL using Python code. This library has many features besides opening URLs. Later we explore its important features. Let’s have a brief introduction to the urllib library.

Python urllib Module

The Python urllib module allows us to access the website via Python code. This facility of urllib provides the flexibility to run GET, POST, and PUT methods and mock the JSON data in our Python code. We can download data, access websites, parse data, and modify the headers using the urllib library. There are two versions of urllib — urllib2 and urllib3. The urllib2 is used in Python 2, and urllib 3 is used in Python 3. The urllib3 has some advanced features.

Few websites don’t allow programs to invade their website’s data and put the load on their servers. When they find such programs, they may sometimes choose to block the program or serve the different data that a regular user might use. However, we can overcome such situations with simple code. We only need to modify the user-agent, which is included in the request header that we send in.

For those who don’t know, headers are the bits of data that share to the server so that server can know about us. It is where the Python version informs the websites that we are trying to access the website with the urllib and Python versions.

The urllib module consists of several modules for working with the URLs. These are given below —

The urllib.request for opening and reading.
The urllib.parse for parsing URLs
The urllib.error for the exceptions raised
The urllib.robotparser for parsing robot.txt files

Most of the time, the urllib package comes with the Python installation if it is not in your environment. We can download it using the below command.

It will install the urllib in your environment.

Basic of HTTP Messages

To understand the urllib.request in a better way, we need to know about the HTTP message. An HTTP message can be known as text, transmitted as a stream of bytes that follow the guideline specified by RFC 7230. A decoded HTTP message can be simple as below.

A GET request specifies at the root (/) using the HTTP/1.1 protocol. The HTTP method has two main parts -metadata and body. The above message has all the metadata but nobody. The response also consists of two parts —

As we can see, the response starts with a status code 200 OK. It is a metadata of the response. We have shown the standard HTTP message, but some may break these rules or follows the older specification. We may get many key-value pair data, such as representing all the response headers.

The urllib.request

Let’s understand the following example of urllib.request.OpenURL() method. We need to import the urllib3 module and assign the opening of the URL to a variable, where we can use the read() command to read the data.

When we request with the urllib.request.urlopen(), it returns HTTPResponse object. We can print the HTTPResponse object using the dir() function to see different methods and properties.

Example —

Output:

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', '_checkSeekable', '_checkWritable', '_check_close', '_close_conn', '_get_chunk_left', '_method', '_peek_chunked', '_read1_chunked', '_read_and_discard_trailer', '_read_next_chunk_size', '_read_status', '_readall_chunked', '_readinto_chunked', '_safe_read', '_safe_readinto', 'begin', 'chunk_left', 'chunked', 'close', 'closed', 'code', 'debuglevel', 'detach', 'fileno', 'flush', 'fp', 'getcode', 'getheader', 'getheaders', 'geturl', 'headers', 'info', 'isatty', 'isclosed', 'length', 'msg', 'peek', 'read', 'read1', 'readable', 'readinto', 'readinto1', 'readline', 'readlines', 'reason', 'seek', 'seekable', 'status', 'tell', 'truncate', 'url', 'version', 'will_close', 'writable', 'write', 'writelines']

Example —

Output:

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-IN"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="miKBzQYvtiMON4gUmZ+2qA==">(function(){window.google={kEI:'lj1QYuXdLNOU-AaAz5XIDg',kEXPI:'0,1302536,56873,1710,4348,207,4804,2316,383,246,5,1354,4013,1238,1122515,1197791,610,380090,16114,19398,9286,17572,4859,1361,9290,3027,17582,4020,978,13228,3847,4192,6430,22112,629,5081,1593,1279,2742,149,1103,841,6296,3514,606,2023,1777,520,14670,3227,2845,7,17450,16320,1851,2614,13142,3,346,230,1014,1,5444,149,11325,2650,4,1528,2304,6463,576,22023,3050,2658,7357,30,13628,13795,7428,5800,2557,4094,4052,3,3541,1,16807,22235,2,3110,2,14022,1931,784,255,4550,743,5853,10463,1160,4192,1487,1020,2381,2718,18233,28,2,2,5,7754,4567,6259,3008,3,3712,2775,13920,830,422,5835,11862,3105,1539,2794,8,4669,1413,1394,445,2,2,1,6395,4561,10491,2378,701,252,1407,10,1,8591,113,5011,5,1453,637,162,2,460,2,430,586,969,593,5214,2215,2343,1866,1563,4987,791,6071,2,1,5829,227,161,983,3110,773,1217,2780,933,504,1259,957,749,6,601,23,31,748,100,1392,2738,92,2552,106,197,163,1315,1133,316,364,810,3,333,233,1586,229,81,967,2,440,357,298,258,1863,400,1698,417,4,881,219,20,396,2,397,481,75,5444944,1269,5994875,651,2801216,882,444,3,1877,1,2562,1,748,141,795,563,1,4265,1,1,2,1331,4142,2609,155,17,13,72,139,4,2,20,2,169,13,19,46,5,39,96,548,29,2,2,1,2,1,2,2,7,4,1,2,2,2,2,2,2,353,513,186,1,1,158,3,2,2,2,2,2,4,2,3,3,269,1601,141,1002,98,214,133,10,9,9,2,10,3,1,11,10,6,239

The above code returns the source code of the www.google.com; we can clean up the results using the regular expression. The webpages are made using HTML, CSS, and JavaScript, but our program only extracts the text usually. Hence the regular expression becomes useful; we will use it in the next section.

There are two methods of data transfer — GET and POST. The GET method is used to fetch the data, and the POST method is used to send data over the server, same as we post some data and get a request based on the post. We also make the request using the GET and POST from/to a URL.

Now let’s see the uses of urllib.parse.

The urllib.parse

The urllib.parse helps to define the function to manipulate URLs and their components parts to create or destroy them. There is a way to hard code the value and pass it to the URL, but we can use the urllib to do it. Let’s understand the following example.

Example —

In the above code, we defined the values to use with the POST method to send with the URL. To do so, first we need to encode the all of the values. And we encode to utf-8 bytes. We encoded dictionary request. Then, we open the URL using the URL with the request. Finally, we read that response with the read().

Example —

Output:

urllib.error.HTTPError: HTTP Error 405: Method Not Allowed

When we run the above code, Google will return the 405, method not allowed. Google doesn’t allow sending the request using the Python script. We can use another website with the search bar to send a request using Python.

Sometimes websites don’t like to be visited by spammers or robots, or they might treat them differently. In the past time, most the website blocks the program. Wikipedia used to outright block programs, but now they serve a page like Google.

We send a header each time we request the URL, which is basic information about us. There is a user-agent value within the header that defines the browser accessing the website’s server. And that’s how Google analytics determines what browser we are using.

By default, the user-agent with urllib 3.4 if we are using the Python 3.4 version. It is either unknown to the websites; otherwise, they will block it entirely.

Example —

We have modified our headers and run the same thing. We make the request, and our response is indeed different.

Common urllib.request Troubles

The program can throw an error while running the urllib.request program. In this section, we will deal with a couple of the most common errors when getting started.

Before understanding the specific errors, we will learn how to implement error handling.

Implement Error Handling

Web development is plagued with errors, and it can take lots of time to resolve. We will define how to handle HTTP, URL, and timeout errors when using urllib.request.

HTTP status code is attached with every response in the status line. If we can read a status code in response, the request reaches its target. There are predefined status codes that are used with the response. For example — 200 and 201 represent successful requests. If the request code is 404 and 405, it means there is something wrong.

There can also be a connection error, or the URL provided isn’t correct. In this case, the urllib.request will raise an error.

Finally, the server takes lots of time to respond. Maybe the network connection is slow, the server is down, or the server is programmed to ignore specific requests. To overcome this, we can pass a timeout argument to the urlopen() function to raise a TimeoutError after a certain time. We need to catch them first using the try….expect clause to solve these problems.

Example —

The request_func() function takes a string as an argument, which will try to fetch the request from the URL with the urllib.request. If URL is not correct, it will catch the URLError. It also catches HTTPError’s object that’s raised an error. If there is no error, it will print the status and return the body and response.

As we can observe, we have called the urlopen() function with the timeout parameter, which will reason a TimeoutError to be raised after the seconds specified. The program will wait for ten seconds, which is a good amount of time to wait for a response.

Handling 403 Error

The 403 status means that server understood the request but won’t fulfill it. It is a common error when we work with web-scraping. As discussed above, this error can be solved using the User-Agent header. Two mainly related 4xx codes are sometimes confused.

401 Unauthorized
403 Forbidden

401 Error — It is returned by the server if the user isn’t identified or logged in and must do something to gain access, like log in register.

403 Error — It is returned by the server if the user is identified but doesn’t have access to the resource.

Luckily we can handle such an error by modifying the User-Agent string on the web, including a user agent database.

The Request Package Ecosystem

This section will clarify the package ecosystem around HTTP requests with Python. There are multiple packages with no clear standard. We must be familiar with the correct package for our requirements.

What are urllib2 and urllib3?

The urllib library is split into parts. If we look back to an early version of Python 1.2 when the original urllib was introduced, around version 1.6, a revamped urllib2 was added. But when Python 3 was introduced, the original urllib was deprecated, and urllib2 dropped the 2, taking on the original urllib name.

urllib.error
urllib.parse
urllib.request
urllib.response
urllib.robotparser

The urllib3 is a third-party library developed while urllib2 was still around. It is an independently maintained library.

When Should Use requests over urllib.requests

It is important to know when to use request or urllib.request library. The urllib.request is meant to be a low-level library that reveals many details about the working of HTTP requests.

The request library is highly recommended when users interact with the many different REST APIs. The official documentation of the request library claims that «built for human beings» and has successfully created an intuitive, secure, and straightforward API around HTTP. The request library allows us to do complex tasks easily, especially when it comes to character encoding. With the urllib library, we have to be aware of encoding and take steps to ensure an error-free experience.

If you are more focused on learning about the more standard Python and how to deal with the HTTP request, the urllib.request is the best way to start. You could even go further and use the very low-level http modules.

Conclusion

We have discussed the urllib library in detail and learned how to make a request using the Python script. We can use this built-in module in our project, keeping them dependency-free for longer. This tutorial explored the urllib library and its lower-level module, such as urllib.request and also explained how to deal with the common error and how to resolve them.

Источник

Introduction

You may also find useful the following article on fetching web resources with Python:

Basic Authentication — A tutorial on Basic Authentication, with examples in Python.

urllib.request is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations — like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.

urllib.request supports fetching URLs for many “URL schemes” (identified by the string before the ":" in URL — for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.

For straightforward situations urlopen is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is RFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using urllib, with enough detail about HTTP to help you through. It is not intended to replace the urllib.request docs, but is supplementary to them.

Fetching URLs

The simplest way to use urllib.request is as follows:

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
   html = response.read()

If you wish to retrieve a resource via URL and store it in a temporary location, you can do so via the shutil.copyfileobj() and tempfile.NamedTemporaryFile() functions:

import shutil
import tempfile
import urllib.request

with urllib.request.urlopen('http://python.org/') as response:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(response, tmp_file)

with open(tmp_file.name) as html:
    pass

Many uses of urllib will be that simple (note that instead of an ‘http:’ URL we could have used a URL starting with ‘ftp:’, ‘file:’, etc.). However, it’s the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP.

HTTP is based on requests and responses — the client makes requests and servers send responses. urllib.request mirrors this with a Request object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling urlopen with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response:

import urllib.request

req = urllib.request.Request('http://www.voidspace.org.uk')
with urllib.request.urlopen(req) as response:
   the_page = response.read()

Note that urllib.request makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so:

req = urllib.request.Request('ftp://example.com/')

In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information (“metadata”) about the data or the about request itself, to the server — this information is sent as HTTP “headers”. Let’s look at each of these in turn.

Data

Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script or other web application). With HTTP, this is often done using what’s known as a POST request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the data argument. The encoding is done using a function from the urllib.parse library.

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

Note that other encodings are sometimes required (e.g. for file upload from HTML forms — see HTML Specification, Form Submission for more details).

If you do not pass the data argument, urllib uses a GET request. One way in which GET and POST requests differ is that POST requests often have “side-effects”: they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.

This is done as follows:

>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values)  # The order may differ from below.  
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib.request.urlopen(full_url)

Notice that the full URL is created by adding a ? to the URL, followed by the encoded values.

Headers

We’ll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.

Some websites (Google for example) dislike being browsed by programs, or send different versions to different browsers (Browser sniffing is a very bad practice for website design — building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers.). By default urllib identifies itself as Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through the User-Agent header (The user agent for MSIE 6 is ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)’). When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer.

For details of more HTTP request headers, see Quick Reference to HTTP Headers.import urllib.parse

import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name': 'Michael Foord',
          'location': 'Northampton',
          'language': 'Python' }
headers = {'User-Agent': user_agent}

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.

Handling Exceptions

urlopen raises URLError when it cannot handle a response (though as usual with Python APIs, built-in exceptions such as ValueError, TypeError etc. may also be raised).

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.

The exception classes are exported from the urllib.error module.

URLError

Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn’t exist. In this case, the exception raised will have a ‘reason’ attribute, which is a tuple containing an error code and a text error message.

e.g.

>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
...     print(e.reason)      
...
(4, 'getaddrinfo failed')

HTTPError

Every HTTP response from the server contains a numeric “status code”. Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a “redirection” that requests the client fetch the document from a different URL, urllib will handle that for you). For those it can’t handle, urlopen will raise an HTTPError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden), and ‘401’ (authentication required).

See section 10 of RFC 2616 for a reference on all the HTTP error codes.

The HTTPError instance raised will have an integer ‘code’ attribute, which corresponds to the error sent by the server.

Error Codes

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100–299 range indicate success, you will usually only see error codes in the 400–599 range.

http.server.BaseHTTPRequestHandler.responses is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
    100: ('Continue', 'Request received, please continue'),
    101: ('Switching Protocols',
          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),
    201: ('Created', 'Document created, URL follows'),
    202: ('Accepted',
          'Request accepted, processing continues off-line'),
    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
    204: ('No Content', 'Request fulfilled, nothing follows'),
    205: ('Reset Content', 'Clear input form for further input.'),
    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',
          'Object has several resources -- see URI list'),
    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
    302: ('Found', 'Object moved temporarily -- see URI list'),
    303: ('See Other', 'Object moved -- see Method and URL list'),
    304: ('Not Modified',
          'Document has not changed since given time'),
    305: ('Use Proxy',
          'You must use proxy specified in Location to access this '
          'resource.'),
    307: ('Temporary Redirect',
          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',
          'Bad request syntax or unsupported method'),
    401: ('Unauthorized',
          'No permission -- see authorization schemes'),
    402: ('Payment Required',
          'No payment -- see charging schemes'),
    403: ('Forbidden',
          'Request forbidden -- authorization will not help'),
    404: ('Not Found', 'Nothing matches the given URI'),
    405: ('Method Not Allowed',
          'Specified method is invalid for this server.'),
    406: ('Not Acceptable', 'URI not available in preferred format.'),
    407: ('Proxy Authentication Required', 'You must authenticate with '
          'this proxy before proceeding.'),
    408: ('Request Timeout', 'Request timed out; try again later.'),
    409: ('Conflict', 'Request conflict.'),
    410: ('Gone',
          'URI no longer exists and has been permanently removed.'),
    411: ('Length Required', 'Client must specify Content-Length.'),
    412: ('Precondition Failed', 'Precondition in headers is false.'),
    413: ('Request Entity Too Large', 'Entity is too large.'),
    414: ('Request-URI Too Long', 'URI is too long.'),
    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
    416: ('Requested Range Not Satisfiable',
          'Cannot satisfy request range.'),
    417: ('Expectation Failed',
          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),
    501: ('Not Implemented',
          'Server does not support this operation'),
    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
    503: ('Service Unavailable',
          'The server cannot process the request due to a high load'),
    504: ('Gateway Timeout',
          'The gateway server did not receive a timely response'),
    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
    }

When an error is raised the server responds by returning an HTTP error code and an error page. You can use the HTTPError instance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods as returned by the urllib.response module:

>>> req = urllib.request.Request('http://www.python.org/fish.html')
>>> try:
...     urllib.request.urlopen(req)
... except urllib.error.HTTPError as e:
...     print(e.code)
...     print(e.read())  
...
404
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">nnn<html
  ...
  <title>Page Not Found</title>n
  ...

Wrapping it Up

So if you want to be prepared for HTTPError or URLError there are two basic approaches. I prefer the second approach.

Number 1

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
    response = urlopen(req)
except HTTPError as e:
    print('The server couldn't fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    # everything is fine

Note

The except HTTPError must come first, otherwise except URLError will also catch an HTTPError.

Number 2

from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn't fulfill the request.')
        print('Error code: ', e.code)
else:
    # everything is fine

info and geturl

The response returned by urlopen (or the HTTPError instance) has two useful methods info() and geturl() and is defined in the module urllib.response..

geturl — this returns the real URL of the page fetched. This is useful because urlopen (or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.

info — this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently an http.client.HTTPMessage instance.

Typical headers include ‘Content-length’, ‘Content-type’, and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.

Openers and Handlers

When you fetch a URL you use an opener (an instance of the perhaps confusingly-named urllib.request.OpenerDirector). Normally we have been using the default opener — via urlopen — but you can create custom openers. Openers use handlers. All the “heavy lifting” is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.

You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.

To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.

Alternatively, you can use build_opener, which is a convenience function for creating opener objects with a single function call. build_opener adds several handlers by default, but provides a quick way to add more and/or override the default handlers.

Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.

install_opener can be used to make an opener object the (global) default opener. This means that calls to urlopen will use the opener you have installed.

Opener objects have an open method, which can be called directly to fetch urls in the same way as the urlopen function: there’s no need to call install_opener, except as a convenience.

Basic Authentication

To illustrate creating and installing a handler we will use the HTTPBasicAuthHandler. For a more detailed discussion of this subject – including an explanation of how Basic Authentication works — see the Basic Authentication Tutorial.

When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ‘realm’. The header looks like: WWW-Authenticate: SCHEME realm="REALM".

e.g.

WWW-Authenticate: Basic realm="cPanel Users"

The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is ‘basic authentication’. In order to simplify this process we can create an instance of HTTPBasicAuthHandler and an opener to use this handler.

The HTTPBasicAuthHandler uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a HTTPPasswordMgr. Frequently one doesn’t care what the realm is. In that case, it is convenient to use HTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing None as the realm argument to the add_password method.

The top-level URL is the first URL that requires authentication. URLs “deeper” than the URL you pass to .add_password() will also match.

# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL
opener.open(a_url)

# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)

Note

In the above example we only supplied our HTTPBasicAuthHandler to build_opener. By default openers have the handlers for normal situations – ProxyHandler (if a proxy setting such as an http_proxy environment variable is set), UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, DataHandler, HTTPErrorProcessor.

top_level_url is in fact either a full URL (including the ‘http:’ scheme component and the hostname and optionally the port number) e.g. "http://example.com/" or an “authority” (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the “userinfo” component — for example "joe:password@example.com" is not correct.

Proxies

urllib will auto-detect your proxy settings and use those. This is through the ProxyHandler, which is part of the normal handler chain when a proxy setting is detected. Normally that’s a good thing, but there are occasions when it may not be helpful.

urllib opener for SSL proxy (CONNECT method): ASPN Cookbook Recipe

One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler:

>>> proxy_support = urllib.request.ProxyHandler({})
>>> opener = urllib.request.build_opener(proxy_support)
>>> urllib.request.install_opener(opener)

Note

Currently urllib.request does not support fetching of https locations through a proxy. However, this can be enabled by extending urllib.request as shown in the recipe.

urllib opener for SSL proxy (CONNECT method): ASPN Cookbook Recipe.

Note

HTTP_PROXY will be ignored if a variable REQUEST_METHOD is set; see the documentation on getproxies().

Sockets and Layers

The Python support for fetching resources from the web is layered. urllib uses the http.client library, which in turn uses the socket library.

As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the http.client or urllib.request levels. However, you can set the default timeout globally for all sockets using

import socket
import urllib.request

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)

Источник

КАК получить интернет-ресурсы с помощью пакета urllib

Author: Michael Foord

Introduction

urllib.request — это модуль Python для получения URL-адресов (унифицированных указателей ресурсов). Он предлагает очень простой интерфейс в виде функции urlopen . Это позволяет получать URL-адреса с использованием множества различных протоколов. Он также предлагает немного более сложный интерфейс для обработки распространенных ситуаций, таких как базовая аутентификация, файлы cookie, прокси и т. Д. Они предоставляются объектами, называемыми обработчиками и открывателями.

urllib.request поддерживает выборку URL-адресов для многих «схем URL-адресов» (обозначается строкой перед ":" в URL-адресе — например , "ftp" — это схема URL-адресов "ftp://python.org/" ), используя связанную с ними сеть . протоколы (например, FTP, HTTP). В этом руководстве рассматривается наиболее распространенный случай — HTTP.

Для простых ситуаций urlopen очень прост в использовании. Но как только вы столкнетесь с ошибками или нетривиальными случаями при открытии URL-адресов HTTP, вам потребуется некоторое понимание протокола передачи гипертекста. Наиболее полной и авторитетной ссылкой на HTTP является RFC 2616 . Это технический документ, и он не предназначен для легкого чтения. Этот HOWTO призван проиллюстрировать использование urllib с достаточным количеством подробностей о HTTP, чтобы помочь вам в этом. Он не предназначен для замены документов urllib.request , но дополняет их.

Fetching URLs

Самый простой способ использования urllib.request заключается в следующем:

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
   html = response.read()

Если вы хотите получить ресурс по URL-адресу и сохранить его во временном месте, вы можете сделать это с помощью функций shutil.copyfileobj() и tempfile.NamedTemporaryFile() :

import shutil
import tempfile
import urllib.request

with urllib.request.urlopen('http://python.org/') as response:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(response, tmp_file)

with open(tmp_file.name) as html:
    pass

Многие случаи использования urllib будут настолько простыми (обратите внимание,что вместо URL ‘http:’ мы могли бы использовать URL,начинающийся с ‘ftp:’,’file:’ и т.д.).Однако цель этого руководства-объяснить более сложные случаи,сосредоточившись на HTTP.

HTTP основан на запросах и ответах — клиент делает запросы, а серверы отправляют ответы. urllib.request отражает это с помощью объекта Request , который представляет собой HTTP-запрос, который вы делаете. В простейшей форме вы создаете объект Request, который указывает URL-адрес, который вы хотите получить. Вызов urlopen с этим объектом Request возвращает объект ответа для запрошенного URL. Этот ответ представляет собой объект в виде файла, что означает, что вы можете, например, вызвать .read() для ответа:

import urllib.request

req = urllib.request.Request('http://www.voidspace.org.uk')
with urllib.request.urlopen(req) as response:
   the_page = response.read()

Обратите внимание,что urllib.request использует один и тот же интерфейс Request для работы со всеми схемами URL.Например,вы можете сделать FTP-запрос следующим образом:

req = urllib.request.Request('ftp://example.com/')

В случае HTTP есть две дополнительные вещи, которые позволяют вам делать объекты Request: во-первых, вы можете передавать данные для отправки на сервер. Во- вторых, вы можете передать дополнительную информацию («метаданные») о данных или о самом запросе на сервер — эта информация отправляется в виде «заголовков» HTTP. Давайте рассмотрим каждый из них по очереди.

Data

Иногда вы хотите отправить данные по URL-адресу (часто URL-адрес будет ссылаться на сценарий CGI (общий интерфейс шлюза) или другое веб-приложение). В HTTP это часто делается с помощью так называемого POST — запроса. Это часто то, что делает ваш браузер, когда вы отправляете HTML-форму, которую вы заполнили в Интернете. Не все POST должны исходить из форм: вы можете использовать POST для передачи произвольных данных в ваше собственное приложение. В общем случае HTML-форм данные должны быть закодированы стандартным способом, а затем переданы объекту запроса в качестве аргумента data Кодирование выполняется с помощью функции из библиотеки urllib.parse .

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

Обратите внимание, что иногда требуются другие кодировки (например, для загрузки файлов из HTML-форм — см. Спецификацию HTML, Отправка форм для более подробной информации).

Если вы не передаете аргумент data , urllib использует запрос GET . Одно из отличий GET- и POST-запросов заключается в том, что POST-запросы часто имеют «побочные эффекты»: они каким-то образом изменяют состояние системы (например, размещая на веб-сайте заказ на доставку центнера консервированного спама). до твоей двери). Хотя стандарт HTTP ясно дает понять, что POST -запросы всегда вызывают побочные эффекты, а GET-запросы никогда не вызывают побочных эффектов, ничто не мешает GET-запросу иметь побочные эффекты, а POST-запросы не иметь побочных эффектов. Данные также можно передавать в HTTP-запросе GET, кодируя их в самом URL-адресе.

Это делается следующим образом:

>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values)  
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib.request.urlopen(full_url)

Обратите внимание, что полный URL-адрес создается путем добавления символа ? к URL-адресу, за которым следуют закодированные значения.

Handling Exceptions

urlopen вызывает URLError , когда не может обработать ответ (хотя, как обычно с API Python, также могут возникать встроенные исключения, такие как ValueError , TypeError и т.д.).

HTTPError — это подкласс URLError , возникающий в конкретном случае URL-адресов HTTP.

Классы исключений экспортируются из модуля urllib.error .

URLError

Часто URLError возникает из-за отсутствия сетевого соединения (нет маршрута к указанному серверу),или указанный сервер не существует.В этом случае вызванное исключение будет иметь атрибут ‘reason’,который представляет собой кортеж,содержащий код ошибки и текстовое сообщение об ошибке.

e.g.

>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
...     print(e.reason)      
...
(4, 'getaddrinfo failed')

HTTPError

Каждый HTTP-ответ от сервера содержит числовой «код состояния». Иногда код состояния указывает на то, что сервер не может выполнить запрос. Обработчики по умолчанию будут обрабатывать некоторые из этих ответов за вас (например, если ответ представляет собой «перенаправление», запрашивающее у клиента получение документа с другого URL-адреса, urllib обработает это за вас). Для тех, с которыми он не может справиться, urlopen вызовет HTTPError . Типичные ошибки включают «404» (страница не найдена), «403» (запрос запрещен) и «401» (требуется аутентификация).

См. раздел 10 RFC 2616 для справки по всем кодам ошибок HTTP.

HTTPError будет иметь целочисленный атрибут «код», который соответствует ошибке, отправленной сервером.

Error Codes

Поскольку обработчики по умолчанию обрабатывают перенаправления (коды в диапазоне 300),а коды в диапазоне 100-299 означают успех,вы обычно будете видеть только коды ошибок в диапазоне 400-599.

http.server.BaseHTTPRequestHandler.responses — это полезный словарь кодов ответов, в котором показаны все коды ответов, используемые RFC 2616 . Словарь воспроизведен здесь для удобства

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
    100: ('Continue', 'Request received, please continue'),
    101: ('Switching Protocols',
          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),
    201: ('Created', 'Document created, URL follows'),
    202: ('Accepted',
          'Request accepted, processing continues off-line'),
    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
    204: ('No Content', 'Request fulfilled, nothing follows'),
    205: ('Reset Content', 'Clear input form for further input.'),
    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',
          'Object has several resources -- see URI list'),
    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
    302: ('Found', 'Object moved temporarily -- see URI list'),
    303: ('See Other', 'Object moved -- see Method and URL list'),
    304: ('Not Modified',
          'Document has not changed since given time'),
    305: ('Use Proxy',
          'You must use proxy specified in Location to access this '
          'resource.'),
    307: ('Temporary Redirect',
          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',
          'Bad request syntax or unsupported method'),
    401: ('Unauthorized',
          'No permission -- see authorization schemes'),
    402: ('Payment Required',
          'No payment -- see charging schemes'),
    403: ('Forbidden',
          'Request forbidden -- authorization will not help'),
    404: ('Not Found', 'Nothing matches the given URI'),
    405: ('Method Not Allowed',
          'Specified method is invalid for this server.'),
    406: ('Not Acceptable', 'URI not available in preferred format.'),
    407: ('Proxy Authentication Required', 'You must authenticate with '
          'this proxy before proceeding.'),
    408: ('Request Timeout', 'Request timed out; try again later.'),
    409: ('Conflict', 'Request conflict.'),
    410: ('Gone',
          'URI no longer exists and has been permanently removed.'),
    411: ('Length Required', 'Client must specify Content-Length.'),
    412: ('Precondition Failed', 'Precondition in headers is false.'),
    413: ('Request Entity Too Large', 'Entity is too large.'),
    414: ('Request-URI Too Long', 'URI is too long.'),
    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
    416: ('Requested Range Not Satisfiable',
          'Cannot satisfy request range.'),
    417: ('Expectation Failed',
          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),
    501: ('Not Implemented',
          'Server does not support this operation'),
    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
    503: ('Service Unavailable',
          'The server cannot process the request due to a high load'),
    504: ('Gateway Timeout',
          'The gateway server did not receive a timely response'),
    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
    }

Когда возникает ошибка, сервер отвечает, возвращая код ошибки HTTP и страницу с ошибкой. Вы можете использовать экземпляр HTTPError в качестве ответа на возвращенной странице. Это означает, что, помимо атрибута code, он также имеет методы read, geturl и info, возвращаемые модулем urllib.response :

>>> req = urllib.request.Request('http://www.python.org/fish.html')
>>> try:
...     urllib.request.urlopen(req)
... except urllib.error.HTTPError as e:
...     print(e.code)
...     print(e.read())  
...
404
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">nnn<html
  ...
  <title>Page Not Found</title>n
  ...

Python

3.11

Unicode HOWTO

1.12 В этом HOWTO рассматривается поддержка Python спецификации Unicode,представляющей текстовые данные,и объясняются различные проблемы,которые обычно возникают у людей.
Преобразование в байты

Противоположным методом bytes.decode()является str.encode(),который возвращает строку представления Unicode,закодированную запрошенной кодировкой.
Подведение итогов

Поэтому,если вы хотите быть готовым к HTTPError URLError,есть два основных подхода.
Установка питоновых модулей (версия Legacy)

Greg Ward Примечание Весь пакет distutils был устаревшим и будет удален в Python 3.12.

Источник

Python urllib Library | Python’s urllib.request for HTTP Requests

Python urllib Module

Basic of HTTP Messages

The urllib.request

The urllib.parse

Common urllib.request Troubles

Implement Error Handling

Handling 403 Error

The Request Package Ecosystem

What are urllib2 and urllib3?

When Should Use requests over urllib.requests

Conclusion

Introduction

Fetching URLs

Data

Headers

Handling Exceptions

URLError

HTTPError

Error Codes

Wrapping it Up

Number 1

Number 2

info and geturl

Openers and Handlers

Basic Authentication

Proxies

Sockets and Layers

КАК получить интернет-ресурсы с помощью пакета urllib

Introduction

Fetching URLs

Data

Handling Exceptions

URLError

HTTPError

Error Codes

Python 3.11

Читайте также:

Python

3.11