Http error 403 forbidden python pandas - Исправление ошибок и поиск оптимальных решений проблем

I would like to read the following html,

 import pandas as pd

daily_info=pd.read_html('https://www.investing.com/earnings-calendar/',flavor='html5lib')

print(daily_info)

Unfortunatelly appears :

urllib.error.HTTPError: HTTP Error 403: Forbidden

Is there anyway to fix it?

asked Apr 24, 2017 at 14:00

JamesHudson81JamesHudson81

2,2554 gold badges20 silver badges42 bronze badges

Pretend to be a browser:

import requests

url = 'https://www.investing.com/earnings-calendar/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

r = requests.get(url, headers=header)

dfs = pd.read_html(r.text)

Result:

In [201]: len(dfs)
Out[201]: 7

In [202]: dfs[0]
Out[202]:
    0   1   2   3
0 NaN NaN NaN NaN

In [203]: dfs[1]
Out[203]:
                 Unnamed: 0                                      Company    EPS /  Forecast Revenue /  Forecast.1 Market Cap  Time  
0    Monday, April 24, 2017                                          NaN    NaN         NaN     NaN           NaN        NaN   NaN
1                       NaN                                 Acadia (AKR)     --      / 0.11      --          / --      2.63B   NaN
2                       NaN                                  Agree (ADC)     --      / 0.39      --          / --      1.34B   NaN
3                       NaN                                   Alcoa (AA)     --      / 0.53      --          / --      5.84B   NaN
4                       NaN                        American Campus (ACC)     --      / 0.27      --          / --      6.62B   NaN
5                       NaN                   Ameriprise Financial (AMP)     --      / 2.52      --          / --     19.76B   NaN
6                       NaN                          Avacta Group (AVTG)     --        / --   1.26M          / --     47.53M   NaN
7                       NaN                         Bank of Hawaii (BOH)    1.2      / 1.08  165.8M          / --      3.48B   NaN
8                       NaN                         Bank of Marin (BMRC)   0.74       / 0.8      --          / --    422.29M   NaN
9                       NaN                                Banner (BANR)     --      / 0.68      --          / --      1.82B   NaN
10                      NaN                           Barrick Gold (ABX)     --       / 0.2      --          / --     22.44B   NaN
11                      NaN                           Barrick Gold (ABX)     --      / 0.28      --          / --     30.28B   NaN
12                      NaN               Berkshire Hills Bancorp (BHLB)     --      / 0.54      --          / --      1.25B   NaN
13                      NaN   Brookfield Canada Office Properties (BOXC)     --        / --      --          / --        NaN   NaN

...

answered Apr 24, 2017 at 14:05

Источник

This is what I initially tried:

pd.read_html('https://www.redfin.com/home-trends/city/30749/NY/New-York')

But it throws this at me:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-20-c20ce10f612c> in <module>()
     31     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
     32 }
---> 33 pd.read_html('https://www.redfin.com/home-trends/city/30749/NY/New-York')
     34 
     35 

12 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

I also tried using request headers to spoof the user agent like this:

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

pd.read_html('https://www.redfin.com/home-trends/city/30749/NY/New-York', header=req_headers)

But I get this (slightly different) exception:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-19-6528c4149478> in <module>()
     31     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
     32 }
---> 33 pd.read_html('https://www.redfin.com/home-trends/city/30749/NY/New-York', header=req_headers)
     34 
     35 

12 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

I found this on stackoverflow and tried the following solution:

req = Request('https://www.redfin.com/home-trends/city/30749/NY/New-York', headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()

tables = pd.read_html('https://www.redfin.com/home-trends/city/30749/NY/New-York') 

print(tables[0])

But alas, it still gives a 403 error at webpage = urlopen(req).read()

Does anyone know how to get around a 403 HTTP error when trying to read html using pandas?

Источник

$begingroup$

The csv file is downloadable. I can download the file and use read_csv, But I want to read the file via direct URL in jupyter, I used the following code, but I get the HTTP 403 Forbidden
error

from io import StringIO

import pandas as pd
import requests
url="https://fineli.fi/fineli/en/elintarvikkeet/resultset.csv"
s=requests.get(url).text

c=pd.read_csv(StringIO(s))
c

how do I read the csv file via URL directly in python with a delimeter «;»

asked Apr 23, 2019 at 6:03

KHAN irfanKHAN irfan

4111 gold badge7 silver badges16 bronze badges

$endgroup$

$begingroup$

The problem is that the url you have doesn’t accept «non-browser» requests. The default header of Python requests is

'User-Agent': 'python-requests/2.13.0'

You can pass your own headers as an argument like that

from io import StringIO
import pandas as pd
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}

url="https://fineli.fi/fineli/en/elintarvikkeet/resultset.csv"
s=requests.get(url, headers= headers).text

c=pd.read_csv(StringIO(s), sep=";")
c

answered Apr 23, 2019 at 6:28

TasosTasos

3,8304 gold badges21 silver badges54 bronze badges

$endgroup$

$begingroup$

I read the file using the following code

from urllib.request import urlopen, Request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
reg_url = "https://fineli.fi/fineli/en/elintarvikkeet/resultset.csv"
req = Request(url=reg_url, headers=headers) 
html = urlopen(req).read() 
print(html)

answered Apr 23, 2019 at 6:27

KHAN irfanKHAN irfan

4111 gold badge7 silver badges16 bronze badges

$endgroup$

Источник

the urllib Module in Python
Check robots.txt to Prevent urllib HTTP Error 403 Forbidden Message
Adding Cookie to the Request Headers to Solve urllib HTTP Error 403 Forbidden Message
Use Session Object to Solve urllib HTTP Error 403 Forbidden Message

Today’s article explains how to deal with an error message (exception), urllib.error.HTTPError: HTTP Error 403: Forbidden, produced by the error class on behalf of the request classes when it faces a forbidden resource.

the `urllib` Module in Python

The urllib Python module handles URLs for python via different protocols. It is famous for web scrapers who want to obtain data from a particular website.

The urllib contains classes, methods, and functions that perform certain operations such as reading, parsing URLs, and robots.txt. There are four classes, request, error, parse, and robotparser.

Check `robots.txt` to Prevent `urllib` HTTP Error 403 Forbidden Message

When using the urllib module to interact with clients or servers via the request class, we might experience specific errors. One of those errors is the HTTP 403 error.

We get urllib.error.HTTPError: HTTP Error 403: Forbidden error message in urllib package while reading a URL. The HTTP 403, the Forbidden Error, is an HTTP status code that indicates that the client or server forbids access to a requested resource.

Therefore, when we see this kind of error message, urllib.error.HTTPError: HTTP Error 403: Forbidden, the server understands the request but decides not to process or authorize the request that we sent.

To understand why the website we are accessing is not processing our request, we need to check an important file, robots.txt. Before web scraping or interacting with a website, it is often advised to review this file to know what to expect and not face any further troubles.

To check it on any website, we can follow the format below.

https://<website.com>/robots.txt

For example, check YouTube, Amazon, and Google robots.txt files.

https://www.youtube.com/robots.txt
https://www.amazon.com/robots.txt
https://www.google.com/robots.txt

Checking YouTube robots.txt gives the following result.

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid-'90s wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

We can notice a lot of Disallow tags there. This Disallow tag shows the website’s area, which is not accessible. Therefore, any request to those areas will not be processed and is forbidden.

In other robots.txt files, we might see an Allow tag. For example, http://youtube.com/comment is forbidden to any external request, even with the urllib module.

Let’s write code to scrape data from a website that returns an HTTP 403 error when accessed.

Example Code:

import urllib.request
import re

webpage = urllib.request.urlopen('https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi').read()
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.findall(findlink, webpage)

print(len(row_array))

Output:

Traceback (most recent call last):
  File "c:UsersakinlDocumentsPythonindex.py", line 7, in <module>
    webpage = urllib.request.urlopen('https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi').read()
  File "C:Python310liburllibrequest.py", line 216, in urlopen
    return opener.open(url, data, timeout)
  File "C:Python310liburllibrequest.py", line 525, in open
    response = meth(req, response)
  File "C:Python310liburllibrequest.py", line 634, in http_response
    response = self.parent.error(
  File "C:Python310liburllibrequest.py", line 563, in error
    return self._call_chain(*args)
  File "C:Python310liburllibrequest.py", line 496, in _call_chain
    result = func(*args)
  File "C:Python310liburllibrequest.py", line 643, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

The reason is that we are forbidden from accessing the website. However, if we check the robots.txt file, we will notice that https://www.cmegroup.com/markets/ is not with a Disallow tag. However, if we go down the robots.txt file for the website we wanted to scrape, we will find the below.

User-agent: Python-urllib/1.17
Disallow: /

The above text means that the user agent named Python-urllib is not allowed to crawl any URL within the site. That means using the Python urllib module is not allowed to crawl the site.

Therefore, check or parse the robots.txt to know what resources we have access to. we can parse robots.txt file using the robotparser class. These can prevent our code from experiencing an urllib.error.HTTPError: HTTP Error 403: Forbidden error message.

Passing a valid user agent as a header parameter will quickly fix the problem. The website may use cookies as an anti-scraping measure.

The website may set and ask for cookies to be echoed back to prevent scraping, which is maybe against its policy.

from urllib.request import Request, urlopen

def get_page_content(url, head):

  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

Output:

<!doctype html>n<html>n<head>n    <title>Example Domain</title>nn    <meta
'
'
'
<p><a href="https://www.iana.org/domains/example">More information...</a></p>n</div>n</body>n</html>n'

Passing a valid user agent as a header parameter will quickly fix the problem.

Use Session Object to Solve `urllib` HTTP Error 403 Forbidden Message

Sometimes, even using a user agent won’t stop this error from occurring. The Session object of the requests module can then be used.

from random import seed
import requests

url = "https://stackoverflow.com/search?q=html+error+403"
session_obj = requests.Session()
response = session_obj.get(url, headers={"User-Agent": "Mozilla/5.0"})

print(response.status_code)

Output:

The above article finds the cause of the urllib.error.HTTPError: HTTP Error 403: Forbidden and the solution to handle it. mod_security basically causes this error as different web pages use different security mechanisms to differentiate between human and automated computers (bots).

Источник

I concur

Python 3.6.9 (default, Apr 18 2020, 01:56:04) 
Type "copyright", "credits" or "license" for more information.

IPython 5.5.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import datapackage

In [2]: import pandas

In [3]: package = datapackage.Package('https://datahub.io/JohnSnowLabs/populatio
   ...: n-figures-by-country/datapackage.json')

In [4]: pandas.read_csv(package.resources[4].descriptor['path'])
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-4a930361f83a> in <module>()
----> 1 pandas.read_csv(package.resources[4].descriptor['path'])

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    429     # See https://github.com/python/mypy/issues/1297
    430     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 431         filepath_or_buffer, encoding, compression
    432     )
    433     kwds["compression"] = compression

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    170 
    171     if isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer):
--> 172         req = urlopen(filepath_or_buffer)
    173         content_encoding = req.headers.get("Content-Encoding", None)
    174         if content_encoding == "gzip":

/home/kranz/.local/lib/python3.6/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
    139     import urllib.request
    140 
--> 141     return urllib.request.urlopen(*args, **kwargs)
    142 
    143 

/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

Источник

The urllib.error.httperror: http error 403: forbidden occurs when you try to scrap a webpage using urllib.request module and the mod_security blocks the request. There are several reasons why you get this error. Let’s take a look at each of the use cases in detail.

Usually, the websites are protected with App Gateway, WAF rules, etc., which monitor whether the requests are from the actual users or triggered through the automated bot system. The mod_security or the WAF rule will block these requests treating them as spider/bot requests. These security features are the most standard ones to prevent DDOS attacks on the server.

Now coming back to the error when you make a request to any site using urllib.request basically, you will not set any user-agents and headers and by default the urllib sets something like python urllib/3.3.0, which is easily detected by the mod_security.

The mod_security is usually configured in such a way that if any requests happen without a valid user-agent header(browser user-agent), the mod_security will block the request and return the urllib.error.httperror: http error 403: forbidden

Example of 403 forbidden error

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/')
webpage = urlopen(req).read()

Output

  File "C:UsersuserAppDataLocalProgramsPythonPython39liburllibrequest.py", line 494, in _call_chain
    result = func(*args)
urllib.error.HTTPError: HTTP Error 403: Forbidden
PS C:ProjectsTryouts> from urllib.request import Request, urlopen

The easy way to resolve the error is by passing a valid user-agent as a header parameter, as shown below.

from urllib.request import Request, urlopen

req = Request('https://www.yahoo.com', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

Alternatively, you can even set a timeout if you are not getting the response from the website. Python will raise a socket exception if the website doesn’t respond within the mentioned timeout period.

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req,timeout=10).read()

In some cases, like getting a real-time bitcoin or stock market value, you will send requests every second, and the servers can block if there are too many requests coming from the same IP address and throws 403 security error.

If you get this error because of too many requests, consider adding delay between each request to resolve the error.

Avatar Of Srinivas Ramakrishna

Srinivas Ramakrishna is a Solution Architect and has 14+ Years of Experience in the Software Industry. He has published many articles on Medium, Hackernoon, dev.to and solved many problems in StackOverflow. He has core expertise in various technologies such as Microsoft .NET Core, Python, Node.JS, JavaScript, Cloud (Azure), RDBMS (MSSQL), React, Powershell, etc.

Sign Up for Our Newsletters

Subscribe to get notified of the latest articles. We will never spam you. Be a part of our ever-growing community.

By checking this box, you confirm that you have read and are agreeing to our terms of use regarding the storage of the data submitted through this form.

Источник

Глядя, чтобы получить данные из таблиц на конкретном сайте киберспорта, и я, кажется, изо всех сил.

Мне сказали, что библиотека панд может помочь мне достичь этого всего за несколько строк.

import pandas as pd

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9')

print(tables[0])

Я пытаюсь отредактировать это, чтобы заставить меня работать, но у меня нет успеха.

import pandas as pd

from urllib.request import Request, urlopen

req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9',     headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9)

print(tables[0])

Я был уверен, что это может быть решение, которое я искал, или что-то похожее на него, но когда я пытаюсь решить проблему таким способом, у меня нет успеха.

"Traceback (most recent call last):
  File "C:UsersanthoOneDriveDocumentsPythontables clloud9.py", line 6, in <module>
webpage = urlopen(req).read()
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 531, in open
response = meth(req, response)
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 569, in error
return self._call_chain(*args)
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 503, in _call_chain
result = func(*args)
  File "C:UsersanthoAppDataLocalProgramsPythonPython37-32liburllibrequest.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden"

Все, что я хочу на данный момент, это чтобы таблица на ссылку была вытащена.

Источник

the urllib Module in Python

Check robots.txt to Prevent urllib HTTP Error 403 Forbidden Message

Use Session Object to Solve urllib HTTP Error 403 Forbidden Message

Sign Up for Our Newsletters

Читайте также:

the `urllib` Module in Python

Check `robots.txt` to Prevent `urllib` HTTP Error 403 Forbidden Message

Use Session Object to Solve `urllib` HTTP Error 403 Forbidden Message