What is Web Scrapping
- Automatic gathering of data from internet. A web Scrapping programmer can be called as a BOT.
Why web Scrapping?
- It allows us to get data more than we generally get in a browser. It can search through multiple/hundreds of pages at once instead of page(in browser).
Our First Web Scrapping
from urllib.request import urlopen
html = urlopen("http://pythonscrapping.com/pages/page1.html")
print(html.read())
urllib - in Python3.x, the urllib2 of python2.x is renamed to urllib and was split into several sub modules a. urllib.request b. urllib.parse c. urllib.error urllib is a standard python library and it contains function to request data accross the web, handling cookies, and even changing metaData such as headers.
A Beautiful Soup
The beautiful soup or bs4 is not a default library and we have to download/install it. Windows - pip install beautifulsoup4 linux - sudo apt-get install pyhton-bs4 You can visit crummy.com for details on beautifulSoup.
Our First Web Scrapping with bs4
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/blog")
bsObj = BeautifulSoup(html.read())
//print(html.read())
print(bsObj.h1)
If you are getting an error like this -- bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? -- Make sure you install the LXML parser, using -- pip install lxml Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\SouravG>pip install lxml Collecting lxml Downloading https://files.pythonhosted.org/packages/3c/06/d8db111411e4ad0677d72985af1d45ead6f1d0e8d09d80bddc45c1f91cfe/lxml-4.2.3-cp36-cp36m-win_amd64.whl (3.6MB) 100% |████████████████████████████████| 3.6MB 620kB/s Installing collected packages: lxml Successfully installed lxml-4.2.3
Exception Handling:
So if you see the line --
html = urlopen("http://pythonscraping.com/pages/page1.html")
There are 2 main thing that can go wrong
--The page is not found, or error retreiving it
--The server is not found
In all of these cases, the urlopen will generate the generic exception. We can handle the exception in the following way.
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
try:
html=urlopen("http://pythonscraping.com")
except HTTPError as e:
print(e)
#return null, break or other
except URLError as e:
print("The server could not be found")
else:
print("It worked")
#the program continue
example2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com")
bsObj = BeautifulSoup(html.read(), 'lxml')
print(bsObj.prettify())
# print(bsObj.nonExistingTag.someTag)
example3
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com")
bsObj = BeautifulSoup(html.read(), 'lxml')
# print(bsObj.prettify())
# print(bsObj.nonExistingTag.someTag)
try:
badContent = bsObj.nonExistingTag.anotherTag
# badContent = bsObj.prettify()
print("I m in try block")
except AttributeError as AE:
print("Tag was not found")
else:
if badContent == None:
print("Tag was not found here EITHER")
else:
#print(badContent)
print("Ok, Ok, Ok... I got You")