WebScrapping

What is Web Scrapping

- Automatic gathering of data from internet. A web Scrapping programmer can be called as a BOT.

Why web Scrapping?

- It allows us to get data more than we generally get in a browser. It can search through multiple/hundreds of pages at once instead of page(in browser).

Our First Web Scrapping

from urllib.request import urlopen
html = urlopen("http://pythonscrapping.com/pages/page1.html")
print(html.read())
urllib - in Python3.x, the urllib2 of python2.x is renamed to urllib and was split into several sub modules
a. urllib.request
b. urllib.parse
c. urllib.error
urllib is a standard python library and it contains function to request data accross the web, handling cookies, and even changing metaData such as headers.

 

A Beautiful Soup

The beautiful soup or bs4 is not a default library and we have to download/install it. 
Windows - pip install beautifulsoup4
linux - sudo apt-get install pyhton-bs4
You can visit crummy.com for details on beautifulSoup.
Our First Web Scrapping with bs4

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/blog")
bsObj = BeautifulSoup(html.read())
//print(html.read())
print(bsObj.h1)
If you are getting an error like this --
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
--
Make sure you install the LXML parser, using -- pip install lxml

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\SouravG>pip install lxml
Collecting lxml
Downloading https://files.pythonhosted.org/packages/3c/06/d8db111411e4ad0677d72985af1d45ead6f1d0e8d09d80bddc45c1f91cfe/lxml-4.2.3-cp36-cp36m-win_amd64.whl (3.6MB)
100% |████████████████████████████████| 3.6MB 620kB/s
Installing collected packages: lxml
Successfully installed lxml-4.2.3

 

Exception Handling:

So if you see the line --
html = urlopen("http://pythonscraping.com/pages/page1.html")
There are 2 main thing that can go wrong
--The page is not found, or error retreiving it
--The server is not found
In all of these cases, the urlopen will generate the generic exception. We can handle the exception in the following way.
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
html=urlopen("http://pythonscraping.com")
except HTTPError as e:
print(e)
#return null, break or other
except URLError as e:
print("The server could not be found")
else:
print("It worked")
#the program continue

 

example2

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com")
bsObj = BeautifulSoup(html.read(), 'lxml')
print(bsObj.prettify())
# print(bsObj.nonExistingTag.someTag)

 

example3

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com")
bsObj = BeautifulSoup(html.read(), 'lxml')
# print(bsObj.prettify())
# print(bsObj.nonExistingTag.someTag)

try:
    badContent = bsObj.nonExistingTag.anotherTag
    # badContent = bsObj.prettify()
    print("I m in try block")
except AttributeError as AE:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found here EITHER")
    else:
        #print(badContent)
        print("Ok, Ok, Ok... I got You")
Design a site like this with WordPress.com
Get started