What is web scrapping python?
Web scrapping is an illegal process in most of cases, i.e. some websites do not allow to access or use their information. So i recommend you to do not use web scrapping for those websites, you can use APIs instead. This article is for study purpose only.
Why web scrapping?
The main idea behind using web scrapping is to fetch and access information from other sources. No doubt a legal and genuine way for accessing information is using APIs, which is highly recommended. But if you don’t have API and you want to access information, in that case you can use web scrapping. There are following main features of web scrapping:
- Fetch/ download web page (usually page automatically downloaded by the browser when we search for it).
- Access information from this web-page including links, images etc.
- Format or arranging fetched data according to requirement.
- Use this data in future.
- Place data to a local file. database or excel.
Web scrapping process
We have already discussed the methodology behind web scrapping. In simple words, it is the process of fetching and extracting information from other websites. As you can see in the image below web scrapping consists of following three steps:
- Fetch information from website.
- Extract Information using web scrapping software program.You can extract anything from web page including HTML, links, images etc.
- Organize fetched information within specific structure like database, file system etc.
Web scrapping python example
Now let’s take an example of using web scrapping in python. I am using following modules/software for the code below:
- Jupyter Notebook (optional)
- bs4 module
- requests module
You can install above packages using:
pip install bs4 and pip install requests
Webscrapping python startup:
If you want to scrap a web page, first of all you need to analyze web page completely. To do so, you need basic understanding of front end. Because here we need to analyze structure and CSS of that web page. You can fetch data of any tag, class or id using web scrapping. You just need to know structure, name of tag, class or id in which data is placed.
I am going to fetch data from our website(not recommend to you).If you re using Google Chrome web browser you can do following steps to detect tag, class or id of any content:
- Select Element/ Text from website wich you want to fetch
- Right click on screen > Inspect.
- And navigate between different elements to see their parent/child elements like below:
Example – 1. Fetch an HTML tag
import bs4 import requests website_link = requests.get("http://onlinetutorial.co.in") ab = bs4.BeautifulSoup(website_link.text,'lxml') heading2= ab.select("h2") print(heading2)
- As you can see in example above first of all, you need to import modules.(Make sure to install first)
- Fetch web page using get method of requests module by passing URL of that page,
- Use BeautifulSoup library of bs4 to fetch data in specific structure.( Here is lxml). If you want to learn more about BeautifulSoup you can visit the link https://en.wikipedia.org/wiki/Web_scraping
- Select element, which you want to fetch using select method.
- If you have done above steps successfully, you will see the data enclosed in python list. Later on you can loop through that data to use in more convenience way.
Example -2. Looping through elements:
You can loop through the result because it is in the form of list. You can fetch content using indexing. Suppose you want to access first <h2> of page you can do (In above code):
print(heading2) #Will return the first <h2>
But if you want to loop through all elements of the result list, then you can do as follows(In above code):
#To return whole elements i.e. tag with text for i in heading2: print(i) #To return Text only for i in heading2: print(i.text)
As you can see in the result above, it returns all the heading(<h2>) of the page.
Example -3. Fetch elements using class or id:
You can also select elements using class or id of it. Syntax is same except:
- You can select id using Hash symbol(#) followed by name of id. Example #IdName
- Moreover, you can select class using . symbol(.) followed by name of class. Example .className
Here is the code for doing same:
import bs4 import requests website_link = requests.get("http://onlinetutorial.co.in") ab = bs4.BeautifulSoup(website_link.text,'lxml') heading2= ab.select(".widget_categories") for i in heading2: print(i.text)
- Here i am using class named widget_categories, you can use your own class name instead.
- As you can see i have use dot symbol to select a class, instead you can use hash symbol to select id.
- If you want to access the data of single block, then you can use id.
- But if you want to access data, more than one blocks , i recommend you to use class name.
- Furthermore, you can loop through each block, to get each block’s contents.
- It is because class name is common and id is unique.
Download source code
So it is all about today’s article, i hope you guys enjoyed the post. We will discuss a lot about web scrapping in our further tutorials. Thanks!!!
Credit goes to: