Web scrapping tutorial
Web scrapping tutorial: In our previous tutorial, we have talked a lot about what web scrapping is? and simple web scrapping. We have done web scrapping using request module of python. There was also an example to get all <h2> of same website. If you want to learn basics of web scrapping, you can visit the link below:
But, the code in above link is limited for some websites because we are using simple request method to get the data. As we already know that web scrapping is very powerful tool. The request method don’t allow us to fetch secured website’s data. In this tutorial we will discuss about fetching data of secured websites, which do not allows to fetch data directly.
Web scrapping is an illegal process in most of cases, i.e. some websites do not allow to access or use their information. So i highly recommend you to do not use web scrapping for those websites, you can use APIs instead. This article is for study purpose only.
How to use web scrapping in various websites
As i discussed earlier most websites do not allow to scrap their data directly. So we need an platform, from where we can scrap secure website’s data or we can say protected data.
Here we need some automated programs, by which we can easily extract website’s data. To implement web scrapping for multiple websites we need following:
- Jupyter Notebook (optional)
- selenium(its webdriver)
- bs4(its BeautifulSoup)
What is selenium?
Selenium is an portable software-testing framework for web applications. It is open-source and free-of cost software i.e anyone can download it easily. Basically selenium is used for automated programs, i.e for open browsers, testing etc. Selenium has various components, which we can use easily and without any cost. These are listed below:
- Selenium IDE
- Selenium RC
- Web driver
- Selenium Grid
To install selenium, you need to execute following command in command prompt:
pip install selenium
Introduction to selenium and Web driver
In above description, we have discussed various components of Selenium. We will use web driver of selenium for web scrapping. Web driver is a tool for automating web application testing, exploring data. Moreover it checks whether all is done in correct and expected way or not(testing).
Here are some important points of selenium Web driver:
- We will use Web driver in web scrapping rather than using simple request method.
- It allows us to fetch website’s data easily.
- There are different web drivers for different browsers.
- Basically web driver accepts commands and send it to particular browser.
- Moreover, it also retrieve results from commands
Install Web driver
First of all we need to install web driver. Here i am using Chrome browser’s web driver. If you are using the same you can visit the link below. If you are using another one browser(like FireFox, IE etc.), you have to install web driver of that particular browser. Here are the steps to install web driver for chrome:
- Click To Install Chrome Web driver
- Open above link and click on ChromeDriver 2.42
- Download zip file according to your OS i.e. whether you need web driver for Mac, Linux or Windows.
- Extract file to any location and copy path of that location to use web driver.
So, we have talked a lot about web drivers and selenium.Let’s take an example of web scrapping using selenium. Here i am fetching each player’s name and description link from the website https://www.nba.com .You can use any website instead.
Here is the code for the same:
from selenium import webdriver from bs4 import BeautifulSoup driver =webdriver.Chrome(executable_path = r'C:\Users\akd62\OneDrive\Desktop\chromedriver.exe') url ='https://www.nba.com/players' driver.get(url) soup =BeautifulSoup(driver.page_source,'lxml') div =soup.find('div',class_='static') d=div.find_all('a') for i in d: print(i.text) print("Player Name: ",i['title']) print("More Details:","https://www.nba.com"+i['href']) print('') driver.quit()
- First of all it opens a web browser (Set driver path here).
- Then it put, given link to the browser.
- It uses get() method of browser.
- Moreover, it uses BeautifulSoup module to extract data (same as in previous article).
- Here we extracts class static‘s anchor <a> tag.
- Furthermore, there is an for loop to find each and every tag with same properties.
- At last, we need to close driver using quit() method
Download source code
So, it is all about web scrapping tutorial(using selenium and web driver). I hope you guys enjoyed the post. Thanks
Credit goes to: