In the first part of this series, we introduced ourselves to the concept of web scraping using two python libraries to achieve this task. Namely, requests and BeautifulSoup. The results were then stored in a JSON file. In this walkthrough, we'll tackle web scraping with a slightly different approach using the selenium python library. We'll then store the results in a CSV file using the pandas library.
The code used in this example is on github.
Feb 11, 2021 Lean how to scrape the web with Selenium and Python with this step by step tutorial. We will use Selenium to automate Hacker News login. Kevin Sahin Updated: 11 February, 2021 8 min read. Selenium is the best for scraping JS and Ajax content. Check this article for extracting data from the web using Python $ pip install selenium.
Selenium is a framework which is designed to automate test for web applications.You can then write a python script to control the browser interactions automatically such as link clicks and form submissions. However, in addition to all this selenium comes in handy when we want to scrape data from javascript generated content from a webpage. That is when the data shows up after many ajax requests. Nonetheless, both BeautifulSoup and scrapy are perfectly capable of extracting data from a webpage. The choice of library boils down to how the data in that particular webpage is rendered.
Other problems one might encounter while web scraping is the possibility of your IP address being blacklisted. I partnered with scraper API, a startup specializing in strategies that'll ease the worry of your IP address from being blocked while web scraping. They utilize IP rotation so you can avoid detection. Boasting over 20 million IP addresses and unlimited bandwidth.
In addition to this, they provide CAPTCHA handling for you as well as enabling a headless browser so that you'll appear to be a real user and not get detected as a web scraper. For more on its usage, check out my post on web scraping with scrapy. Although you can use it with both BeautifulSoup and selenium.
If you want more info as well as an intro the scrapy library check out my post on the topic.
Using this scraper api link and the codelewis10, you'll get a 10% discount off your first purchase!
For additional resources to understand the selenium library and best practices, this article by towards datascience and accordbox.
We'll be using two python libraries. selenium and pandas. To install them simply run pip install selenium pandas
In addition to this, you'll need a browser driver to simulate browser sessions.Since I am on chrome, we'll be using that for the walkthrough.
For this example, we'll be extracting data from quotes to scrape which is specifically made to practise web scraping on.We'll then extract all the quotes and their authors and store them in a CSV file.
The code above is an import of the chrome driver and pandas libraries.We then make an instance of chrome by using driver = Chrome(webdriver)
Note that the webdriver variable will point to the driver executable we downloaded previously for our browser of choice. If you happen to prefer firefox, import like so
On close inspection of the sites URL, we'll notice that the pagination URL isHttp://quotes.toscrape.com/js/page/{{current_page_number}}/
where the last part is the current page number. Armed with this information, we can proceed to make a page variable to store the exact number of web pages to scrape data from. In this instance, we'll be extracting data from just 10 web pages in an iterative manner.
The driver.get(url)
command makes an HTTP get request to our desired webpage.From here, it's important to know the exact number of items to extract from the webpage.From our previous walkthrough, we defined web scraping as
This is the process of extracting information from a webpage by taking advantage of patterns in the web page's underlying code.
We can use web scraping to gather unstructured data from the internet, process it and store it in a structured format.
On inspecting each quote element, we observe that each quote is enclosed within a div with the class name of quote. By running the directive driver.get_elements_by_class('quote')
we get a list of all elements within the page exhibiting this pattern.
To begin extracting the information from the webpages, we'll take advantage of the aforementioned patterns in the web pages underlying code.
We'll start by iterating over the quote
elements, this allows us to go over each quote and extract a specific record.From the picture above we notice that the quote is enclosed within a span of class text and the author within the small tag with a class name of author.
Finally, we store the quote_text and author names variables in a tuple which we proceed to append to the python list by the name total.
Using the pandas library, we'll initiate a dataframe to store all the records(total list) and specify the column names as quote and author.Finally, export the dataframe to a CSV file which we named quoted.csv in this case.
Don't forget to close the chrome driver using driver.close().
You'll notice that I used the find_elements_by_class method in this walkthrough. This is not the only way to find elements. This tutorial by Klaus explains in detail how to use other selectors.
If you prefer to learn using videos this series by Lucid programming was very useful to me.https://www.youtube.com/watch?v=zjo9yFHoUl8
And with that, hopefully, you too can make a simple web scraper using selenium 😎.
If you enjoyed this post subscribe to my newsletter to get notified whenever I write new posts.
I recently made a collaborations page on my website. Have an interesting project in mind or want to fill a part-time role?You can now book a session with me directly from my site.
Thanks.
Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.
A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.
Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.
In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.
A great example of a static website is example.com
:
The whole content of this website is loaded as a plain HTML while the initial page load.
To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:
All we have here is an HTML file with a single <div>
in the body that contains text - Web Scraping is hard
, but after the page load, that text is replaced with the text generated by the Javascript:
To prove this, let's open this page in the browser and observe a dynamically replaced text:
Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.
BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.
Let's use BeautifulSoup for extracting the text inside <div>
from our sample above.
This code snippet uses os
library to open our test HTML file (test.html
) from the local directory and creates an instance of the BeautifulSoup library stored in soup
variable. Using the soup
we find the tag with id test
and extracts text from it.
In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt
, but the code snippet output is the following:
And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.
We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.
Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.
Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.
To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:
Selenium instantiating and scraping flow is the following:
In the code perspective, it looks the following:
And finally, we'll receive the required result:
Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.
I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.
Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.
Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.
To install Pyppeteer you can execute the following command:
The usage of Pyppeteer for our needs is much simpler than Selenium:
I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.
As we can expect, the result is the following:
We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.
Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.
The API is almost the same as for Pyppeteer, but have sync and async version both.
Installation is simple as always:
Let's rewrite the previous example using Playwright.
As a good tradition, we can observe our beloved output:
We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?
Meet the web scraping API!
ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.
Usage of web scraping API is the simplest option and requires only basic programming skills.
You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.
As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
To check it out as HTML, we can use another great tool: HTMLPreview
The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html
The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.
Let's install in first:
And use the installed library:
To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.
And the result is still the required one.
All the headless browser magic happens in the cloud, so you need to make an API call to get the result.
Check out the documentation for more info about ScrapingAnt API.
Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:
Happy web scraping, and don't forget to use proxies to avoid blocking 🚀