Our MBA in Data Journalism students produced a series of tutorials as their final work in the Low Code discipline: Transforming data into guidelines without programming, taught by professor Adriano Belisário. This month you will be able to check out some of their work and have fun with the tutorials they created. Today you can check out the tutorial made by Carolina Timm.
Web Scraper: a tutorial with the beginning, middle and end of a scraping exercise + practical tips
Hey guys. Tired of spending hours copying and pasting information that could clearly be in a table ready to download and not posted on a website?
Would you like to have a simple way to collect all the information you need at once? How about a table with structured data of your choice extracted at the end of the collection?
Well then, rest assured, what you are looking for is called data scraping, a method of collecting information in an automated way. Here, we'll learn about one tool in particular: Web Scraper .
Why read this tutorial?
At the end of reading, you will know how to install and use Web Scraper to create something called Sitemap containing Selectors capable of navigating autonomously and extracting information as you determine. To guide us in this tutorial, we will scrape Rotten Tomatoes (a film and series review site), more specifically content titled 200 Best LGBTQ+ Movies of All Time .
Contextualizing the chosen website
The data that is of interest in this tutorial is at this url .
Note that the ranking is distributed over four pages of results, in this order of presentation: 200-151; 150-101; 100-51; 50-1. The home page contains basic information: film poster, title, year of release, rating (the site's classic “tomatometer”), whether there is critical consensus, synopsis, cast and position in the ranking.
It is possible to locate and read the content on the website, but the elements are not organized in a structured way that allows easy extraction, handling and analysis of the data available there. Without a scraper, it would be necessary to organize the elements manually, repeating the process of selecting each piece of information from each film 200 times (literally). In other words, it would be a repetitive task and would take time. So, from here we go hand in hand with a tool called Web Scraper .
Getting to know Web Scraper
webscraper.io is a free extension that you can install on your browser in a few minutes. At first glance, before installation, this is what it looks like:
Just click Add the extension to your browser and authorize access permission. To check if the installation was effective, go to Extensions Web Scraper icon and name already appear there.
Step by step
First, install Web Scraper on your browser.
Access Roten Tomatoes
At any point on the page, right-click and click Inspect.
If the section opens on the side of your browser, I recommend moving it to the bottom. To do this, simply click on the three dots in the right corner and choose the icon that illustrates this way of viewing:
Next, locate the Web Scraper at the end of the first line.
When you click on it, the extension displays three contents:
> Sitemaps , where the sitemaps created or imported into your browser are “saved” (at this moment, right after installation, this tab will be empty);
> Sitemap , “current” Sitemap space;
> Create new sitemap , which has two options:
Create Sitemap is the option that interests us at the moment. After clicking on it, you must fill in two fields and finish by clicking on Create Sitemap .
> Sitemap name : the name that will identify your Sitemap p. It must contain only lowercase letters, without accents and without spaces. Here, it's called Filmes_lgbtq
> Start URL : copy and paste the URL of the page: https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/
As soon as you click on Create Sitemap , you will go to the Sitemap Filmes_lgbtq tab, which already presents you with the selectors field.
Go Click on Add new selector . The first selector that we will create will bring together all the information that accompanies each of the films: film poster, title, year of release, website rating, synopsis, cast and position in the ranking. Here, it will be called a film. In Type , you will be faced with several options:
Our selector is Element . Multiple box , as we want the selector to continue browsing the other films and selecting the pattern of information that we indicated for it:
To select, click Select . Then, just move the cursor over the page until all the elements of the first film are selected (poster, title, year and so on). Then, scroll down the page and make that same selection in the film below. Okay, at this point, Web Scraper has already understood your idea and will select the next ones on this page.
Confirm the action by clicking Done selecting . Check if Multiple is confirmed and save. We already have our first selector \o/
From now on, we will create selectors within this main selector for each of the information that interests us.
However, before that, parentheses: if you click on Data Preview and check the information, you will notice that only the films on the first page were selected. Wow, so I need to create a different Sitemap for each of the four pages and put everything together in a spreadsheet editor after exporting? No, the good news is that you don't need all that work. Just change the URL .
Click on the other pages in the ranking, one by one, and note the URL of each one. Did you notice how the number at the end changes?
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/2/
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/ 3/
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/4/
In Sitemap Filmes_lgbtq , you will have these alternatives:
Choose Edit metadata .
In Start URL , we will change it to https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/[1-4 ] This way, the four pages will be scraped.
After saving this change, click on Sitemap Filmes_lgbtq and go back to Selectors . Now, instead of adding a new “main” selector, we will create selectors within the already created selector. To do this, click on film:
Notice that it is now indicated next to _root at the top. Now, Add new selector. Shall we start by selecting the title? The Type is now Text and you no longer need to select the Multiple , after all, there is only one title within our main “film” selector. With the cursor, select only the film title information and repeat this same selection with the film below.
WebScraper already understands what you want to select as “title”. To confirm, you can again consult the Data Preview.
Save the selector ☺
With this tactic, we will continue creating selectors within the main “film” selector:
A selector for each information: title, year, poster, ranking, cast, synopsis and evaluation. Will they all be Type Text ? Hmm, almost. The only exception is the poster, which will be Type Image .
Have all the selectors already been created out there? Excellent!
In the Filmes_lgbtq Sitemap , now look at the Scrape . Yes, it is this button that makes the scraping happen automatically. Without having to change the speed fields, when you click on Start Scraping , the extension will open a new window and collect data from the selectors created. Don't close the window with the Web Scraper icon , just let it happen ☺
After notification of the end of scraping, in the Sitemap Filmes_lgbt , you can extract the information in csv format > Export data as csv
Once exported, this is what your spreadsheet
Additionally, you can also export the Sitemap , which will be generated like this:
{"_id":"filmes_lgbtq","startUrl":["https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/[1-4]"],"selectors" :[{"id":"movie","type":"SelectorElement","parentS electors":["_root"],"selector":"div.countdown-item:nth-of-type(n+2)","multiple":true,"delay":0},{"id": "title","type":"SelectorText","parentSelectors":["movie"],"selector":"h2 a","multiple":false,"regex":"","delay":0},{"id":"year","type":"SelectorText","parentSelectors":["movie"], "selector":"span.subtle","multiple":false,"regex":"","delay":0},{"id":"poster","type":"SelectorImag and","parentSelectors":["movie"],"selector":"img","multiple":false,"delay":0},{"id":"ranking","type":"SelectorText" ,"parentSelectors":["movie"],"selector":"div.countdown-index","multiple":false,"regex":"", "delay":0},{"id":"cast","type":"SelectorText","parentSelectors":["film"],"selector":"div.cast","multiple":false, "regex":"","delay":0},{"id":"synopsis","type":"SelectorText","parentSelectors":["movie"], "selector":"div.synopsis","multiple":false,"regex":"","delay":0},{"id":"evaluation","type":"SelectorText","parentSelectors" :["movie"],"selector":"span.tMeterScore","multiple":false,"regex":"","delay":0}]}
Out of curiosity, this is the content that you paste into the Import Sitemap when you want to consult this Sitemap in another browser or share it so that other users can access it.
This is our finish line. Good practices and long life with Web Scraper!
Please Post Your Comments & Reviews