Copy-and-paste is a thing of the past, the fashion now is to scrape your data

Sep 15, 2021

Our MBA in Data Journalism students produced a series of tutorials as their final work in the Low Code discipline: Transforming data into guidelines without programming, taught by professor Adriano Belisário. This month you will be able to check out some of their work and have fun with the tutorials they created. Today you can check out the tutorial made by Carolina Timm.

Web Scraper: a tutorial with the beginning, middle and end of a scraping exercise + practical tips

Hey guys. Tired of spending hours copying and pasting information that could clearly be in a table ready to download and not posted on a website?
Would you like to have a simple way to collect all the information you need at once? How about a table with structured data of your choice extracted at the end of the collection?
Well then, rest assured, what you are looking for is called data scraping, a method of collecting information in an automated way. Here, we'll learn about one tool in particular: Web Scraper .

Why read this tutorial?

At the end of reading, you will know how to install and use Web Scraper to create something called Sitemap containing Selectors capable of navigating autonomously and extracting information as you determine. To guide us in this tutorial, we will scrape Rotten Tomatoes (a film and series review site), more specifically content titled 200 Best LGBTQ+ Movies of All Time .

Contextualizing the chosen website

The data that is of interest in this tutorial is at this url .

Note that the ranking is distributed over four pages of results, in this order of presentation: 200-151; 150-101; 100-51; 50-1. The home page contains basic information: film poster, title, year of release, rating (the site's classic “tomatometer”), whether there is critical consensus, synopsis, cast and position in the ranking.

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - ranking home page rotten tomatoes.png

It is possible to locate and read the content on the website, but the elements are not organized in a structured way that allows easy extraction, handling and analysis of the data available there. Without a scraper, it would be necessary to organize the elements manually, repeating the process of selecting each piece of information from each film 200 times (literally). In other words, it would be a repetitive task and would take time. So, from here we go hand in hand with a tool called Web Scraper .

Getting to know Web Scraper

webscraper.io is a free extension that you can install on your browser in a few minutes. At first glance, before installation, this is what it looks like:

C:\Users\Windows 10\Desktop\print - site web scraper add to chrome.png

Just click Add the extension to your browser and authorize access permission. To check if the installation was effective, go to Extensions Web Scraper icon and name already appear there.

Step by step

First, install Web Scraper on your browser.
Access Roten Tomatoes

At any point on the page, right-click and click Inspect.

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - right click inspect.png

If the section opens on the side of your browser, I recommend moving it to the bottom. To do this, simply click on the three dots in the right corner and choose the icon that illustrates this way of viewing:

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - right button down.png

Next, locate the Web Scraper at the end of the first line.
When you click on it, the extension displays three contents:
> Sitemaps , where the sitemaps created or imported into your browser are “saved” (at this moment, right after installation, this tab will be empty);
> Sitemap , “current” Sitemap space;
> Create new sitemap , which has two options:

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - create new sitemap.png

Create Sitemap is the option that interests us at the moment. After clicking on it, you must fill in two fields and finish by clicking on Create Sitemap .
> Sitemap name : the name that will identify your Sitemap p. It must contain only lowercase letters, without accents and without spaces. Here, it's called Filmes_lgbtq
> Start URL : copy and paste the URL of the page: https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - sitemap name start url.png

As soon as you click on Create Sitemap , you will go to the Sitemap Filmes_lgbtq tab, which already presents you with the selectors field.

$C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - webscraper add new selector.png$ Go Click on Add new selector . The first selector that we will create will bring together all the information that accompanies each of the films: film poster, title, year of release, website rating, synopsis, cast and position in the ranking. Here, it will be called a film. In Type , you will be faced with several options:

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - webscraper film selector element.png

Our selector is Element . Multiple box , as we want the selector to continue browsing the other films and selecting the pattern of information that we indicated for it:

D:\Web Scraper\prints\print - webscraper film selector.png

To select, click Select . Then, just move the cursor over the page until all the elements of the first film are selected (poster, title, year and so on). Then, scroll down the page and make that same selection in the film below. Okay, at this point, Web Scraper has already understood your idea and will select the next ones on this page.

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - webscraper film selector multiple.png

Confirm the action by clicking Done selecting . Check if Multiple is confirmed and save. We already have our first selector \o/
From now on, we will create selectors within this main selector for each of the information that interests us.

However, before that, parentheses: if you click on Data Preview and check the information, you will notice that only the films on the first page were selected. Wow, so I need to create a different Sitemap for each of the four pages and put everything together in a spreadsheet editor after exporting? No, the good news is that you don't need all that work. Just change the URL .

Click on the other pages in the ranking, one by one, and note the URL of each one. Did you notice how the number at the end changes?
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/2/
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/ 3/
https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/4/

In Sitemap Filmes_lgbtq , you will have these alternatives:

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - sitemap.png tab

Choose Edit metadata .
In Start URL , we will change it to https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/[1-4 ] This way, the four pages will be scraped.

$C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Carolina Timm - Low Code Tutorial\prints\print - save sitemap metadata.png$ After saving this change, click on Sitemap Filmes_lgbtq and go back to Selectors . Now, instead of adding a new “main” selector, we will create selectors within the already created selector. To do this, click on film:

D:\Web Scraper\prints\print - add selector within the movie selector.png

Notice that it is now indicated next to _root at the top. Now, Add new selector. Shall we start by selecting the title? The Type is now Text and you no longer need to select the Multiple , after all, there is only one title within our main “film” selector. With the cursor, select only the film title information and repeat this same selection with the film below.

D:\Web Scraper\prints\print - webscraper selector titulo.png

WebScraper already understands what you want to select as “title”. To confirm, you can again consult the Data Preview.

D:\Web Scraper\prints\print - webscraper datapreview.png

Save the selector ☺

With this tactic, we will continue creating selectors within the main “film” selector:

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - selectors within the main film selector.png

A selector for each information: title, year, poster, ranking, cast, synopsis and evaluation. Will they all be Type Text ? Hmm, almost. The only exception is the poster, which will be Type Image .

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Web Scraper\prints\print - image selector.png

Have all the selectors already been created out there? Excellent!

In the Filmes_lgbtq Sitemap , now look at the Scrape . Yes, it is this button that makes the scraping happen automatically. Without having to change the speed fields, when you click on Start Scraping , the extension will open a new window and collect data from the selectors created. Don't close the window with the Web Scraper icon , just let it happen ☺

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Agendas\Web Scraper\prints\print - webscraper after scrape.png

After notification of the end of scraping, in the Sitemap Filmes_lgbt , you can extract the information in csv format > Export data as csv
Once exported, this is what your spreadsheet

C:\Users\user\Desktop\MBA IDP Data Journalism\Classes\Low Code - Data and Guidelines\Carolina Timm - Low Code Tutorial\prints\Figure20.png

Additionally, you can also export the Sitemap , which will be generated like this:

{"_id":"filmes_lgbtq","startUrl":["https://editorial.rottentomatoes.com/guide/best-lgbt-movies-of-all-time/[1-4]"],"selectors" :[{"id":"movie","type":"SelectorElement","parentS electors":["_root"],"selector":"div.countdown-item:nth-of-type(n+2)","multiple":true,"delay":0},{"id": "title","type":"SelectorText","parentSelectors":["movie"],"selector":"h2 a","multiple":false,"regex":"","delay":0},{"id":"year","type":"SelectorText","parentSelectors":["movie"], "selector":"span.subtle","multiple":false,"regex":"","delay":0},{"id":"poster","type":"SelectorImag and","parentSelectors":["movie"],"selector":"img","multiple":false,"delay":0},{"id":"ranking","type":"SelectorText" ,"parentSelectors":["movie"],"selector":"div.countdown-index","multiple":false,"regex":"", "delay":0},{"id":"cast","type":"SelectorText","parentSelectors":["film"],"selector":"div.cast","multiple":false, "regex":"","delay":0},{"id":"synopsis","type":"SelectorText","parentSelectors":["movie"], "selector":"div.synopsis","multiple":false,"regex":"","delay":0},{"id":"evaluation","type":"SelectorText","parentSelectors" :["movie"],"selector":"span.tMeterScore","multiple":false,"regex":"","delay":0}]}

Out of curiosity, this is the content that you paste into the Import Sitemap when you want to consult this Sitemap in another browser or share it so that other users can access it.

This is our finish line. Good practices and long life with Web Scraper!

https://media.tenor.com/images/ff4d08553f058aadb6e49b93e120f522/tenor.gif

Let me know, ECOM

Comments

No comments yet. Be the first!

Please Post Your Comments & Reviews

Cancel reply

Your email will not be published. All fields are mandatory. Your comment will be sent to moderation and published once approved.

Let us know what you have to say:

Event list

	Event Key	Event ID	Check-in	Day/time
1

Add line

Accessibility tools

Check the Institution's registration in the e-MEC System here

Why read this tutorial?

Contextualizing the chosen website

Getting to know Web Scraper

Step by step

Comments

Please Post Your Comments & Reviews

Cancel reply

Please Post Your Comments & Reviews

Subscribe to the ECOM newsletter

Institutional

Courses

Other Information

Contact

Accessibility tools

Check the Institution's registration in the e-MEC System here

Why read this tutorial?

Contextualizing the chosen website

Getting to know Web Scraper

Step by step

Comments

Please Post Your Comments & Reviews Cancel reply

Please Post Your Comments & Reviews

Subscribe to the ECOM newsletter

Institutional

Courses

Other Information

Contact

Data usage control

Please Post Your Comments & Reviews

Cancel reply