Our MBA in Data Journalism students produced a series of tutorials as their final work in the Low Code discipline: Transforming data into guidelines without programming, taught by professor Adriano Belisário. This month you will be able to check out some of their work and have fun with the tutorials they created. The first on the list is student Beatriz Pinheiro.
Despite the recent turnaround in Brazilian women's football, with changes to the calendar, greater availability of game broadcasts and increased interest from the media and the public, there is still some delay in the professionalization of the sport, which was banned by decree for 40 years in Brazil. This delay is reflected in the historical records of women's football, which brings a lot of harm to journalistic coverage, as it is difficult to find systematized data to support the production of guidelines.
With this scenario in mind, the objective of this tutorial is to present Web Scraper as a tool that can help explore and create more user-friendly databases about teams, players and competitions, in order to not only facilitate journalistic work, but also collaborate with records history for the development of Brazilian women's football.
Understanding the tool
Web Scraper is a Google Chrome extension that allows you to extract data using websites' HTML codes as sources. These codes structure website information into elements, which function as “boxes” in which the data is ordered. The role of the Web Scraper is to extract the data from these boxes and transform it into a structured spreadsheet.
The source of data used in this tutorial will be Soccerway Mulheres, a website that brings together statistics from global women's football games and information such as: teams, athletes, championships, games, results, etc. For this practice, we will use the Brazilian Women's Championship A1 - 2020 table as an example, and scrape information about all the athletes who competed in the competition.
The idea we will work on here will be to select the following data from all players: club, name, position, age, games and goals scored in the championship. If this process were done manually, we would have to individually access the page of each of the 16 teams, access each player's page and copy and paste the desired information into a spreadsheet.
In addition to being extremely laborious and tiring, this process would also be more prone to errors if done manually, which would put the entire analysis of the collected data at risk. That's where Web Scraper comes in, which allows you to automate the steps mentioned above.
Hands on
The first step to starting the scraping process is to install Web Scraper, which can be done through this link. Then, just click on the extensions icon in the top right corner of Google Chrome and select Web Scraper to activate it.
With the extension installed and the Brazilian Women's Championship classification open, let's right-click and select the Inspect . Notice that a tab opens at the bottom of the screen, where information about the
page codes. Let's pay attention to the Web Scraper , the last one that appears in the menu, on the right.
With the Web Scraper tab open, click the Create new sitemap and select the create sitemap . Two blank fields will appear - the first, Sitemap name , to be filled in with the name of your robot, which will scrape the information. In our case, we will name it “brasileirao-feminino-2020”. Start URL field will appear , in which we will define the starting page for data scraping. In this example, the Brasileirão Feminino classification page.
Next, we will define the first parameter to be scraped by the robot. To do this, we will click on the Add new selector Id, Type and Select fields . The Id field serves to name the information we want to extract and, in the case of this practice, we want information from each of the teams in the Brasileirão Feminino, therefore, we will call the selector “teams”.
The Type field indicates the type of element in the HTML code that will be scraped, which can be text, link, image, among other options that appear when clicking on the field. Looking at the Brasileirão Feminino classification, we notice that each team in the table is a link, which directs to the team's individual page. Therefore, in this step, we will select the link .
The next step is to activate the select and click on the name of each team. Note that the link is highlighted in a red box and, from the second click on, the tool itself already recognizes the selection we want to make. Check that everything is ok and confirm with the green button, done selecting , which appears above the inspection bar.
Don't forget to check the Multiple , to ensure that all selected elements, that is, all teams, will be recognized by the selector. Finally, just click on the save selector at the bottom of the page and that's it, we have the first scraper.
Our objective here is to collect information about the athletes from each team, so let's access the Corinthians page, the first in the classification, as an example, and scroll to the part where the athletes' information is.
In the Web Scraper control bar, we will click on the “teams” selector, already created, and repeat the previous process, this time for each athlete on the team: create a new selector, enter the name “players” in the Id field, select again the type of element as a link, to ensure that the robot will access the pages of each player, and click on the select button. Then, simply select the name of each player and click done selecting, remembering to check the multiple option. Finally, save the selector.
The next step is to access the page of one of the athletes, click on the “players” selector in the Web Scraper control bar, and repeat the process for the information we are looking for. This time, we want to select the position each athlete plays, so we will name the selector “position”. Now, the type of element we want to select is text , and we won't need to check the Multiple , since we only have one block of information of interest. Now just save the selector.
From now on, the process follows the same for the other athlete information we are looking for: age, games and goals for the season.
Scraping the data
Once that's done, now it's time for data scraping. In the Web Scraper control bar, we will click on the sitemap brasileirao-feminino-2020 scrape option , and then click on the start scraping .
Now it's time to rest, because the robot is already working: see that a new browser window opens, in which the tool accesses the pages of each team and each athlete from the Brasileirão Feminino 2020 to scrape the data we determined.
When the process is finished, just click the refresh , and Web Scraper will show a preview of the table organized after scraping the data. Now, just click again on the brazilian-feminino-2020 and select the export as CSV .
Ready! Now we have the complete table, with information on all the players who played in the Brasileirão Feminino 2020.
Please Post Your Comments & Reviews