The rvest package (as in “harvest”) allows you to scrape information from a web page and read it into R. In this chapter, we’ll explain the basics of rvest and walk you through an example.
Or copy & paste this link into an email or IM. Data wrangling with dplyr and tidyr cheat sheet tidy data foundation for wrangling in ma ma in tidy data set: each variable is saved in its own column syntax.
16.1 Web page basics
16.1.1 HTML
HTML (Hyper Text Markup Language) defines the content and structure of a web page. In Chrome, you can view the HTML that generates a given web page by navigating to View > Developer > Developer tools.
A series of elements, like paragraphs, headers, and tables, make up every HTML page. Here’s a very simple web page and the HTML that generates it.
The words surrounded by < >
are HTML tags. Tags define where an element starts and ends. Elements, like paragraph (<p>
), headings (<h1>
), and tables (<table>
), start with an opening tag (<tagname>
) and end with the corresponding closing tag (</tagname>
).
Elements can be nested inside other elements. For example, notice that the <tr>
tags, which generate rows of a table, are nested inside the <table>
tag, and the <td>
tags, which define the cells, are nested inside <tr>
tags.
The HTML contains all the information we’d need if we wanted to read the animal data into R, but we’ll need rvest to extract the table and turn it into a data frame.
16.1.2 CSS
CSS (Cascading Style Sheets) defines the appearance of HTML elements. CSS selectors are often used to style particular subsets of elements, but you can also use them to extract elements from a web page.
CSS selectors often reflect the structure of the web page. For example, the CSS selector for the example page’s heading is
body > h1
and the selector for the entire table is
body > table
You don’t need to generate CSS selectors yourself. In the next section, we’ll show you how to use your browser to figure out the correct selector.
16.2 Scrape data with rvest
R Data Wrangling Cheat Sheet
Our World in Data compiled data on world famines and made it available in a table.
Using this table as an example, we’ll show you how to use rvest to scrape a web page’s HTML, read in a particular element, and then convert HTML to a data frame.
16.2.1 Read HTML
First, copy the url of the web page and store it in a parameter.
Next, use rvest::read_html()
to read all of the HTML into R.
read_html()
reads in all the html for the page. The page contains far more information than we need, so next we’ll extract just the famines data table.
16.2.2 Find the CSS selector
We’ll find the CSS selector of the famines table and then use that selector to extract the data.
In Chrome, right click on a cell near the top of the table, then click Inspect (or Inspect element in Safari or Firefox).
The developer console will open and highlight the HTML element corresponding to the cell you clicked.
Hovering over different HTML elements in the Elements pane will highlight different parts of the web page.
Move your mouse up the HTML document, hovering over different lines until the entire table (and only the table) is highlighted. This will often be a line with a <table>
tag.
Right click on the line, then click Copy > Copy selector (Firefox: Copy > CSS selector; Safari: Copy > Selector Path).
Return to RStudio, create a variable for your CSS selector, and paste in the selector you copied.
16.2.3 Extract the table
Tidyverse Cheat Sheet Pdf
You already saw how to read HTML into R with rvest::read_html()
. Next, use rvest::html_node()
to select just the element identified by your CSS selector.
The data is still in HTML. Use rvest::html_table()
to turn the output into a data frame. Note that rvest::html_table()
returns a data.frame object, not a tibble. To convert it to a tibble, use as_tibble()
.
R Data Wrangling Cheat Sheet Pdf
Now, the data is ready for wrangling in R.
R Dataframe Cheat Sheet
Note that html_table()
will only work if the HTML element you’ve supplied is a table. If, for example, we wanted to extract a paragraph of text, we’d use html_text()
instead.