Viewing scraped HTML sessions from rvest and friends
Hadley Wickham’s excellent rvest R package is an extremely easy way to scrape any HTML page.
It also provides limited utilities to do the scraping interactively:
html_session
, which actually starts a browsing session in R interpreter,back
,follow_link
,jump_to
implement page navigation using history or link selectors (CSS/XPath)html_nodes
acceptsession
objects and return HTML nodes inrvest
-friendly format
However, there is no easy way to see the code which is actually parsed.
Since rvest uses httr to make HTML requests, returned code can be very different from what is seen on a browser.
Similar issues exist even if you use other R packages such as httr (which, basically, deals only with the HTTP requests), curl or RCurl (same, but the hard way!).
In this post, we provide some quick functions to access the HTML code and view it in a browser. We will also learn how to exploit the S3 object system in R to help us!
Using rvest
Suppose that we have just scraped a page:
library(xml2)
library(rvest)
url <- 'https://nghttp2.org/httpbin/'
s <- rvest::html_session(url)
class(s)
## [1] "session"
s
holds the scraped page (it is a session
-object), we can access the page contents using xml2::read_html()
:
s_tree <- xml2::read_html(s)
class(s_tree)
## [1] "xml_document" "xml_node"
s_tree
## {xml_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n <a href="https://github.com/requests/httpbin" class="git ...
We can see that s_tree
has class xml2::xml_document
, but the code is still hidden. The function as.character
helps:
as.character(s_tree)
## [1] "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"UTF-8\">\n<title>httpbin.org</title>\n<link href=\"https:/ ...
Here it is! Now we want to view it on a browser.
R base
package contains the function browseURL()
, which however opens the browser to an URL or a file. So we have to store it on disk.
Let’s create a function which does all of this: getting a session
object, extracting the HTML code, casting to character, writing it to disk and taking care of cleaning the output.
view_rvest <- function(s) {
# Cast the session to character
stopifnot(class(s) == 'session')
s_tree <- xml2::read_html(s)
s_code <- as.character(s_tree)
# Make a temporary file, fill it with text
temp_file <- tempfile(fileext = '.html')
f <- file(temp_file, open = 'w')
write(s_code, f)
close(f)
browseURL(temp_file)
# Wait a while, then delete it
Sys.sleep(3)
unlink(temp_file)
}
view_rvest(s)
And voilà!
Using httr
Suppose, now, that we want to do the same with an httr
request.
E.g., the code from the previous page can be retrieved by:
req <- httr::GET(url)
httr::content(req, as = 'text')
## [1] "<!DOCTYPE html>\n<html lang=\"en\">\n\n<head>\n <meta charset=\"UTF-8\">\n <title>httpbin.org</title>\n <link href=\"https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+C ...
where req
is a response
-object.
The resulting do-it-all function would be:
view_httr <- function(req) {
# Cast the session to character
stopifnot(class(req) == 'response')
s_code <- httr::content(req, as = 'text')
# Make a temporary file, fill it with text
temp_file <- tempfile(fileext = '.html')
f <- file(temp_file, open = 'w')
write(s_code, f)
close(f)
browseURL(temp_file)
# Wait a while, then delete it
Sys.sleep(3)
unlink(temp_file)
}
view_httr(req)
And voilà! (2)
Generalizing with S3 objects
So far, we wrote two functions:
view_rvest
, operating ontext
objectsview_httr
, operating onrequest
objects
We would like to have a single function which elegantly deals with both cases, and allows for easy extensions.
Either we write a wrapper, and distinguish between the two with an if
, or we use the S3 object system in R.
It is actually very simple and hackish, but it works well!
In short, we write a generic method, which then calls (dispatches) the appropriate method for the object type.
The generic method will be seen by the user, the following methods will be, in general, hidden.
view_html <- function(x) {
UseMethod('view_html', x)
}
What it does when called, is to grab the class of x
, and call the method named as view_html.class
(where, of course, class is class(x)
).
So, we just need to rename:
view_httr
toview_html.response
(asx
is aresponse
, fromhttr
)view_rvest
toview_html.session
(asx
is asession
, fromrvest
)
To be more compact, we can also introduce the view_html.character
method which writes character vectors to disk and opens a browser.
Finally, here are the resulting methods:
The “smart” calls:
view_html(s)
view_html(req)
view_html('test')
Thank you for reading!