An R Package for Turning Ethnic NewsWatch Search Results into Tidyverse-ready Dataframes
Author: Jae Yeon Kim
File an issue if you have problems, questions or suggestions.
The Ethnic NewsWatch database, provided by ProQuest, has compiled more than 2.5 million articles published in US ethnic newspapers and magazines. The database covers the following groups:
This data source helps researchers investigate the political voice of minority groups in the US with rich and extensive text data. tidyethnicnews
provides functions to turn Ethnic NewsWatch search results into tidyverse-ready dataframes.
## Install the current development version from GitHub devtools::install_github("jaeyk/tidyethnicnews", dependencies = TRUE)
tidyethnicnews
should be used in strict accordance with Ethnic NewsWatch’s Terms of Use. ProQuest does not allow web scraping. Instead, I recommend manually downloading search results from the database. Below, I provide a step-by-step guide. This method still works as of June 13, 2020. However, please note that ProQuest’s download policy could change at any time.
Visit the website of a library that you are a member of (assuming that the library has access to the Ethnic NewsWatch database).
Access the database through the library website.
Search the database using a specified query. In general, I recommend limiting the search results to full text
, excluding duplicates
(one of the page options
), and increasing items per page to 100
for high data quality and efficient data processing.
Select all the items that have appeared on a specific webpage. (If your search yields 1,000 items, you should have 10 pages.)
Click the ...
action button and save the search result using the PRINT
option. There are multiple file formats for saving the search result, but HTML
is the most flexible for extracting various attributes. The printed search result should look like the sample screenshot in Figure 1.
Figure 1. Sample screenshot.
Save the printed search result (an HTML file) in a directory.
Manually repeat steps 4–6 of the process. Caution should be taken, as automating this process will violate ProQuest’s Terms of Use.
Now, you have a list of unstructured HTML files. Collecting the data has been tedious. By contrast, turning these files into a tidyverse-ready dataframe is incredibly easy and lightning fast with the help of tidyethnicnews
. The parsed text data has four columns: text
, author
, (publication) date
, and source
(publisher). Its rows are newspaper articles. You can enrich the text data by adding other covariates, searching key words, exploring topics, and classifying main issues.
html_to_dateframe()
: Turn an HTML file into a dataframeAs a test, select one of the HTML files and inspect whether the result is desirable.
# Load library library(tidyethnicnews) # You need to choose an HTML search result filepath <- file.choose() # Assign the parsed result to the `df` object df <- html_to_dataframe(filepath)
The df
object should have four columns: text
, author
, source
, date
. According to the performance test done by the microbenchmark
package, the html_to_dataframe()
function takes average 0.0005 seconds to turn 100 newspaper articles into a tidy dataframe.
This function helps you clean and wrangle your text data without needing to know anything about parsing HTML and manipulating string patterns (regular expressions).
Notes on the Error Messages
If an article is missing an author
field, a value in the source
field is moved to the author
field and NA
is recorded in the source
field during the wrangling process. When this issue occurs, you will receive an error message saying:
“NAs were found in source column. The problem will be fixed automatically.”`
As stated in the message, the problem will be fixed automatically and there’s nothing you need to worry about. If you are not assured, then please take a look at the source code.
html_to_dataframe_all()
: Turn all HTML files, saved in a directory, into a dataframeThe next step is scale-up. The html_to_dataframe_all()
function allows you to turn all the HTML files saved in a particular directory into a d dataframe using a single command.
Again, the df_all
object should have four columns: text
, author
, source
, date
. I tested the running time performance using the tictoc
package. If you use parallel processing, the html_to_dataframe_all()
function takes 12.34 seconds to turn 5,684 articles into a dataframe. (On average, 0.002 seconds per article.)
# Load library library(tidyethnicnews) library(furrr) # for parallel processing # You need to designate a directory path. dirpath <- tcltk::tk_choose.dir() # Parallel processing n_cores <- availableCores() - 1 plan(multiprocess, # multicore, if supported, otherwise multisession workers = n_cores) # the maximum number of workers # Assign the parsed result to the `df_all` object df_all <- html_to_dataframe_all(dirpath)