An R Package for Turning Ethnic NewsWatch Search Results into Tidyverse-ready Dataframes
Author: Jae Yeon Kim
File an issue if you have problems, questions or suggestions.
This data source helps researchers investigate the political voice of minority groups in the US with rich and extensive text data.
tidyethnicnews provides functions to turn Ethnic NewsWatch search results into tidyverse-ready dataframes.
## Install the current development version from GitHub devtools::install_github("jaeyk/tidyethnicnews", dependencies = TRUE)
Visit the website of a library that you are a member of (assuming that the library has access to the Ethnic NewsWatch database).
Access the database through the library website.
Search the database using a specified query. In general, I recommend limiting the search results to
full text, excluding
duplicates (one of the
page options), and increasing items per page to
100 for high data quality and efficient data processing.
Select all the items that have appeared on a specific webpage. (If your search yields 1,000 items, you should have 10 pages.)
... action button and save the search result using the
HTML is the most flexible for extracting various attributes. The printed search result should look like the sample screenshot in Figure 1.
Figure 1. Sample screenshot.
Save the printed search result (an HTML file) in a directory.
Now, you have a list of unstructured HTML files. Collecting the data has been tedious. By contrast, turning these files into a tidyverse-ready dataframe is incredibly easy and lightning fast with the help of
tidyethnicnews. The parsed text data has four columns:
source (publisher). Its rows are newspaper articles. You can enrich the text data by adding other covariates, searching key words, exploring topics, and classifying main issues.
As a test, select one of the HTML files and inspect whether the result is desirable.
# Load library library(tidyethnicnews) # You need to choose an HTML search result filepath <- file.choose() # Assign the parsed result to the `df` object df <- html_to_dataframe(filepath)
df object should have four columns:
date. According to the performance test done by the
microbenchmark package, the
html_to_dataframe() function takes average 0.0005 seconds to turn 100 newspaper articles into a tidy dataframe.
This function helps you clean and wrangle your text data without needing to know anything about parsing HTML and manipulating string patterns (regular expressions).
Notes on the Error Messages
If an article is missing an
author field, a value in the
source field is moved to the
author field and
NA is recorded in the
source field during the wrangling process. When this issue occurs, you will receive an error message saying:
“NAs were found in source column. The problem will be fixed automatically.”`
As stated in the message, the problem will be fixed automatically and there’s nothing you need to worry about. If you are not assured, then please take a look at the source code.
The next step is scale-up. The
html_to_dataframe_all() function allows you to turn all the HTML files saved in a particular directory into a d dataframe using a single command.
df_all object should have four columns:
date. I tested the running time performance using the
tictoc package. If you use parallel processing, the
html_to_dataframe_all() function takes 12.34 seconds to turn 5,684 articles into a dataframe. (On average, 0.002 seconds per article.)
# Load library library(tidyethnicnews) library(furrr) # for parallel processing # You need to designate a directory path. dirpath <- tcltk::tk_choose.dir() # Parallel processing n_cores <- availableCores() - 1 plan(multiprocess, # multicore, if supported, otherwise multisession workers = n_cores) # the maximum number of workers # Assign the parsed result to the `df_all` object df_all <- html_to_dataframe_all(dirpath)