An R Package for Turning Tweet JSON Files into a Tidyverse-ready Dataframe
Author: Jae Yeon Kim
File an issue if you have problems, questions or suggestions.
Twitter data is an important resource for social science research. (As of July 1, 2020, if you queried “Twitter” in Google Scholar, it returned more than seven million search results.) However, parsing a great deal of Twitter JSON data is not an easy task for researchers with little programming experience. This package provides functions to clean and wrangle Tweet JSON files simply and quickly.
1. Tidying a Tweet JSON file
Most social science researchers have only worked with clearly structured data (e.g., spreadsheets). If one downloads tweets from the Twitter API, these tweets were stored in a semi-structured file format called JSON (JaveScript Object Notation). In simple terms, it has some structure but does not resemble a spreadsheet. With tidytweetjson
, one does not need to understand a JSON file structure. The package helps you to get a tidy data from a Tweet JSON file simply and quickly.
2. Tidying multiple Tweet JSON files
The other challenge is that, often, the number of tweets you want to analyze is large (several GBs or TBs). One way you can work around this problem is to split the JSON file, parse each JSON file into a dataframe, and join them. With tidytweetjson
, one does not need to write a for loop to perform this task. The package helps you to parse and join multiple JSON files into a tidyverse-ready dataframe with just one command.
## Install the current development version from GitHub
devtools::install_github("jaeyk/tidytweetjson",
dependencies = TRUE)
tidytweetjson
should be used in strict accordance with Twitter’s developer terms.
Sign up a Twitter developer account.
Either search for tweets or turn a tweet ID dataset into into tweets (called hydrating) using the Twitter API. I highly recommend using twarc, a command line tool, and Python library to archive Twitter JSON data. Twarc is fast, reliable, and easy to use. If you are using Twarc for the first time, refer to this tutorial. You just need to type one or two commands in the command line to download the Twitter data you want. The followings are examples.
# search
$ twarc search covid19 > search.jsonl
# hydrate
$ twarc hydrate covid19.tsv [hypothetical data] > search.jsonl
jsonl
format; jsonl
extension refers to JSON Lines files. This structure is useful for splitting JSON data into smaller chunks, if it is too large.#Divide the JSON file by 1000 lines (tweets)
# Linux and Windows (in Bash)
$ split -1000 search.jsonl
# macOS
$ gsplit -1000 search.jsonl
If you are a macOS user, you need to install coreutils
by typing brew info coreutils
in a terminal window. You should use gsplit
instead of split
.
If you are a Windows user and you have not installed Bash yet, please install it (here is a guide). Bash allows you to use Linux commands in Windows.
Now, you have a Tweet JSON file or a list of them. Collecting the Tweet JSON file could have has been tedious, especially if you have never done this before. By contrast, turning these files into a tidyverse-ready dataframe is incredibly easy and lightning fast with the help of tidytweetjson
.
The parsed JSON data has a tidy structure. It has nine columns: (user) id
, country_code
(country code), (user) location
, created_at
(time stamp), full_text
(tweets), retweet_count
, favorite_count
, user.followers_count
, and user.friends_count
. Its rows are tweets.
jsonl_to_df()
: Turn a Tweet JSON file into a dataframe
As a test, select one of the JSON files and inspect whether the result is desirable. According to the performance test done by the microbenchmark
package, the jsonl_to_df()
function takes average 5 seconds to turn 1000 tweets into a tidy dataframe.
# Load library
library(tidytweetjson)
# You need to choose a Tweet JSON file
filepath <- file.choose()
# Assign the parsed result to the `df` object
df <- jsonl_to_df(filepath)
jsonl_to_df_all()
: Turn all Tweet JSON files, saved in a directory, into a dataframe
Again, the df_all
object should have nine columns. If your JSON file is heavy (>10GB), I recommend running future::plan("multiprocess")
before using this function to speed up the process. I tested the running time performance using the tictoc package. If you use parallel processing, the jsonl_to_df_all()
function takes 241.68 seconds, or 4 minutes, to turn 1,927,000 tweets into a data frame.
# Load library
library(tidytweetjson)
# You need to designate a directory path where you saved the list of JSON files.
dirpath <- tcltk::tk_choose.dir()
# Parallel processing
n_cores <- availableCores() - 1
plan(multiprocess, # multicore, if supported, otherwise multisession
workers = n_cores) # the maximum number of workers
# Assign the parsed result to the `df_all` object
df_all <- jsonl_to_df_all(dirpath)
add_date()
: Add a date variable that parses Tweet timestamps into date format
create_at
variable (Tweet timestamps) is not immediately usable for data modeling and visualization in R. The add_date()
function parses create_at
variable into a new variable called date
that is in the Year-Month-Day date format.
# Add date variable to the data.frame
df <- add_date(df)
# Head 5 rows of the target and parsed columns
df[1:5,] %>% select(created_at, date)
created_at date
Mon Jan 27 18:40:17 +0000 2020 2020-01-27
Tue Jan 28 20:52:03 +0000 2020 2020-01-28
Fri Jan 31 19:33:48 +0000 2020 2020-01-31
Wed Jan 29 17:02:34 +0000 2020 2020-01-29
Fri Jan 31 14:21:47 +0000 2020 2020-01-31
add_US_location()
: Add a dummy variable that identifies whether a Twitter user is located in the US
location
variable indicates the location of a Twitter user. However, it is difficult to use this variable as Twitter users record their locations in non-unified ways (e.g., Berkeley
, Berkeley, CA
, Berkeley, USA
, The People's Republic of Berkeley
). The add_US_location
function searches whether the string pattern, the location
input, is matched with the names of the major US cities (population greater than about 40000) and states. It also validates the data quality by checking whether the string pattern is not matched with the names of the non-US countries. The function returns a dummy variable called US_location
in which a 1
represents located in the US
, and a 0
represents not located in the US
.