Chapter 9 Big data
9.1 The Big Picture
- Big data problem: data is too big to fit into memory (=local environment).
- R reads data into random-access memory (RAM) at once, and this object lives in memory entirely. So, if object.size > memory.size, the process will crash R.
- Therefore, the key to dealing with big data in R is reducing the size of data you want to bring into it.
Techniques to deal with big data
- Medium-sized file (1-2 GB)
- Try to reduce the size of the file using slicing and dicing
- Tools:
- R:
data.table::fread(file path, select = c("column 1", "column 2"))
. This command imports data faster thanread.csv()
does. - Command-line:
csvkit
- a suite of command-line tools to and working with CSV
- R:
- Large file (> 2-10 GB)
- Put the data into a database and ACCESS it
- Explore the data and pull the objects of interest
Databases
- Types of databases
- Relational database = a collection of tables (fixed columns and rows): SQL is a staple tool to define, query (the focus of the workshop today), control, and manipulate this type of database
- Non-relational database = a collection of documents (MongoDB), key-values (Redis and DyanoDB), wide-column stores (Cassandra and HBase), or graph (Neo4j and JanusGraph). Note that this type of database does not preclude SQL. NoSQL stands for “not only SQL.”
Relational database example
9.2 SQL
Structured Query Language. Called SEQUEL and was developed by IBM Corporation in the 1970s.
Remains the standard language for a relational database management system.
It’s a DECLARATIVE language (what to do > how to do)
- Database management systems figure an optimal way to execute a query (query optimization)
SELECT COLUMN FROM TABLE
9.2.1 Learning objectives
Embracing a new mindset: shifting from ownership (opening CSVs stored in your laptop) to access (accessing data stored in a database)
Learning how to use R and SQL to access and query a database
9.2.2 SQL and R
- SQL and R
SQL | R |
---|---|
SELECT | select() for columns, mutate() for expressions, summarise() for aggregates |
FROM | which data frame |
WHERE | filter() |
GROUP BY | group_by() |
HAVING | filter() after group_by() |
ORDER BY | arrange() |
LIMIT | head() |
Challenge 1
- Can you tell me the difference in the order in which the following
R
andSQL
codes were written to manipulate data? For instance, in R, what command comes first? In contrast, in SQL, what command comes first?
- R example
%>% # Data
data select() %>% # Column
filter() %>% # Row
group_by() %>% # Group by
summarise(n = n()) %>% # n() is one of the aggregate functions in r; it's count() used inside summarise() function
filter() %>% # Row
order_by() # Arrange
- SQL example (in a SQL chunk, use
--
instead of#
to comment)
SELECT column, aggregation (count())` -- Column
FROM data # Data
WHERE condition -- Filter rows
GROUP BY column -- Group by
HAVING condition -- Filter rows after group by
ORDER BY column -- Arrange
9.2.4 Packages
pacman::p_load()
reduces steps for installing and loading several packages simultaneously.
# pacman
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
# The rest of pkgs
::p_load(
pacman# tidyverse packages
tidyverse, # using SQL queries
DBI, # SQLite
RSQLite, # use database with dplyr
dbplyr, # glue to automate workflow
glue, # toy data
nycflights13 )
## Installing package into '/home/jae/R/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
## also installing the dependency 'plogr'
##
## RSQLite installed
9.2.5 NYC flights data
- The flight on-time performance data from the Bureau of Transportation Statistics of the U.S. government. The data goes back to 1987, and its size is more than 20 gigabytes. For practice, we only use a small subset of the original data (flight data departing NYC in 2013) provided by RStudio.
9.2.6 Workflow
- Create/connect to a database
Note that the server also can be your laptop (called localhost).
Short answer: To do so, you need interfaces between R and a database. We use
RSQLite
in this tutorial because it’s easy to set up.Long answer: The
DBI
package in R provides a client-side interface that allowsdplyr
to work with databases. DBI is automatically installed when you installdbplyr
. However, you need to install a specific backend engine (a tool for communication between R and a database management system) for the database (e.g.,RMariaDB
,RPostgres
,RSQLite
). In this workshop, we use SQLite because it is the easiest to get started with. I love PostgreSQL because it’s open-source and also powerful to do many amazing things (e.g., text mining, geospatial analysis). If you want to build a data warehouse, an analytical platform, consider using Spark (Hadoop).
- Copy a table to the database
Option 1: You can create a table and insert rows manually. You also need to define the data schema (the database structure) to do that.
Table
- Collection of rows
- Collection of columns (fields or attributes)
- Each col has a type:
- String:
VARCHAR(20)
- Integer:
INTEGER
- Floating-point:
FLOAT
,DOUBLE
- Date/time:
DATE
,TIME
,DATETIME
- String:
- Schema: the structure of the database
- The table name
- The names and types of its columns
- Various optional additional information
- Constraints
- Syntax:
column datatype constraint
- Examples:
NOT NULL
,UNIQUE
,INDEX
- Syntax:
- Constraints
-- Create table
CREATE TABLE students (
id INT AUTO_INCREMENT,
VARCHAR(30),
name DATE,
birth FLOAT,
gpa INT,
grad PRIMARY KEY(id));
-- Insert one additional row
INSERT INTO students(name, birth, gpa, grad)
VALUES ('Adam', '2000-08-04', 4.0, 2020);
- Option 2: Copy a file (object) to a table in a database using
copy_to
). We take this option as it’s fast, and we would like to focus on querying in this workshop.
- Query the table
- Main focus
Pull the results of interests (data) using
collect()
Disconnect the database
9.2.6.1 Create a database
# Define a backend engine
<- RSQLite::SQLite()
drv
# Create an empty in-memory database
<- DBI::dbConnect(drv,
con dbname = ":memory:")
# Connect to an existing database
#con <- DBI::dbConnect(RMariaDB::MariaDB(),
# host = "database.rstudio.com",
# user = "hadley",
# password = rstudioapi::askForPassword("Database password")
#)
dbListTables(con)
## character(0)
# character(0) = NULL
- Note that
con
is empty at this stage.
9.2.6.2 Copy an object as a table to the database (push)
# Copy objects to the data
# copy_to() comes from dplyr
copy_to(dest = con,
df = flights)
copy_to(dest = con,
df = airports)
copy_to(dest = con,
df = planes)
copy_to(dest = con,
df = weather)
# If you need, you can also select which columns you would like to copy:
# copy_to(dest = con,
# df = flights,
# name = "flights",
# indexes = list(c("year", "tailnum", "dest")))
# Show two tables in the database
dbListTables(con)
## [1] "airports" "flights" "planes" "sqlite_stat1" "sqlite_stat4"
## [6] "weather"
# Show the columns/attributes/fields of a table
dbListFields(con, "flights")
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
dbListFields(con, "weather")
## [1] "origin" "year" "month" "day" "hour"
## [6] "temp" "dewp" "humid" "wind_dir" "wind_speed"
## [11] "wind_gust" "precip" "pressure" "visib" "time_hour"
9.2.6.3 Quick demonstrations:
SELECT desired columns
FROM tables
Select all columns (*) from
flights
table and show thefirst ten rows
Note that you can combine SQL and R commands thanks to
dbplyr.
Option 1
::dbGetQuery(con,
DBI"SELECT * FROM flights;") %>% # SQL
head(10) # dplyr
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## arr_delay carrier flight tailnum origin dest air_time distance hour minute
## 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
## 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
## 3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
## 4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
## 5 -25 DL 461 N668DN LGA ATL 116 762 6 0
## 6 12 UA 1696 N39463 EWR ORD 150 719 5 58
## 7 19 B6 507 N516JB EWR FLL 158 1065 6 0
## 8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
## 9 -8 B6 79 N593JB JFK MCO 140 944 6 0
## 10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
## time_hour
## 1 1357034400
## 2 1357034400
## 3 1357034400
## 4 1357034400
## 5 1357038000
## 6 1357034400
## 7 1357038000
## 8 1357038000
## 9 1357038000
## 10 1357038000
- Option 2 (works faster)
SELECT *
FROM flights
LIMIT 10
Option 3 (automating workflow)
- When local variables are updated, the SQL query is also automatically updated. This approach is called parameterized query (or prepared statement).
######################## PREPARATION ########################
# Local variables
<- "flights"
tbl <- "dep_delay"
var <- 10
num
# Glue SQL query string
# Note that to indicate a numeric value, you don't need.
<- glue_sql("
sql_query SELECT {`var`}
FROM {`tbl`}
LIMIT {num}
", .con = con)
######################## EXECUTION ########################
# Run the query
dbGetQuery(con, sql_query)
## dep_delay
## 1 2
## 2 4
## 3 2
## 4 -1
## 5 -6
## 6 -4
## 7 -5
## 8 -3
## 9 -3
## 10 -2
Challenge 2
Can you rewrite the above code using LIMIT
instead of head(10)
?
You may notice that using only SQL code makes querying faster.
Select
dep_delay
andarr_delay
from flights table, show the first ten rows, then turn the result into a tibble.
Challenge 3
Could you remind me how to see the list of attributes of a table? Let’s say you want to see the flights
table attributes. How can you do it?
- Collect the selected columns and filtered rows
<- dbGetQuery(con,
df "SELECT dep_delay, arr_delay FROM flights;") %>%
head(10) %>%
collect()
Counting rows
- Count all (*)
dbGetQuery(con,
"SELECT COUNT(*)
FROM flights;")
## COUNT(*)
## 1 336776
dbGetQuery(con,
"SELECT COUNT(dep_delay)
FROM flights;")
## COUNT(dep_delay)
## 1 328521
- Count distinct values
dbGetQuery(con,
"SELECT COUNT(DISTINCT dep_delay)
FROM flights;")
## COUNT(DISTINCT dep_delay)
## 1 527
9.2.6.4 Tidy-way: dplyr -> SQL
Thanks to the dbplyr
package, you can use the dplyr
syntax to query SQL.
- Note that pipe (%) works.
# tbl select tables
<- con %>% tbl("flights")
flights <- con %>% tbl("airports")
airports <- con %>% tbl("planes")
planes <- con %>% tbl("weather") weather
select
=SELECT
%>%
flights select(contains("delay"))
## # Source: SQL [?? x 2]
## # Database: sqlite 3.40.1 [:memory:]
## dep_delay arr_delay
## <dbl> <dbl>
## 1 2 11
## 2 4 20
## 3 2 33
## 4 -1 -18
## 5 -6 -25
## 6 -4 12
## 7 -5 19
## 8 -3 -14
## 9 -3 -8
## 10 -2 8
## # ℹ more rows
Challenge 4
Your turn: write the same code in SQL. Don’t forget to add the connection
argument to your SQL code chunk.
mutate
=SELECT
AS
%>%
flights select(distance, air_time) %>%
mutate(speed = distance / (air_time / 60))
## # Source: SQL [?? x 3]
## # Database: sqlite 3.40.1 [:memory:]
## distance air_time speed
## <dbl> <dbl> <dbl>
## 1 1400 227 370.
## 2 1416 227 374.
## 3 1089 160 408.
## 4 1576 183 517.
## 5 762 116 394.
## 6 719 150 288.
## 7 1065 158 404.
## 8 229 53 259.
## 9 944 140 405.
## 10 733 138 319.
## # ℹ more rows
Challenge 5
Your turn: write the same code in SQL. (
Hint: mutate(new_var = var 1 * var2
(R) = SELECT var1 * var2 AS near_var
(SQL)
filter
=WHERE
%>%
flights filter(month == 1, day == 1) # filter(month ==1 & day == 1) Both work in the same way.
## # Source: SQL [?? x 19]
## # Database: sqlite 3.40.1 [:memory:]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dbl>
Challenge 6
Your turn: write the same code in SQL (hint: filter(condition1, condition2)
= WHERE condition1 and condition2
)
Additional tips
Note that R and SQL operators are not exactly alike. R uses !=
for Not equal to
. SQL uses <>
or !=
. Furthermore, there are some cautions about using NULL
(NA; unknown or missing): it should be IS NULL
or IS NOT NULL
not =NULL
or !=NULL
(this makes sense because NULL represents an absence of a value).
Another pro-tip is LIKE
operator, used in a WHERE
statement to find values based on string patterns.
SELECT DISTINCT(origin) -- Distinct values from origin column
FROM flights
WHERE origin LIKE 'J%'; -- Find any origin values that start with "J"
origin |
---|
JFK |
%
is one of the wildcards you can use for string matching. %
matches any number of characters. So, J%
matches Jae, JFK, Joseph, etc. _
is another useful wildcard that matches exactly one character. So J_
matches only JA, JE, etc. If wildcards are not enough, then you should consider using regular expressions.
arrange
=ORDER BY
%>%
flights arrange(carrier, desc(arr_delay)) %>%
show_query()
## <SQL>
## SELECT *
## FROM `flights`
## ORDER BY `carrier`, `arr_delay` DESC
Challenge 7
Your turn: write the same code in SQL.
Hint: arrange(var1, desc(var2)
(R) = ORDER BY var1, var2 DESC
(SQL)
summarise
=SELECT
AS
andgroup by
=GROUP BY
%>%
flights group_by(month, day) %>%
summarise(delay = mean(dep_delay))
## `summarise()` has grouped output by "month". You can override using the
## `.groups` argument.
## Warning: Missing values are always removed in SQL aggregation functions.
## Use `na.rm = TRUE` to silence this warning
## This warning is displayed once every 8 hours.
## # Source: SQL [?? x 3]
## # Database: sqlite 3.40.1 [:memory:]
## # Groups: month
## month day delay
## <int> <int> <dbl>
## 1 1 1 11.5
## 2 1 2 13.9
## 3 1 3 11.0
## 4 1 4 8.95
## 5 1 5 5.73
## 6 1 6 7.15
## 7 1 7 5.42
## 8 1 8 2.55
## 9 1 9 2.28
## 10 1 10 2.84
## # ℹ more rows
Challenge 8
Your turn: write the same code in SQL (hint: in SQL the order should be SELECT group_var1, group_var2, AVG(old_var) AS new_var
-> FROM
-> GROUP BY
)
- If you feel too much challenged, here’s a help.
%>%
flights group_by(month, day) %>%
summarise(delay = mean(dep_delay)) %>%
show_query() # Show the SQL equivalent!
## `summarise()` has grouped output by "month". You can override using the
## `.groups` argument.
## <SQL>
## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay`
## FROM `flights`
## GROUP BY `month`, `day`
Joins
Using joins is more straightforward in R than it is in SQL.
However, more flexible joins exist in SQL, and they are not available in R.
- Joins involving 3+ tables are not supported.
- Some advanced joins available in SQL are not supported.
- For more information, check out
tidyquery
to see the latest developments.
SQL command
FROM one table LEFT JOIN another table ON condition = condition
(ON
in SQL = BY
in R)
SELECT *
FROM flights AS f
LEFT JOIN weather AS w
ON f.year = w.year AND f.month = w.month
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 1 | 39.02 | 26.06 | 59.37 | 270 | 10.35702 | NA | 0 | 1012.0 | 10 | 1357020000 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 2 | 39.02 | 26.96 | 61.63 | 250 | 8.05546 | NA | 0 | 1012.3 | 10 | 1357023600 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 3 | 39.02 | 28.04 | 64.43 | 240 | 11.50780 | NA | 0 | 1012.5 | 10 | 1357027200 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 4 | 39.92 | 28.04 | 62.21 | 250 | 12.65858 | NA | 0 | 1012.2 | 10 | 1357030800 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 5 | 39.02 | 28.04 | 64.43 | 260 | 12.65858 | NA | 0 | 1011.9 | 10 | 1357034400 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 6 | 37.94 | 28.04 | 67.21 | 240 | 11.50780 | NA | 0 | 1012.4 | 10 | 1357038000 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 7 | 39.02 | 28.04 | 64.43 | 240 | 14.96014 | NA | 0 | 1012.2 | 10 | 1357041600 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 8 | 39.92 | 28.04 | 62.21 | 250 | 10.35702 | NA | 0 | 1012.2 | 10 | 1357045200 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 9 | 39.92 | 28.04 | 62.21 | 260 | 14.96014 | NA | 0 | 1012.7 | 10 | 1357048800 |
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 | EWR | 2013 | 1 | 1 | 10 | 41.00 | 28.04 | 59.65 | 260 | 13.80936 | NA | 0 | 1012.4 | 10 | 1357052400 |
Can anyone explain why SQL query using dplyr
then translated by show_query()
looks more complex than the above? (Hint)
%>%
flights left_join(weather, by = c("year", "month")) %>%
show_query()
## <SQL>
## SELECT
## `flights`.`year` AS `year`,
## `flights`.`month` AS `month`,
## `flights`.`day` AS `day.x`,
## `dep_time`,
## `sched_dep_time`,
## `dep_delay`,
## `arr_time`,
## `sched_arr_time`,
## `arr_delay`,
## `carrier`,
## `flight`,
## `tailnum`,
## `flights`.`origin` AS `origin.x`,
## `dest`,
## `air_time`,
## `distance`,
## `flights`.`hour` AS `hour.x`,
## `minute`,
## `flights`.`time_hour` AS `time_hour.x`,
## `weather`.`origin` AS `origin.y`,
## `weather`.`day` AS `day.y`,
## `weather`.`hour` AS `hour.y`,
## `temp`,
## `dewp`,
## `humid`,
## `wind_dir`,
## `wind_speed`,
## `wind_gust`,
## `precip`,
## `pressure`,
## `visib`,
## `weather`.`time_hour` AS `time_hour.y`
## FROM `flights`
## LEFT JOIN `weather`
## ON (
## `flights`.`year` = `weather`.`year` AND
## `flights`.`month` = `weather`.`month`
## )
9.2.6.5 Collect (pull)
collect()
is used to pull the data. Depending on the data size, it may take a long time to run.The following code won’t work.
Error in UseMethod(“collect”) : no applicable method for ‘collect’ applied to an object of class “c(‘LayerInstance’, ‘Layer’, ‘ggproto’, ‘gg’)”
<- flights %>%
origin_flights_plot group_by(origin) %>%
tally() %>%
ggplot() +
geom_col(aes(x = origin, y = n)) %>%
collect()
- This works.
<- flights %>%
df group_by(origin) %>%
tally() %>%
collect()
<- ggplot(df) +
origin_flights_plot geom_col(aes(x = origin, y = n))
origin_flights_plot
9.2.7 Things we didn’t cover
9.2.7.1 Subquery
Subquery = a query nested inside a query
This hypothetical example is inspired by dofactory blog post.
SELECT names -- Outer query
FROM consultants
WHERE Id IN (SELECT ConsultingId
FROM consulting_cases
WHERE category = 'r' AND category = 'sql'); -- Subquery
9.2.7.2 Common table expression (WITH clauses)
This is just a hypothetical example inspired by [James LeDoux’s blog post](https://jamesrledoux.com/code/sql-cte-common-table-expressions.
-- cases about R and SQL from dlab-database
WITH r_sql_consulting_cases AS ( -- The name of the CTE expression
-- The CTE query
SELECT
id
FROM
dlab WHERE
LIKE '%sql%'
tags AND
LIKE '%r%'
tags
),-- count the number of open cases about this consulting category
-- The outer query
SELECT status, COUNT(status) AS open_status_count
FROM dlab as d
INNER JOIN r_sql_consulting_cases as r
ON d.id = r.id
WHERE status = 'open';
9.2.8 References
- csv2db - for loading large CSV files in to a database
- R Studio, Database using R
- Ian Cook, “Bridging the Gap between SQL and R” rstudio::conf 2020 slides
- Data Carpentry contributors, SQL database and R, Data Carpentry, September 10, 2019.
- Introduction to dbplyr
- Josh Erickson, SQL in R, STAT 701, University of Michigan
- SQL zine by Julia Evans
- q - a command-line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files)