Chapter 1 Hello World

print("Hello, World!")
## [1] "Hello, World!"

Make simple things simple, and complex things possible. - Alan Kay

This is the website for Computational Thinking for Social Scientists. This open-access book intends to help social scientists think computationally and develop proficiency with computational tools and techniques to research computational social science. Mastering these tools and techniques not only enables social scientists to collect, wrangle, analyze, and interpret data with less pain and more fun, but it also let them work on research projects that would previously seem impossible.

Horace Mann, the first great American advocate of public education, claimed that ‘’Education, then, beyond all other divides of human origin, is a great equalizer of conditions of men—the balance wheel of the social machinery.’’ I believe in this potential of education; however, I also fully acknowledge that quality education is not accessible equally. Often, the gap between education and technology is greater among historically disadvantaged groups than advantaged groups. As an educator, this book is my small contribution to making this democratic vision of education possible, at least in the emerging field of computational social science.

That said, this book is not intended to be a comprehensive guide for computational social science or any particular programming language, computational tool, or technique. If you are interested in a general introduction to computational social science, I highly recommend Matthew Salganik’s Bit By Bit (2017). Salganik’s book is comprehensive, accessible, and pedagogically friendly.

The book comprises two main subjects (fundamentals and applications) and eight main sessions.

1.2 Part II Applications

  1. How to collect and parse semi-structured data at scale (e.g., using APIs and web scraping)

  2. How to analyze high-dimensional data (e.g., text) using machine learning

  3. How to access, query, and manage big data using SQL

The book teaches how to do all of these, mostly in R, and sometimes in bash and Python.

  • Why R? R is free, easy to learn (thanks to tidyverse and RStudio), fast (thanks to Rcpp), runs everywhere (Mac/Windows/Linux), open (16,000+ packages; counting only ones available at CRAN), and has a growing, large, and inclusive community (#rstats).

  • Why R + Python + bash?

    “For R and Python, Python is first and foremost a programming language. And that has a lot of good features, but it tends to mean, that if you are going to do data science in Python, you have to first learn how to program in Python. Whereas I think you are going to get up and running faster with R, than with Python because there’s just a bunch more stuff built in and you don’t have to learn as many programming concepts. You can focus on being a great political scientist or whatever you do and learning enough R that you don’t have to become an expert programmer as well to get stuff done.” - Hadley Wickham

    • However, this feature of the R community also raises a challenge.

    Compared to other programming languages, the R community tends to be more focused on results instead of processes. Knowledge of software engineering best practices is patchy: for instance, not enough R programmers use source code control or automated testing. Inconsistency is rife across contributed packages, even within base R. You are confronted with over 20 years of evolution every time you use R. R is not a particularly fast programming language, and poorly written R code can be terribly slow. R is also a profligate user of memory. - Hadley Wickham

    • RStudio, especially the tidyverse team, has made heroic efforts to overcome the limitations mentioned above. Readers will learn these recent advances in the R ecosystem and complement R with Python and Bash.

    • Nevertheless, if you’re serious about programming, I highly recommend learning Python. Learning Python also helps you fill gaps in software engineering that could be useful to be highly proficient in R.

1.3 Special thanks

This book is collected as much as it is authored. It is a remix version of PS239T, a graduate-level computational methods course at UC Berkeley, originally developed by Rochelle Terman (Assistant Professor of Political Science, Chicago) then revised by Rachel Bernhard (Assistant Professor of Political Science, UC Davis). I have taught PS239T as lead instructor in Spring 2019 and TA in Spring 2018 and taught it with Nick Kuipers (Postdoc, Stanford) in Spring 2020. Other teaching materials draw from the workshops I have created for D-Lab and Data Science Discovery Program at UC Berkeley and the Summer Institute in Computational Social Science hosted by Howard University and Mathematica. I also have cited all the other references whenever I am aware of related books, articles, slides, blog posts, or YouTube video clips.

1.4 Suggestions, questions, or comments

Please feel free to create issues; if you find typos, errors, missing citations, please report them via the GitHub repository associated with this book.

1.5 License

This work is licensed under a Creative Commons Attribution 4.0 International License.