Chapter 2 Computational thinking
2.1 Why computational thinking
If social scientists want to know how to work smart and not just hard, they need to take full advantage of the power of modern programming languages, and that power is automation.
Let’s think about the following two cases (these examples come from the column I contributed to the UC Berkeley D-Lab website)
- Case 1: uppose a social scientist needs to collect data on civic organizations in the United States from websites, Internal Revenue Service reports, and social media posts. Since the number of these organizations is large, the researcher cannot collect a significant amount of data from various sources alone. Therefore, they would hire undergraduates and distribute tasks. This is a typical data collection plan in social science research, and it is labor-intensive. Automation is not part of the game plan, yet it is critical for many reasons. Because the process is costly, no one is likely to replicate or update the data collection effort.
Case 1 illustrates that it is challenging to be reproducible and scalable without efficient data analytics pipelines.
- Case 2: An alternative is to write computer programs that automatically collect such data, parse it, and store it in interconnected databases. Additionally, someone may need to maintain and validate the quality of the data infrastructure. Nevertheless, this approach lowers the cost of the data collection process, thereby substantially increasing reproducibility and scalability. Furthermore, the researcher can document their code and publicly share it using their GitHub repository or even gather some of the functions they used and distribute them as open-source libraries.
Case 2 demonstrates the advantages of automation and its usefulness to both the academic community and the general public.
To gain these benefits, one must learn how to program. In the age of data science, programming is as valuable a skill as writing in social science research. The extent to which a researcher can automate the research process can determine its efficiency, reproducibility, and scalability.
Below is an insightful quote from Hadley Wickham, who won the 2019 COPSS Presidents’ Award for his outstanding contribution to statistics by developing tidyverse R packages that have transformed how people wrangle, analyze, and visualize data. Even if you do not work with big data or machine learning, you can still benefit from learning how to program and applying the programming skills to data analysis because it is a “force multiplier.”
Every modern statistical and data analysis problem needs code to solve it. You shouldn’t learn just the basics of programming, spend some time gaining mastery. Improving your programming skills pays off because code is a force multiplier: once you’ve solved a problem once, code allows you to solve it much faster in the future. As your programming skill increases, the generality of your solutions improves: you solve not just the precise problem you encountered, but a wider class of related problems (in this way programming skill is very much like mathematical skill). Finally, sharing your code with others allows them to benefit from your experience. - Hadley Wickham
However, I also do not claim that social scientists should learn programming like software engineers. For social scientists, programming is a means, not an end. I encourage readers to think about which aspects of the social science research process can be automated. Again, programming is just a way to teach a machine to perform these tasks and complete them.
Teaching a computer to perform a particular task requires computational thinking: “formulating a problem and expressing its solution in a way that a computer—human or machine—can effectively carry out” (defined by Jeannette M. Wing). Specifically, this means readers need to get familiar with how computers think about data and handle them.
2.2 How to teach and learn computational thinking
This book teaches how you learn this art in incremental steps.
From graphic user interface to command-line interface (ch 3)
From short programs to long programs (ch 4-5)
The ultimate goal is to solve complex problems at scale using computation (ch 6-7)
In this textbook, I will cover programming concepts, with an emphasis on practical application. As John Chambers’ quote suggests, this approach helps you learn computational thinking and apply it in specific contexts by coding and solving problems.
“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.” - Stages in the Evolution of S by John Chambers (S is the progenitor of R)
Here are some valuable reminders for learning.
Beginners! Learning programming is a long game. The essential component of learning (for almost any subject) is consistency. Never stop writing code, even though your current code may fall far short of perfection.
Intermediate programmers! Try to empower, not intimidate, newbies. The most important rule in the computational social science community (at least, in my opinion) is being nice. Please read David Robinson’s “A Million Lines of Bad Code” for more insights.
Finally, have fun. I’ve discussed the benefits of learning programming, but I’ve been a teacher long enough to understand that this won’t persuade everyone to learn how to program, particularly those who have had negative experiences with STEM education.
Instead, I will try to make the materials as accessible as possible by emphasizing the following two ideas in teaching: showing the BIG PICTURE and walking through the WORKFLOW. With Margaret Ng (Assistant Professor of Journalism, UIUC), I discussed the pedagogical importance of these two concepts in teaching computational social science to everyone. The article was published in PS: Political Science and Politics (APSA’s official journal on the field’s professionalization). If you are interested in my full argument, please read the article.
Here is a quick summary of why I think inclusive teaching of programming matters for social science students.
Showing the big picture: When teaching a new skill or technique, it is important to remind students of the input and output data types. Students from Excel, SPSS, or Stata backgrounds may not be used to considering data structure when working with data. Therefore, providing guideposts is crucial to help them avoid making obvious mistakes (such as providing a character vector when a numeric vector is needed for input data) and to see the connection between different skills (such as using APIs and web scraping).
Walking through the workflow: Break down the steps required to move from input to output data. This approach helps students feel less overwhelmed by the complex steps required to solve a particular task. It also helps them learn how to formulate a workflow when encountering the same problem in a different context. Although the exact context may not be identical, they can find patterns across them. Finally, teaching the workflow means breaking down these steps and putting them together, ideally using functions. Acquiring this skill is critical for students to advance from beginner to intermediate programmers who can write readable and reusable code.