QSS 20. Modern Statistical Computing#

QSS 20 is a foundational and required course in the Quantitative Social Science curriculum that equips students with the computing literacy to conduct social science research in the age of “big data.” The skills students learn in QSS 20 are building blocks for data science applications from research to industry to nonprofits and government. This course builds on students’ introductory programming course and teaches you how to draw meaningful insights from real-world, often messy datasets, so you can help incorporate data into decision-making and analysis. The course will teach students to quickly pick up new methods and find patterns in large-scale data—essential skills given that methods and tools for modern statistical computing develop at a rapid pace, and in real-world data science you sometimes don’t know what tools you need to know until you need to know them.

Our topics will include data wrangling and visualization, including merging datasets and SQL for database manipulation; data extraction via APIs and web-scraping; processing and analyzing text as data; and supervised machine learning. Students will also be exposed to Git/GitHub for version control and reproducibility, the command line for scalable computing, and LaTeX for smart typesetting and collaborative workflows. In addition to introductory coding modules via DataCamp, in-class activities, and problem sets, the course will culminate in a group-based, applied data science project using a real-world dataset with social impact.

Prerequisites#

The course will move fast, cover a lot of material, and require active engagement with applied projects. To ensure your success, students must have taken or received AP credit for COSC 1, ENGS 20, or equivalent (with QSS chair’s permission) in order to enroll. An introductory statistics course is also recommended.

Goals#

After completing this course, students will be able to do the following in Python:

  • Write effective and well-documented user-defined functions

  • Work with and visualize various data structures, like lists and DataFrames

  • Manipulate a variety of data, including flat files and text data

  • Write API queries to systematically access and custom-build web-based databases

  • Train and apply supervised machine learning algorithms

  • Write SQL queries to pull, aggregate, and summarize data stored in database tables

Office hours#

By day of the week#

  • Monday: 2:15-3:15 PM (Prof. OH), 7-8 PM (peer tutoring)

  • Tuesday: 1-2 PM (Eunice OH)

  • Wednesday: 2:15-3:15 PM (Prof. OH)

  • Thursday: 9-10 PM (peer tutoring)

  • Friday: 2-3 PM (Ramsey OH)

  • Saturday: none

  • Sunday: 2-3 PM (Ramsey OH), 8-9 PM (peer tutoring)

DataCamp as learning resource#

This class has no textbook. Instead, we will use DataCamp to introduce you to course concepts, and you’re encouraged to use other online resources like StackOverflow to fill in the gaps. DataCamp modules feature short introductory videos to a concept (e.g., loops), which you can click through if you’re already familiar with the concept. Next is a series of tasks where you will write code, submit to test whether your code does what it’s supposed to, and then progress to the next task. You can access these modules on the course page within DataCamp, which you’ll sign up for using your Dartmouth email. You can join that here.

I suggest completing the assigned modules before the corresponding class, so you can get more from the in-class activities. However, completing the modules within a few days of class can also be a good way to review and deepen what you learned in class—either option is acceptable. See the DataCamp course page for specific assigned modules (which should generally match the course schedule) as well as optional modules you are welcome to complete if useful to you (these are listed as due on March 14 but are NOT required).

DataCamp plays a supporting and minor role in the course, helping prepare you with the basic syntax for in-class activities. Accordingly, DataCamp modules are graded on a completion-only basis and are worth only 5% of your grade. You will receive full credit for DataCamp modules so long as you complete them by March 14 (when your final papers are due). Even so, if you’d prefer to skip the DataCamp modules, you can talk to the Prof. to get the 5% reapportioned to your problem sets.

Any DataCamp modules listed as being due March 30 are not required. These are optional extra practice with the course tools and concepts to support your learning—especially those of you new to Python. You will have access to these modules (for free) until March 30, after which point only DataCamp subscribers can access them.