Meaningful Text Analysis with Word Embeddings

Syllabus

A week-long workshop, taught at Digital Humanities Summer Institute, Summer 2022.

Instructor: Jonathan Reeve, Department of English and Comparative Literature, Columbia University.

Email: jonathan.reeve@columbia.edu. Although please communicate with me using the course chatroom, whenever possible and appropriate.

Website: https://dhsi2022.jonreeve.com

Dates: 6–10 June, 2022, at 11:00 Victoria / 14:00 New York / 18:00 UTC, for one hour.

Classroom: https://meet.jit.si/dhsi2022-word-embeddings

Chatroom: DHSI2022 Word Embeddings, on Matrix. We’ll use this room for all of our communication. A good program for Matrix chat is Element, available for Web, Android, iOS, MacOS, Windows, and Linux.

Course Description

Word embeddings provide new ways of understanding language, by incorporating contexts, meanings, and senses of words into their digital representations. They are a new technology, developed by researchers at Google, which now powers the most advanced computational language tasks, such as machine translation, automatic summarization, and information extraction. Since they represent more than just the surface forms of words, their applications for humanities scholarship are profound. This course will serve as a hands-on introduction to word embeddings, and will use the Python programming language, in conjunction with the SpaCy package for natural language processing. Participants are encouraged to bring their own collections of text to analyze, and will create meaningful explorations of them by the end of the course. No prior programming experience is necessary.

Course Communications

Since this is an online-only course, this summer, we’ll have to get creative with the ways we communicate. Here are our modes of communication:

Daily video lectures. These are pre-recorded video lectures, to be watched before we meet each day.
Daily videoconferences. These are our class times.
Chatroom. This is for any other communication. Any questions or comments you may have, feel free to post them there.
Marginalia, using Hypothes.is, on our readings.

Readings

I’ve chosen five readings that I hope will be of interest to you. I made the unconventional decision, for a digital humanities course, of choosing primary texts from technical disciplines, and so they may seem somewhat like they’re written in a foreign language. Don’t worry about understanding every bit of them. But don’t ignore their implied challenge, either.

We’ll discuss the readings using Hypothes.is. Feel free to write any annotations you may have, in the virtual margins, and to reply to other annotations. Try to write at least one per reading.

Technical stack

We’ll be using Google Colaboratory as our computing environment. It runs in the cloud, on Google’s servers, so you don’t need anything more than a web browser to run it. It does require that you have a Google account, however.

One important note about Colab is that the virtual machine’s state (its memory of executed code) is wiped after a certain period of inactivity, around one hour.

Before the course

Please fill out this short initial survey, whether you are a participant, auditor, or anyone else.
Please introduce yourself to everyone in our course chatroom. You may have to create a Matrix account.
Create a Hypothes.is account, if you don’t already have one, and write an annotation on our first reading. I recommend using your real name as your username, so that it’s easier to know who’s who.

Monday, 6 June: Theory of Word Embeddings

Lecture video 1
[NB: the lecture video is from 2021, bu applies to this year’s course, as well. Please watch the lecture video before we meet over videoconference.]
Colab notebook as a GitHub Gist
Class videoconference: 11:00 Pacific / 14:00 New York / 18:00 UTC, in our videoconference room on Jitsi.
Reading: Chapter 6 of Jurafski, Dan, and James H. Martin. Speech and Language Processing. Third edition draft.
- Please write at least one annotation using the Hypothes.is annotation layer, before class.
- Original here

Tuesday, 7 June: Introduction to Python for Text Analysis

Lecture video 2. Please watch before class.
Lecture notebook 2
Class videoconference: 11:00 Pacific / 14:00 New York / 18:00 UTC, in our videoconference room on Jitsi.
Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.
- Please write at least one annotation using the Hypothes.is annotation layer.

Wednesday, 8 June: Hands-on With Pre-Trained Word Embeddings

Lecture video 3
Lecture notebook 3
Class videoconference: 11:00 Pacific / 14:00 New York / 18:00 UTC, in our videoconference room on Jitsi.
Reading: Kozlowski, Austin C., Matt Taddy, and Evans, James A. (2019) “The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings.” American Sociological Review 84:5.
- Also available here, via Sage
- Please write at least one annotation using the Hypothes.is annotation layer.

Thursday, 9 June: Practicum in Text Analysis

Lecture video 4
Lecture notebook 4
Class videoconference: 11:00 Pacific / 14:00 New York / 18:00 UTC, in our videoconference room on Jitsi.
Reading: Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018) “Word embeddings quantify 100 years of gender and ethnic stereotypes” PNAS 115:16
- Originally here, at PNAS
- Please write at least one annotation using the Hypothes.is annotation layer.

Friday, 10 June: Lab Work

Lecture video 5
Lecture notebook 5
Class videoconference: 11:00 Pacific / 14:00 New York / 18:00 UTC, in our videoconference room on Jitsi.
Reading: Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. “Man is to computer programmer as woman is to homemaker? debiasing word embeddings.” arXiv preprint arXiv:1607.06520 (2016).
- Original via ArXiv
- Please write at least one annotation using the Hypothes.is annotation layer.