Meaningful Text Analysis with Word Embeddings

Syllabus

A week-long workshop, taught at Digital Humanities Summer Institute, Summer 2022.

InstructorJonathan Reeve, Department of English and Comparative Literature, Columbia University.

Email: jonathan.reeve@columbia.edu. Although please communicate with me using the course chatroom, whenever possible and appropriate.

Website: https://dhsi2022.jonreeve.com

Dates: 6–10 June, 2022, at 11:00 Victoria / 14:00 New York / 18:00 UTC, for one hour.

Classroom: https://meet.jit.si/dhsi2022-word-embeddings

Chatroom: DHSI2022 Word Embeddings, on Matrix. We’ll use this room for all of our communication. A good program for Matrix chat is Element, available for Web, Android, iOS, MacOS, Windows, and Linux.

Course Description

Word embeddings provide new ways of understanding language, by incorporating contexts, meanings, and senses of words into their digital representations. They are a new technology, developed by researchers at Google, which now powers the most advanced computational language tasks, such as machine translation, automatic summarization, and information extraction. Since they represent more than just the surface forms of words, their applications for humanities scholarship are profound. This course will serve as a hands-on introduction to word embeddings, and will use the Python programming language, in conjunction with the SpaCy package for natural language processing. Participants are encouraged to bring their own collections of text to analyze, and will create meaningful explorations of them by the end of the course. No prior programming experience is necessary.

Course Communications

Since this is an online-only course, this summer, we’ll have to get creative with the ways we communicate. Here are our modes of communication:

  1. Daily video lectures. These are pre-recorded video lectures, to be watched before we meet each day.
  2. Daily videoconferences. These are our class times.
  3. Chatroom. This is for any other communication. Any questions or comments you may have, feel free to post them there.
  4. Marginalia, using Hypothes.is, on our readings.

Readings

I’ve chosen five readings that I hope will be of interest to you. I made the unconventional decision, for a digital humanities course, of choosing primary texts from technical disciplines, and so they may seem somewhat like they’re written in a foreign language. Don’t worry about understanding every bit of them. But don’t ignore their implied challenge, either.

We’ll discuss the readings using Hypothes.is. Feel free to write any annotations you may have, in the virtual margins, and to reply to other annotations. Try to write at least one per reading.

Technical stack

We’ll be using Google Colaboratory as our computing environment. It runs in the cloud, on Google’s servers, so you don’t need anything more than a web browser to run it. It does require that you have a Google account, however.

One important note about Colab is that the virtual machine’s state (its memory of executed code) is wiped after a certain period of inactivity, around one hour.

Before the course

Monday, 6 June: Theory of Word Embeddings

Tuesday, 7 June: Introduction to Python for Text Analysis

Wednesday, 8 June: Hands-on With Pre-Trained Word Embeddings

Thursday, 9 June: Practicum in Text Analysis

Friday, 10 June: Lab Work