IQMR 2026

Introduction to Text as Data (Modules 19, 23)

Monday, June 22; Tuesday, June 23

Eggers Hall, Room 060

Fiona Shen-Bayh (University of Maryland)

What does it mean to transform texts into data? How do computers read and analyze qualitative information quantitatively? Do computational analyses of texts map onto qualitative understanding of human language? This unit explores these questions and more by introducing students to the “text as data” pipeline, beginning with the curation of digital texts and concluding with the measurement of political concepts in lexical terms. Our first lab-based session will examine what it means to transform a collection of documents into machine readable objects, after which we will cover step-by-step how to collate and clean a digital corpus in Python. The next three lab-based sessions will examine introductory approaches to quantitative text analysis, including counting, vectorizing, and embedding techniques. Our final session will conclude with a group discussion of text-based measurement strategies wherein we critically question whether such methods can produce reliable and valid measures of the concepts political scientists care about.

This module sequence has a required prerequisite. You can find more information and instructions on this Navigator page.  The prerequisite is due on June 14.

Participants who did not attend the module sequence from the beginning may not join later in the sequence.

Quantifying texts (M19, June 22)

8:45am - 10:15am – Transforming text into data

What does it mean to transform text into data? This is a task of translation: translating human language (i.e. natural language) into computer language. This module digs into the mechanics of translating qualitative information into machine readable data. We will also examine various ways of reading texts into Python.

Required readings:

Suggested readings:

1:30pm - 3:00pm – Building a corpus

The first step in any text-as-data project is understanding how to build a corpus. In this module, we will cover common approaches to text selection and representation for computational analysis, including the bag of words assumption, tokenization, and document-term matrices. By the end of the session, you will know how to collate a body of texts for a variety of modeling techniques.

Required readings:

Suggested readings:

3:30pm - 5:00pm – Counting

When we transform texts into data, we create opportunities to summarize qualitative information in quantitative terms. The most intuitive way of doing so is through counting. While simple in theory, counting can reveal important trends both within documents and across a corpus — which terms are distinctive to particular documents, the prevalence of key terms across the corpus as a whole, and other lexical patterns that may be relevant for your research question. This session will explore basic principles of counting as well as their more complex applications, including the analysis of lexicons and sentiments using dictionaries.

Required readings:

Suggested readings:

Modeling and measuring texts (M23, June 23)

8:45am - 10:15am – Vectorizing

When we transform texts into machine readable data, we store our text data as vectors. Vectorizing texts opens up a world of analytical possibilities in vector space. Using properties of geometry, we can compute differences and similarities between text vectors in order to analyze differences and similarities between the original texts. This session will cover the “geometry of text” and illustrate how to use Euclidean distance and cosine similarity to draw meaningful comparisons between texts in vector space.

Required readings:

1:30pm - 3:00pm – Embedding

You can infer the meaning of a word by the company it keeps — this is the intuition behind embedding, a type of vector space model wherein every word (or text) in a corpus is represented by a vector. Building upon the previous session, in this module we will explore a more advanced approach to word vectorizing using the popular Word2Vec algorithm. We will generate our own embeddings in order to analyze the semantic similarity (and dissimilarity) of words in a particular corpus.

Required readings:

Suggested readings:

3:30pm - 5:00pm – Measuring

From counting to embedding and beyond, text-as-data approaches have introduced novel ways of measuring complex concepts. But are these measures valid or reliable? How should we evaluate the quantification of qualitative information? This session will be a critical discussion of measurement strategies and best-practices in social science research.

Required readings:

Suggested readings:

For the extra curious, some additional readings with creative measurement strategies: