IQMR 2025

Text as Data (Modules 2, 6)

Monday, June 16; Tuesday, June 17

Monday, June 16; Tuesday, June 17

Fiona Shen-Bayh (University of Maryland)

What does “text as data” mean? How do scholars transform texts into data that can be read and analyzed quantitatively? How are these data and methods used in service of political science research questions? This unit explores these questions by examining a variety of computational approaches to analyzing texts in innovative ways. Drawing on insights from research in political science, economics, sociology, and the humanities, the aim of this course is to provide students with a toolkit that can be used in service of applied text as data research.

Participants who did not attend the module sequence from the beginning may not join later in the sequence.

Quantifying texts (M2, June 16)

8:45am - 10:15am – Transforming text into data

What does it mean to transform text into data? This is a task of translation: translating human language (i.e. natural language) into computer language. This module digs into the mechanics of translating qualitative information into machine readable data. We will also examine various ways of reading texts into Python.

Required readings:

Suggested readings:

1:30pm - 3:00pm – Building a corpus

The first step in any text-as-data project is understanding how to build a corpus. In this module, we will cover common approaches to text selection and representation for computational analysis, including the bag of words assumption, tokenization, and document-term matrices. By the end of the session, you will know how to collate a body of texts for a variety of modeling techniques.

Required readings:

Suggested readings:

3:30pm - 5:00pm – Counting

When we transform texts into data, we create opportunities to summarize qualitative information in quantitative terms. The most intuitive way of doing so is through counting. While simple in theory, counting can reveal important trends both within documents and across a corpus – which terms are distinctive to particular documents, the prevalence of key terms across the corpus as a whole, and other lexical patterns that may be relevant for your research question. This session will explore basic principles of counting as well as their more complex applications, including the analysis of lexicons and sentiments using dictionaries.

Required readings:

Suggested readings:

Modeling and measuring texts (M6, June 17)

8:45am - 10:15am – Vectorizing

When we transform texts into machine readable data, we store our text data as vectors. Vectorizing texts opens up a world of analytical possibilities in vector space. Using properties of geometry, we can compute differences and similarities between text vectors in order to analyze differences and similarities between the original texts. This session will cover the “geometry of text” and illustrate how to use Euclidean distance and cosine similarity to draw meaningful comparisons between texts in vector space.

Required readings:

1:30pm - 3:00pm – Embedding

You can infer the meaning of a word by the company it keeps – this is the intuition behind embedding, a type of vector space model wherein every word (or text) in a corpus is represented by a vector. Building upon the previous session, in this module we will explore a more advanced approach to word vectorizing using the popular Word2Vec algorithm. We will generate our own embeddings in order to analyze the semantic similarity (and dissimilarity) of words in a particular corpus.

Required readings:

Suggested readings:

3:30pm - 5:00pm – Measuring

From counting to embedding and beyond, text-as-data approaches have introduced novel ways of measuring complex concepts. But are these measures valid or reliable? How should we evaluate the quantification of qualitative information? This session will be a critical discussion of measurement strategies and best-practices in social science research.

Required readings:

Suggested readings: