Selecting Texts for Quantitative Work

Kristoffer Nielbo & Ryan Nichols

We’ve reviewed unstructured data so the next step is to find a full-text database that is relevant to our research question. While research libraries and the internet may have many relevant primary sources for our use, a complete database, if there is a database, might be harder to find. To make things even more difficult, if our research question requires looking at historical texts, those texts may not be digitized. In some cases, a digital version will be hard produce due to a manuscript’s non-standard form. For example, if your text is drawn from Eighteenth Century Collections Online, many documents will be in pdf image form. These can be downloaded. You can use Adobe Pro or AABBYY to convert the document into a pdf text document. However, even then you are left with headaches including errors and systemic use of “f” instead of “s.”

Nielbo-Nichols_Blog Post 2_Figure 1

Humanities research projects often start by building a database from scratch or extending existing databases. This will typically involve identification of content and digitization of new texts. There is a tradeoff between working with incomplete databases and building a new database. On the one hand, there is the Garbage In, Garbage Out principle to consider: the quality of the output the database gives us will depend on the quality of content we put into it. On the other hand, making sure we are not putting “garbage” in our database is not a cost-efficient process. Fortunately, there has been a massive increase in high quality full-text databases.

In recent decades, enormous amounts of effort and funding have been dedicated to building massive database of historical texts, such as the Hathi Trust Digital Library. These represent resources of historically unprecedented power and scope for the study of cultural history. Too often, however, the term “digital humanities” is associated with database-building techniques when texts in these databases can be analyzed in novel ways. With all the money and time spent, these databases are primarily used as faster versions of resources we already have, such as concordances. By getting more familiar with data mining, Humanities scholars can take advantage of a new way of analyzing texts with these massive databases.

Nielbo-Nichols_Blog Post 2_Figure 2

Having identified or built a database, the second step is to select a smaller section of texts within the database to be our research corpus. Sometimes distinguishing between a database and a research corpus can be confusing when our research question requires that we build a new database. However, it is important that we make the distinction because databases can be used for a wide range of research questions while a research corpus is made for specific ones. Selection of a research corpus can apply both probability and nonprobability sampling (e.g., Jockers & Mimno 2013; Nichols et al in review; Slingerland & Chaduk 2011). Regarding generalization, it is important to notice that, although high performance allows us to model every text relevant to our research question, we have to treat the research corpus as a sample of a larger data set. This is because, as seen with historical texts, it is unlikely that our research corpus contains all of the relevant texts that exist or have existed.

The next important step is choosing a document size. Although there are many document sizes we can choose for our research corpus, the selected size will affect the granularity of our answer to the guiding research question. For example, if we want to model a topic space of the books of the New Testament, the document size is books, and the project will have 27 documents.[1] If, instead, we want to count the number of keywords in the New Testament, the document size is a collection of books and the project will have one document. Document sizes can also be sentences, verses, or some external formal criteria like strings of 1000 characters. Of course, our research question can have multiple document sizes, but this will require parallel preprocessing and, eventually, different models.

One final consideration is the available metadata for the research corpus. Metadata are data that describe general features of the data, including date, location, authorship, or translation, to mention a few. Author demographics such as age, gender, ethnicity and religion might be interesting for reconstructing specific past minds (e.g. Baunvig & Nielbo in review). Demographic features can also be used in models for numerical prediction and classification.  In many cases, general metadata will be available from the database and, sometimes, have dedicated structured fields in the text. With more specific queries, we will have to collect metadata from secondary sources.

Baunvig, Katrine F., and Kristoffer L. Nielbo. in prep. “Grundtvig’s Mind: Proxies for Bipolar Disorder in the Collected Works of N.F.S. Grundtvig”

Jockers, Matthew L., and David Mimno. 2013. “Significant Themes in 19th-Century Literature.” Poetics 41 (6): 750–69.

Nichols, Ryan, Kristoffer L. Nielbo, Edward Slingerland, Uffe Bergeton, Carson

Logan, and Scott Kleinman. in review. “Topic Modeling Ancient Chinese Texts: Knowledge Discovery in Databases for Humanists”

Slingerland, Edward, and Maciej Chudek. 2011. “The Prevalence of Mind-Body Dualism in Early China.” Cognitive Science 35 (5): 997–1007.

[1] The New Testament from the King James Version of the Bible is used throughout the following posts See For the sake of simplicity we use this small research corpus for all examples. Code for all example is available at: