Preprocessing Texts

Kristoffer Nielbo & Ryan Nichols

To begin preprocessing texts, we need to segment the string of words in each document into their constituent parts. This process is called tokenization. Tokenization consists of identifying the smallest units of analysis in the text mining project (Weiss, Indurkhya & Tong 2010). Tokens can be single words or a group of words, called n-grams (e.g., Grant & Walsh 2015). If your research project means you are interested in word sequences such as ‘Jesus Christ’, a bigram, and ‘Son of God’, a trigram, then you’ll include both in the model.

The New Testament includes 27 documents and about 160,000 word-level tokens comprising 14,000 unique word types. (A ‘token’ is an instance of a ‘type’. The previous sentence included eight word-types and seven tokens since ‘a’ was used twice.) To find out how many times a word appears in a document, we can create a simple word by document matrix with 27 rows × 14000 columns, creating 400,000 entries. The majority of entries will contain a zero, meaning that a given word does not appear in one of the documents. For example, the word ‘wine’ does not occur in 1 Corinthians. The goal of preprocessing is to transform a high dimensional and noisy dataset into a lower dimensional and cleaner dataset. By removing irrelevant information and ordering linguistic forms, we can build a numerical representation of the New Testament that is tailored to our research project. Let’s look at some concrete ways to achieve these two goals.

One obvious way we can reduce dimensionality and make our dataset more manageable is by removing punctuation and numbers, converting all upper-case to lower-case letters, and applying a stop word (sometimes ‘stopword’) filter (Banchs 2013). Applying all of those techniques to the New Testament corpus reduced the number of word types to 5975, in other words, an almost 60% reduction. We can remove stop words, like ‘the’, ‘is’, ’at’, ‘which’ for two reasons. First they are frequent words in any given language but they do not convey any discriminatory information. Second, they equally distributed across all the documents. There is no general agreement on which words need to be in a stop word filter, so you can choose the word list in accordance with your research question. But software experts have probably done this work for you already.

English language stopwords from:

English language stopwords from:

Many text mining packages and tools, like NLTK for Python, tm for R, RapidMiner, and SAS, do offer standardized filters for a wide range of contemporary languages. In addition, many stopword lists can be found in the public domain (as here). Keep in mind, though, that these standardized filters might be insufficient when working with historical and non-western literature. If, however, the research corpus is built from a larger database, you can simply use the most frequent words from the database. The stop word filter applied to our New Testament corpus removed an additional 93 types based on a list of 174 English stop words. This difference of 81 stop words exactly reflects the effect of applying a contemporary filter to an historical text (in this case the 17th century translation). Using a technique like this is often just fine, but you’ll want to alert your readers to the shortcut. Alternatively, if you would like to craft your own stop word list—as you’ll have to if you are working in an unusual idiosyncratic linguistic environment—consider first building a list of words by frequency. Then you can identify frequent terms that do not contain any interest for your research question. (See early blog posts about AntConc if you want a quick way to do this.)

Another way to reduce dimensionality is by systematically changing the words in your corpus themselves. When studying literature, a single word has a variety of grammatical forms. This often makes it necessary to reduce these variations to their common base form (Bird, Klein & Loper 2009). There are two main ways of doing this. Stemming facilitates this process by reducing a word to its stem by simply removing its ending. For example, applying Porter’s stemming algorithm (Porter 1980) to the New Testament corpus reduces words like ‘pray’, ‘prayed’, and ‘praying’ to the common base of ‘pray’. ‘Prayeth’, however, will not be reduced. Since Porter’s algorithm lacks a rule of the form ‘-eth →   ’, the archaic third-person indicative form of ‘pray’ remains. Lemmatization, using morphological analysis and specific vocabularies, allows us to transform archaic language and irregular forms (Manning, Raghavan & Schütze 2008). Lemmatization reduces various linguistic forms of a word to their common canonical form, the lemma, so that both ‘prayeth’ and ‘prayest’ are included in the ‘pray’ type.  In addition to these two main ways of reducing dimensionality, you can apply another kind of transformation using thesauri and lexical databases, like WordNet. WordNet can be made to detect synonyms in a text. The goal is to replace synonymous words with a basic name form.  In the New Testament corpus, it can be relevant to replace ‘Christ’, ‘Son of Man’, and ‘Son of God’ with the common denominator ‘Jesus’.

Using stop word lists, stemming and lemmatization are ways to remove irrelevant information and ordering linguistic forms by dimension reduction. In some cases, however, you might want to add information to the corpus in order to answer your research question. This might be because in a majority of cases, text mining projects targeting the content of texts disregard syntactic information. There can be many reasons for this, but many models rely on a bag-of-words assumption (Banchs 2013). A bag-of-words model of language essentially disregards word order. Instead, a document is treated as a bag containing all its words without sequential position, thus making word frequency central. But because syntactic information is relevant in studying verse and prose, there exists a range of tools for analyzing sentence structure while including word class. For instance, Parts-Of-Speech (POS) tagging is a set of techniques for grammatical annotation that tags words with their parts of speech. These tags include NN (noun, singular), NNS (noun, plural), and VB (verb, base) (Bird, Klein & Loper 2009). Note that POS tagging is hindered by prior use of some of the transformations mentioned above. If both POS-tags and other transformations are necessary, the POS tagger should generally precede all other transformations. For more information about POS tagging, and to download a handy, free POS tagger, see the Stanford Natural Language Processing Lab’s page.

The next article resumes the series by turning to what comes after all the preprocessing: modeling the corpus.

Banchs, Rafael E. 2013. Text Mining with MATLAB. 2013 edition. Springer.

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. 1 edition. Cambridge Mass.: O’Reilly Media.

Grant, Will J., and Erin Walsh. 2015. “Social Evidence of a Changing Climate: Google Ngram Data Points to Early Climate Change Impact on Human Society.” Weather 70 (7): 195–97.

Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. 1 edition. New York: Cambridge University Press.

Porter, M.F. 2006. “An Algorithm for Suffix Stripping.” Program: Electronic Library and Information Systems 40 (3): 211–18.

Weiss, Sholom M., Nitin Indurkhya, and Tong Zhang. 2010. Fundamentals of Predictive Text Mining. Springer.