Evaluation and Interpretation

Kristoffer Nielbo & Ryan Nichols

This represents our final article in this series, and here we will talk about understanding results of text analytics. The value of text analytics is often not apparent. After all, any extracted pattern from a research corpus is one out of many possible patterns. In order for a pattern to give us valuable knowledge, we must consider the model from which it was found.

Model evaluation concerns the pattern’s quality and the model’s suitability (Witten, Frank & Hall 2011). We can use the results from word counting and association mining in procedures for evaluating the model’s fit and generalizability (e.g., Baunvig & Nielbo, in prep). For example, we can model how often ‘Jesus’ and ‘said’ occurs in the Historical books versus the Epistles of the New Testament. A linear model predicts the word frequency of both ‘Jesus’ and ‘said’ in either the Historical, Pauline, or Non-Pauline class, to which we can apply statistical hypothesis testing. The ‘Jesus-said’ model is statistically reliable, and we can see how well the model explains the data with an effect size measure, such as η2 (Richardson 2011). In this case, the model explains 55% of the variation. Therefore, the model finds strong support in the data and verifies that ‘Jesus said’ is a function of document class in the New Testament.

Since models of relations between documents use machine learning, we evaluate their procedures from that field. Machine learning offers a range of evaluation procedures, like estimator score methods, internal scoring strategies, and functions for assessing performance (Bishop 2007). We can evaluate a document clustering model by looking at how the cluster model explains variation in the document-term matrix compared to treating the documents as one cluster. For the hard cluster models, this evaluation is measured at 30%, which is acceptable as each document has more than 1500 features.  Adding more clusters would increase the evaluation measure, but would make interpretation difficult. A model with 27 clusters would recreate the 27 books that the model is supposed to explain.

A confusion matrix, discussed in the previous entry, evaluates a classifier’s performance (Kohavi & Provost 1998). Predictive accuracy is one such performance measure, which was around 90% for the New Testament example. That is, the classifier accurately predicted the class value for 90% of the New Testament slices. Two popular performance measures are precision, which measures if a document is correctly classified, and recall, which measures how well the classifier detects documents in a given category (Witten, Frank & Hall 2011). Both measures have a range from 0 (worst performance) to 1 (best performance). By using these measures on the New Testament classifier, we see that it performed well: precision = .88; recall = .97.

Formal evaluation supports our interpretation of the model’s results. Most text mining endeavors desire to discover new information, but we often find support for existing theories. This can be an informal validation of the model. If text mining techniques can reproduce previous findings in new ways, they can possibly convince their sceptics. Once they are validated, these techniques can approach new problems or critique current views in a given field. For instance, a recent topic modeling study of an ancient Chinese text of disputed date, the Book of Documents or Shujing 書經 (Nichols et al., in review), reproduced the scholarly consensus of the dates of individual chapters, but suggested areas in which the consensus might be wrong.

As with all Humanities research, researchers’ domain expertise is invaluable for interpreting and contextualizing results. If you, as an area expert, are put in the position of interpreting the results of a document classification model, you should certainly trust the evidence you have amassed learning about these primary texts and their historical features while at the same time being certain to cultivate open-mindedness about how the results of the model might challenge conventional thinking. As text mining grows, it will likely create new areas for scholars to apply their knowledge. Without such expertise, the results of text mining historical and cultural data will not be very useful.

In this series of articles we have introduced several elements of the text mining workflow so that Humanities scholars can see the potential for text mining in their research. In this format, we have condensed the information for an accessible introduction to text mining. Current developments in character-level models, word embedding and deep structured learning are altering text mining as we write. Nevertheless, the general workflow remains the same: text selection, preprocessing, modeling and interpretation. Because the knowledge from this workflow is dependent on Humanities domain expertise, we need historical, cultural, and literary researchers who are willing and able to participate in text mining projects. Given the constant growth of historical texts, text mining provides us with tools to take advantage of these new scholarly resources.

Text mining is not meant to replace traditional qualitative methods in the Humanities. On the contrary, detailed readings of selected texts can, and should, complement large-scale quantitative analysis. Projects that rely primarily on document-level text mining also need a qualitative assessment of the data. Classical Humanities disciplines offer critical insights with their experience in finding patterns qualitatively. Finally, Humanities research constitutes a rich theoretical resource for finding the connection between the internal and external factors of a text. Text mining is a new and exciting tool in out toolbox. Moreover, the participation of Humanities scholars in text mining projects can shed a light on the scientific and societal relevance of Humanities research.