Text Analysis


Overview

Text Analysis (also Text Mining) refers to the process performed on a textual dataset to extract information from it. Text Analysis includes a variety of applications, from tracking the fluctuation of specific words or themes over time to stylometry, a form of inquiry that combines literary theory and linguistics in order to compare the underlying structure of texts from one writer or genre to another.

All Text Analysis projects in the Digital Humanities require the curation of a dataset. This dataset might be a single author’s entire oeuvre, part of a magazine print run, or a collection of poems from across multiple centries.

Text Analysis programs use a range of methods and techniques to extract information from texts. These include word counts, trends, keyword density, correlations and topic modelling. As with all Digital Humanities projects, the question of what makes for meaningful information is always open for discussion. For this reason, a good digital project in Text Analysis should always start with a clear, compelling research question.

Tools

There are a large number of computer-assisted Text Analysis tools, each of which performs slightly different functions and requires different skills in its user.  

Some downloadable applications that require little to no programing skills include NVivo, Tableau and Cowo.

Easy-to-use tools that don’t require programming skills include Voyant Tools, JSTOR Lab’s Text Analyzer, Netlytic, and Wordle.

Other Text Analysis tools require small amounts of command line usage. Some examples include Stanford’s CoreNLP and MALLET, a tool that generates topic models.

Some platforms that require enhanced levels of programming skills include NLTK and Bookworm, which tracks word frequencies over time.

Datasets

Central to any Text Analysis project is the curation of a dataset, which frequently requires digitisation and optical character recognition (OCR), the process of turning words from images into searchable text.

The internet is text-centric, so there is a huge array of  textual data sets freely available online. Some examples of these to get you started include:

Three reference datasets have been curated and transformed as part of the Tinker environment. You can find them here.

How do I get started?

Before beginning a Digital Humanities project of your own, it can be useful to follow a recipe.

In the same way that people learn to cook by using recipes (ingredients, utensils, steps etc.), researchers can learn how to use new digital tools and methods by following a series of steps.

Have a go one of these Text Analysis recipes:

Readings