Text Analysis (also Text Mining) refers to the process performed on a textual dataset to extract information from it. Text Analysis includes a variety of applications, from tracking the fluctuation of specific words or themes over time to stylometry, a form of inquiry that combines literary theory and linguistics in order to compare the underlying structure of texts from one writer or genre to another.
All Text Analysis projects in the Digital Humanities require the curation of a dataset. This dataset might be a single author’s entire oeuvre, part of a magazine print run, or a collection of poems from across multiple centries.
Text Analysis programs use a range of methods and techniques to extract information from texts. These include word counts, trends, keyword density, correlations and topic modelling. As with all Digital Humanities projects, the question of what makes for meaningful information is always open for discussion. For this reason, a good digital project in Text Analysis should always start with a clear, compelling research question.
There are a large number of computer-assisted Text Analysis tools, each of which performs slightly different functions and requires different skills in its user.
Central to any Text Analysis project is the curation of a dataset, which frequently requires digitisation and optical character recognition (OCR), the process of turning words from images into searchable text.
The internet is text-centric, so there is a huge array of textual data sets freely available online. Some examples of these to get you started include:
Three reference datasets have been curated and transformed as part of the Tinker environment. You can find them here.
How do I get started?
Before beginning a Digital Humanities project of your own, it can be useful to follow a recipe.
In the same way that people learn to cook by using recipes (ingredients, utensils, steps etc.), researchers can learn how to use new digital tools and methods by following a series of steps.
Have a go one of these Text Analysis recipes:
- Seven Ways Humanists are Using Computers to Understand Text by Ted Underwood
- Stanford Literary Lab Pamphlets