Create your dataset

One of the first things you will need to do when undertaking data-driven research or study in the Humanities, Arts and Social Sciences (HASS) is to create your dataset.

This dataset might comprise materials you have located yourself or those collated by others. For a digital HASS project, this often entails either finding a collection of texts or images online or digitizing physical holdings.

You might find that datasets are frequently referred to as a corpora or corpus. This is because a good deal of terminology in the Digital Humanities derives from Linguistic study, where a text-based dataset is most commonly used for analysis.

Datasets can consist of collated text documents, materials to be digitised and transcribed, statistical and tabular data, images and a wide range of other materials from diverse sources.

The documents in your dataset will need to be computer-readable before you engage in meaningful computational analysis. To enable this for large numbers of documents you will usually need to undertake some form of transcription. For example, you may be able to run your documents through optical character recognition (OCR) software. OCRing texts turn text documents from images (a pdf., for example) into searchable, machine-actionable texts. 

With some handwritten or older typeset documents where OCR is not an option, Tinker provides a variety of tools, software and how-to information to transcribe the information yourself.

If you are working with tabular data you may need to clean this data for analysis. To do this you can use tools such as OpenRefine. Open refine is a purpose-built piece of software for removing errors and inconsistencies in data.

How can Tinker help? The Tinker platform provides links to tools and resources which can assist with the creation of your Corpus and to guide you in the process digital analysis. We also provide case studies and examples where we can connect you with other researchers, trainers and teachers doing similar work!