HathiTrust Research Center

For the 2016-18 academic years I was on fellowship or affiliated with the HathiTrust Research Center (HTRC). The work from that time was finally released in late 2018. A brief summary is below.

Data Capsules

HTRC Data Capsules are virtual machines provisioned for researchers at HathiTrust Member Insitutions that give access to the fulltext and OCR scans of both public domain and in-copyright texts. I launched two new features with much help from Yu Ma, Samitha Liyanage, Leena Unnikrishnan, Charitha Madurangi, and Eleanor Dickson Koehl.

The first feature was the HTRC Workset Toolkit. This tool provides a command line interface (CLI) for interacting with and downloading volumes in the HathiTrust digital library. It also has tools for metadata management and collection management. The collection management tools are really great because a user can go from a collection URL to a list of volume IDs or record IDs for later download or metadata retrieval.

The second feature was the addition of the InPhO Topic Explorer to the Data Capsule’s default software stack. This allows the Topic Explorer to train models on the raw fulltext of public domain and in-copyright texts, as oppposed to over the word counts exposed by the extracted features.

One critical notion to the use of data capsules is that of non-consumptive research. In summary, research products cannot allow for reconstruction of the original text for human reading. Algorithmic analysis is considered a “transformative use” covered by fair use. These products can then be exported from a data capsule after review.

Algorithms

However, some analysis pipelines are guaranteed to produce valid non-consumptive products. These have been added to an HTRC Algorithmsportal for batch processing. I added the InPhO Topic Explorer to this tool.

Extracted Features

Finally, the coolest non-consumptive dataset is the HTRC Extracted Featrues Dataset which consists of word counts, part of speech tags, and more page-level details for 15.7 million public domain and in-copyright texts. The genius of the Extracted Features is that bag-of-words models (like topic models!) do not require anything more than word counts, so analyses can be performed on local computers, rather than a data capsule or other sandboxed environment.

I did not create the extracted features dataset, but created a way to integrate it with the Topic Explorer. Now using the command topicexplorer init --htrc htrcids.txt, where htrcids.txt is a file with one HathiTrust Volume ID per file models can be built on the extracted features over any volumes.