HathiTrust Research Center

Written by
For the 2016-18 academic years I was on fellowship or affiliated with the HathiTrust Research Center (HTRC). The work from that time was finally released in late 2018. A brief summary is below.
Data CapsulesHTRC Data CapsulesÂ are virtual machines provisioned for researchers at HathiTrust Member Insitutions that give access to the fulltext and OCR scans of both public domain and in-copyright texts. I launched two new features with much help from Yu Ma, Samitha Liyanage, Leena Unnikrishnan, Charitha Madurangi, and Eleanor Dickson Koehl.
The first feature was theÂ HTRC Workset Toolkit. This tool provides a command line interface (CLI) for interacting with and downloading volumes in the HathiTrust digital library. It also has tools for metadata management and collection management. The collection management tools are really great because a user can go from a collection URL to a list of volume IDs or record IDs for later download or metadata retrieval.
The second feature was the addition of theÂ InPhO Topic ExplorerÂ to the Data Capsule’s default software stack. This allows the Topic Explorer to train models on the raw fulltext of public domain and in-copyright texts, as oppposed to over the word counts exposed by the extracted features.
One critical notion to the use of data capsules is that ofÂ non-consumptive research. In summary, research products cannot allow for reconstruction of the original text for human reading. Algorithmic analysis is considered a “transformative use” covered by fair use. These products can then be exported from a data capsule after review.
AlgorithmsHowever, some analysis pipelines are guaranteed to produce valid non-consumptive products. These have been added to anÂ HTRC Algorithmsportal for batch processing. I added the InPhO Topic Explorer to this tool.
Extracted FeaturesFinally, the coolest non-consumptive dataset is theÂ HTRC Extracted Featrues DatasetÂ which consists of word counts, part of speech tags, and more page-level details for 15.7 million public domain and in-copyright texts. The genius of the Extracted Features is that bag-of-words models (like topic models!) do not require anything more than word counts, so analyses can be performed on local computers, rather than a data capsule or other sandboxed environment.
I did not create the extracted features dataset, butÂ created a way to integrate it with the Topic Explorer. Now using the commandÂ topicexplorer init --htrc htrcids.txt, whereÂ htrcids.txtÂ is a file with one HathiTrust Volume ID per file models can be built on the extracted features over any volumes.
HathiTrust Research Center

Data Capsules

Algorithms

Extracted Features

More posts

Space Mirrors?!

The Living Experiment: America at 250

A Landscape Under Threat

Reflections on Patriotism