Archive for Uncategorized

HathiTrust Research Center

For the 2016-18 academic years I was on fellowship or affiliated with the HathiTrust Research Center (HTRC). The work from that time was finally released in late 2018. A brief summary is below.

Data Capsules

HTRC Data Capsules are virtual machines provisioned for researchers at HathiTrust Member Insitutions that give access to the fulltext and OCR scans of both public domain and in-copyright texts. I launched two new features with much help from Yu Ma, Samitha Liyanage, Leena Unnikrishnan, Charitha Madurangi, and Eleanor Dickson Koehl.

The first feature was the HTRC Workset Toolkit. This tool provides a command line interface (CLI) for interacting with and downloading volumes in the HathiTrust digital library. It also has tools for metadata management and collection management. The collection management tools are really great because a user can go from a collection URL to a list of volume IDs or record IDs for later download or metadata retrieval.

The second feature was the addition of the InPhO Topic Explorer to the Data Capsule’s default software stack. This allows the Topic Explorer to train models on the raw fulltext of public domain and in-copyright texts, as oppposed to over the word counts exposed by the extracted features.

One critical notion to the use of data capsules is that of non-consumptive research. In summary, research products cannot allow for reconstruction of the original text for human reading. Algorithmic analysis is considered a “transformative use” covered by fair use. These products can then be exported from a data capsule after review.


However, some analysis pipelines are guaranteed to produce valid non-consumptive products. These have been added to an HTRC Algorithmsportal for batch processing. I added the InPhO Topic Explorer to this tool.

Extracted Features

Finally, the coolest non-consumptive dataset is the HTRC Extracted Featrues Dataset which consists of word counts, part of speech tags, and more page-level details for 15.7 million public domain and in-copyright texts. The genius of the Extracted Features is that bag-of-words models (like topic models!) do not require anything more than word counts, so analyses can be performed on local computers, rather than a data capsule or other sandboxed environment.

I did not create the extracted features dataset, but created a way to integrate it with the Topic Explorer. Now using the command topicexplorer init --htrc htrcids.txt, where htrcids.txt is a file with one HathiTrust Volume ID per file models can be built on the extracted features over any volumes.

Comments off

InPhO for All: Why APIs Matter

This month Colin Allen and I published “InPhO for All: Why APIs Matter” in the Journal of the Chicago Colloquium on Digital Humanities and Computer Science (JDHCS). It’s a short piece setting up the API development narrative for digital humanists. Abstract, citation, and paper link follow.

The unique convergence of humanities scholars, computer scientists, librarians, and information scientists in digital humanities projects highlights the collaborative opportunities such research entails. Unfortunately, the relatively limited human resources committed to many digital humanities projects have led to unwieldy initial implementations and underutilization of semantic web technology, creating a sea of isolated projects without integratable data. Furthermore, the use of standards for one particular purpose may not suit other kinds of scholarly activities, impeding collaboration in the digital humanities. By designing and utilizing an Application Platform Interface (API), projects can reduce these barriers, while simultaneously reducing internal support costs and easing the transition to new development teams. Our experience developing an API for the Indiana Philosophy Ontology (InPhO) Project highlights these benefits.

Jaimie Murdock and Colin Allen. InPhO for All: Why APIs Matter. In Journal of the Chicago Colloquium on Digital Humanities and Computer Science (JDHCS). Evanston, Illinois, 2011. [paper]

Comments (1)