HathiTrust Research Center

For the 2016-18 academic years I was on fellowship or affiliated with the HathiTrust Research Center (HTRC). The work from that time was finally released in late 2018. A brief summary is below.

Data Capsules

HTRC Data Capsules are virtual machines provisioned for researchers at HathiTrust Member Insitutions that give access to the fulltext and OCR scans of both public domain and in-copyright texts. I launched two new features with much help from Yu Ma, Samitha Liyanage, Leena Unnikrishnan, Charitha Madurangi, and Eleanor Dickson Koehl.

The first feature was the HTRC Workset Toolkit. This tool provides a command line interface (CLI) for interacting with and downloading volumes in the HathiTrust digital library. It also has tools for metadata management and collection management. The collection management tools are really great because a user can go from a collection URL to a list of volume IDs or record IDs for later download or metadata retrieval.

The second feature was the addition of the InPhO Topic Explorer to the Data Capsule’s default software stack. This allows the Topic Explorer to train models on the raw fulltext of public domain and in-copyright texts, as oppposed to over the word counts exposed by the extracted features.

One critical notion to the use of data capsules is that of non-consumptive research. In summary, research products cannot allow for reconstruction of the original text for human reading. Algorithmic analysis is considered a “transformative use” covered by fair use. These products can then be exported from a data capsule after review.


However, some analysis pipelines are guaranteed to produce valid non-consumptive products. These have been added to an HTRC Algorithmsportal for batch processing. I added the InPhO Topic Explorer to this tool.

Extracted Features

Finally, the coolest non-consumptive dataset is the HTRC Extracted Featrues Dataset which consists of word counts, part of speech tags, and more page-level details for 15.7 million public domain and in-copyright texts. The genius of the Extracted Features is that bag-of-words models (like topic models!) do not require anything more than word counts, so analyses can be performed on local computers, rather than a data capsule or other sandboxed environment.

I did not create the extracted features dataset, but created a way to integrate it with the Topic Explorer. Now using the command topicexplorer init --htrc htrcids.txt, where htrcids.txt is a file with one HathiTrust Volume ID per file models can be built on the extracted features over any volumes.

Comments off

What have I been up to?

Hello from the Land of Enchantment! In October, our family moved to Albuquerque, New Mexico – our fourth state in two years. Understandably, blogging has been a bit slower, but we’re finally getting settled, so I’m going to start with some basics before doing research updates and then expanding on some of these things.

The day after the election my then-fianceé Emily got offered a fellowship at the National Academies of Science. We packed up and left for DC, for what we thought was just 6 months. In order to keep us afloat, I took full-time work at the Internet Archive. We got married in May and when she was offered a (semi-)permanent position, we moved to suburban Maryland.

The Archive does heroic work for preservation and access, but was not a good fit for me, so in January 2018 I started a new position at Cornell University, working on the arXiv Next Generation (arXiv-NG) publishing platform. I’ve gotten a kick out of the fact that I moved to an organization with the exact same URL pronunciation (archive.org → arxiv.org), and the work we do has tremendous impact on scientific communications, with over 22 million monthly article downloads.

Skipping ahead to June 2018, our son Javier was born. With expenses already at their limit before adding a child and the challenges of employment in Trump’s Washington, we had to relocate in October. After triangulating what was important to us in a home – a bilingual blue state with sunny weather, low cost of living, and a lack of anti-vaxers – New Meixco was our choice. It’s definitely delivered on the promise: not a day goes by where I wish we were back East. There’s something exciting about being on the frontier, somewhere no one in my family history has ever lived.

tl;dr: moved 3 times, got married, had a baby, archive.org → arxiv.org

Sunset at Volcanoes Day Use Area, Petroglyph National Monument, Albuquerque, NM

Comments off

Towards Cultural-Scale Models of Full Text

For the past year, Colin and I have been on a HathiTrust Advanced Collaborative Support (ACS) Grant. This project has examined how topic models differ between library subject areas. For example, some areas may have a “canon” meaning that a low number of topics selects the same themes, no matter what the corpus size is. In contrast, still emerging fields may not agree on the overall thematic structure. We also looked at how sample size affects these models. We’ve uploaded the initial technical report to the arXiv:

Towards Cultural Scale Models of Full Text
Jaimie Murdock, Jiaan Zeng, Colin Allen
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust.

Comments off

Psychonomics 2015

This weekend I was in Chicago for the Psychonomic Society and Society for Computers in Psychology meetings. Emily and I stayed Thursday through Saturday and experienced a record first snow of the season. I hope that our fellow conference-goers made it back safely as well.

Chicago is one of the best food towns we’ve ever been to: we cannot recommend Gino’s East deep-dish pizza and Santorini’s Greek restaurant enough.

Below are some conference observations and highlights.

Conference Impressions
As an abstract-only, non-proceedings conference, it is a great opportunity to showcase developing or under review work. For an idea of the breadth of the conference, please look at the abstract book. The talks were of varying quality, but the rapt attention of the audience and quality of questions were excellent. Next year it will be in Boston on November 17-20.

Distributed Cognition
One of the best talks was by Steven Sloman on “The Illusion of Explanatory Depth and the Community of Knowledge”:

Asking people to explain how something works reveals an illusion of explanatory depth: Typically, people know less about the causal mechanism they are describing than they think they do (Rozenblit & Keil, 2002). I report studies showing that explanation shatters people’s sense of understanding in politics. I also show that people’s sense of understanding increases when they are informed that someone else understands and that this effect is not attributable to task demands or understandability inferences. The evidence suggests that our sense of understanding resides in a community of knowledge: People fail to distinguish the knowledge inside their heads from the knowledge in other people’s heads.

The article detailing that explanation shatters political understanding is quite accessible. The further results about “a community of knowledge” are under review.

Prof. Sloman is the conference chair for the International Conference on Thinking on August 3-6, 2016 at Brown University. Submission deadline is March 31, 2016.

The Science of Narrative
Another excellent talk was by Mark Finlayson who studies “the science of narrative”. He developed “Analogical Story Merging” (ASM), which can replicate Vladmir Propp’s theory of the structure of folktale plots. This process is described in his dissertation, which is an excellent synthesis of literary theory and computer science.

Prof. Finlayson is hosting the 7th International Workshop on Computational Models of Narrative at Digital Humanities 2016 in Kraków, Poland on July 11-12. The call for papers is pending.


There were two talks in the Bilingualism track that were particularly interesting.  Conor McLennan and Sara Incera reported that mouse tracking behavior in bilinguals doing a word discrimination task shows the same sort of reaction delay as in expert discrimination tasks. This correlates with confidence in answers – experts may take longer but move directly to their answers. The results are published in Bilingualism.

Another talk looked at how multilingualism affects vocabulary size using a massive online experiment. While the task of identifying whether a word is known or not is riddled with false positives, the results were interesting in and of themselves. Mutlilinguals tended to have higher vocabularies across languages, and L2 learners tended to actually have a higher vocabulary than L1 native speakers within a language. The results are published in The Quarterly Journal of Experimental Psychology.

Comments off

Darwin’s Semantic Voyage

The preprint of my project “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks” was released on arXiv on Friday. The paper is joint work with my advisors Colin Allen and Simon DeDeo.

This has consumed my life for the past year and I’m incredibly proud of the results. It’s an entertaining read — printing pages “1-11,24-28” gives the main body and references. 12-23 are the “supporting information” explaining some of the archival work, mathematics, and model verification, but absolutely not central to the key points of the paper.

The key point for digital humanities is that we’ve come up with a way to characterize an individual’s reading behaviors and identify key biographical periods from their life. Darwin is incredibly well-studied, so our results largely confirm existing history of science work. However, by adjusting the granularity we can also suggest hypotheses for further investigation – in this case, the period of Darwin’s life from 1851-1853 after his daughter’s death. For less well-studied individuals, this may help humanists gain traction on narrative organization when interacting with large historical archives.

The key point for cognitive scientists is that we can now characterize information foraging behaviors on multiple timescales using an information theoretic measure of cognitive surprise. While many people have studied foraging behavior in individuals on the order of minutes, or in cultures on the order of decades – this is the first study that looks at how an individual interacts with the products of their culture over the course of a lifetime.

It’s important to note that we don’t say anything about how his reading affected his writing – that’s for paper #2!

Also, I’ll presenting this work at the 2015 Conference on Complex Systems this Friday at Arizona State University, with slides available on Google Slides.

Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks
Jaimie Murdock, Colin Allen, Simon DeDeo
Abstract: Search in an environment with an uncertain distribution of resources involves a trade-off between local exploitation and distant exploration. This extends to the problem of information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. Darwin built his theory of natural selection in part by synthesizing disparate parts of Victorian science. When we analyze his extensively self-documented reading we find shifts, on multiple timescales, between choosing to remain with familiar topics and seeking cognitive surprise in novel fields. On the longest timescales, these shifts correlate with major intellectual epochs of his career, as detected by Bayesian epoch estimation. When we compare Darwin’s reading path with publication order of the same texts, we find Darwin more adventurous than the culture as a whole.

Comments off

Thomas Jefferson’s Mind: Polymathic and Polygraphic

This summer I am a Fellow at the International Center for Jefferson Studies (ICJS) at Monticello writing a grant proposal for the study of Thomas Jefferson’s Mind through his libraries. The day I arrived – May 8, 2015 – was the 200th anniversary of Jefferson’s sale to the Library of Congress of 6,497 volumes for $23,950 to replace the Library which was burned during the British invasion of Washington on August 24, 1814. This sale more than doubled the library’s catalog of 3,076 volumes and forever changed this national institution.

On Thursday, June 4, Colin Allen and I will be presenting our proposal to the Center via a public lecture for feedback from the community of historians, librarians, and other vested interests. The abstract and our bios are below.

Thomas Jefferson’s Mind: Polymathic and Polygraphic
Jaimie Murdock and Colin Allen
ICJS Fellows’ Forum, June 4, 2015

Jefferson collected thousands of books, wrote nearly 20,000 letters, and generated tens of thousands of other papers, keeping copies made with his famous double-penned “polygraph” machine. The digitization of his own writings, and the possibility of recreating his libraries from digital collections such as the HathiTrust, presents members of the public with unprecedented opportunities to explore and understand the interleaved themes of Jefferson’s life and career through the network of themes crisscrossing his reading and writing. We present our proposal to the Prototype Phase of the NEH Digital Projects for the Public Program, TJmind, which aids humanistic interpretation through novel, informative interfaces to both the books he read and the letters he wrote. These tools will make it possible for members of the public interested in Thomas Jefferson to discover themes writ large and small, and to drill down from high-level “distant readings” of the words that shaped his work into specific texts. By encouraging the public to engage directly with the texts and supporting their own “close readings” of thematically related documents, we can educate a broad audience about the interpretive process of the humanities. Our approach will also allow scholars who wish to apply computational methods in more depth to address research questions of interest to historians and other humanists, and to make these investigations available to a broad, general audience. Through both scholarly and public interactions, we set out on a discovery of the many facets of Jefferson, polymathic and polygraphic: from his interest in the Virginia climate to his concerns about the effects of Barbary pirates on American foreign trade, and of course from his historically significant role in designing the constitution, to his central position in defending American interests during two major wars.

Colin Allen is a Provost’s Professor at Indiana University, where he teaches in the Cognitive Science Program and the Department of History and Philosophy of Science. He is a philosopher of science whose research spans animal cognition, the prospects for artificial moral agents, and algorithmic analysis of philosophical texts. http://pages.iu.edu/~colallen/

Jaimie Murdock is a joint PhD Student in Cognitive Science and Informatics at Indiana University studying the dynamics of expert learning and innovation through the lenses of history and philosophy of science, machine learning, and cognition. He is a prolific programmer with eight years of research experience in digital humanities as the lead developer of the InPhO Project. http://jamram.net/

Comments off

Topic Modeling Tutorial at JCDL2015

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-­consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM’s secure mode to download texts and analyze their contents. [tutorial paper]2015 HTRC UnCamp



We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60

Hope to see you there!

Comments off

Six Upcoming Talks

For the past 6 months, I’ve been very busy working on a number of collaborations with Simon DeDeo and Colin Allen. Now, I’m taking to the road to show the fruit of my labors. Below are 6 upcoming talks, tutorials, and workshops about this work on topic modeling, Charles Darwin, information foraging, and the HathiTrust. I hope to see you there!

Topics over Time: Into Darwin’s Mind (Local)
Network Science @ IU Talks
Monday, March 9 — 12:30-1pm
Social Science Research Commons
Slides: http://jamr.am/DarwinIUNetSci
Video coming soon!

Topic Modeling with the HathiTrust Data Capsule
HathiTrust UnCamp 2015
Monday, March 30
Ann Arbor, MI
Presenters: Jaimie Murdock, Colin Allen

Topic-driven Foraging (Local)
Goldstone, Todd, Landy Lab
Friday, April 10 — 9-10a
MSB II Gill Conference Room

Visualization Techniques for LDA (Local)
Cognitive Science 25th Anniversary
Interactive Systems Open House
Friday, April 17 — 3:30-5:15pm
Location TBD

Topic Modeling & Network Analysis (Local)
Catapult Center Workshops
Friday, April 24 — 1-4pm
Wells Library E159
Presenter: Colin Allen

HT Data Capsule & Topic Modeling for Non-consumptive Research
JCDL 2015 Tutorial
Sunday, June 21 — 9am-noon
Knoxville, TN
Presenters: Jaimie Murdock, Jiaan Zeng, Robert MacDonald

Comments off

Wisdom of the Few?

Wisdom of the Few? “Supertaggers” in Collaborative Tagging Systems

Jared Lorince, Sam Zorowitz, Jaimie Murdock, Peter M. Todd

A folksonomy is ostensibly an information structure built up by the “wisdom of the crowd”, but is the “crowd” really doing the work? Tagging is in fact a sharply skewed process in which a small minority of “supertagger” users generate an overwhelming majority of the annotations. Using data from three large-scale social tagging platforms, we explore (a) how to best quantify the imbalance in tagging behavior and formally define a supertagger, (b) how supertaggers differ from other users in their tagging patterns, and (c) if effects of motivation and expertise inform our understanding of what makes a supertagger. Our results indicate that such prolific users not only tag more than their counterparts, but in quantifiably different ways. These findings suggest that we should question the extent to which folkosonomies achieve crowdsourced classification via the “wisdom of the crowd”, especially for broad folksonomies like Last.fm as opposed to narrow folksonomies like Flickr.

Preprint of article in review available at arXiv:1502.02777 [cs.SI]

Comments off

Topic Explorer at AAAI

Next week, I’ll be headed to Austin, TX for AAAI-15 to present a demo of the Topic Explorer. With this presentation is a short paper:

Topic models remain a black box both for modelers and for end users in many respects. From the modelers’ perspective, many decisions must be made which lack clear rationales and whose interactions are unclear – for example, how many topics the algorithms should find (K), which words to ignore (aka the “stop list”), and whether it is adequate to run the modeling process once or multiple times, producing different results due to the algorithms that approximate the Bayesian priors. Furthermore, the results of different parameter settings are hard to analyze, summarize, and visualize, making model comparison difficult. From the end users’ perspective, it is hard to understand why the models perform as they do, and information-theoretic similarity measures do not fully align with humanistic interpretation of the topics. We present the Topic Explorer, which advances the state-of-the-art in topic model visualization for document-document and topic-document relations. It brings topic models to life in a way that fosters deep understanding of both corpus and models, allowing users to generate interpretive hypotheses and to suggest further experiments. Such tools are an essential step toward assessing whether topic modeling is a suitable technique for AI and cognitive modeling applications.

Jaimie Murdock and Colin Allen. (2015) Visualization Techniques for Topic Model Checking. [demo track] in Proceedings of the 29th AAAI Conference (AAAI-15). Austin, Texas, USA, January 25-29, 2015.

Comments off