Category: digital humanities

Towards Cultural-Scale Models of Full Text

For the past year, Colin and I have been on a HathiTrust Advanced Collaborative Support (ACS) Grant. This project has examined how topic models differ between library subject areas. For example, some areas may have a “canon” meaning that a low number of topics selects the same themes, no matter what the corpus size is. In contrast, still emerging fields may not agree on the overall thematic structure. We also looked at how sample size affects these models. We’ve uploaded the initial technical report to the arXiv:

Towards Cultural Scale Models of Full Text
Jaimie Murdock, Jiaan Zeng, Colin Allen
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust.
http://arxiv.org/abs/1512.05004

January 26, 2016
Darwin’s Semantic Voyage

The preprint of my project “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks” was released on arXiv on Friday. The paper is joint work with my advisors Colin Allen and Simon DeDeo.

This has consumed my life for the past year and I’m incredibly proud of the results. It’s an entertaining read — printing pages “1-11,24-28” gives the main body and references. 12-23 are the “supporting information” explaining some of the archival work, mathematics, and model verification, but absolutely not central to the key points of the paper.

The key point for digital humanities is that we’ve come up with a way to characterize an individual’s reading behaviors and identify key biographical periods from their life. Darwin is incredibly well-studied, so our results largely confirm existing history of science work. However, by adjusting the granularity we can also suggest hypotheses for further investigation – in this case, the period of Darwin’s life from 1851-1853 after his daughter’s death. For less well-studied individuals, this may help humanists gain traction on narrative organization when interacting with large historical archives.

The key point for cognitive scientists is that we can now characterize information foraging behaviors on multiple timescales using an information theoretic measure of cognitive surprise. While many people have studied foraging behavior in individuals on the order of minutes, or in cultures on the order of decades – this is the first study that looks at how an individual interacts with the products of their culture over the course of a lifetime.

It’s important to note that we don’t say anything about how his reading affected his writing – that’s for paper #2!

Also, I’ll presenting this work at the 2015 Conference on Complex Systems this Friday at Arizona State University, with slides available on Google Slides.

Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks
Jaimie Murdock, Colin Allen, Simon DeDeo
Abstract:Â Search in an environment with an uncertain distribution of resources involves a trade-off between local exploitation and distant exploration. This extends to the problem of information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. Darwin built his theory of natural selection in part by synthesizing disparate parts of Victorian science. When we analyze his extensively self-documented reading we find shifts, on multiple timescales, between choosing to remain with familiar topics and seeking cognitive surprise in novel fields. On the longest timescales, these shifts correlate with major intellectual epochs of his career, as detected by Bayesian epoch estimation. When we compare Darwin’s reading path with publication order of the same texts, we find Darwin more adventurous than the culture as a whole.

September 30, 2015
Thomas Jefferson’s Mind: Polymathic and Polygraphic

This summer I am a Fellow at the International Center for Jefferson Studies (ICJS) at Monticello writing a grant proposal for the study of Thomas Jefferson’s Mind through his libraries. The day I arrived – May 8, 2015 – was the 200th anniversary of Jefferson’s sale to the Library of Congress of 6,497 volumes for $23,950 to replace the Library which was burned during the British invasion of Washington on August 24, 1814. This sale more than doubled the library’s catalog of 3,076 volumes and forever changed this national institution.

On Thursday, June 4, Colin Allen and I will be presenting our proposal to the Center via a public lecture for feedback from the community of historians, librarians, and other vested interests. The abstract and our bios are below.

Thomas Jefferson’s Mind: Polymathic and Polygraphic
Jaimie Murdock and Colin Allen
ICJS Fellows’ Forum, June 4, 2015

Jefferson collected thousands of books, wrote nearly 20,000 letters, and generated tens of thousands of other papers, keeping copies made with his famous double-penned â€œpolygraphâ€ machine. The digitization of his own writings, and the possibility of recreating his libraries from digital collections such as the HathiTrust, presents members of the public with unprecedented opportunities to explore and understand the interleaved themes of Jeffersonâ€™s life and career through the network of themes crisscrossing his reading and writing. We present our proposal to the Prototype Phase of the NEH Digital Projects for the Public Program, TJmind, which aids humanistic interpretation through novel, informative interfaces to both the books he read and the letters he wrote. These tools will make it possible for members of the public interested in Thomas Jefferson to discover themes writ large and small, and to drill down from high-level â€œdistant readingsâ€ of the words that shaped his work into specific texts. By encouraging the public to engage directly with the texts and supporting their own “close readingsâ€ of thematically related documents, we can educate a broad audience about the interpretive process of the humanities. Our approach will also allow scholars who wish to apply computational methods in more depth to address research questions of interest to historians and other humanists, and to make these investigations available to a broad, general audience. Through both scholarly and public interactions, we set out on a discovery of the many facets of Jefferson, polymathic and polygraphic: from his interest in the Virginia climate to his concerns about the effects of Barbary pirates on American foreign trade, and of course from his historically significant role in designing the constitution, to his central position in defending American interests during two major wars.

Colin Allen is a Provost’s Professor at Indiana University, where he teaches in the Cognitive Science Program and the Department of History and Philosophy of Science. He is a philosopher of science whose research spans animal cognition, the prospects for artificial moral agents, and algorithmic analysis of philosophical texts. http://pages.iu.edu/~colallen/

Jaimie Murdock is a joint PhD Student in Cognitive Science and Informatics at Indiana University studying the dynamics of expert learning and innovation through the lenses of history and philosophy of science, machine learning, and cognition. He is a prolific programmer with eight years of research experience in digital humanities as the lead developer of the InPhO Project. http://jamram.net/

May 30, 2015
Topic Modeling Tutorial at JCDL2015

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!

Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-Âconsumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VMâ€™s secure mode to download texts and analyze their contents.Â [tutorial paper]

We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60

Hope to see you there!

http://www.jcdl2015.org/registration

May 14, 2015