Towards Cultural-Scale Models of Full Text

For the past year, Colin and I have been on a HathiTrust Advanced Collaborative Support (ACS) Grant. This project has examined how topic models differ between library subject areas. For example, some areas may have a “canon” meaning that a low number of topics selects the same themes, no matter what the corpus size is. In contrast, still emerging fields may not agree on the overall thematic structure. We also looked at how sample size affects these models. We’ve uploaded the initial technical report to the arXiv:

Towards Cultural Scale Models of Full Text
Jaimie Murdock, Jiaan Zeng, Colin Allen
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust.
http://arxiv.org/abs/1512.05004

Comments off

Psychonomics 2015

This weekend I was in Chicago for the Psychonomic Society and Society for Computers in Psychology meetings. Emily and I stayed Thursday through Saturday and experienced a record first snow of the season. I hope that our fellow conference-goers made it back safely as well.

Chicago is one of the best food towns we’ve ever been to: we cannot recommend Gino’s East deep-dish pizza and Santorini’s Greek restaurant enough.

Below are some conference observations and highlights.

Conference Impressions
As an abstract-only, non-proceedings conference, it is a great opportunity to showcase developing or under review work. For an idea of the breadth of the conference, please look at the abstract book. The talks were of varying quality, but the rapt attention of the audience and quality of questions were excellent. Next year it will be in Boston on November 17-20.

Distributed Cognition
One of the best talks was by Steven Sloman on “The Illusion of Explanatory Depth and the Community of Knowledge”:

Asking people to explain how something works reveals an illusion of explanatory depth: Typically, people know less about the causal mechanism they are describing than they think they do (Rozenblit & Keil, 2002). I report studies showing that explanation shatters people’s sense of understanding in politics. I also show that people’s sense of understanding increases when they are informed that someone else understands and that this effect is not attributable to task demands or understandability inferences. The evidence suggests that our sense of understanding resides in a community of knowledge: People fail to distinguish the knowledge inside their heads from the knowledge in other people’s heads.

The article detailing that explanation shatters political understanding is quite accessible. The further results about “a community of knowledge” are under review.

Prof. Sloman is the conference chair for the International Conference on Thinking on August 3-6, 2016 at Brown University. Submission deadline is March 31, 2016.

The Science of Narrative
Another excellent talk was by Mark Finlayson who studies “the science of narrative”. He developed “Analogical Story Merging” (ASM), which can replicate Vladmir Propp’s theory of the structure of folktale plots. This process is described in his dissertation, which is an excellent synthesis of literary theory and computer science.

Prof. Finlayson is hosting the 7th International Workshop on Computational Models of Narrative at Digital Humanities 2016 in Kraków, Poland on July 11-12. The call for papers is pending.

Bilingualism

There were two talks in the Bilingualism track that were particularly interesting.  Conor McLennan and Sara Incera reported that mouse tracking behavior in bilinguals doing a word discrimination task shows the same sort of reaction delay as in expert discrimination tasks. This correlates with confidence in answers – experts may take longer but move directly to their answers. The results are published in Bilingualism.

Another talk looked at how multilingualism affects vocabulary size using a massive online experiment. While the task of identifying whether a word is known or not is riddled with false positives, the results were interesting in and of themselves. Mutlilinguals tended to have higher vocabularies across languages, and L2 learners tended to actually have a higher vocabulary than L1 native speakers within a language. The results are published in The Quarterly Journal of Experimental Psychology.

Comments off

Darwin’s Semantic Voyage

The preprint of my project “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks” was released on arXiv on Friday. The paper is joint work with my advisors Colin Allen and Simon DeDeo.

This has consumed my life for the past year and I’m incredibly proud of the results. It’s an entertaining read — printing pages “1-11,24-28” gives the main body and references. 12-23 are the “supporting information” explaining some of the archival work, mathematics, and model verification, but absolutely not central to the key points of the paper.

The key point for digital humanities is that we’ve come up with a way to characterize an individual’s reading behaviors and identify key biographical periods from their life. Darwin is incredibly well-studied, so our results largely confirm existing history of science work. However, by adjusting the granularity we can also suggest hypotheses for further investigation – in this case, the period of Darwin’s life from 1851-1853 after his daughter’s death. For less well-studied individuals, this may help humanists gain traction on narrative organization when interacting with large historical archives.

The key point for cognitive scientists is that we can now characterize information foraging behaviors on multiple timescales using an information theoretic measure of cognitive surprise. While many people have studied foraging behavior in individuals on the order of minutes, or in cultures on the order of decades – this is the first study that looks at how an individual interacts with the products of their culture over the course of a lifetime.

It’s important to note that we don’t say anything about how his reading affected his writing – that’s for paper #2!

Also, I’ll presenting this work at the 2015 Conference on Complex Systems this Friday at Arizona State University, with slides available on Google Slides.

Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks
Jaimie Murdock, Colin Allen, Simon DeDeo
Abstract: Search in an environment with an uncertain distribution of resources involves a trade-off between local exploitation and distant exploration. This extends to the problem of information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. Darwin built his theory of natural selection in part by synthesizing disparate parts of Victorian science. When we analyze his extensively self-documented reading we find shifts, on multiple timescales, between choosing to remain with familiar topics and seeking cognitive surprise in novel fields. On the longest timescales, these shifts correlate with major intellectual epochs of his career, as detected by Bayesian epoch estimation. When we compare Darwin’s reading path with publication order of the same texts, we find Darwin more adventurous than the culture as a whole.

Comments off

Thomas Jefferson’s Mind: Polymathic and Polygraphic

This summer I am a Fellow at the International Center for Jefferson Studies (ICJS) at Monticello writing a grant proposal for the study of Thomas Jefferson’s Mind through his libraries. The day I arrived – May 8, 2015 – was the 200th anniversary of Jefferson’s sale to the Library of Congress of 6,497 volumes for $23,950 to replace the Library which was burned during the British invasion of Washington on August 24, 1814. This sale more than doubled the library’s catalog of 3,076 volumes and forever changed this national institution.

On Thursday, June 4, Colin Allen and I will be presenting our proposal to the Center via a public lecture for feedback from the community of historians, librarians, and other vested interests. The abstract and our bios are below.

Thomas Jefferson’s Mind: Polymathic and Polygraphic
Jaimie Murdock and Colin Allen
ICJS Fellows’ Forum, June 4, 2015

Jefferson collected thousands of books, wrote nearly 20,000 letters, and generated tens of thousands of other papers, keeping copies made with his famous double-penned “polygraph” machine. The digitization of his own writings, and the possibility of recreating his libraries from digital collections such as the HathiTrust, presents members of the public with unprecedented opportunities to explore and understand the interleaved themes of Jefferson’s life and career through the network of themes crisscrossing his reading and writing. We present our proposal to the Prototype Phase of the NEH Digital Projects for the Public Program, TJmind, which aids humanistic interpretation through novel, informative interfaces to both the books he read and the letters he wrote. These tools will make it possible for members of the public interested in Thomas Jefferson to discover themes writ large and small, and to drill down from high-level “distant readings” of the words that shaped his work into specific texts. By encouraging the public to engage directly with the texts and supporting their own “close readings” of thematically related documents, we can educate a broad audience about the interpretive process of the humanities. Our approach will also allow scholars who wish to apply computational methods in more depth to address research questions of interest to historians and other humanists, and to make these investigations available to a broad, general audience. Through both scholarly and public interactions, we set out on a discovery of the many facets of Jefferson, polymathic and polygraphic: from his interest in the Virginia climate to his concerns about the effects of Barbary pirates on American foreign trade, and of course from his historically significant role in designing the constitution, to his central position in defending American interests during two major wars.

Colin Allen is a Provost’s Professor at Indiana University, where he teaches in the Cognitive Science Program and the Department of History and Philosophy of Science. He is a philosopher of science whose research spans animal cognition, the prospects for artificial moral agents, and algorithmic analysis of philosophical texts. http://pages.iu.edu/~colallen/

Jaimie Murdock is a joint PhD Student in Cognitive Science and Informatics at Indiana University studying the dynamics of expert learning and innovation through the lenses of history and philosophy of science, machine learning, and cognition. He is a prolific programmer with eight years of research experience in digital humanities as the lead developer of the InPhO Project. http://jamram.net/

Comments off

Topic Modeling Tutorial at JCDL2015

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-­consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM’s secure mode to download texts and analyze their contents. [tutorial paper]2015 HTRC UnCamp

 

 

We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60

Hope to see you there!

Comments off

Six Upcoming Talks

For the past 6 months, I’ve been very busy working on a number of collaborations with Simon DeDeo and Colin Allen. Now, I’m taking to the road to show the fruit of my labors. Below are 6 upcoming talks, tutorials, and workshops about this work on topic modeling, Charles Darwin, information foraging, and the HathiTrust. I hope to see you there!

Topics over Time: Into Darwin’s Mind (Local)
Network Science @ IU Talks
Monday, March 9 — 12:30-1pm
Social Science Research Commons
Slides: http://jamr.am/DarwinIUNetSci
Video coming soon!

Topic Modeling with the HathiTrust Data Capsule
HathiTrust UnCamp 2015
Monday, March 30
Ann Arbor, MI
Presenters: Jaimie Murdock, Colin Allen

Topic-driven Foraging (Local)
Goldstone, Todd, Landy Lab
Friday, April 10 — 9-10a
MSB II Gill Conference Room

Visualization Techniques for LDA (Local)
Cognitive Science 25th Anniversary
Interactive Systems Open House
Friday, April 17 — 3:30-5:15pm
Location TBD

Topic Modeling & Network Analysis (Local)
Catapult Center Workshops
Friday, April 24 — 1-4pm
Wells Library E159
Presenter: Colin Allen

HT Data Capsule & Topic Modeling for Non-consumptive Research
JCDL 2015 Tutorial
Sunday, June 21 — 9am-noon
Knoxville, TN
Presenters: Jaimie Murdock, Jiaan Zeng, Robert MacDonald

Comments off

Wisdom of the Few?

Wisdom of the Few? “Supertaggers” in Collaborative Tagging Systems

Jared Lorince, Sam Zorowitz, Jaimie Murdock, Peter M. Todd

A folksonomy is ostensibly an information structure built up by the “wisdom of the crowd”, but is the “crowd” really doing the work? Tagging is in fact a sharply skewed process in which a small minority of “supertagger” users generate an overwhelming majority of the annotations. Using data from three large-scale social tagging platforms, we explore (a) how to best quantify the imbalance in tagging behavior and formally define a supertagger, (b) how supertaggers differ from other users in their tagging patterns, and (c) if effects of motivation and expertise inform our understanding of what makes a supertagger. Our results indicate that such prolific users not only tag more than their counterparts, but in quantifiably different ways. These findings suggest that we should question the extent to which folkosonomies achieve crowdsourced classification via the “wisdom of the crowd”, especially for broad folksonomies like Last.fm as opposed to narrow folksonomies like Flickr.

Preprint of article in review available at arXiv:1502.02777 [cs.SI]

Comments off

Topic Explorer at AAAI

Next week, I’ll be headed to Austin, TX for AAAI-15 to present a demo of the Topic Explorer. With this presentation is a short paper:

Topic models remain a black box both for modelers and for end users in many respects. From the modelers’ perspective, many decisions must be made which lack clear rationales and whose interactions are unclear – for example, how many topics the algorithms should find (K), which words to ignore (aka the “stop list”), and whether it is adequate to run the modeling process once or multiple times, producing different results due to the algorithms that approximate the Bayesian priors. Furthermore, the results of different parameter settings are hard to analyze, summarize, and visualize, making model comparison difficult. From the end users’ perspective, it is hard to understand why the models perform as they do, and information-theoretic similarity measures do not fully align with humanistic interpretation of the topics. We present the Topic Explorer, which advances the state-of-the-art in topic model visualization for document-document and topic-document relations. It brings topic models to life in a way that fosters deep understanding of both corpus and models, allowing users to generate interpretive hypotheses and to suggest further experiments. Such tools are an essential step toward assessing whether topic modeling is a suitable technique for AI and cognitive modeling applications.

Jaimie Murdock and Colin Allen. (2015) Visualization Techniques for Topic Model Checking. [demo track] in Proceedings of the 29th AAAI Conference (AAAI-15). Austin, Texas, USA, January 25-29, 2015.

Comments off

Nothing was ever the same again…

On a frigid night in February, I was supposed to show a prospective student around Bloomington, but it turned into an evening with a beautiful linguist flirting about perscriptivism. Before she left, I asked for her number. She gave it to me, double checked it, kissed me, said “call me, let me know if you’re serious,” and left the club. I was. When I got three calls during our first date saying that my apartment was on fire, I told her “I’m not a fireman!” and did everything I could to rush back to her.

» Continue reading “Nothing was ever the same again…”

Comments off

Granddad’s Indonesian Career

The Granddad had two tenures in Indonesia at the Bogor Agricultural School (Institut Pertanian Bogor – IPB) from 1968-1970 and 1980-1985. IPB became independent from the University of Indonesia in 1963, and Granddad’s work was instrumental in its reorganization as the first degree-granting agricultural school in Indonesia. In his first term, he created the 4-year undergraduate curriculum and set general education requirements, helping the university exceed its goal of 20,000 graduates by the year 2000. In his second term, he served as Director of the Graduate Education Project and began issuing doctoral degrees.

IPB was founded on a “tri-darma” of teaching, research, and extension, which matched the educational philosophy of the American land grant universities that trained Granddad. His design for the flagship Darmaga Campus was located on a Dutch rubber plantation and recognized that a university hosted not only research and faculty, but also students and their families. Therefore, it included classroom buildings, research and teaching fields, extension offices, residence halls and chapels. On a tour of the campus on Christmas Eve 2013, he was especially proud that the church and the mosque were located on the same courtyard sharing the same playground, that actual rubber, banana, rice, and corn fields for the students had been preserved, and that the library had been vastly expanded. He visited his IPB colleagues every year from his retirement at UW-Madison through his death in 2014 (pictured laughing in 2006, below).

Granddad talking about his career on our way to the IPB Darmaga Campus (December 2013)
December 2006 gathering of Granddad and IPB colleagues at Aunt Cindy's house

Gathering of The Granddad and IPB colleagues at Aunt Cindy’s house (December 2006)

Comments off