## Towards Cultural-Scale Models of Full Text

For the past year, Colin and I have been on a HathiTrust Advanced Collaborative Support (ACS) Grant. This project has examined how topic models differ between library subject areas. For example, some areas may have a “canon” meaning that a low number of topics selects the same themes, no matter what the corpus size is. In contrast, still emerging fields may not agree on the overall thematic structure. We also looked at how sample size affects these models. We’ve uploaded the initial technical report to the arXiv:

Towards Cultural Scale Models of Full Text
Jaimie Murdock, Jiaan Zeng, Colin Allen
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust.
http://arxiv.org/abs/1512.05004

## Psychonomics 2015

This weekend I was in Chicago for the Psychonomic Society and Society for Computers in Psychology meetings. Emily and I stayed Thursday through Saturday and experienced a record first snow of the season. I hope that our fellow conference-goers made it back safely as well.

Chicago is one of the best food towns we’ve ever been to: we cannot recommend Gino’s East deep-dish pizza and Santorini’s Greek restaurant enough.

Below are some conference observations and highlights.

Conference Impressions
As an abstract-only, non-proceedings conference, it is a great opportunity to showcase developing or under review work. For an idea of the breadth of the conference, please look at the abstract book. The talks were of varying quality, but the rapt attention of the audience and quality of questions were excellent. Next year it will be in Boston on November 17-20.

Distributed Cognition
One of the best talks was by Steven Sloman on “The Illusion of Explanatory Depth and the Community of Knowledge”:

Asking people to explain how something works reveals an illusion of explanatory depth: Typically, people know less about the causal mechanism they are describing than they think they do (Rozenblit & Keil, 2002). I report studies showing that explanation shatters people’s sense of understanding in politics. I also show that people’s sense of understanding increases when they are informed that someone else understands and that this effect is not attributable to task demands or understandability inferences. The evidence suggests that our sense of understanding resides in a community of knowledge: People fail to distinguish the knowledge inside their heads from the knowledge in other people’s heads.

The article detailing that explanation shatters political understanding is quite accessible. The further results about “a community of knowledge” are under review.

Prof. Sloman is the conference chair for the International Conference on Thinking on August 3-6, 2016 at Brown University. Submission deadline is March 31, 2016.

The Science of Narrative
Another excellent talk was by Mark Finlayson who studies “the science of narrative”. He developed “Analogical Story Merging” (ASM), which can replicate Vladmir Propp’s theory of the structure of folktale plots. This process is described in his dissertation, which is an excellent synthesis of literary theory and computer science.

Prof. Finlayson is hosting the 7th International Workshop on Computational Models of Narrative at Digital Humanities 2016 in Kraków, Poland on July 11-12. The call for papers is pending.

Bilingualism

There were two talks in the Bilingualism track that were particularly interesting.  Conor McLennan and Sara Incera reported that mouse tracking behavior in bilinguals doing a word discrimination task shows the same sort of reaction delay as in expert discrimination tasks. This correlates with confidence in answers – experts may take longer but move directly to their answers. The results are published in Bilingualism.

Another talk looked at how multilingualism affects vocabulary size using a massive online experiment. While the task of identifying whether a word is known or not is riddled with false positives, the results were interesting in and of themselves. Mutlilinguals tended to have higher vocabularies across languages, and L2 learners tended to actually have a higher vocabulary than L1 native speakers within a language. The results are published in The Quarterly Journal of Experimental Psychology.

## Darwin’s Semantic Voyage

The preprint of my project “Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks” was released on arXiv on Friday. The paper is joint work with my advisors Colin Allen and Simon DeDeo.

This has consumed my life for the past year and I’m incredibly proud of the results. It’s an entertaining read — printing pages “1-11,24-28” gives the main body and references. 12-23 are the “supporting information” explaining some of the archival work, mathematics, and model verification, but absolutely not central to the key points of the paper.

The key point for digital humanities is that we’ve come up with a way to characterize an individual’s reading behaviors and identify key biographical periods from their life. Darwin is incredibly well-studied, so our results largely confirm existing history of science work. However, by adjusting the granularity we can also suggest hypotheses for further investigation – in this case, the period of Darwin’s life from 1851-1853 after his daughter’s death. For less well-studied individuals, this may help humanists gain traction on narrative organization when interacting with large historical archives.

The key point for cognitive scientists is that we can now characterize information foraging behaviors on multiple timescales using an information theoretic measure of cognitive surprise. While many people have studied foraging behavior in individuals on the order of minutes, or in cultures on the order of decades – this is the first study that looks at how an individual interacts with the products of their culture over the course of a lifetime.

It’s important to note that we don’t say anything about how his reading affected his writing – that’s for paper #2!

Also, I’ll presenting this work at the 2015 Conference on Complex Systems this Friday at Arizona State University, with slides available on Google Slides.

Exploration and Exploitation of Victorian Science in Darwin’s Reading Notebooks
Jaimie Murdock, Colin Allen, Simon DeDeo
Abstract: Search in an environment with an uncertain distribution of resources involves a trade-off between local exploitation and distant exploration. This extends to the problem of information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. Darwin built his theory of natural selection in part by synthesizing disparate parts of Victorian science. When we analyze his extensively self-documented reading we find shifts, on multiple timescales, between choosing to remain with familiar topics and seeking cognitive surprise in novel fields. On the longest timescales, these shifts correlate with major intellectual epochs of his career, as detected by Bayesian epoch estimation. When we compare Darwin’s reading path with publication order of the same texts, we find Darwin more adventurous than the culture as a whole.

## Topic Modeling Tutorial at JCDL2015

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-­consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM’s secure mode to download texts and analyze their contents. [tutorial paper]

We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70 Non-ACM/IEEE/SIG/ASIS&T Members –$95
ACM/IEEE/SIG/ASIS&T Student – $20 Non-member Student –$40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95 Non-ACM/IEEE/SIG/ASIS&T Members –$120
ACM/IEEE/SIG/ASIS&T Student – $40 Non-member Student –$60

Hope to see you there!

## Six Upcoming Talks

For the past 6 months, I’ve been very busy working on a number of collaborations with Simon DeDeo and Colin Allen. Now, I’m taking to the road to show the fruit of my labors. Below are 6 upcoming talks, tutorials, and workshops about this work on topic modeling, Charles Darwin, information foraging, and the HathiTrust. I hope to see you there!

Topics over Time: Into Darwin’s Mind (Local)
Network Science @ IU Talks
Monday, March 9 — 12:30-1pm
Social Science Research Commons
Slides: http://jamr.am/DarwinIUNetSci
Video coming soon!

Topic Modeling with the HathiTrust Data Capsule
HathiTrust UnCamp 2015
Monday, March 30
Ann Arbor, MI
Presenters: Jaimie Murdock, Colin Allen

Topic-driven Foraging (Local)
Goldstone, Todd, Landy Lab
Friday, April 10 — 9-10a
MSB II Gill Conference Room

Visualization Techniques for LDA (Local)
Cognitive Science 25th Anniversary
Interactive Systems Open House
Friday, April 17 — 3:30-5:15pm
Location TBD

Topic Modeling & Network Analysis (Local)
Catapult Center Workshops
Friday, April 24 — 1-4pm
Wells Library E159
Presenter: Colin Allen

HT Data Capsule & Topic Modeling for Non-consumptive Research
JCDL 2015 Tutorial
Sunday, June 21 — 9am-noon
Knoxville, TN
Presenters: Jaimie Murdock, Jiaan Zeng, Robert MacDonald

## Wisdom of the Few?

Wisdom of the Few? “Supertaggers” in Collaborative Tagging Systems

Jared Lorince, Sam Zorowitz, Jaimie Murdock, Peter M. Todd

A folksonomy is ostensibly an information structure built up by the “wisdom of the crowd”, but is the “crowd” really doing the work? Tagging is in fact a sharply skewed process in which a small minority of “supertagger” users generate an overwhelming majority of the annotations. Using data from three large-scale social tagging platforms, we explore (a) how to best quantify the imbalance in tagging behavior and formally define a supertagger, (b) how supertaggers differ from other users in their tagging patterns, and (c) if effects of motivation and expertise inform our understanding of what makes a supertagger. Our results indicate that such prolific users not only tag more than their counterparts, but in quantifiably different ways. These findings suggest that we should question the extent to which folkosonomies achieve crowdsourced classification via the “wisdom of the crowd”, especially for broad folksonomies like Last.fm as opposed to narrow folksonomies like Flickr.

Preprint of article in review available at arXiv:1502.02777 [cs.SI]

## Topic Explorer at AAAI

Next week, I’ll be headed to Austin, TX for AAAI-15 to present a demo of the Topic Explorer. With this presentation is a short paper:

Topic models remain a black box both for modelers and for end users in many respects. From the modelers’ perspective, many decisions must be made which lack clear rationales and whose interactions are unclear – for example, how many topics the algorithms should find (K), which words to ignore (aka the “stop list”), and whether it is adequate to run the modeling process once or multiple times, producing different results due to the algorithms that approximate the Bayesian priors. Furthermore, the results of different parameter settings are hard to analyze, summarize, and visualize, making model comparison difficult. From the end users’ perspective, it is hard to understand why the models perform as they do, and information-theoretic similarity measures do not fully align with humanistic interpretation of the topics. We present the Topic Explorer, which advances the state-of-the-art in topic model visualization for document-document and topic-document relations. It brings topic models to life in a way that fosters deep understanding of both corpus and models, allowing users to generate interpretive hypotheses and to suggest further experiments. Such tools are an essential step toward assessing whether topic modeling is a suitable technique for AI and cognitive modeling applications.

Jaimie Murdock and Colin Allen. (2015) Visualization Techniques for Topic Model Checking. [demo track] in Proceedings of the 29th AAAI Conference (AAAI-15). Austin, Texas, USA, January 25-29, 2015.

## The InPhO Topic Explorer

This week, I launched The InPhO Topic Explorer. Through an interactive visualization, The InPhO Topic Explorer exposes one way search engine results are generated and allows more focused exploration than just a list of related documents. It uses the LDA machine learning algorithm, the explorer infers topics from arbitrary text corpora. The current demo is trained on the Stanford Encyclopedia of Philosophy, but I will be expanding this to other collections in the next few weeks.

The color bands within each article’s row show the topic distribution within that article, and the relative sizes of each band indicates the weight of that topic in the article. The full width of each row indicates the similarity to the focus article. Each topic’s label and color is arbitrarily assigned, but is consistent across articles in the browser per topic.

Display options include topic normalization, alphabetical sort and topic sort. By normalizing topics, the full width of each bar expands and topic weights per document can be compared. By clicking a topic, the documents will reorder acoording to that topic’s weight and topic bars will reorder according to the topic weights in the highest weighted document.

By varying the number of topics, one can get a finer or coarser-grained analysis of the areas discussed in the articles. The visualization currently has 20, 40, 60, 80, 100, and 120 topic models for the Stanford Encyclopedia of Philosophy.

In contrast to a search engine, which displays articles based on a similarity measure, the topic explorer allows you to reorder results based on what you’re interested in. For example, if you’re looking at animal consciousness (80 topics), you can click on topic 46 to see those that are closest in the “animals” category, while 46 shows “consciousness” and 42 shows “perception” (arbitrary labels chosen). Some topics have a lot of words like “theory”, “case”, “would”, and “even”. These general argumentative topics can be indicative of areas where debate is still ongoing.

In early explorations, the visualization already highlights some interesting phenomena:

• For central articles, such as kant (40 topics), one finds that a single topic (topic 30) comprises much of the article. By increasing the number of topics, such as to kant (120 topics), topic 77 now captures the “kant”-ness of the article, but several other components can now be explored. This shows the value of having multiple topic models.
• For creationism (120 topics), one can see that the particular blend of topics generating that article is truly an outlier, with the probability only just over .5 of generating the next closest document; compare this to the distribution of top articles related to animal-consciousness (120 topics) or kant (120 topics).  Can you find other outliers in the SEP?

The underlying dataset was generated using the InPhO VSM module’s LDA implementation. See Wikipedia: Latent Dirichlet Allocation for more on the LDA topic modeling approach or “Probabilistic Topic Models” (Blei, 2012) for a recent review.

Source code and issue tracking are available at GitHub.

## Containing the Semantic Explosion

Yesterday afternoon, I delivered a talk to the PhiloWeb Workshop at the WWW2012 Conference titled “Containing the Semantic Explosion” with Cameron Buckner and Colin Allen. It is an overview of the InPhO Project architecture, known as dynamic ontology, and a preview of some forthcoming data mining tools. [slides]

## Talks

Last week I wrote and then gave two lectures on “Categorization” and “Practical Parallelism”. It was a ton of fun to prepare them, and actually giving them made me realize how much I miss teaching. Abstracts and slides follow.

## Categorization

Student Organization for Cognitive Science (SOCS)
November 15, 2011 @ 5:30pm

Abstract: Categorization is a fundamental problem in cognitive science that goes by a multitude of names: In artificial intelligence, categorization is known as clustering; in mathematics, the problem is partitioning. There are many applications in linguistics, vision, and memory research. In this talk, I will provide a brief overview of exemplar vs. prototype models in the cognitive sciences (Goldstone & Kersten 2003), followed by an introduction to three different general-purpose clustering algorithms: k-means (MacQueen 1967), qt-clust (Heyer et al 1999), and information-theoretic clustering (Gokcayso & Principe 2002). Open-source Python implementations of each algorithm will be provided.

Slides

## Practical Parallelism

CS Club Tech Talk
November 17, 2011 @ 7pm

Abstract: In this talk, I will give a brief overview of several key parallelism concepts and practical tools for several languages. After this talk, attendees should have the resources to recognize and solve “painfully parallel problems”. Topics will include: threads vs. processes, Amdahl’s Law, shared vs. distributed memory, synchronization, locks, pipes, queues, process pools, futures, OpenMP, MapReduce, Hadoop, and GPU programming.

Slides