Blog

Topic Modeling Tutorial at JCDL2015

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!

Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-Âconsumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VMâ€™s secure mode to download texts and analyze their contents.Â [tutorial paper]

We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60

Hope to see you there!

http://www.jcdl2015.org/registration

May 14, 2015
Six Upcoming Talks

For the past 6 months, I’ve been very busy working on a number of collaborations with Simon DeDeo and Colin Allen. Now, I’m taking to the road to show the fruit of my labors. Below are 6 upcoming talks, tutorials, and workshops about this work on topic modeling, Charles Darwin, information foraging, and the HathiTrust. I hope to see you there!

Topics over Time: Into Darwin’s Mind (Local)
Network Science @ IU Talks
Monday, March 9 â€” 12:30-1pm
Social Science Research Commons
Slides: http://jamr.am/DarwinIUNetSci
Video coming soon!

Topic Modeling with the HathiTrust Data Capsule
HathiTrust UnCamp 2015
Monday, March 30
Ann Arbor, MI
Presenters: Jaimie Murdock, Colin Allen

Topic-driven ForagingÂ (Local)
Goldstone, Todd, Landy Lab
Friday, April 10 â€” 9-10a
MSB II Gill Conference Room

Visualization Techniques for LDA (Local)
Cognitive Science 25th Anniversary
Interactive Systems Open House
Friday, April 17 â€” 3:30-5:15pm
Location TBD

Topic Modeling & Network Analysis (Local)
Catapult Center Workshops
Friday, April 24 â€” 1-4pm
Wells Library E159
Presenter: Colin Allen

HT Data Capsule & Topic Modeling for Non-consumptive Research
JCDL 2015 Tutorial
Sunday, June 21 â€” 9am-noon
Knoxville, TN
Presenters: Jaimie Murdock,Â Jiaan Zeng, Robert MacDonald

March 16, 2015
Wisdom of the Few?

Wisdom of the Few? “Supertaggers” in Collaborative Tagging Systems

Jared Lorince, Sam Zorowitz, Jaimie Murdock, Peter M. Todd

A folksonomy is ostensibly an information structure built up by the “wisdom of the crowd”, but is the “crowd” really doing the work? Tagging is in fact a sharply skewed process in which a small minority of “supertagger” users generate an overwhelming majority of the annotations. Using data from three large-scale social tagging platforms, we explore (a) how to best quantify the imbalance in tagging behavior and formally define a supertagger, (b) how supertaggers differ from other users in their tagging patterns, and (c) if effects of motivation and expertise inform our understanding of what makes a supertagger. Our results indicate that such prolific users not only tag more than their counterparts, but in quantifiably different ways. These findings suggest that we should question the extent to which folkosonomies achieve crowdsourced classification via the “wisdom of the crowd”, especially for broad folksonomies like Last.fm as opposed to narrow folksonomies like Flickr.

Preprint of article in review available atÂ arXiv:1502.02777 [cs.SI]

February 11, 2015
Topic Explorer at AAAI

Next week, I’ll be headed to Austin, TX for AAAI-15 to present a demo of the Topic Explorer. With this presentation is a short paper:

Topic models remain a black box both for modelers and for end users in many respects. From the modelersâ€™ perspective, many decisions must be made which lack clear rationales and whose interactions are unclear â€“ for example, how many topics the algorithms should find (K), which words to ignore (aka the â€œstop listâ€), and whether it is adequate to run the modeling process once or multiple times, producing different results due to the algorithms that approximate the Bayesian priors. Furthermore, the results of different parameter settings are hard to analyze, summarize, and visualize, making model comparison difficult. From the end usersâ€™ perspective, it is hard to understand why the models perform as they do, and information-theoretic similarity measures do not fully align with humanistic interpretation of the topics. We present the Topic Explorer, which advances the state-of-the-art in topic model visualization for document-document and topic-document relations. It brings topic models to life in a way that fosters deep understanding of both corpus and models, allowing users to generate interpretive hypotheses and to suggest further experiments. Such tools are an essential step toward assessing whether topic modeling is a suitable technique for AI and cognitive modeling applications.

Jaimie Murdock and Colin Allen. (2015) Visualization Techniques for Topic Model Checking. [demo track] in Proceedings of the 29th AAAI Conference (AAAI-15). Austin, Texas, USA, January 25-29, 2015.

January 20, 2015
Granddad’s Indonesian Career

The Granddad had two tenures in Indonesia at the Bogor Agricultural School (Institut Pertanian Bogor – IPB) from 1968-1970 and 1980-1985. IPB became independent from the University of Indonesia in 1963, and Granddad’s work was instrumental in its reorganization as the first degree-granting agricultural school in Indonesia. In his first term, he created the 4-year undergraduate curriculum and set general education requirements, helping the university exceed its goal of 20,000 graduates by the year 2000. In his second term, he served as Director of the Graduate Education Project and began issuing doctoral degrees.

IPB was founded on a “tri-darma” of teaching, research, and extension, which matched the educational philosophy of the American land grant universities that trained Granddad. His design for the flagship Darmaga Campus was located on a Dutch rubber plantation and recognized that a university hosted not only research and faculty, but also students and their families. Therefore, it included classroom buildings, research and teaching fields, extension offices, residence halls and chapels. On a tour of the campus on Christmas Eve 2013, he was especially proud that the church and the mosque were located on the same courtyard sharing the same playground, that actual rubber, banana, rice, and corn fields for the students had been preserved, and that the library had been vastly expanded. He visited his IPB colleagues every year from his retirement at UW-Madison through his death in 2014 (pictured laughing in 2006, below).
Granddad talking about his career on our way to the IPB Darmaga Campus (December 2013)
Gathering of The Granddad and IPB colleagues at Aunt Cindy’s house (December 2006)

December 5, 2014
Debut Album: “We are the 123s!”

On November 21st, The 123s will release our debut album “We are the 123s!” Recorded on June 10, 2014 at Russian Recording in Bloomington, IN, the album features 7 tracks and is available for streaming at http://wearethe123s.com/

Additionally, we’ll be having a FREE release show:

“We are the 123s!” Release Show
November 21, 2014 10 pm
Max’s on the Square
106 W 6th St, Bloomington, IN

Finally, we’ve released a full set of music videos from the live recording:

A lot of hard work went into this album, and I’m very excited to share it with everyone! Physical copies on a “vinyl” CD are available as well, message me for more details.

November 21, 2014
Granddad

On August 29, Granddad passed away suddenly at 86 in his home on Terrapin Creek. As the public obituary shows, Granddad was a legendary man: a Professor of Soil Science for 39 years at University of Wisconsin – Madison, he led the green revolution in Indonesia and Brazil (for which he received doctorates in 1985 and 2014, respectively). As President of the Midwest Universities Consortium on International Activities (MUCIA), he helped many other institutions and countries coordinate humanitarian aid. After retirement, Granddad still traveled to Indonesia every year and worked on the Ponderosa through his last day.

At his funeral, all his grandchildren were given the opportunity to speak and my eulogy is below.

After Grandma passed away, Granddad started a new tradition of writing his grandchildren a Christmas letter every year. In them, he told us his life’s story â€“ from childhood on Terrapin Creek to finding the love of his life to moving away for school and then his first job. Throughout everything Granddad’s letters were filled with love and his profound sense of finding home, wherever he was.

In the past two years, Granddad and I recognized that I was following in part of his footsteps by becoming a PhD student. This Christmas, I made plans to see him in Indonesia. Granddad and I arrived in Jakarta within an hour of each other. He had just come back from the mission field in Sulawesi, and was undeniably sick. Granddad’s health was never a complaint, it was just a statement. When his lung stopped working almost a decade ago, he didn’t. When he visited the doctor in Indonesia, the doctor asked to take a picture of him. Granddad asked why and the doctor said â€œMy dad is 84 and giving up on life, you’re 86 and your life is just beginning!â€

Granddad’s life always was just beginning â€“ he started every day in gratitude and as his letters showed us, even recollections of the past started with thankfulness for the day he was given and the future he had created for his family and the world.

As he was feeling better, he started whistling again as he was in the house. One morning I asked him to take me to IPB â€“ the Bogor Agricultural School â€“ that he worked at for 7 years. Twenty minutes later, in a moment that was very Granddad, he said â€œcar’s out front, let’s go.â€ Now, I was expecting him to take a few days to make arrangements, so I hurried off to get shoes. When we got in the car he started telling me all about the work he had done there restructuring the curriculum and I hadn’t realized he literally designed the university â€“ from the library to offices to fields to chapels. They had set a goal of 20,000 graduates by the year 2000, which they met early!

Granddad made an immediate impact on so many lives, but the life and work he created was built to last. We each saw that first-hand as he mentored so many of us. Now, as his letters stop, we are left to find our own path, but the lessons he gave us of love and dedication will live on forever.

September 18, 2014
The InPhO Topic Explorer
This week, I launched The InPhO Topic Explorer. Through an interactive visualization, The InPhO Topic Explorer exposes one way search engine results are generated and allows more focused exploration than just a list of related documents. It uses the LDA machine learning algorithm, the explorer infers topics from arbitrary text corpora. The current demo is trained on the Stanford Encyclopedia of Philosophy, but I will be expanding this to other collections in the next few weeks.

The color bands within each article’s row show the topic distribution within that article, and the relative sizes of each band indicates the weight of that topic in the article. The full width of each row indicates the similarity to the focus article. Each topic’s label and color is arbitrarily assigned, but is consistent across articles in the browser per topic.

Display options include topic normalization, alphabetical sort and topic sort. By normalizing topics, the full width of each bar expands and topic weights per document can be compared. By clicking a topic, the documents will reorder acoording to that topic’s weight and topic bars will reorder according to the topic weights in the highest weighted document.

By varying the number of topics, one can get a finer or coarser-grained analysis of the areas discussed in the articles. The visualization currently has 20, 40, 60, 80, 100, and 120 topic models for the Stanford Encyclopedia of Philosophy.

In contrast to a search engine, which displays articles based on a similarity measure, the topic explorer allows you to reorder results based on what you’re interested in. For example, if you’re looking at animal consciousness (80 topics), you can click on topic 46 to see those that are closest in the “animals” category, while 46 shows “consciousness” and 42 shows “perception” (arbitrary labels chosen). Some topics have a lot of words like “theory”, “case”, “would”, and “even”. These general argumentative topics can be indicative of areas where debate is still ongoing.

In early explorations, the visualization already highlights some interesting phenomena:
- For central articles, such as kant (40 topics), one finds that a single topic (topic 30) comprises much of the article. By increasing the number of topics, such as to kant (120 topics), topic 77 now captures the “kant”-ness of the article, but several other components can now be explored. This shows the value of having multiple topic models.
- ForÂ creationism (120 topics), one can see that the particular blend of topics generating thatÂ article is truly an outlier, with the probability only just over .5 of generating the next closest document; compare this to the distribution of top articles related toÂ animal-consciousness (120 topics)Â orÂ kant (120 topics). Â Can you find other outliers in the SEP?
The underlying dataset was generated using theÂ InPhO VSM module’sÂ LDA implementation. SeeÂ Wikipedia: Latent Dirichlet AllocationÂ for more on the LDA topic modeling approach orÂ “Probabilistic Topic Models” (Blei, 2012)Â for a recent review.

Source codeÂ andÂ issue trackingÂ are available atÂ GitHub.

Please share any notes in the comments below!
August 12, 2014
2013 in Review

As 2013 comes to an end, I’ve found myself in Indonesia again. With Granddad turning 86 and deciding to take an extended 2 month trip, it seemed like an important time to go and events in my own life lined up well — no finals, no school until January 13th, and no particular attachments in Bloomington. I’m spending 10 days with family, then off to Bali for 5 days, the beach for 4 days, 2 more days in Bogor, and then back to America. As in 1990 and 2007, I will leave on January 8th, 2014, which is apparently my Indonesian expiration date.

The opportunity to explore Bogor and just unplug from my normal life has given me time for reflection and pause on what has been an eventful and fantastic year. I summarized much of the first half of the year earlier, but since then I’ve been moving swiftly.

In July, I returned to DC to give a talk at the International Association for Computing and Philosophy, came back to Indy to give a poster at the Joint Conference on Digital Libraries, and then left for a 2 week vacation in the Bay. The vacation was amazing: I saw The Postal Service reunion, went on a road trip down California 1, checked out a music festival in Santa Cruz, then headed to Outside Lands in SF. When I got back, I ran off to Illinois to give a presentation and then moved down the hall to a new 2-bedroom apartment with a loft and 2-story ceilings. September and October were a blur of shows, homework, and settling into my new place.

Perhaps November is the most emblematic of all the ways I’ve grown: I ran my first half-marathon (2:03!), organized my first retreat, hosted Friendsgiving, played with The 123s at The Bishop, gave presentations for all my classes, and hosted Mom’s Thanksgiving. None of these things would’ve been possible at the start of the year.

For the first time in years, I feel caught up on life and comfortable in my own skin. While I still get overwhelmed, I’m starting to recognize that it’s going to work out. 2013 was a rediscovery of my values, and it feels like 2014 was the destination. I can’t wait to see what’s next.

December 25, 2013
One down, N to go

This has been a very intense year, but the end has been worth it. In August, I started graduate school at Indiana University in the Computer Science Program. By October, I started having my first round of grad school anxieties – was a PhD worth it? Was I just doing more of the same by staying at IU? Was I going to grow? Several job offers and much discernment later, I realized that I truly wanted my doctorate, but that I had not positioned myself in the right programs â€” my interests are intensely interdisciplinary and more cognitive than computational. So, after some negotiations, I transferred from Computer Science to the Complex Systems Group in Informatics, which is a much better fit for my research goals.

After this academic identity crisis, I came down with mono in December. Since I was the AI for the 75-student Data Structures course, I had to take incompletes in my coursework to focus my much-diminished energy on teaching. Despite the setback, mono was a very positive catalyst for me. I finally got to a doctor, which woke me up to the reality of what I had done to my body over the past 6 years: I was 23 and my blood pressure was in the hypertension range. For some reason the nurses weren’t freaked out about this, the doctor just said to check it out in a few months, but I knew something was wrong. So while recovering from mono, I decided to change things. I quit drinking to focus on my incompletes, started hitting the gym 5 times a week, picked up running, and have lost 40 pounds since January. I have collarbones, wristbones, and an Adam’s apple. It’s fucking awesome. Plus, I finished my first year of graduate school with a 3.83 GPA! 😀

Research-wise, I’ve been distilling a new research area and imagining what my committee will look like. Right now, I’m diving into a literature review on what Colin is calling “biographically-plausible corpora”. The general intuition is that while “big data” approaches can create excellent recommendations, humans gain expertise from much smaller datasets. Thus, instead of training semantic models on 50 million books, what happens if you train them only on 50 or 500 books? I’ll be presenting this work at a symposia at IACAP 2013 in July.

I’ve also had two side projects. The first is a return to Polyworld to examine correlations between TSE complexity and social behavior â€” an ALife approach to the social brain hypothesis. The second is an examination of the information flow between science and the humanities using the PhilPapers index and the UCSD Map of Science. Preliminary results are being presented as a poster at the Joint Conference on Digital Libraries (JCDL) and we’re aiming for a journal article by the end of the summer.

Outside of school, I’ve been really enjoying myself musically. In January, I joined The 123s, playing alto sax on early rock, blues, and soul covers (stuff like Ray Charles, Aretha Franklin, Little Richard, Smokey Robinson, and Chuck Berry). This month Afro-Hoosier got a new trombone player, which has allowed me to switch to bari full-time. I’m playing gigs every other week, and on May 17th I’ll be playing my first gig in another town – a fundraiser out in Lafayette. On May 23rd, I’ll be headlining at the Bishop with The 123s. In the next 3 months I’ll be seeing Of Monsters and Men, Cold War Kids, Portugal. the Man, Todd Snider, The Wailers, The Postal Service, and all the bands at Outside Lands. Life has been good to my ears.

So, all in all, I feel pretty great about where I’ve come this year. It took a bit of soul-searching to realize how much I wanted my PhD, and a lot of work to get my body ready for it, but I’m ready now and extremely satisfied with my position.

May 12, 2013