Category: compsci

Towards Cultural-Scale Models of Full Text

For the past year, Colin and I have been on a HathiTrust Advanced Collaborative Support (ACS) Grant. This project has examined how topic models differ between library subject areas. For example, some areas may have a “canon” meaning that a low number of topics selects the same themes, no matter what the corpus size is. In contrast, still emerging fields may not agree on the overall thematic structure. We also looked at how sample size affects these models. We’ve uploaded the initial technical report to the arXiv:

Towards Cultural Scale Models of Full Text
Jaimie Murdock, Jiaan Zeng, Colin Allen
In this preliminary study, we examine whether random samples from within given Library of Congress Classification Outline areas yield significantly different topic models. We find that models of subsamples can equal the topic similarity of models over the whole corpus. As the sample size increases, topic distance decreases and topic overlap increases. The requisite subsample size differs by field and by number of topics. While this study focuses on only five areas, we find significant differences in the behavior of these areas that can only be investigated with large corpora like the Hathi Trust.
http://arxiv.org/abs/1512.05004

January 26, 2016
The InPhO Topic Explorer
This week, I launched The InPhO Topic Explorer. Through an interactive visualization, The InPhO Topic Explorer exposes one way search engine results are generated and allows more focused exploration than just a list of related documents. It uses the LDA machine learning algorithm, the explorer infers topics from arbitrary text corpora. The current demo is trained on the Stanford Encyclopedia of Philosophy, but I will be expanding this to other collections in the next few weeks.

The color bands within each article’s row show the topic distribution within that article, and the relative sizes of each band indicates the weight of that topic in the article. The full width of each row indicates the similarity to the focus article. Each topic’s label and color is arbitrarily assigned, but is consistent across articles in the browser per topic.

Display options include topic normalization, alphabetical sort and topic sort. By normalizing topics, the full width of each bar expands and topic weights per document can be compared. By clicking a topic, the documents will reorder acoording to that topic’s weight and topic bars will reorder according to the topic weights in the highest weighted document.

By varying the number of topics, one can get a finer or coarser-grained analysis of the areas discussed in the articles. The visualization currently has 20, 40, 60, 80, 100, and 120 topic models for the Stanford Encyclopedia of Philosophy.

In contrast to a search engine, which displays articles based on a similarity measure, the topic explorer allows you to reorder results based on what you’re interested in. For example, if you’re looking at animal consciousness (80 topics), you can click on topic 46 to see those that are closest in the “animals” category, while 46 shows “consciousness” and 42 shows “perception” (arbitrary labels chosen). Some topics have a lot of words like “theory”, “case”, “would”, and “even”. These general argumentative topics can be indicative of areas where debate is still ongoing.

In early explorations, the visualization already highlights some interesting phenomena:
- For central articles, such as kant (40 topics), one finds that a single topic (topic 30) comprises much of the article. By increasing the number of topics, such as to kant (120 topics), topic 77 now captures the “kant”-ness of the article, but several other components can now be explored. This shows the value of having multiple topic models.
- ForÂ creationism (120 topics), one can see that the particular blend of topics generating thatÂ article is truly an outlier, with the probability only just over .5 of generating the next closest document; compare this to the distribution of top articles related toÂ animal-consciousness (120 topics)Â orÂ kant (120 topics). Â Can you find other outliers in the SEP?
The underlying dataset was generated using theÂ InPhO VSM module’sÂ LDA implementation. SeeÂ Wikipedia: Latent Dirichlet AllocationÂ for more on the LDA topic modeling approach orÂ “Probabilistic Topic Models” (Blei, 2012)Â for a recent review.

Source codeÂ andÂ issue trackingÂ are available atÂ GitHub.

Please share any notes in the comments below!
August 12, 2014
Containing the Semantic Explosion

Yesterday afternoon, I delivered a talk to the PhiloWeb Workshop at the WWW2012 Conference titled “Containing the Semantic Explosion” with Cameron Buckner and Colin Allen. It is an overview of the InPhO Project architecture, known as dynamic ontology, and a preview of some forthcoming data mining tools. [slides]

The explosion of semantic data on the information web, and within digital philosophy, requires new techniques for organizing and linking these knowledge repositories. These must address concerns about consistency, completeness, maintenance, usability, and pragmatics, while reducing the cost of double experts trained both in ontology design and the target domain. Folksonomy approaches address concerns about usability and personnel at the expense of consistency, completeness, and maintenance. Upper-level formal ontologies address concerns about consistency and completeness, but require double experts for the initial construction and maintenance of the representation. At the Indiana Philosophy Ontology (InPhO) Project, we have developed a general methodology called dynamic ontology, which alleviates the need for double experts, while addressing concerns about consistency, completeness and change through machine learning over a domain corpus, and concerns about usability and pragmatics through human input and semantic web standards. This representation can then be used by other projects in digital philosophy, such as the Stanford Encyclopedia of Philosophy (SEP) and PhilPapers, along with resources outside of digital philosophy enabled by the LinkedHumanities project. [slides]

April 18, 2012
Talks

Last week I wrote and then gave two lectures on “Categorization” and “Practical Parallelism”. It was a ton of fun to prepare them, and actually giving them made me realize how much I miss teaching. Abstracts and slides follow.

Categorization

Student Organization for Cognitive Science (SOCS)
November 15, 2011 @ 5:30pm

Abstract: Categorization is a fundamental problem in cognitive science that goes by a multitude of names: In artificial intelligence, categorization is known as clustering; in mathematics, the problem is partitioning. There are many applications in linguistics, vision, and memory research. In this talk, I will provide a brief overview of exemplar vs. prototype models in the cognitive sciences (Goldstone & Kersten 2003), followed by an introduction to three different general-purpose clustering algorithms: k-means (MacQueen 1967), qt-clust (Heyer et al 1999), and information-theoretic clustering (Gokcayso & Principe 2002). Open-source Python implementations of each algorithm will be provided.

Slides

Practical Parallelism

CS Club Tech Talk
November 17, 2011 @ 7pm

Abstract: In this talk, I will give a brief overview of several key parallelism concepts and practical tools for several languages. After this talk, attendees should have the resources to recognize and solve “painfully parallel problems”. Topics will include: threads vs. processes, Amdahl’s Law, shared vs. distributed memory, synchronization, locks, pipes, queues, process pools, futures, OpenMP, MapReduce, Hadoop, and GPU programming.

Slides

November 20, 2011
Speciation and Information Theory

For the past two semesters, I’ve been doing some exploratory work marrying speciation with information theory in the framework of the Polyworld artificial life simulator. The simulation gives us a nice framework for mathematically “pure” evolutionary theory and exploration of neural complexity. We’ve applied clustering algorithms to the genetic information, revealing evidence of both sympatric and allopatric speciation events. The key algorithmic intuition is that genes which are highly selected for will conserve, while those which are not will descend to a random distribution (and thus high entropy), so each dimension (gene) can be weighted by its information certainty to alleviate the curse of dimensionality.

The work was accepted as a poster and extended abstract for the Genetic and Evolutionary Computing Conference (GECCO), and was accepted as a full paper for the European Conference on Artificial Life (ECAL). The full paper is substantially revised from the initial GECCO submission, and provides an introduction to several problems of biological, computational, and information theoretic importance. The visualizations, including several videos showing the cluster data, were especially fun to create, and I’m proud of the finished product.

There are still several more research directions from this work: the allopatric and sympatric effects have not been differentiated, only one environment was analyzed (consistent with past work on evolution of complexity), the clustering algorithm’s thresholds were not explored for hierarchical effects, alternate clustering algorithms were not explored (future open-source project for me: clusterlib), … Still, the present work is encapsuled, the source is in the Polyworld trunk, and it was accepted for publication.

Abstract, citation, and paper follow.

Complex artiï¬cial life simulations can yield substantially distinct populations of agents corresponding to different adaptations to a common environment or specialized adaptations to different environments. Here we show how a standard clustering algorithm applied to the artiï¬cial genomes of such agents can be used to discover and characterize these subpopulations. As gene changes propagate throughout the population, new subpopulations are produced, which show up as new clusters. Cluster centroids allow us to characterize these different subpopulations and identify their distinct adaptation mechanisms. We suggest these subpopulations may reasonably be thought of as species, even if the simulation software allows interbreeding between members of the different subpopulations, and provide evidence of both sympatric and allopatric speciation in the Polyworld artiï¬cial life system. Analyzing intra- and inter-cluster fecundity differences and offspring production rates suggests that speciation is being promoted by a combination of post-zygotic selection (lower ï¬tness of hybrid offspring) and pre-zygotic selection (assortative mating), which may be fostered by reinforcement (the Wallace effect).

Jaimie Murdock and Larry Yaeger. Identifying Species by Genetic Clustering. In Proceedings of the 2011 European Conference on Artificial Life. Paris, France, 2011. [paper]

June 13, 2011
Two New Publications

This past week brought two publication deadlines, a conference submission deadline, and preparation for a software demo at Harvard. Needless to say, I am exhausted, but it was well worth the effort.

The first publication is a 2-page summary of work I’ve been doing with Prof. Larry Yaeger looking at speciation mechanisms in artificial life simulations. This was a condesnation of a paper submission for the Genetic and Evolutionary Computing Conference, and I’m really pleased with how much we were able to squeeze in. Abstract, citation, and link follow:

Artificial life simulations can yield distinct populations of agents representing different adaptations to a common environment or specialized adaptations to different environments. Here we apply a standard clustering algorithm to the genomes of such agents to discover and characterize these subpopulations. As evolution proceeds new subpopulations are produced, which show up as new clusters. Cluster centroids allow us to characterize these different subpopulations and identify their distinct adaptation mechanisms. We suggest these subpopulations may reasonably be thought of as species, even if the simulation software allows interbreeding between members of the different subpopulations. Our results indicate both sympatric and allopatric speciation are present in the Polyworld artificial life system. Our analysis suggests that intra- and inter-cluster fecundity differences may be sufficient to foster sympatric speciation in artificial and biological ecosystems.

Jaimie Murdock and Larry Yaeger. Genetic Clustering for Species Identification. In Proceedings of the Genetic and Ecolutionary Computation Conference (GECCO) 2011. Dublin, Ireland, 2011. [paper]

The second publication is an expansion of the work on ontology evaluation presented last year at the 2010 International Conference on Knowledge Engineering and Ontology Development (KEOD) in Valencia, Spain. We’ve completely rewritten the section on our volatility score, and tightened up the language throughout. The 20-page behemoth will be published as a chapter in an upcoming volume of Springer-Verlag’s Communications in Computer and Information Science (CCIS) series. Abstract, citation, and link follow:

Ontology evaluation poses a number of difficult challenges requiring different evaluation methodologies, particularly for a "dynamic ontology" generated by a combination of automatic and semi-automatic methods. We review evaluation methods that focus solely on syntactic (formal) correctness, on the preservation of semantic structure, or on pragmatic utility. We propose two novel methods for dynamic ontology evaluation and describe the use of these methods for evaluating the different taxonomic representations that are generated at different times or with different amounts of expert feedback. These methods are then applied to the Indiana Philosophy Ontology (InPhO), and used to guide the ontology enrichment process.

Jaimie Murdock, Cameron Buckner and Colin Allen. Evaluating Dynamic Ontologies. Communications in Computer and Information Science (Lecture Notes). Spencer-Verlag. 2011. [chapter]

April 22, 2011
Published!

In June, my paper "Two Methods for Evaluating Dynamic Ontologies" was accepted to the 2nd Knowledge Engineering and Ontology Development (KEOD) Conference in Valencia, Spain on October 25-28. The paper was co-authored with Cameron Buckner, a graduate student in Philosophy, and Colin Allen, a Professor in Cognitive Science and History & Philosophy of Science, and details some of our work with the Indiana Philosophy Ontology (InPhO) Project.

This paper is the culmination of two summers of research on knowledge representation. If you’re interested in the InPhO project, section 3 of the paper is a reasonably accessible summary. The paper as a whole deals with a subproblem in ontologies – how do you quantify the quality of a candidate knowledge representation? We hypothesize that the structure of a domain corpus should be reflected in the structure of a taxonomy of that domain, and that a better taxonomy will better match the corpus statistics.

I’ll be headed to Valencia October 22-31, and the Hutton Honors College has generously approved a travel grant to cover expenses for the week. I’ve set up my flights to and from Madrid, and I’ll have 2 days before and 3 days after the conference to wander around Spain — I’ve never been to Europe before, so I’m extremely excited!

The abstract is below:

Ontology evaluation poses a number of difficult challenges requiring different evaluation methodologies, particularly for a "dynamic ontology" representing a complex set of concepts and generated by a combination of automatic and semi-automatic methods. We review evaluation methods that focus solely on syntactic (formal) correctness, on the preservation of semantic structure, or on pragmatic utility. We propose two novel methods for dynamic ontology evaluation and describe the use of these methods for evaluating the different taxonomic representations that are generated at different times or with different amounts of expert feedback. The proposed "volatility" and "violation" scores represent an attempt to merge syntactic and semantic considerations. Volatility calculates the stability of the methods for ontology generation and extension. Violation measures the degree of "ontological fit" to a text corpus representative of the domain. Combined, they support estimation of convergence towards a stable representation of the domain. No method of evaluation can avoid making substantive normative assumptions about what constitutes "correct" representation, but rendering those assumptions explicit can help with the decision about which methods are appropriate for selecting amongst a set of available ontologies or for tuning the design of methods used to generate a hierarchically organized representation of a domain.

September 7, 2010
More Curriculum Musings

I’ve been making a bunch of comments on Computer Science education lately. The New York Times has an excellent article about “Making Computer Science More Enticing” which focuses on Stanford’s new curriculum. The Stanford curriculum is very similar to IU’s new specialization-based curriculum and seems to be an excellent approach to “teaching the discipline”.

Also, I found the “definitive” document on CS education – The ACM/IEEE Computing Curriculum 2008 Update [PDF].

Why so much focus on education? Computer Science is a (relatively) new discipline with a multitude of high-impact applications, giving us an imperative to train students quickly. Unfortunately, the speed at which our field is moving can cause us to lose sight of the philosophy behind the science.

If someone wants to learn Biology, you would point them to Campbell & Reece. If someone wants to learn computation, where do you point them? A list of books. There are books focused on introducing algorithms and functional programming (SICP); there are tomes focused on general computation (Knuth); there are books focused on application (the entire O’Reilly library); there are definitive texts on specific languages (The C Programming Language, The Scheme Programming Language); there does not seem to be a widely-accepted, integrative introduction that emphasizes computation — algorithms and models. From what I’m observing in CS curricula across the country, the coursework is moving in this direction, but we still need this cohesive “Introduction to Computing” book.

As a final message, this video linked in the NYT article captures the beauty, richness and excitement of our discipline right now — “It’s sort of like you’re geometers and you’re living in the time of Euclid”:

February 21, 2010
Computer Studies
The latest issue of Communications of the ACM, the premier computer science journal, contains an interesting article by IU Professor Dennis Groth — Why an Informatics Degree? The article has much to say about the necessity of application and applied computing as a measure of computer science success.

However, there are some questions left unanswered. First, I address two questions in philosophy of science: “What is Computer Science?” and “Why Informatics?” I then address the pedagogical implications of these questions in a section on “Computer Studies”.

What is Computer Science?

Any new discipline needs to consider its philosophy in order to establish a methodology and range of study. Prof. Groth’s definitions of Computer Science and Informatics do not quite capture these considerations:

Computer science is focused on the design of hardware and software technology that provides computation. Informatics, in general, studies the intersection of people, information, and technology systems.

In explicitly linking the science to its implementation, this definition of Computer Science fumbles away its essence. Yes, the technology is important and provides a crucial instrument on which to study computation, but at its core computer science studies computation — information processing. Computer science empirically examines this question by studying algorithms (or procedures) in the context of a well-defined model (or system).

This conflation of implementation and quantum is extremely pervasive. For example, Biology is “the study of life”, but in a (typical) biology class one never addresses the basic question: “What is life?” The phenomena of life can be studied independently of the specific carbon-based implementation we have encountered. This doesn’t deny the practical utility of modern biology, but it does raise the question of how useful our study of the applied life is to our understanding of life itself. (If you’re interested in this line of questioning, I highly recommend Larry Yaeger’s course INFO-I486 Artificial Life.)

Similarly, Computer Science can study procedures independently of the hardware and software implementations. Consider the sorting problem. (If you are unfamiliar with sorting, see the Appendix: Sorting Example.) One would not start by looking at processor architecture or software deisgn, but would instead focus on the algorithm. Pure Computer Science has nothing to do with hardware or software, they are just an extremely practical medium on which we experiment.

Why Informatics?

Informatics seems to be ahead of itself here in asking “Why an Informatics degree?” before asking the more fundamental “Why Informatics?” There are two primary definitions implied in the article. The more popular answer is that “Informatics solves interdisciplinary problems through computation”. The second, emerging answer is that “Informatics studies the interaction of people and technology”.

The first definition defines a methodology but does not define a subject. It should be obvious that we live in a collaborative, interdisciplinary world. Fields should inform one another but there is still a distinction between fields: Biology studies life; Computer Science studies computation; Cognitive Science studies cognition; Chemistry studies chemicals; etc. One can approach any problem with any number of techniques – computing is one part of this problem-solving toolkit, along with algebra, calculus, logic and rhetoric. However, each of the particular sciences should answer some natural question – whether that be a better explanation of life, computation, mathematics or cognition. Positing a discipline as the use of one field to address problems in another field is not a new field. It’s applied [field] or [field] engineering.

The other definition, that informatics studies the interaction of people and technology, hints at a new discipline studying a quantum of “interaction”. This area has tons of exciting research, especially in human-computer interaction (HCI) and network science. Further emphasizing this would go a long ways toward creating a new discipline and set a clear distinction between the informaticist and the computer scientist. Computer scientists study computation; informaticists study interaction; both should be encouraged. As it stands, both study “computers” and both step on each other’s toes.

Computer Studies

This discussion of philosophies has important implications for how we structure computer-related education (formalized as Computer Studies). Despite major differences in our approaches, it does seem clear that Computer Science and Informatics should work together, especially in applications.

However, as currently implemented at IU, the Informatics curriculum is a liberal arts degree in technology. Formal education should teach either a vocation, a discipline or (ideally) both. Informatics seems to answer to neither claim by emphasizing how informaticists “solve problems with computers” without diving into programming or modeling. If it aims to teach such a vocation, then more application is necessary to give expertise; if it aims to teach a discipline, it is fine to do that through application, but we must recognize that application is only useful insofar as it benefits theory (and vice versa). Additionally, if the field does indeed have a quantum of interaction, then interaction should be the forefront of the curriculum.

IU’s Computer Science ex-department is a valiant effort to teach a discipline – in the span of 4 years we cover at least 3 distinct programming paradigms (functional, object-oriented and logic) spread over 4 distinct languages, bristling with an exploration of algorithms. That being said, I would be surprised if more than 25% of the graduating class could explain a Turing Machine.

Not everyone is into theory – most people really just want to “solve problems with computers” and have a good job. Where do these programmers go? Informatics does not address this challenge, and shouldn’t attempt to. The answer is software engineering – just as applied physics finds a home in classical engineering. By establishing a third program for those clearly interested in application, IU would have a very solid “computer studies” program (as distinguished from computation or technology). [A friend has pointed out that IU cannot legally offer an engineering degree, so we’d have to get creative on the name or tell people to go to Purdue. This works as a general model of Computer Studies pedagogy.]

As another example of how to split “computer studies”, Georgia Tech recently moved to a three-prong approach with the School of Computer Science (CS), School of Interactive Computing (IC), and Computational Science and Engineering Division (CSE). My view of Informatics roughly correlates to that of IC; the Computer Science programs are equivalent but include software engineering. The CSE division is a novel concept, presently captured by IU’s School of Informatics, and it seems this is another working group, but I feel it is best captured by adjunct faculty and interdisciplinary programs, rather than a whole new field.
Appendix: Sorting Example

Let’s say we have a list of numbers and want to sort them from smallest to largest. One naive way is to compare each term to the next one, and swap them if they are in the wrong order and restart until you can make it to the end without swapping:

1: *4 3* 2 1 -> 3 *4 2* 1 -> 3 2 *4 1* -> 3 2 1 4 2: *3 2* 1 4 -> 2 *3 1* 4 -> 2 1 *3 4* -> 2 1 3 4 3: *2 1* 3 4 -> 1 *2 3* 4 -> 1 2 *3 4* -> 1 2 3 4 4: *1 2* 3 4 -> 1 *2 3* 4 -> 1 2 *3 4* -> 1 2 3 4

This is called bubble sort, and solves the problem of sorting. However, consider what you’d have to do to sort a bigger list: each time you make a swap you have to rescan the whole list! A smarter way to sort this list would be to divide the list into two smaller lists, sort the smaller lists, and then merge them together:

1a: *4 3* -> 3 4 1b: *2 1* -> 1 2 Now merge: 2a: *3* 4 -> *3* 4 -> 1 2 3 4 2b: *1* 2 -> 1 *2* -^

This only takes 4 comparisons, compared to 12! We just did a classic problem in Computer Science without even once mentioning computer hardware or writing a single line of code!
View document source.
February 17, 2010
Computer Science

Indiana University’s Department of Computer Science has been completely absorbed by the School of Informatics (SoI). I’m not entirely comfortable with this decision, as what we do in Computer Science (theory and algorithms) is very different from Informatics (applications to other areas). Also, Computer Science students are a very different breed from Informatics students – there’s a number of differences in the curriculum.

Anyways, the new SoI bulletin has completely revised the BS CS degree program. It is now much more streamlined. Core courses have been reduced from 6 to 4 and upper-level requirements have been reduced from 7 courses divided amongst various first letter and second number distinctions to 5 courses in a simplified concentration. My concentrations will be Artificial Intelligence and Programming Languages.

These changes have dramatically altered the next three years. I was 5 CS courses away from graduation. Under the new requirements I have only 3 more, which can be from a broad list of related courses. Instead of taking every undergrad CS course, I’m now going to be able to take Artificial Life, Bioinspired Computing, The Computer and Natural Language (NLP), and Search Informatics: Google Under the Hood (MapReduce). These have all been on my radar, but since they were in the Informatics program, they did not meet any CS requirements. Now I’m able to shave a semester off my graduation and take some (hopefully) more interesting courses.

There are some issues with the changes – there is a lot less emphasis on theory, which is the hallmark of the IU CS program. Since I’m staying on an extra year for the Professional Masters program I’m not concerned about my education, but it is alarming that people can get away with only 6 CS courses when the old program required 13. (I’ll graduate with 10.)

At any rate, I’ll be out by Fall 2011 instead of sometime in 2012, and that’s awesome.

September 4, 2009