jsLDA: Topic modeling PoC for Digital Humanities

SciencesPo / Columbia Program

The goals of this project is to demonstrate the potential of Machine Learning and AI in Digital Humanities topics modeling.

Run a model

Instructions:

When you open the page it will load a file containing documents and a file containing stopwords. The default is a corpus of paragraphs from US State of the Union speeches. It is large enough to get interesting results but small enough to train quickly.

All words have initially been assigned randomly to topics. Click the "Run 50 iterations" button to start training. The iteration count will increase each time the algorithm passes through the dataset.

The topics on the right side of the page should now look more interesting. Run more iterations if you would like -- there's probably still a lot of room for improvement after only 50 iterations.

Once you're satisfied with the model, you can click on a topic from the list on the right to sort documents in descending order by their use of that topic. Proportions are weighted so that longer documents will come first. You can also explore correlations between topics by clicking the "Topic Correlations" tab. Pairs of topics that are correlated will appear as blue circles, pairs that are anti-correlated will appear as red circles.

Using your own documents:

If you would like to explore your own collection, you can upload documents and stopword list files directly to the browser. No data is sent over the internet. Remember that "document" really means "segment of text". A few hundred words is a good length; longer passages tend to shift their topical focus, making inference more difficult. The format for the documents file is one document per line, with each line consisting of

[doc ID] [tab] [label] [tab] [text...]

(this is the default format for Mallet). The values in the "label" field are treated as a sequence of categories, which are shown in the "timeseries" tab in the order they appear in the documents file.

The format for stopwords is one word per line. The "Vocabulary" tab allows you to dynamically add and remove stopwords, and shows which words appear in many topics and which are more specific. Unicode is supported, so most languages that have meaningful whitespace (ie not CJK) should work.

To save data from a trained model, go to the downloads tab. The links on this page generate files from your browser, again, no data is sent over the internet.