This post is about how to fit a topic model to a set of documents. It uses the method of Latent Dirichlet Allocation (LDA), and leverages the Java based software MALLET. MALLET uses Gibbs sampling to create its topic models. When writing this , I drew on the MALLET documentation, and on tutorials by Allen Riddell at TAToM, and by Shawn Graham, Scott Weingart and Ian Milligan at The Programming Historian.
- Make sure that you have the Java Development Kit installed.
- Download MALLET from its website.
- Unpack MALLET into a directory on your system. The name of the directory will differ depending on the MALLET version. When I created this tutorial it was mallet-2.0.8.
Open Terminal in the MALLET folder (mallet-2-0-8). You need to be in this directory to be able to run MALLET commands.
Enter ./bin/mallet to see a list of the available commands.
First, put all the .txt documents that you want to topic model in a data directory on your computer. I used /text-data. Second, put a file named extra_stopwords.txt in the MALLET root directory. Put in this file any custom stop words that you want to remove, separated by whitespace.
To import the documents in our data directory, we enter ./bin/mallet import-dir followed by a set of options. We will use these:
- --input text_data — because text_data is the folder with our text files.
- --output dataset.mallet — because we want to save this imported dataset to a file called dataset.mallet.
- --remove-stopwords — because we want to remove standard English stop words.
- --extra-stopwords extra_stopwords.txt — because we want to remove our extra stopwords.
- --keep-sequence — because MALLET wants its documents ordered in a certain way.
Now, let's get rid of the annoying .DS_Store file that macOS creates in its directories. We don't want that system file included in the analysis so enter this in Terminal, in the MALLET directory:
rm -v **/.DS_Store
Then, enter this:
./bin/mallet import-dir --input text_data --output dataset.mallet --remove-stopwords --extra-stopwords extra_stopwords.txt --keep-sequence
MALLET will save the output file.
Fitting the topic model
Now we have a MALLET corpus. Now, let's create a topic model for it.
For this, we enter ./bin/mallet train-topics followed by a some options. We will use these:
- --input dataset.mallet — because dataset.mallet is the corpus that we created above, and want to analyse.
- --num-topics 20 — because we want MALLET to identify 20 topics.
- --num-iterations 200 — because we want MALLET to iterate the analysis 200 times, gradually refining the model.
- --output-doc-topics doc-topics.txt — because we want to save a file named doc-topics.txt containing information about the topic composition of documents.
- --output-topic-keys topic-model.txt — because we want information about the identified topics to be saved to a file named topic-model.txt.
So, enter this in Terminal, in the MALLET directory:
./bin/mallet train-topics --input dataset.mallet --num-topics 20 --num-iterations 200 --output-doc-topics doc-topics.txt --output-topic-keys topic-model.txt
MALLET will save the output files.