The MALLET package for topic models outputs several files when running an analysis. One of these is the topic composition file, which includes information about the probabilities for all topics in all documents. I wrote the script below to be able to identify which documents are the most connected to a particular topic.
- Open Terminal in your desired working directory.
- Start R by typing R.
- Load the required libraries:
library(RCurl) library(streamR) library(ROAuth) library(RJSONIO) library(stringr)
- Set up for authentication. Create an app at dev.twitter.com and use your consumerkey and consumersecret in the code below:
token <- "https://api.twitter.com/oauth/request_token" access <- "https://api.twitter.com/oauth/access_token" authorize <- "https://api.twitter.com/oauth/authorize" consumerkey <- "YOUR CONSUMER KEY" consumersecret <- "YOUR CONSUMER SECRET" oauth <- OAuthFactory$new(consumerKey = consumerkey, consumerSecret = consumersecret, requestURL = token, accessURL = access, authURL = authorize)
- Now, do the actual handshake with the API. Running the code below will open up a browser where you will be provided with a PIN number to paste back into Terminal.
oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl") )
- You can now save your authentication details, so that you can skip the above steps next time.
save(oauth, file = "oauth.Rdata")
- With authentication set up, we can start a data collection job from this working directory. First, load streamR and authenticate:
- The code below will initiate the data collection.
filterStream(file.name = "drinks_tweets.json", #this saves tweets into a .json file track = c("coffee", "tea"), #collects tweets that include these keywords language = "en", #collect tweets in a specific language timeout = 10800, #number is in seconds (3 hours), use 0 for permanent collection oauth = oauth) #uses the "oauth" file as your accreditation
- When done, the json can be parsed into an R dataframe:
drinks_tweets.df <- parseTweets("drinks_tweets.json", simplify = FALSE)
This post is about how to fit a topic model to a set of documents. It uses the method of Latent Dirichlet Allocation (LDA), and leverages the Java based software MALLET. MALLET uses Gibbs sampling to create its topic models. When writing this , I drew on the MALLET documentation, and on tutorials by Allen Riddell at TAToM, and by Shawn Graham, Scott Weingart and Ian Milligan at The Programming Historian.
- Make sure that you have the Java Development Kit installed.
- Download MALLET from its website.
- Unpack MALLET into a directory on your system. The name of the directory will differ depending on the MALLET version. When I created this tutorial it was mallet-2.0.8.
Open Terminal in the MALLET folder (mallet-2-0-8). You need to be in this directory to be able to run MALLET commands.
Enter ./bin/mallet to see a list of the available commands.
First, put all the .txt documents that you want to topic model in a data directory on your computer. I used /text-data. Second, put a file named extra_stopwords.txt in the MALLET root directory. Put in this file any custom stop words that you want to remove, separated by whitespace.
To import the documents in our data directory, we enter ./bin/mallet import-dir followed by a set of options. We will use these:
- --input text_data — because text_data is the folder with our text files.
- --output dataset.mallet — because we want to save this imported dataset to a file called dataset.mallet.
- --remove-stopwords — because we want to remove standard English stop words.
- --extra-stopwords extra_stopwords.txt — because we want to remove our extra stopwords.
- --keep-sequence — because MALLET wants its documents ordered in a certain way.
Now, let's get rid of the annoying .DS_Store file that macOS creates in its directories. We don't want that system file included in the analysis so enter this in Terminal, in the MALLET directory:
rm -v **/.DS_Store
Then, enter this:
./bin/mallet import-dir --input text_data --output dataset.mallet --remove-stopwords --extra-stopwords extra_stopwords.txt --keep-sequence
MALLET will save the output file.
Fitting the topic model
Now we have a MALLET corpus. Now, let's create a topic model for it.
For this, we enter ./bin/mallet train-topics followed by a some options. We will use these:
- --input dataset.mallet — because dataset.mallet is the corpus that we created above, and want to analyse.
- --num-topics 20 — because we want MALLET to identify 20 topics.
- --num-iterations 200 — because we want MALLET to iterate the analysis 200 times, gradually refining the model.
- --output-doc-topics doc-topics.txt — because we want to save a file named doc-topics.txt containing information about the topic composition of documents.
- --output-topic-keys topic-model.txt — because we want information about the identified topics to be saved to a file named topic-model.txt.
So, enter this in Terminal, in the MALLET directory:
./bin/mallet train-topics --input dataset.mallet --num-topics 20 --num-iterations 200 --output-doc-topics doc-topics.txt --output-topic-keys topic-model.txt
MALLET will save the output files.
Deen Freelon maintains a great Python script, called fb_scrape_public, for collecting research data in a structured format from Facebook. Freelon himself, apart from working with coding and computational methods to extract and analyse digital datasets, has done lots of interesting stuff such as a 2016 report on #blacklivesmatter and online struggles for offline justice. The fb_scrape_public script has been updated regularly to respond to changes in the Facebook API. For the time being [April 2017], this is how the script can be run:
- Download it from GitHub.
- In spite of some previous instructions, don't edit anything inside the script itself. Just put it in a directory.
- Create your own Facebook app at: https://developers.facebook.com/apps . It doesn't matter what you call it, you just need to pull the unique client ID (app ID) and app secret for your new app.
- From within the fb_scrape_public directory, open a Python 3 shell in Terminal. The code below — if you replace AppID and AppSecret with the keys for your own app, and FacebookID with the ID of the page/group/profile you want to scrape — will collect data into a csv file. If needed, get the FacebookIDs from here.
>>> from fb_scrape_public import scrape_fb >>> fsp = scrape_fb('AppID','AppSecret','FacebookID')
If you use this tool in publications, make sure to cite it:
The Oxford Dictionaries Word of the Year 2016 is post-truth – an adjective defined as ‘relating to or denoting circumstances in which objective facts are less influential in shaping public opinion than appeals to emotion and personal belief’.