Twitter provides access to its data through a set of different APIs. Two of these are the Search API and the Streaming API. The public version of the Search API goes back 7 days in time. According to Twitter it "behaves similarly to, but not exactly like the Search feature available in Twitter mobile or web clients". It does not aspire to be a source of complete data. The public Streaming API returns tweets in realtime that match one or more filter predicates.
While none of these methods promises to give access to *all* tweets on a given topic, they still return large amounts of relevant data for a number of applications. It is allowed to use Twitter data, retrieved via the public APIs, for research as long as these guidelines are followed.
The btf (Back to Future) tweet collector is a method for collecting tweets both a bit back in time, and in realtime, for a set of keywords. It uses the Search API to search back, and the Streaming API to stream in realtime. Tweet objects are parsed and written to two databases, that are eventually merged into one database.
The Jupyter notebook btf_jupyter.ipynb goes through all the steps in the data collection process.
To get the most complete results however, one may want to run the forward and backwards scrape at the same time, instead of waiting for the first one to complete before starting the second one. This is how to run the collection using a set of scripts from Terminal:
To set up the query, edit q.txt with one keyword per line:
$ nano q.txt
To create the two databases:
$ python dbs.py
Launch the live stream, preferably in a separate Terminal window (or shell screen session).
$ python stream.py
Launch the search job, also in another screen window/session.
$ python search.py
When having run both scripts for as long as desired, merge the databases into one.
$ python mergedb.py
To work further with the data, either start from the btf_parser.ipynb Jupyter notebook, or just save all data as a csv file:
$ python makecsv.py
If needed at any point, to delete all databases in the directory:
$ python dbkill.py