Store tweets in a CSV file without using the Twitter API
In this project we continue exploring big data and social networks. Once again we use Twitter as a source of information, but this time we will collect tweets without needing a Twitter developer account.
Twitter allows users to publish short text messages of up to 280 characters. These messages, called tweets, are usually collected through the platform API, but in this example we will use Twint to gather data without using that official API.
The goal is to implement a Python script that stores tweets written about a topic, in a specific language and between selected dates. With that kind of dataset, you can later analyze public opinion on a topic, user or event over time.
The complete code is available at https://github.com/al118345/Tweepy_Example/blob/master/twint_ejemplo.py.
Library installation
In this example I used Python 3.8 together with the Twint library https://github.com/twintproject/twint.
To run the example, install the repository dependencies with pip install -r requirements.txt:
Script analysis
As you can see, the script is straightforward. The goal of this implementation is to search for tweets in Spanish (es) that contain the word spain, and store the latest 100 matching tweets in the file spain.csv.
Before looking at the most important parts of the code, it is useful to review the kind of information that Twint can store for each tweet. Besides the main text, the output may include identifiers, timestamps, usernames, hashtags, links, reply and retweet counters, likes, detected language and many other metadata fields. Some of them may be empty because not every tweet exposes the same information.
| id | conversation_id | created_at | date | time | timezone | user_id | username | name | place | tweet | language | mentions | urls | photos | replies_count | retweets_count | likes_count | hashtags | cashtags | link | retweet | quote_url | video | thumbnail | near | geo | source | user_rt_id | user_rt | retweet_id | reply_to | retweet_date | translate | trans_src | trans_dest |
| 1594720567554281472 | 1594720567554281472 | 2022-11-21 16:53:06 CET | 2022-11-21 | 16:53:06 | +0100 | 185115193 | adabagcompany | Cengiz Adabag | Spain Holiday Warning To Tourists Over Rise In Simple Money Scam https://t.co/LKsLhKfsTl | en | [] | ['https://canadanews.fr/spain-holiday-warning-to-tourists-over-rise-in-simple-money-scam/'] | [] | 0 | 0 | 0 | [] | [] | https://twitter.com/adabagcompany/status/1594720567554281472 | False | 0 | [] |
You can also configure twint.Config() with parameters such as:
- Lang: target tweet language.
- Limit: maximum number of tweets.
- Since: start date.
- Until: end date.
- Store_json: store output in JSON.
- Output: destination file name.
In short, this small script is enough to build a reusable CSV dataset without the official Twitter API, which can later be used for sentiment analysis, topic analysis or any other social-data experiment.
Current limitations and responsible use
This article should be read as an educational example. Twitter, now X, has changed its access policies several times, and unofficial scraping tools can stop working when the platform changes its frontend, rate limits or legal terms. For production work, the official API, approved data providers or already published datasets are more stable options.
The CSV is also only a sample of public conversation, not a neutral representation of society. Before drawing conclusions, remove duplicates, document the date range, preserve the query used, check the language field and be careful with personal data. A reproducible notebook should explain exactly how the tweets were collected and why that sample is valid for the research question.
For related background, see the pages about geolocated tweets with Tweepy and social data in the coronavirus project.