Archived Twint example: store tweets in a CSV file

In this project we continue exploring big data and social networks. Once again we use Twitter as a source of information, but this example is now best read as an archived data-collection exercise.

Current note about Twitter/X access

Twint was useful for learning how a scraper can export public posts to CSV, but the Twitter/X ecosystem has changed substantially. Unofficial collectors can break when the frontend, rate limits or terms change. For a current research or production workflow, start with the official X API, approved data providers or public datasets, and keep this script as historical context about data capture and cleaning.

Twitter allows users to publish short text messages of up to 280 characters. These messages, called tweets, are usually collected through the platform API. In the original version of this exercise, Twint was used to gather data without that official API; today that detail should be treated as historical context.

The goal is to implement a Python script that stores tweets written about a topic, in a specific language and between selected dates. With that kind of dataset, you can later analyze public opinion on a topic, user or event over time.

The complete code is available at https://github.com/al118345/Tweepy_Example/blob/master/twint_ejemplo.py.

Library installation

In this example I used Python 3.8 together with the Twint library https://github.com/twintproject/twint.

To run the example, install the repository dependencies with pip install -r requirements.txt:

Script analysis

As you can see, the script is straightforward. The goal of this implementation is to search for tweets in Spanish (es) that contain the word spain, and store the latest 100 matching tweets in the file spain.csv.

Before looking at the most important parts of the code, it is useful to review the kind of information that Twint can store for each tweet. Besides the main text, the output may include identifiers, timestamps, usernames, hashtags, links, reply and retweet counters, likes, detected language and many other metadata fields. Some of them may be empty because not every tweet exposes the same information.

conversation_id

created_at

date

time

timezone

user_id

username

name

place

language

mentions

urls

photos

replies_count

retweets_count

likes_count

hashtags

cashtags

link

retweet

quote_url

video

thumbnail

near

geo

source

user_rt_id

user_rt

retweet_id

reply_to

retweet_date

translate

trans_src

trans_dest

1594720567554281472

2022-11-21 16:53:06 CET

2022-11-21

16:53:06

+0100

185115193

adabagcompany

Cengiz Adabag

Spain Holiday Warning To Tourists Over Rise In Simple Money Scam https://t.co/LKsLhKfsTl

[]

['https://canadanews.fr/spain-holiday-warning-to-tourists-over-rise-in-simple-money-scam/']

[]

https://twitter.com/adabagcompany/status/1594720567554281472

False

[]

You can also configure twint.Config() with parameters such as:

Lang: target tweet language.
Limit: maximum number of tweets.
Since: start date.
Until: end date.
Store_json: store output in JSON.
Output: destination file name.

In short, this small script shows how a reusable CSV dataset was built without the official Twitter API. Today the same research idea should be checked against current X access rules before running a collector.

Current limitations and responsible use

This article should be read as an educational example. Twitter, now X, has changed its access policies several times, and unofficial scraping tools can stop working when the platform changes its frontend, rate limits or legal terms. For production work, the official API, approved data providers or already published datasets are more stable options.

The CSV is also only a sample of public conversation, not a neutral representation of society. Before drawing conclusions, remove duplicates, document the date range, preserve the query used, check the language field and be careful with personal data. A reproducible notebook should explain exactly how the tweets were collected and why that sample is valid for the research question.

For related background, see the pages about geolocated tweets with Tweepy and social data in the coronavirus project.