Code to store only those tweets that have geolocation using Tweepy and Python.
In this project we are going to carry out a small investigation from the point of view of Big Data and social networks. More specifically, we have used Twitter as a textual source, a social network that allows us to collect geolocated tweets by area.
What this Tweepy example solves
The goal is not to download every tweet about a topic, but to keep only the records that include usable geographic information. That distinction is important because most tweets do not include coordinates, and a dataset that mixes located and non-located messages is difficult to map or analyze reliably.
The workflow is simple: authenticate with the Twitter API, open a streaming listener, check whether the coordinate fields exist and write a normalized CSV row. If you need the broader collection context, the related article about capturing coronavirus tweets with Tweepy explains the base pipeline before the geolocation filter is applied.
For those who may not know it, Twitter allows the sending of short plain-text messages with a maximum of 280 characters. These messages, called tweets, are shown on the user's main page and can be captured through an API provided by the social network itself.
This example consists of implementing a Python script that continuously stores all tweets written about the Coronavirus, in English, that are geolocated. In that way, we could investigate in which country or city a given topic is discussed the most.
For more information about how Tweepy works, I recommend that you visit the URL: https://1938.com.es/app-coronavirus-twitter or, if you prefer a video tutorial, visit the following URL: https://www.youtube.com/watch?v=vCFioQizM4w
The full code can be downloaded from the following URL https://github.com/al118345/Tweepy_Example/blob/master/Tweepy_ejemplo_localizacion.py
Script analysis.
As you can see, the code is not mysterious at all. The goal of this implementation is to search for all tweets in English (en) that contain the word Coronavirus and store only those that contain geolocation information at the moment they were written.
That said, and before looking at the most important parts of the code, I am going to show in the following table what geolocated information can be obtained from a tweet.
| status.created_at | status.text | status.geo | status.coordinates | status.place |
|---|---|---|---|---|
| 2021-05-24 14:49:33 | Coronavirus Update: WHO head slams ‘scandalous inequity’ in COVID vaccines with 10 countries accounting for 75% of doses administered https://t.co/3myGLUUaKm #Nifty #Sipgrab #UnitingPeopleWithThePossibilities | {'type': 'Point', 'coordinates': [12.8898216, 77.65212771]} | {'type': 'Point', 'coordinates': [77.65212771, 12.8898216]} | Place(_api=<tweepy.api.API object at 0x7f9a270791c0>, id='5f55bb82cf16ac81', url='https://api.twitter.com/1.1/geo/id/5f55bb82cf16ac81.json', place_type='city', name='Bengaluru South', full_name='Bengaluru South, India', country_code='IN', country='India', bounding_box=BoundingBox(_api=<tweepy.api.API object at 0x7f9a270791c0>, type='Polygon', coordinates=[[[77.330578, 12.731936], [77.330578, 13.114293], [77.786319, 13.114293], [77.786319, 12.731936]]]), attributes={}) |
As can be verified, a tweet stores location in different formats and coordinate types. On the one hand, in the property place we can obtain the full name of the city, state or country where the user is located at the moment of writing the tweet and, on the other hand, in coordinates or geo we have the coordinate points.
But not all written tweets have this functionality enabled. In other words, out of every 100 tweets written about the coronavirus, only 1 has this information about its location.
To obtain only the tweet that has geographic information, we must look at line 15 of the code shown above. More specifically, the following line.
This line only allows storing tweets that have the field coordinate filled in. You can place the shown functionality in context inside the following function, whose only purpose is, given a tweet (status), to check whether it is geopositioned. If so, it stores it in the file; if not, it continues. The fragment responsible for providing tweets to the function analyzed is the following piece of code. Its only purpose is to activate the listener in charge of obtaining the different tweets published for a given topic and language. Finally, remember that it is necessary to obtain and fill in the access credentials so that the script works correctly. For that purpose, I recommend visiting the following URL https://1938.com.es/app-coronavirus-twitter where it is explained how this information can be obtained. YouTube explanation
Practical limitations
This kind of dataset should be interpreted carefully. Geolocated tweets are a small and biased subset of the full conversation: users must have location enabled, the API must expose the field and the capture process must remain active while the message is published. For that reason, the result is useful for technical experimentation and exploratory visualization, but it should not be treated as a statistically complete sample of public opinion.
A good next step is to combine the CSV with a data preparation workflow, remove duplicates and enrich the records with country or city labels before drawing maps or dashboards. For that part, the article on data analysis and dataset preparation gives a complementary structure for cleaning and interpreting tabular information.
What to improve before using it in a real analysis
For a more robust project, store the tweet identifier, creation date, language, coordinates, place name and raw API payload separately. That makes it possible to deduplicate records, repeat the cleaning process and enrich the dataset later with country codes or administrative regions.
It is also useful to separate collection from analysis. The streaming script should only capture and persist data; a second script can validate coordinates, remove broken rows and prepare the CSV for maps, dashboards or text mining workflows.