TwitterScraper is a Python library for pulling tweets about any topic from Twitter.. So what can we do with these captured tweets? Actually, this is something that varies according to the intended use.. “How to Do a Sentiment Analysis? “In our article, we said that there are data on many social media such as Twitter and Facebook for sentiment analysis, but we did not say anything about how to use them.. In this article, we will be learning how to pull the data required for sentiment analysis from Twitter.
Twitter provides developers with a REST API to access and use their data, as well as a real-world dataset. It also provided a Streaming API that can be used to access realtime data.
Most software written to access Twitter data requires an API (application programming interface). This brings with it a limitation.. Also, accessing these APIs is not something that can be done quickly.. As a developer, you must apply to Twitter and have a valid reason to access this data.
With Twitter’s search API, you can only send 180 requests every 15 minutes. This means you can receive a maximum of 100 tweets per request and 4 X 180 X 100 = 72,000 tweets per hour. With Twitterscraper, you can extract as much data as you want regardless of this number, as long as your internet speed and bandwidth are sufficient.
Another big disadvantage of Twitter Search API is that you can only access tweets written in the last 7 days.. This is quite annoying for those who want to include old data in their model.. There is no such limitation with Twitterscrapper.
Installation
To install TwitterScraper, you need to run the code below.
pip install twitterscraper
Another way is to download the GitHub repository and then run the following code.
python setup.py install
If you are using Docker:
docker build -t twitterscraper:build
and you can run your Docker Container with the command:
docker run –rm -it -v/:/app/data twitterscraper:build
Command Line
You can store tweets in JSON (Javascript Object Notation) format by running the code below on the command line.
twitterscraper “Trump OR Clinton” –limit 100 –outputs=tweets.json
You can change the limit as you wish and You can cancel the search by pressing Ctrl + C while searching for tweets.. After the process is completed, the captured tweets are safely stored in your JSON file.
Python Code
As we have mentioned in every article, we will of course include Python code in this article.. Let’s get started!
We can start by installing the necessary libraries:
from twitterscraper import query_tweets import datetime as dt import pandas as pd
We will explain the necessary libraries in detail later in our article. We can move on to the next step.
Our query_tweets() module takes the following as a parameter:
- limit: It stops pulling data when the given limit is reached. This will always be a multiple of 20 as the data is taken in batches of 20. You can stop the pull by pressing the Ctrl + C keys before you reach the limit you set.
- lang: Accesses data in the specified language. Currently supported by more than 30 languages. You can print the help message for the full list.
- begindate: It starts to pull data from the date you specify.. Your format should be YYYY-MM-DD. The default value of this parameter is 2006-03-21.
- enddate: Retrieves data up to the dates you specify. Your format should be YYYY-MM-DD. The default value of this parameter is today.
- query: Type the word you want to search about as a string.
We can define our parameter values for TwitterScraper. In this Python application, I want to capture tweets with “Atatürk” and use the DateTime library to determine my start and end days.
limit = 1000 begin_date = dt.date(2019,4,15) end_date = dt.date(2019,4) ,18) lang = “turkish” query = “Atatürk”
After defining the parameters, we can now move on to the extraction phase.
ataturk= query_tweets(query,begindate=begin_date,enddate=end_date,limit=limit,lang =lang)
As we can see above, we shot 1027 tweets, but the tweets now appear as Tweet objects. We convert it to DataFrame using the Pandas library.
df_ataturk = pd.DataFrame(t.__dict__ for t in ataturk)
After converting, a DataFrame with 21 columns appeared. These columns are:
df_ataturk.columns OUTPUT: Index([‘screen_name’, ‘username’, ‘user_id’, ‘tweet_id’, ‘tweet_url’, ‘timestamp’, ‘timestamp_epochs’, ‘text’, ‘text_html’ , ‘links’, ‘hashtags’, ‘has_media’, ‘img_urls’, ‘video_url’, ‘likes’, ‘retweets’, ‘replies’, ‘is_replied’, ‘is_reply_to’, ‘parent_tweet_id’, ‘users’] dtype=’object’)
I only want to receive tweets in the DataFrame, I need to do a filtering for that.
df = df[[‘text’]]
Look at our first 6 tweets.
As a result, we did it using the Twitterscraper library, without the need for an API to pull data from Twitter.. Even more different results can be obtained by changing the parameters.