Analyzing tweets using Twitter APIv2, Tweepy, Elasticsearch, and Kibana
Big Data Processing
Harvard University Extension School "Principles of Big Data Processing" CSCI E-88, Fall 2022
Project Goal and Problem Statement
This project aims to study real-time tweets focused around the Cloud Marketplaces for Amazon Web Services, Google Cloud Platform, and Microsoft Azure. I will demonstrate how to build a system that collects twitter data which mentions this information, index them into ElasticSearch for further analytics, and visualize with Kibana.
Big Data Source
For Twitter I used the Twitter API v2 data dictionary to pull the ‘root-level’ attributes.
Expect to find which marketplace, and more specifically, which partner marketplace program, has the most mentions worldwide.
Pipeline Overview and Technologies used
Collect data using Twitter API v2 using Python/Tweepy.
- Tweepy is an open source Python package that gives you a very convenient way to access the Twitter API with Python. Tweepy includes a set of classes and methods that represent Twitter's models and API endpoints, and it transparently handles various implementation details, such as: Data encoding and decoding. [Data is exported to a CSV file]
Messaging/Stream Processing Tier: Push data to Elasticsearch using Python
- Eland and Pandas are used to push the CSV file to Elasticsearch
- Eland is a Python client and toolkit for DataFrames and machine learning in Elasticsearch.
- https://www.elastic.co/guide/en/elasticsearch/client/eland/current/o verview.html
- Pandas is a open source data analysis and manipulation tool built on top of the Python programming language
Visualization Tier: Kibana with ElasticSearch to visualize received data and discover which cloud marketplaces are the most popular
- Kibana is a free and open front end application that sits on top of the Elastic Stack, providing search and data visualization capabilities for data indexed
- Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
The index, which was created in the python script
Data being visualized
Conclusions and Lessons Learned
Limitations from technology used.
- The main limitation I ran into was trying to run everything locally on my M1 Mac due to the arm chip not being compatible.
- Another limitation was that I would need to run the script to maintain the pipeline because I was ingesting with python.
What I would have done differently
- One thing I wanted to do was measure tweet sentiment (positivity/negativity). Although AWS showed a larger number of tweets, perhaps Azure or GCP had a higher ratio of positive tweets.
- his is functionality is included functionality in the Twitter API - https://developer.twitter.com/en/blog/community/2020/how-to-analyze-the-sentiment-of-your-own-tweetsTwitter
Alternative Technologies I would have considered.
- Rather than hosting everything locally I would have used elastic cloud or ran the ELK stack in a Cloud compute Engine
Where would I like to take this project next?
- I want to run the python script in Google Cloud Run. Set the code to run daily and include the date in the index. This would allow me to collect data daily and visualize the data in Kibana over time. I would also want to add the sentiment functionality. Regarding Elastic, I would continue to use the cloud version over hosting this locally. Using elastic cloud with GCP Cloud run would allow for full automation.
- Issues with running ELK stack locally on M1 mac
- Issues authentication with elastic from python script
- Issues authenticating with Twitter APIv2
GitHub URL with Source Code