Analyzing tweets using Twitter APIv2, Tweepy, Elasticsearch, and Kibana

Data Processing

Cisco ISE

Harvard University Extension School "Principles of Big Data Processing" CSCI E-88, Fall 2022
Final Project

Project Goal and Problem Statement

This project aims to study real-time tweets focused around the Cloud Marketplaces for Amazon Web Services, Google Cloud Platform, and Microsoft Azure. I will demonstrate how to build a system that collects twitter data which mentions this information, index them into ElasticSearch for further analytics, and visualize with Kibana.

Terms:
“AWS Marketplace”

“GCP Marketplace”

“Azure Marketplace”

Big Data Source

Twitter APIv2
For Twitter I used the Twitter API v2 data dictionary to pull the ‘root-level’ attributes.

https://developer.twitter.com/en/docs/twitter-api/data-dictionary/introduction

Expected results

Expect to find which marketplace, and more specifically, which partner marketplace program, has the most mentions worldwide.

Pipline showing twitter -> tweeps -> Python -> ElasticSearch -> Kibana

Pipeline Overview and Technologies used

Collect data using Twitter API v2 using Python/Tweepy.

  • Tweepy is an open source Python package that gives you a very convenient way to access the Twitter API with Python. Tweepy includes a set of classes and methods that represent Twitter's models and API endpoints, and it transparently handles various implementation details, such as: Data encoding and decoding. [Data is exported to a CSV file]

Messaging/Stream Processing Tier: Push data to Elasticsearch using Python

Visualization Tier: Kibana with ElasticSearch to visualize received data and discover which cloud marketplaces are the most popular

  • Kibana is a free and open front end application that sits on top of the Elastic Stack, providing search and data visualization capabilities for data indexed
  • https://www.elastic.co/what-is/kibana
  • Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
  • https://www.elastic.co/what-is/elasticsearch

Results

The index, which was created in the python script

Index Management page in Kibana

Index

Query to check index

Data being visualized

Conclusions and Lessons Learned

Limitations from technology used.

  • The main limitation I ran into was trying to run everything locally on my M1 Mac due to the arm chip not being compatible.
  • Another limitation was that I would need to run the script to maintain the pipeline because I was ingesting with python.

What I would have done differently

Alternative Technologies I would have considered.

  • Rather than hosting everything locally I would have used elastic cloud or ran the ELK stack in a Cloud compute Engine

Where would I like to take this project next?

  • I want to run the python script in Google Cloud Run. Set the code to run daily and include the date in the index. This would allow me to collect data daily and visualize the data in Kibana over time. I would also want to add the sentiment functionality. Regarding Elastic, I would continue to use the cloud version over hosting this locally. Using elastic cloud with GCP Cloud run would allow for full automation.

Issues incountered

  • Issues with running ELK stack locally on M1 mac
  • Issues authentication with elastic from python script
  • Issues authenticating with Twitter APIv2

Keynote Presentation

GitHub URL with Source Code

https://github.com/netdevmike/CSCIE-88-Final-Project

Contact Me

Lets Work Together

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Stay in touch

Ready to Talk

Feel free to contact me