Harvard University Extension School "Principles of Big Data Processing" CSCI E-88, Fall 2022
Final Project

‍

Project Goal and Problem Statement

This project aims to study real-time tweets focused around the Cloud Marketplaces for Amazon Web Services, Google Cloud Platform, and Microsoft Azure. I will demonstrate how to build a system that collects twitter data which mentions this information, index them into ElasticSearch for further analytics, and visualize with Kibana.

Terms:
“AWS Marketplace”

“GCP Marketplace”

“Azure Marketplace”

‍

Big Data Source

Twitter APIv2
For Twitter I used the Twitter API v2 data dictionary to pull the ‘root-level’ attributes.

https://developer.twitter.com/en/docs/twitter-api/data-dictionary/introduction

‍

Expected results

Expect to find which marketplace, and more specifically, which partner marketplace program, has the most mentions worldwide.

‍

Pipline showing twitter -> tweeps -> Python -> ElasticSearch -> Kibana

‍

Pipeline Overview and Technologies used

Collect data using Twitter API v2 using Python/Tweepy.

Tweepy is an open source Python package that gives you a very convenient way to access the Twitter API with Python. Tweepy includes a set of classes and methods that represent Twitter's models and API endpoints, and it transparently handles various implementation details, such as: Data encoding and decoding. [Data is exported to a CSV file]

https://www.tweepy.org

Messaging/Stream Processing Tier: Push data to Elasticsearch using Python

Eland and Pandas are used to push the CSV file to Elasticsearch
Eland is a Python client and toolkit for DataFrames and machine learning in Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/client/eland/current/o verview.html
Pandas is a open source data analysis and manipulation tool built on top of the Python programming language
https://pandas.pydata.org

Visualization Tier: Kibana with ElasticSearch to visualize received data and discover which cloud marketplaces are the most popular

Kibana is a free and open front end application that sits on top of the Elastic Stack, providing search and data visualization capabilities for data indexed
https://www.elastic.co/what-is/kibana
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured.
https://www.elastic.co/what-is/elasticsearch

Results

The index, which was created in the python script

Index

Data being visualized

‍

Conclusions and Lessons Learned

Limitations from technology used.

The main limitation I ran into was trying to run everything locally on my M1 Mac due to the arm chip not being compatible.
Another limitation was that I would need to run the script to maintain the pipeline because I was ingesting with python.

What I would have done differently

One thing I wanted to do was measure tweet sentiment (positivity/negativity). Although AWS showed a larger number of tweets, perhaps Azure or GCP had a higher ratio of positive tweets.
his is functionality is included functionality in the Twitter API - https://developer.twitter.com/en/blog/community/2020/how-to-analyze-the-sentiment-of-your-own-tweetsTwitter

Alternative Technologies I would have considered.

Rather than hosting everything locally I would have used elastic cloud or ran the ELK stack in a Cloud compute Engine

Where would I like to take this project next?

I want to run the python script in Google Cloud Run. Set the code to run daily and include the date in the index. This would allow me to collect data daily and visualize the data in Kibana over time. I would also want to add the sentiment functionality. Regarding Elastic, I would continue to use the cloud version over hosting this locally. Using elastic cloud with GCP Cloud run would allow for full automation.

Issues incountered