Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data
Data Engineering

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Live Twitter Sentiment Analysis

May 26, 2020

The blog is posted by WeCloudData’s Big Data course student Udayan Maurya.

This Live Twitter Sentiment Analyzer helps track present sentiment for a given track word.

In this document, I will describe the work flow I followed to develop this SaaS app.

Contents

  1. Data Pipeline Map
  2. Data Collection
  3. Preparing Data for Data Analysis
  4. Training the Machine Learning Model
  5. Developing the Dashboard

Tools/Software Used

  • AWS: EC2, Kinesis Firehose, and S3
  • Entire App is developed on Python!
  • Python Libraries:
  • Data Extraction/Munging: Tweepy, Nltk, Boto
  • Data Cleaning: Numpy, Pandas
  • Model Development: Nltk, sklearn,
  • Dashboard Development: Panel, Bokeh

1. Data Pipeline Map

Following is the scheme for development and deployment of sentiment analysis app:

Data Collection:

  • Data was streamed using Twitter API from 23-Dec-2019 to 06-Jan-2020.
  • Real-time data was pushed and assembled in AWS-Kinesis messaging queue.
  • Every 15 minutes data was delivered using AWS-Firehose to store in AWS-S3.
  • Data was read from data store AWS-S3 using a subclass of NLTK CorpusReader.
  • Data was ingested in chunks tokenized, lemmatized, and vectorized.
  • Various models were trained to find the best predictive model.
  • Model visualization are developed using Bokeh and Panel.

Model Execution:

  • Dashboard has been deployed with trained model on AWS-EC2 instance for real-time sentiment analysis.
  • The App uses tweepy to read real-time tweets and makes sentiment predictions using the model.

2. Data Collection

Twitter allows access to programmatically read tweets using API for various languages. We are using “tweepy” (python) to read data from Twitter. Data from 23-Dec-2019 to 06-Dec-2020 (~ 36 million Tweets) was collected.

Data Ingestion Schematic diagram

Corpus Scheme:

Data has been stored in AWS-S3 in the directory structure as shown in figure.

twitter-|sentiment|-|Year| -> |Month| -> |Date| -> |Tab delimited files|

Methodology for Sentiment segregation: Twitter allows data stream to be filtered using track words. Emojis are an excellent way to determine emotion from a text. Therefore, emojis have been used as track words to filter for negative sentiment and positive sentiment tweets. This helps in producing noisy signal for true underlying sentiment. Additionally, training on large amount of data helps in clarifying the signal and develop a good general predictive model. The methodology is inspired by the following work: https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

Following emojis/smileys are used to filer tweets for sentiment segregated data:

u”U********” are utf-8 uni-codes for emojis

Additionally, tweets have been filtered to include English language tweets only.

3. Preparing Data for Analysis

NLTK — CorpusReader class has been extended to read the data as pandas DataFrame in chunks. CorpusReader helps in navigating and reading appropriate files by matching regular expressions in complex directory structures.

Example: This code creates list of all the negative data files

Following code read each file from positive and negative file lists and yields chunks of DataFrames containing both emotions (Positive: 1, Negative: 0)

Raw tweets are transformed to numerical vector form as follows:

Tokenization/Vectorization

  • “@” Twitter handles are removed
  • Stop words (a, an, the, to, etc.) are removed
  • URLs in tweets have been removed
  • Emoji and smileys used to filter and get data are removed
  • Tokens are further lemmatized using NLTK WordNetLemmatizer
  • Tokens are vectorized using sklearn TFIDF/Count vectorizer

4. Training the Machine Learning Model

Naive Bayes, and Support Vector Machine models are trained. The data was split into 67% training and 33% testing data. Following four combinations of models were tried:

  • TFIDF Vectorizer with MultinomialNB
  • TFIDF Vectorizer with SVM for Classification
  • Count Vectorizer with MultinomialNB
  • Count Vectorizer with SVM for Classification

Test Accuracy for training models

Test accuracy of all the models was fairly comparable. For final model for execution phase we selected “TFIDF Vectorizer with MultinomialNB”. We opted for MultinomialNB because it has faster execution time.

In final execution Neutral class has been artificially created. Given that response in training data is only for positive and negative sentiment, the prediction classes have been defined as follows:

5. Developing the Dashboard

Bokeh and Panel libraries have been used to develop plots and represent the dashboard.

Bokeh DataTable class is used for populating the live tweets and sentiments:

Bokeh Vbar plot is used to plot sentiment distribution:

Console objects are Plane TextInput and Button Classes

All the Dashboard elements are assembled using panel Row and Column classes. Finally, the panel code is served on an AWS-EC2 instance:

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Udayan, check out his Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.