Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data
Data Engineering

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Creating a Data Warehouse Using Amazon Redshift for StackOverflow Data

November 4, 2020

The blog is posted by WeCloudData’s  student Sneha Mehrin.

Steps to Create a Data Warehouse and Automate the Process of Loading Pre-Processed Data Using Pyspark Script in Emr

computer towers in a data warehouse

This article is part of the series and continuation of the previous post where we processed the streamed data using spark on EMR.

Why use Redshift?

Redshift is a fully managed data warehouse solution provided by Amazon. Redshift is designed for analytic workloads and delivers fast query and I/O performance for any dataset by using 2 key techniques ; columnar storage and massive parallel processing.

Why Is Columnar Storage So Important?

  • Transactional operations differ majorly from analytical queries.
  • Transactional operations are indexed and very fast, however analytical queries are performed over a huge dataset that can take a lot of time to compute.
  • Relational databases store data in row form and are indexed by primary key for faster accessing. This is perfect for insert ,append or upsert operations.
  • However analytical queries are interested in aggregates(For example : sum of revenue of a particular region). Performing this query in a row base data base requires to scan the entire database, along with all the unused columns, which can be ineffective and performance heavy.
  • This is addressed by columnar databases which stores data in the form of columns instead of rows. This allows the data warehouse to store the data of the same column sequentially on a disk, which allows faster reading capabilities.

To Summarise, Here Are the Key Advantages of Redshift

  • Read less data
  • Better compression
  • Perfect for OLAP
  • Parallel processing , by distribution of queries across multiple nodes.

How to Create a Redshift Cluster?

Amazon has made it insanely easy to create the redshift cluster. Below are the key main steps for creating a Redshift Cluster.

Step 1 : Login to the console and choose Redshift as the service and click on create cluster

Step 2 : Configure User Name and Password

3. Choose an IAM Role for Redshift(This can be the default IAM role)

4.Choose a Security Group and make the cluster Publicly accessible

The security group step is really important as without it you cannot connect to redshift using datagrip.

In the security group, make sure we have inbound rules configured for Redshift.

You can make this either my ip or choose anywhere option.

5. Configure DataGrip to connect to Redshift

DataGrip is designed to query, create and manage databases. If you have a JDBC connection, you can easily connect to any database.

You can install DataGrip from this site https://www.jetbrains.com/help/datagrip/installation-guide.html

From the properties panel of the Redshift cluster copy the JDBC URL

Provide the URL, username and password & test the connection in DataGrip

Step 8 : Create a Table in Redshift with the below query.

The schema should match that of the source data(In this case our stack overflow data)

create table stackoverflow (
question_id varchar(13),
is_answered varchar(10),
view_count int,
answer_count int,
score int,
creation_date date
)

Our basic Redshift cluster creation and configuration is done!!

Next Steps

  • We ned to write our pre-processed dataframe into the redshift cluster using pyspark. This step will be the last step in the script we created in the previous post.
  • Below is the snippet of the code for writing our spark dataframe into our redshift cluster.

We use spark-redshift for this purpose which is a library used to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.

However, this is easily said than done and I think this was the most challenging part of this project.

Key dependencies for PySpark to be able to call spark-redshift library are as follows :

  • In order to connect to Redshift using the Amazon Redshift JDBC Driver, the JDBC application or Java code that you are using to connect to your data must be able to access the driver JAR files. The driver Jar files can be downloaded from here and this needs to be available in the EMR cluster before the pyspark script runs http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver
  • The PySpark script should be able to find the necessary packages in the moment to execute it. The org.apache.spark:spark-avro_2.11:2.4.3 and com.databricks:spark-redshift_2.11:2.0.1 are the 2 necessary packages that are required to be added as part of the spark-submit step.
  • Since Redshift uses s3 as temporary storage, you need to set up proper access keys in the script. In order to achieve this, we need to add our aws access key and secret access key in our PySpark script.

After adding all these dependencies, our final script looks like below :

We also have a bash script which calls this PySpark Script with all the necessary packages and Jars.

What Next?

  • Now that we have our scripts ready, we need to test this in a emr cluster before we automate this in boto3.

Let’s see how to do that!!

Key Steps for testing in EMR Cluster

Step 1 : Create a dummy EMR cluster with spark added.

Step 2 : Copy the Jar file and our two scripts into the hadoop folder of EMR cluster

You can run the below commands in terminal to copy the files from local machine to EMR cluster

If your copy is successful, your terminal will show the below message.

Step 3 : ssh into the emr cluster and check if the file exists

ssh -i bigdata.pem hadoop@ec2–3–88–110–90.compute-1.amazonaws.com

You can use the ls command to list all the files in the home/hadoop directory

Step 4 : Execute the bash script to run our PySpark script

sudo sh Execute.sh

If your script is successful, then you will get the above message in your terminal.

Step 5 : After this is done, you can check the records in your stackoverflow table in DataGrip

If you get any errors while loading data in Redshift, it won’t be very evident from the terminal.

You will need to run the below query to understand the error.

SELECT le.starttime,d.query,d.line_number, d.colname,d.value,le.raw_line,le.err_reason
FROM stl_loaderror_detail d
JOIN stl_load_errors le
ON d.query = le.query
ORDER BY le.starttime DESC

Automating Above Steps using boto3

Now that we know our script works, we need to automate this. I want the EMR cluster to be a transient cluster as it just runs maybe for an hour everyday.

So I created two scripts :- One to create EMR cluster and copy all the scripts from my S3 location, second to execute my bash script to run the PySpark Script.

We would need to schedule these scripts in an ec2 instance using cron jobs.

  • I intend to schedule the first script which launches the cluster at 12 :00 am everyday- This will ensure it picks up all the streams created yesterday.
  • Second script to run the PySpark script should run at one hour after this.

The reason I had to do this way is because, aws runs the steps in random order to optimize the resources, so i was not able to sequentially run them.

Now, let’s finally get to the data visualization part using Einstein Analytics.

This is covered in detail in this post.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Sneha, check out her Medium posts here.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.