IBM Capstone Data Engineering Project

Overview

This project explored several data engineering technologies, concepts and skills that I acquired while completing the IBM Data Engineering Professional Certificate. You can find all the screenshots and scripts pertaining to this project on GitHub.

Data Platform Architecture and OLTP Database

Designed and implemented a data platform using MySQL as an OLTP database, and another using MongoDB.

PostgreSQL Data Warehouse

Designed and implemented a data warehouse on PostgreSQL using a star schema and performed various queries.

Data Analytics and IBM Cognos Dashboards

Loaded the data into IBM Cognos Analytics and created dashboards.

ETL & Data Pipeline (Airflow, Python and Bash)

Automated the process of loading data from MySQL to a PostgreSQL data warehouse.

Used Airflow to create a pipeline that analyzes web server logs, extracts the required lines and fields, transforms and loads the data.

Big Data Analytics with PySpark

Used PySpark and data from a webserver to analyze search terms, and loaded a pretrained sales forecasting model to predict the forecast for a future year based on given sales data.

Below is a summary of some of the tasks I performed and some of the screenshots I took during the project.

In the first section of the project, I created a table on MySQL for sales data. And then I inserted sales data from a sales_data.sql into the table. I also queried the table, performed operations and exported the data.

I performed a similar operation with another database in MongoDB. I imported a file into it, performed queries, created an index to improve query performance and exported the database.

I also designed and created a star schema for a database which was supposed to hold ecommerce data on PostgreSQL. Then I performed several queries on the database, from simple select queries to groupingsets, cubes, rollups and created a materialized view.

I imported a dataset into IBM Cognos Dashboards and created dashboards such as a bar graph to show mobile phone sales in each quarter, a line graph to show sales for each month of 2022, and a pie chart to show sales for three product categories.

I automated the process of retrieving the latest records from a MySQL table and inserting them into a PostgreSQL data warehouse. Below are the Python functions that fetch the records, insert them and the output I got after executing the script.

I used Airflow to orchestrate a data pipeline that extracts specific IP addresses from a access log file and loads them into a destination file.

I used PySpark to load a sales prediction model, apply it to a sales data set, and predict the sales for the year 2023.