My First Data Engineering Project from Scratch

Feb 23, 2023

Are you interested in building a data engineering project from scratch, but not sure where to start? Look no further! In this blog post, I'll share my experience working on my first data engineering project and the lessons I learned along the way.

Introduction

At the end of 2022 I made the decision that I wanted to make a "career switch" from being a Cloud Native Engineer to becoming a Data Engineer.

I had made a rough roadmap for myself on how I thought I could make the switch. I first studied for and passed the DP-900 Azure Data Fundamentals exam and right away started studying for the DP-203 Azure Data Engineering exam. I quickly noticed that my SQL was too rusty and I missed a solid base of knowledge and experience with Python.

After asking for advice and looking around on the internet I decided to do the Data Engineering with Python Career Track from Datacamp, but before I could start that one I first completed the SQL Fundamentals, Python Fundamentals and Python Programming Skill Tracks. All tracks are short videos explaining a subject and then you have to practice it with hands-on exercises. I really enjoyed the Datacamp courses and I would recommend them to anyone who wants to learn (more about) SQL, Python and Data Engineering.

Instead of focussing on the DP-203 exam I decided I wanted to first get more hands-on experience with SQL and Python. So, I started working on my first data project and wanted to make sure that I use different data sources and data formats to extract, transform and load.

Project Purpose

For my first data engineering project, I wanted to build a project to gain hands-on experience with Python and SQL, but also use it for my portfolio.

The question was, what data am I going to use? You can get data everywhere for free to use for your data project, but I wanted to have my own data. A couple of months ago I placed hygrometers in several rooms in the house, which measure the humidity and temperature. That is when I came to the idea to gather and use the humidity data and I wanted to combine it with outside weather data and my electricity + gas usage and get some insights out of it.

The projects main focus is to extract, transform, and load the data into a PostgreSQL database.

Tools and Technologies

Cloud Services

My expertise and experience is working with Azure with regards to Cloud providers. So this project uses the following Azure Services:

Azure Storage Account
Azure PostgreSQL

Programming Language

One of my goals/ The purpose was gaining hands-on experience with Python and SQL in a Data Engineering project it is no surprise that I used Python as main language for this project.

Data Platform

I did not want to make the setup to complex for my first project, because one of the goals is to gain hands-on experience with Python and SQL. I chose Github Action Workflows to run the Data pipelines, because of my experience with Github and Github Actions.

Versioning

I chose Github as tool for running the Data pipelines, so the most logic choice is to also use Github for versioning.

Project Setup

Directory Structure

The projects structure is:

.github/workflows: This directory contains the Github Workflows.
- More information about the specific workflows can be found in the Workflows section.
/data: This directory contains the CSV file with humidity data.
/etl: This directory contains the code for the ETL process. It contains:
- __init__.py: This file initialises the etl package and import modules from the /etl directory.
- constants.py: Contains all the variables used across the project.
- humidity_data_etl_pipeline.py: Contains the user-defined functions specific to transforming the humidity data.
- utils.py: Contains generic user-defined functions used across the project.
- weather_data_etl_pipeline.py: Contains the user-defined functions specific to transforming the weather data fetched from weather API.
/scripts/bash: This directory contains a Bash script that uploads the CSV file to an Azure storage account.
requirements.txt: This file contains the Python packages that are required to run the project.

Package

In the etl package I have different modules, the diagram below shows the dependency between the modules.

Workflows

The project has four workflows:

codeql-code-analysis.yml: This workflow runs semantic code analysis against the repository's source code to find security vulnerabilities.
flake8-lint.yml: This workflow runs the flake8 linter on the Python code in the /etl directory. It runs:
- On demand/ Manually
- With a pull request
- On push to the main branche
humidity-data-etl-pipeline.yml: This workflow extracts, transforms, and loads the humidity data from the uploaded CSV file into the Azure PostgreSQL database. The workflow uses the constants and functions defined in the /etl directory. It runs:
- On demand/ Manually
upload-humidity-data.yml: This workflow executes the Bash script located in the /scripts/bash directory, which uploads the CSV file to an Azure storage account. It runs:
- On demand/ Manually
weather-data-etl-pipeline.yml: This workflow fetches weather data from a weather API, transforms the data using the constants and functions defined in the /etl directory, and loads the data into the same Azure PostgreSQL database as the humidity data. It runs:
- On demand/ Manually

Data Engineering Processes

ETL Process

For this project I used the ETL process, which is a combination of three processes:

Extract
- Raw data is extracted from one or multiple sources
  - E.g. JSON, XML, CSV, Relational databases
Transform
- The raw data is being adjusted to the organisations requirements
  - E.g. Converting data types, Cleaning data, Combining data
Load
- The data that has been transformed can be made available to users
  - E.g. Written to a database/ data warehouse.

For my project it is a pretty simple overview as you can see. I have (for now) two data sources that provide raw data (CSV and JSON), and with the use of Github Workflows I run data pipelines to extract the data, transform it and write it to a PostgreSQL database

Extract

The two datasources are:

An Azure Storage Account which will have a CSV uploaded with the Humidity data from my living room
A weather API that provides data in JSON format

Azure Authentication

To be able to fetch a file from the storage account I need first authenticate. I used the DefaultAzureCredential class from the azure.identity package.

######################
# Basic example code #
######################
# Install the azure identity package
pip install azure-identity

# Import the DefaultAzureCredential Class
from azure.identity import DefaultAzureCredential

# Create credentials object
credential = DefaultAzureCredential()

The DefaultAzureCredential relies on the environment variables:

AZURE_CLIENT_ID
AZURE_CLIENT_SECRET
AZURE_TENANT_ID

Azure Storage Account

To connect to my storage account I used the BlobClient class from the azure.storage.blob package.

######################
# Basic example code #
######################

# Install the blob package
pip install azure-storage-blob

# Import the BlobClient Class
from azure.storage.blob import BlobClient

# Get BlobClient object
account_url = "https://" + storage_account_name + ".blob.core.windows.net"
storage_container_name="example_container"
blob_name="example_blob"
credential = DefaultAzureCredential()

blob_client = BlobClient(
        account_url,
        container_name=storage_container_name,
        blob_name=blob_name,
        credential=credential
        )
# Download the blob
blob_content = blob_client.download_blob()

Weather API

To get data from an API we can use the Requests HTTP library.


######################
# Basic example code #
######################
# Install Requests
python -m pip install requests

# Import Requests
import requests

# Get the response object
response = requests.get('https://exampleapi.com/api')

Transform

CSV

To be able to transform the CSV data I used Pandas DataFrame.

######################
# Basic example code #
######################

# import pandas dataframe
import pandas as pd

# Convert a csv to a dataframe
df = pd.read_csv(blob_content)

Response object

In my case the weather API returned JSON, so we can use the built-in JSON decoder from the Requests library. Like with the CSV data I converted the JSON object into a Pandas Dataframe.

######################
# Basic example code #
######################

# Get the JSON object from the response object
json_response = response.json()

# Convert to a dataframe
df = pd.DataFrame(json_response)

Load

PostgreSQL

For my project I am using an Azure PostgreSQL Database, so I am using Psycopg to connect to the instance. Psycopg is the most popular PostgreSQL database adapter for the Python programming language.

######################
# Basic example code #
######################

# Install Psycopg
pip install psycopg2

# Import Psycopg
import psycopg2

# Create a connection object
conn = psycopg2.connect(
        host=host,
        user=user,
        password=password,
        dbname=dbname,
        sslmode=sslmode
)

# Create a cursor to execute PostgreSQL command
cursor = conn.cursor()

# Execute a PostgreSQL command
cursor.execute("""
        SELECT * FROM {};
        """.format(table_name))

# Close the connection
cursor.close()
conn.close()

Lessons Learned

Before I start to write any code:
- Think about the project's setup/ structure in Github and implement it. It is important!
- Design a data model and data schema. Even though this project's focus is on ETL, it's wise to have a data model and schema. A data architecture reflects the goals and business logic of an organisation¹.
Write unit test during the development process
Write data quality checks during the development process

What's up next?

In the project I have added a TODO.md with tasks I still want to implement.

After this first time project I will start working on different setups, using the same data, but using tools like Airflow, Mage, Azure Data Factory, Azure Synapse, Azure Databricks, PySpark etc. This way I can get more hands-on experience with different tools and technologies.

So keep an eye on my blog and/or LinkedIn Posts!

References

Fundamentals of Data Engineering by Joe Reis & Matt Housley

Azure Data Dude

Discussion about this post