From Configuration to Orchestration: Constructing an ETL Workflow with AWS Is No Longer a Struggle

to steer the cloud industry with a whopping 32% share as a result of its early market entry, robust technology and comprehensive service offerings. Nonetheless, many users find AWS difficult to navigate, and this discontentment lead more firms and organisations to prefer its competitors Microsoft Azure and Google Cloud Platform.

Despite its steeper learning curve and fewer intuitive interface, AWS stays the highest cloud service as a result of its reliability, hybrid cloud and maximum service options. More importantly, the collection of proper strategies can significantly reduce configuration complexity, streamline workflows, and boost performance.

In this text, I’ll introduce an efficient approach to arrange an entire ETL pipeline with orchestration on AWS, based by myself experience. It would also offer you a refreshed view on the production of knowledge with AWS or make you are feeling less struggling when conducting configuration if that is your first time to make use of AWS for certain tasks.

Strategy for Designing an Efficient Data Pipeline

AWS has essentially the most comprehensive ecosystem with its vast services. To construct a production-ready data warehouse on AWS no less than requires the next services:

IAM – Although this service isn’t included into any a part of the workflow, it’s the muse for accessing all other services.
AWS S3 – Data Lake storage
AWS Glue – ETL processing
Amazon Redshift – Data Warehouse
CloudWatch – Monitoring and logging

You furthermore mght need access to Airflow if you could have to schedule more complex dependencies and conduct advanced retries when it comes to error handling although Redshift can handle some basic cron jobs.

To make your work easier, I highly recommend to put in an IDE (Visual Studio Code or PyCharm and naturally you possibly can select your individual favourite IDE). An IDE dramatically improves your efficiency for complex python code, local testing/debugging, version control integration and team collaboration. And in the following session, I’ll provide step-by-step configurations.

Initial Setup

Listed below are the steps of initial configurations:

Launch a virtual environment in your IDE
Install dependencies – principally, we want to put in the libraries that will probably be used afterward.

pip install apache-airflow==2.7.0 boto3 pandas pyspark sqlalchemy

Install AWS CLI – this step permits you to write scripts to automate various AWS operations and makes the management of AWS resources more efficiently.

AWS Configuration – be certain that to enter these IAM user credentials when prompted:
- AWS Access Key ID: Out of your IAM user.
- AWS Secret Access Key: Out of your IAM user.
- Default region: us-east-1 (or your selected region)
- Default output format: json.
Integrate Airflow – listed below are the steps:
- Initialize Airflow
- Create DAG files in Airflow
- Run the online server at http://localhost:8080 (login:admin/admin)
- Open one other terminal tab and begin the scheduler

export AIRFLOW_HOME=$(pwd)/airflow
airflow db init
airflow users create 
  --username admin 
  --password admin 
  --firstname Admin 
  --lastname User 
  --role Admin 
  --email [email protected]
#Initialize Airflow

airflow webserver --port 8080 ##run the webserver

airflow scheduler #start the scheduler

Development Workflow: COVID-19 Data Case Study

I’m using JHU’s public COVID-19 dataset (CC BY 4.0 licensed) for demonstration purpose. You’ll be able to consult with data here,

The chart below shows the workflow from data ingestion to data loading to Redshift tables in the event environment.

Development workflow created by creator

Data Ingestion

In step one of knowledge ingestion to AWS S3, I processed data by melting them to long format and converting the date format. I saved the info within the parquet format to enhance the storage efficiency, enhance query performance and reduce storage costs. The code for this step is as below:

import pandas as pd
from datetime import datetime
import os
import boto3
import sys

def process_covid_data():
    try:
        # Load raw data
        url = "https://github.com/CSSEGISandData/COVID-19/raw/master/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
        df = pd.read_csv(url)
        
        # --- Data Processing ---
        # 1. Melt to long format
        df = df.melt(
            id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
            var_name='date_str',
            value_name='confirmed_cases'
        )
        
        # 2. Convert dates (JHU format: MM/DD/YY)
        df['date'] = pd.to_datetime(
            df['date_str'], 
            format='%m/%d/%y',
            errors='coerce'
        ).dropna()
        
        # 3. Save as partitioned Parquet
        output_dir = "covid_processed"
        df.to_parquet(
            output_dir,
            engine='pyarrow',
            compression='snappy',
            partition_cols=['date']
        )
        
        # 4. Upload to S3
        s3 = boto3.client('s3')
        total_files = 0
        
        for root, _, files in os.walk(output_dir):
            for file in files:
                local_path = os.path.join(root, file)
                s3_path = os.path.join(
                    'raw/covid/',
                    os.path.relpath(local_path, output_dir)
                )
                s3.upload_file(
                    Filename=local_path,
                    Bucket='my-dev-bucket',
                    Key=s3_path
                )
            total_files += len(files)
        
        print(f"Successfully processed and uploaded {total_files} Parquet files")
        print(f"Data covers from {df['date'].min()} to {df['date'].max()}")
        return True

    except Exception as e:
        print(f"Error: {str(e)}", file=sys.stderr)
        return False

if __name__ == "__main__":
    process_covid_data()

After running the python code, you must have the opportunity to see the parquet files within the S3 buckets, under the folder of ‘raw/covid/’.

ETL Pipeline Development

AWS Glue is principally used for ETL Pipeline Development. Although it may possibly even be used for data ingestion even when the info hasn’t loaded to S3, its strength lies in processing data once it’s in S3 for data warehousing purposes. Here’s PySpark scripts for data transform:

# transform_covid.py
from awsglue.context import GlueContext
from pyspark.sql.functions import *

glueContext = GlueContext(SparkContext.getOrCreate())
df = glueContext.create_dynamic_frame.from_options(
    "s3",
    {"paths": ["s3://my-dev-bucket/raw/covid/"]},
    format="parquet"
).toDF()

# Add transformations here
df_transformed = df.withColumn("load_date", current_date())

# Write to processed zone
df_transformed.write.parquet(
    "s3://my-dev-bucket/processed/covid/",
    mode="overwrite"
)

The following step is to load data to Redshift. In Redshift Console, click on “Query Editor Q2” on the left side and you possibly can edit your SQL code and finish the Redshift COPY.

# Create a table covid_data in dev schema
CREATE TABLE dev.covid_data (
    "Province/State" VARCHAR(100),  
    "Country/Region" VARCHAR(100),
    "Lat" FLOAT8,
    "Long" FLOAT8,
    date_str VARCHAR(100),
    confirmed_cases FLOAT8  
)
DISTKEY("Country/Region")   
SORTKEY(date_str);

# COPY data to redshift
COPY dev.covid_data (
    "Province/State",
    "Country/Region",
    "Lat",
    "Long",
    date_str,
    confirmed_cases
)
FROM 's3://my-dev-bucket/processed/covid/'
IAM_ROLE 'arn:aws:iam::your-account-id:role/RedshiftLoadRole'
REGION 'your-region'
FORMAT PARQUET;

Then you definately’ll see the info successfully uploaded to the info warehouse.

Pipeline Automation

The best approach to automate your data pipeline is to schedule jobs under Redshift query editor v2 by making a Stored Procedure (I actually have a more detailed introduction about SQL Stored Procedure, you possibly can consult with this text).

CREATE OR REPLACE PROCEDURE dev.run_covid_etl()
AS $$
BEGIN
  TRUNCATE TABLE dev.covid_data;
  COPY dev.covid_data 
  FROM 's3://simba-dev-bucket/raw/covid'
  IAM_ROLE 'arn:aws:iam::your-account-id:role/RedshiftLoadRole'
  REGION 'your-region'
  FORMAT PARQUET;
END;
$$ LANGUAGE plpgsql;

Alternatively, you possibly can run Airflow for scheduled jobs.

from datetime import datetime
from airflow import DAG
from airflow.providers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator

default_args = {
    'owner': 'data_team',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'retries': 2
}

with DAG(
    'redshift_etl_dev',
    default_args=default_args,
    schedule_interval='@every day',
    catchup=False
) as dag:

    run_etl = RedshiftSQLOperator(
        task_id='run_covid_etl',
        redshift_conn_id='redshift_dev',
        sql='CALL dev.run_covid_etl()',
    )

Production Workflow

Airflow DAG is powerful to orchestrates your entire ETL pipeline if there are lots of dependencies and it’s also an excellent practice in production environment.

After developing and testing your ETL pipeline, you possibly can automate your tasks in production environment using Airflow.

Listed below are the check list of key preparation steps to assist the successful deployment in Airflow:

Create S3 bucket my-prod-bucket
Create Glue job prod_covid_transformation in AWS Console
Create Redshift Stored Procedure prod.load_covid_data()
Configure Airflow
Configure SMTP for emails in airflow.cfg

Then the deployment of the info pipeline in Airflow is:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.operators.redshift_sql import RedshiftSQLOperator
from airflow.operators.email import EmailOperator

# 1. DAG CONFIGURATION
default_args = {
    'owner': 'data_team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'start_date': datetime(2023, 1, 1)
}

# 2. DATA INGESTION FUNCTION
def load_covid_data():
    import pandas as pd
    import boto3
    
    url = "https://github.com/CSSEGISandData/COVID-19/raw/master/archived_data/archived_time_series/time_series_19-covid-Confirmed_archived_0325.csv"
    df = pd.read_csv(url)

    df = df.melt(
        id_vars=['Province/State', 'Country/Region', 'Lat', 'Long'], 
        var_name='date_str',
        value_name='confirmed_cases'
    )
    df['date'] = pd.to_datetime(df['date_str'], format='%m/%d/%y')
    
    df.to_parquet(
        's3://my-prod-bucket/raw/covid/',
        engine='pyarrow',
        partition_cols=['date']
    )

# 3. DAG DEFINITION
with DAG(
    'covid_etl',
    default_args=default_args,
    schedule_interval='@every day',
    catchup=False
) as dag:

    # Task 1: Ingest Data
    ingest = PythonOperator(
        task_id='ingest_data',
        python_callable=load_covid_data
    )

    # Task 2: Transform with Glue
    transform = GlueJobOperator(
        task_id='transform_data',
        job_name='prod_covid_transformation',
        script_args={
            '--input_path': 's3://my-prod-bucket/raw/covid/',
            '--output_path': 's3://my-prod-bucket/processed/covid/'
        }
    )

    # Task 3: Load to Redshift
    load = RedshiftSQLOperator(
        task_id='load_data',
        sql="CALL prod.load_covid_data()"
    )

    # Task 4: Notifications
    notify = EmailOperator(
        task_id='send_email',
        to='you-email-address',
        subject='ETL Status: {{ ds }}',
        html_content='ETL job accomplished: View Logs'
    )

My Final Thoughts

Although some users, especially those that are latest to the cloud and looking for easy solutions are likely to be daunted by AWS’s high barrier to entry and be overwhelmed by the large selections of services, it’s well worth the time and efforts and listed below are the explanations:

The means of configuration, and the designing, constructing and testing of the info pipelines gives you the deep understanding of a typical data engineering workflow. The abilities will profit you even if you happen to produce your projects with other cloud services, akin to Azure, GCP and Alibaba Cloud.
The mature ecosystem that AWS has and an enormous array of services that it offers enable users to customize their data architecture strategies and revel in more flexibility and scalability of their projects.

Thanks for reading! Hope this text helpful to construct your cloud-base data pipeline!

From Configuration to Orchestration: Constructing an ETL Workflow with AWS Is No Longer a Struggle

Strategy for Designing an Efficient Data Pipeline

Initial Setup

Development Workflow: COVID-19 Data Case Study

Production Workflow

My Final Thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

From Configuration to Orchestration: Constructing an ETL Workflow with AWS Is No Longer a Struggle

Strategy for Designing an Efficient Data Pipeline

Initial Setup

Development Workflow: COVID-19 Data Case Study

Production Workflow

My Final Thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.