Home Artificial Intelligence Tips on how to Construct Easy ETL Pipelines With GitHub Actions

Tips on how to Construct Easy ETL Pipelines With GitHub Actions

0
Tips on how to Construct Easy ETL Pipelines With GitHub Actions

ETLs don’t should be complex. If that’s the case, use GitHub Actions.

Photo by Roman Synkevych 🇺🇦 on Unsplash

Should you’re into software development, you’d know what GitHub actions are. It’s a utility by GitHub to automate dev tasks. Or, in popular language, a DevOps tool.

But people hardly use it for constructing ETL pipelines.

The very first thing that involves mind when discussing ETLs is Airflow, Prefect, or related tools. They’re, indubitably, the perfect out there for task orchestration. But many ETLs we construct are easy, and hosting a separate tool for them is commonly overkill.

You should utilize GitHub Actions as an alternative.

This text focuses on GitHub Actions. But in case you’re on Bitbucket or GitLab, you possibly can use their respective alternatives too.

We are able to run our Python, R, or Julia scripts on GitHub Actions. In order a knowledge scientist, you don’t should learn a recent language or tool for this matter. You might even get email notifications when any of your ETL tasks fail.

You’ll be able to still enjoy 2000min of computation monthly in case you’re on a free account. You’ll be able to try GitHub motion in case you can estimate your ETL workload inside this range.

How will we start constructing ETLs on GitHub Actions?

Getting began with the GitHub actions is straightforward. You might follow the official doc. Or the three easy steps are as follows.

In your repository, create a directory at .github/workflows . Then create the YAML config file actions.yaml inside it with the next content.

name: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM on daily basis

jobs:
etl:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Arrange Python
uses: actions/setup-python@v2
with:
python-version: '3.9'

- name: Extract data
run: python extract.py

- name: Transform data
run: python transform.py

- name: Load data
run: python load.py

The above YAML automates an ETL (Extract, Transform, Load) pipeline. The workflow is triggered on daily basis at 12:00 AM UTC, and it consists of a single job that runs on the ubuntu-latest environment (Whatever that’s available on the time.)

The steps of those configurations are easy.

The job has five steps: the primary two steps take a look at the code and arrange the Python environment, respectively, while the subsequent three steps execute the extract.py, transform.py, and load.py Python scripts sequentially.

This workflow provides an automatic and efficient way of extracting, transforming, and loading data on a each day basis using GitHub Actions.

The Python scripts may vary depending on the scenario. Here’s one in all some ways.

# extract.py
# --------------------------------
import requests

response = requests.get("https://api.example.com/data")
with open("data.json", "w") as f:
f.write(response.text)

# transform.py
# --------------------------------
import json

with open("data.json", "r") as f:
data = json.load(f)

# Perform transformation
transformed_data = [item for item in data if item["key"] == "value"]

# Save transformed data
with open("transformed_data.json", "w") as f:
json.dump(transformed_data, f)

# load.py
# --------------------------------
import json
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData

# Connect with database
engine = create_engine("postgresql://myuser:mypassword@localhost:5432/mydatabase")

# Create metadata object
metadata = MetaData()

# Define table schema
mytable = Table(
"mytable",
metadata,
Column("id", Integer, primary_key=True),
Column("column1", String),
Column("column2", String),
)

# Read transformed data from file
with open("transformed_data.json", "r") as f:
data = json.load(f)

# Load data into database
with engine.connect() as conn:
for item in data:
conn.execute(
mytable.insert().values(column1=item["column1"], column2=item["column2"])
)

The above scripts read from a dummy API and push it to a Postgres database.

Things to contemplate when deploying ETL pipelines to GitHub Actions.

1. Security: Keep your secrets secure by utilizing GitHub’s secret store and avoid hardcoding secrets into your workflows.

Have you ever already noticed that the sample code I’ve given above has database credentials? It’s not right for a production system.

We’ve other ways to securely embed secrets, like database credentials.

Should you don’t encrypt your secrets in GitHub Actions, they will probably be visible to anyone who has access to the repository’s source code. Which means if an attacker gains access to the repository or the repository’s source code is leaked; the attacker will find a way to see your secret values.

To guard your secrets, GitHub provides a feature called encrypted secrets, which means that you can store your secret values securely within the repository settings. Encrypted secrets are only accessible to authorized users and are never exposed in plaintext in your GitHub Actions workflows.

Here’s how it really works.

Within the repository settings sidebar, you could find the secrets and variables for Actions. You’ll be able to create your variables here.

Screenshot by the creator.

Secrets created here should not visible to anyone. They’re encrypted and might be utilized in the workflow. Even you’ll be able to’t read them. But you’ll be able to update them with a recent value.

When you created the secrets, you’ll be able to pass in them using the GitHub Actions configuration as an environment variable. Here’s how it really works:

name: ETL Pipeline

on:
schedule:
- cron: '0 0 * * *' # Runs at 12.00 AM on daily basis

jobs:
etl:
runs-on: ubuntu-latest
steps:
...

- name: Load data
env: # Or as an environment variable
DB_USER: ${{ secrets.DB_USER }}
DB_PASS: ${{ secrets.DB_PASS }}
run: python load.py

Now, we are able to modify the Python scripts to read credentials from environment variables.

# load.py
# --------------------------------
import json
import os
from sqlalchemy import create_engine, Table, Column, Integer, String, MetaData

# Connect with database
engine = create_engine(
f"postgresql://{os.environ['DB_USER']}:{os.environ['DB_PASS']}@localhost:5432/mydatabase"
)

2. Dependencies: Be sure to make use of the proper version of dependencies to avoid any issues.

Your Python project may have already got a requirements.txt file that specifies dependencies together with their versions. Or, for more sophisticated projects, chances are you’ll be using modern dependency management tools like Poetry.

It’s best to have a step to establish your environment before you run the opposite pieces of your ETL. You’ll be able to do that by specifying the next in your YAML configuration.

- name: Install dependencies
run: pip install -r requirements.txt

3. Timezone settings: GitHub actions use UTC timezone, and as of writing this post, you’ll be able to’t change it.

Thus you need to make sure you’re using the proper timezone. You should utilize a web-based converter or manually adjust your local time to UTC before configuring.

The most important caveat of GitHub motion scheduling is its uncertainty within the execution time. Despite the fact that you’ve configured it to run at a particular cut-off date, if the demand is high at that time, your job will probably be qued. Thus, there will probably be a brief delay within the actual job starting time.

In case your job is dependent upon exact execution time, using GitHub Actions scheduling might be not a great option. Using a self-hosted runner in GitHub actions may help.

4. Resource Usage: Avoid overloading the resources provided by GitHub.

Despite the fact that GitHub actions, even with a free account, has 2000 minutes of free run time, in case you use a unique OS than Linux, rules change a bit.

Should you’re using a Windows runtime, you’ll get only half of it. In a MacOS environment, you’ll only get one-tenth of it.

Conclusion

GitHub actions is a DevOps tool. But we are able to use it to run any scheduled tasks. On this post, we’ve discussed how you can create an ETL that periodically fetches an API and pushes the information to a dataframe.

For easy ETLs, this approach is simple to develop and deploy.

But scheduled jobs in GitHub actions don’t should run at the very same time. Hence for time bounded tasks, this isn’t suitable.

LEAVE A REPLY

Please enter your comment!
Please enter your name here