Modern Data Engineering

Artificial Intelligence

Modern Data Engineering

admin

November 4, 2023

Platform Specific Tools and Advanced Techniques

The fashionable data ecosystem keeps evolving and recent data tools emerge from time to time. In this text, I need to discuss crucial things that affect data engineers. We are going to discuss how one can use this data to power advanced analytics pipelines and operational excellence.

I’d prefer to discuss some popular Data engineering questions:

Modern data engineering (DE). What’s it?
Does your DE work well enough to fuel advanced data pipelines and Business intelligence (BI)?
Are your data pipelines efficient?
What’s required from the technological perspective to enable operational excellence?

Back in October, I wrote concerning the rise of the Data Engineer, the role, its challenges, responsibilities, each day routine and how one can turn out to be successful on this field. The information engineering landscape is continuously changing but major trends seem to stay the identical.

As an information engineer, I’m tasked to design efficient data processes almost daily. So listed here are a couple of things to think about that can assist us answer these questions.

Modern data engineering trends

ETL vs ELT
Simplified data connectors and API integrations
ETL frameworks explosion
Data infrastructure as code
Data Mesh and decentralized data management
Democratization of Business intelligence pipelines using AI
Deal with data literacy

ELT vs ETL

Popular SQL data transformation tools like Dataform and DBT made a big contribution to the popularisation of the ELT approach [1]. It simply is sensible to perform required data transformations, akin to cleansing, enrichment and extraction within the place where data is being stored. Often it’s an information warehouse solution (DWH) within the central a part of our infrastructure. Cloud platform leaders made DWH (Snowflake, BigQuery, Redshift, Firebolt) infrastructure management really easy and in lots of scenarios they’ll outperform and dedicated in-house infrastructure management team by way of cost-effectiveness and speed.

Data warehouse exmaple. Image by creator

It also is likely to be a datalake in the middle and it is dependent upon the kind of our data platform and tools we use. On this case, SQL stops being an option in lots of cases making it difficult to question the info for those users who usually are not accustomed to programming. Tools like Databricks, Tabular and Galaxy try to unravel this problem and it really seems like the longer term. Indeed, datalakes can store all kinds of knowledge including unstructured ones and we still have to have the ability to analyse these datasets.

Just imagine transactionally consistent datalake tables with point-in-time snapshot isolation.

I previously wrote about it in one among my stories on Apache Iceberg table format [2].

Simplified data integrations

Managed solutions like Fivetran and Stitch were built to administer third-party API integrations with ease. Lately many firms select this approach to simplify data interactions with their external data sources. This is able to be the appropriate option to go for data analyst teams that usually are not accustomed to coding.

Indeed, why would we construct an information connector from scratch if it already exists and is being managed within the cloud?

The downside of this approach is it’s pricing model though.

Fairly often it’s row-based and might turn out to be quite expensive on an enterprise level of knowledge ingestion, i.e. big data pipelines. That is where open-source alternatives come into play. Frameworks like Airbyte and Meltano is likely to be a simple and quick solution to deploy an information source integration microservice.

For those who don’t have time to learn a recent ETL framework you possibly can create an easy data connector yourself. For those who know a little bit of Python it will be a trivial task. In one among my previous articles I wrote how easy it’s to create a microservice that pulls data from NASA API [3]:

Consider this code snippet for app.py

import requests
session = requests.Session()url="https://api.nasa.gov/neo/rest/v1/feed"
apiKey="your_api_key"
requestParams = {
'api_key': apiKey,
'start_date': '2023-04-20',
'end_date': '2023-04-21'
}
response = session.get(url, params = requestParams, stream=True)
print(response.status_code)

It may possibly be deployed in any cloud vendor platform and scheduled to run with the required frequency. It’s at all times a superb practice to make use of something like Terraform to deploy our data pipeline applications.

ETL frameworks explosion

We are able to witness a “Cambrian explosion” of assorted ETL frameworks for data extraction and transformation. It’s not a surprise that a lot of them are open-source and are Python-based.

Luigi [8] is one among them and it helps to create ETL pipelines. It was created by Spotify to administer massive data processing workloads. It has a command line interface and great visualization features. Nevertheless, even basic ETL pipelines would require a certain level of Python programming skills. From my experience, I can tell that it’s great for strict and simple pipelines. I find it particularly difficult to implement complex branching logic using Luigi but it surely works great in lots of scenarios.

Python ETL (PETL) [9] is one of the crucial widely used open-source ETL frameworks for straightforward data transformations. It’s invaluable working with tables, extracting data from external data sources and performing basic ETL on data. In some ways, it is comparable to Pandas however the latter has more analytics capabilities under the hood. PETL is great for aggregation and row-level ETL.

Bonobo [10] is one other open-source lightweight data processing tool which is great for rapid development, automation and parallel execution of batch-processing data pipelines. What I like about it’s that it makes it very easy to work with various data file formats, i.e. SQL, XML, XLS, CSV and JSON. It’s going to be an ideal tool for those with minimal Python knowledge. Amongst other advantages, I like that it really works well with semi-complex data schemas. It is good for easy ETL and may run in Docker containers (it has a Docker extension).

Pandas is an absolute beast on the planet of knowledge and there isn’t any have to cover it’s capabilities on this story. It’s value mentioning that its data frame transformations have been included in one among the fundamental methods of knowledge loading for a lot of modern data warehouses. Consider this data loading sample into the BigQuery data warehouse solution:

from google.cloud import bigquery
from google.oauth2 import service_account
...
# Authenticate BigQuery client:
service_acount_str = config.get('BigQuery') # Use config
credentials = service_account.Credentials.from_service_account_info(service_acount_str)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)...
def load_table_from_dataframe(table_schema, table_name, dataset_id):
#! source data file format should be outer array JSON:
"""
[
{"id":"1"},
{"id":"2"}
]
"""
blob = """
[
{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]},
{"id":"2","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
]
"""
body = json.loads(blob)
print(pandas.__version__)
table_id = client.dataset(dataset_id).table(table_name)
job_config = bigquery.LoadJobConfig()
schema = create_schema_from_yaml(table_schema) 
job_config.schema = schema
df = pandas.DataFrame(
body,
# Within the loaded table, the column order reflects the order of the
# columns within the DataFrame.
columns=["id", "first_name","last_name","dob","addresses"],
)
df['addresses'] = df.addresses.astype(str)
df = df[['id','first_name','last_name','dob','addresses']]
print(df)
load_job = client.load_table_from_dataframe(
df,
table_id,
job_config=job_config,
)
load_job.result()
print("Job finished.")

Apache Airflow, for instance, just isn’t an ETL tool per se but it surely helps to arrange our ETL pipelines right into a nice visualization of dependency graphs (DAGs) to explain the relationships between tasks. Typical Airflow architecture features a schduler based on metadata, executors, staff and tasks.

For instance, we will run ml_engine_training_op after we export data into the cloud storage (bq_export_op) and make this workflow run each day or weekly.

ML model training using Airflow. Image by creator.

Consider this instance below.

It creates an easy data pipeline graph to export data right into a cloud storage bucket after which trains the ML model using MLEngineTrainingOperator.

"""DAG definition for recommendation_bespoke model training."""import airflow
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.contrib.operators.bigquery_to_gcs import BigQueryToCloudStorageOperator
from airflow.hooks.base_hook import BaseHook
from airflow.operators.app_engine_admin_plugin import AppEngineVersionOperator
from airflow.operators.ml_engine_plugin import MLEngineTrainingOperator
import datetime
def _get_project_id():
"""Get project ID from default GCP connection."""
extras = BaseHook.get_connection('google_cloud_default').extra_dejson
key = 'extra__google_cloud_platform__project'
if key in extras:
project_id = extras[key]
else:
raise ('Must configure project_id in google_cloud_default '
'connection from Airflow Console')
return project_id
PROJECT_ID = _get_project_id()
# Data set constants, utilized in BigQuery tasks.  You may change these
# to adapt to your data.
DATASET = 'staging' #'analytics'
TABLE_NAME = 'recommendation_bespoke'
# GCS bucket names and region, may also be modified.
BUCKET = 'gs://rec_wals_eu'
REGION = 'us-central1' #'europe-west2' #'us-east1'
JOB_DIR = BUCKET + '/jobs'
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': airflow.utils.dates.days_ago(2),
'email': ['mike.shakhomirov@gmail.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': datetime.timedelta(minutes=5)
}
# Default schedule interval using cronjob syntax - will be customized here
# or within the Airflow console.
schedule_interval = '00 21 * * *'
dag = DAG('recommendations_training_v6', default_args=default_args,
schedule_interval=schedule_interval)
dag.doc_md = __doc__
#
#
# Task Definition
#
#
# BigQuery training data export to GCS
training_file = BUCKET + '/data/recommendations_small.csv' # just a couple of records for staging
t1 = BigQueryToCloudStorageOperator(
task_id='bq_export_op',
source_project_dataset_table='%s.recommendation_bespoke' % DATASET,
destination_cloud_storage_uris=[training_file],
export_format='CSV',
dag=dag
)
# ML Engine training job
training_file = BUCKET + '/data/recommendations_small.csv'
job_id = 'recserve_{0}'.format(datetime.datetime.now().strftime('%Y%m%d%H%M'))
job_dir = BUCKET + '/jobs/' + job_id
output_dir = BUCKET
delimiter=','
data_type='user_groups'
master_image_uri='gcr.io/my-project/recommendation_bespoke_container:tf_rec_latest'
training_args = ['--job-dir', job_dir,
'--train-file', training_file,
'--output-dir', output_dir,
'--data-type', data_type]
master_config = {"imageUri": master_image_uri,}
t3 = MLEngineTrainingOperator(
task_id='ml_engine_training_op',
project_id=PROJECT_ID,
job_id=job_id,
training_args=training_args,
region=REGION,
scale_tier='CUSTOM',
master_type='complex_model_m_gpu',
master_config=master_config,
dag=dag
)
t3.set_upstream(t1)

Bubbles [11] is one other open-source tool for ETL within the Python world. It’s great for rapid development and I like how it really works with metadata to explain data pipelines. The creators of Bubbles call it an “abstract framework” and say that it may be used from many other programming languages, not exclusively from Python.

There are numerous other tools with more specific applications, i.e. extracting data from web pages (PyQuery, BeautifulSoup, etc.) and parallel data processing. It may possibly be a subject for an additional story but I wrote about a few of them before, i.e. joblib library [12]

Data infrastructure as code

Infrastructure as code (IaC) is a preferred and really functional approach for managing data platform resources. Even for data, it’s just about a normal straight away, and it definitely looks great in your CV telling your potential employers that you simply are accustomed to DevOps standards. Using tools like Terraform (platform agnostic) and CloudFormation we will integrate our development work and deployments (operations) with ease.

Generally, we might wish to have staging and production data environments for our data pipelines. It helps to check our pipelines and facilitate collaboration between teams.

Consider this diagram below. It explains how data environments work.

Often we’d need an additional sandbox for testing purposes or to run data transformation unit tests when our ETL services trigger CI/CD workflows.
I previously wrote about it here:

Using AWS CloudFormation template files we will describe required resources and their dependencies so we will launch and configure them together as a single stack.

For those who are a data skilled this approach will certainly help working with different data environments and replicate data platform resources faster and more consistently without errors.

The issue is that many data practitioners usually are not accustomed to IaC and it creates numerous errors through the development process.

Data Mesh and decentralized data management

Data space has significantly evolved through the last decade and now now we have a lot of data tools and frameworks. Data Mesh defines the state when now we have different data domains (company departments) with their very own teams and shared data resources. Each team has their very own goals, KPIs, data roles and responsibilities.

For a protracted time period, data bureaucracy has been an actual pain for a lot of firms.

This data platform type [4] may appear a bit chaotic but it surely was meant to turn out to be a successful and efficient selection for firms where decentralization enables different teams to access cross-domain datasets and run analytics or ETL tasks on their very own.

Indeed, Snowflake is likely to be your favourite data warehouse solution in the event you are an information analyst and never accustomed to Spark. Nevertheless, often it’s a trivial problem whenever you might wish to read datalake data without data engineering help. On this scenario, a bunch of metadata records on datasets could possibly be extremely useful and that’s why Data Mesh is so successful.

It enables users with knowledge about data, its origins and the way other teams could make the perfect of those datasets they weren’t previously aware of.

Sometimes datasets and data source connections turn out to be very intricate and it’s at all times a superb practice to have a single-source-of-truth data silo or repository with metadata and dataset descriptions.

In one among my previous stories [5] I wrote concerning the role of SQL as a unified querying language for teams and data. Indeed, it analytical, self-descriptive and are available be even dynamic which makes it an ideal tool for all data users.

Often all of it turns into a giant mes(s/h)

This fact makes SQL-based templating engines like DBT, Jinja and Dataform highly regarded. Just imagine you may have an SQL-like platform where all datasets and their transformations are described and defined thoroughly [6].

Dataform’s dependency graph and metadata. Image by creator.

It is likely to be a giant challenge to grasp how data teams relate to data sources and schemas. Fairly often it’s all tangled in spaghetti of dataset dependencies and ETL transformations.
Data engineering plays a critical role in mentoring, improving data literacy and empowering the remaining of the corporate with state-of-the-art data processing techniques and best practices.

Democratization of Business Intelligence pipelines using AI

Improving data accessibility has at all times been a preferred topic in the info space but it surely is interesting to see how the entire data pipeline design process is becoming increasingly accessible to groups that weren’t accustomed to data before. Now almost every department can utilize built-in AI capabilities to create complex BI transformations on data.

All they need is to explain what they need BI-wise in their very own words

For instance, BI tools like Thoughspot use AI with an intuitive “Google-like search interface” [7] to achieve insights from data stored in any modern DWH solution akin to Google Big Query, Redshift, Snowflake or Databricks.

Modern Data Stack includes BI tools that help with data modelling and visualization. Lots of them have already got these built-in AI capabilities to achieve data insights faster based on user behaviour.

I think it’s a reasonably easy task to integrate GPT and BI. In the subsequent couple of years, we’ll see many recent products using this tech.

GPT can pre-process text data to generate a SQL query that understands your intent and answers your query.

Conclusion

In this text, I attempted to offer a really high-level overview of major data trends that affect data engineering role today. Data Mesh and templated SQL with dependency graphs to facilitate data literacy democratized the entire analytics process. Advanced data pipelines with intricate ETL techniques and transformations will be transparent for everybody within the organisation now. Data pipelines have gotten increasingly accessible for other teams they usually don’t have to know programming to learn and understand the complexity of ETL. Data Mesh and metadata help to unravel this problem. From my experience, I can tell that I keep seeing increasingly people learning SQL to contribute to the transformation layer. Corporations born through the “advanced data analytics” age have the posh of easy accessibility to cloud vendor products and their managed services. It definitely helps to accumulate the required data skills and improve them to achieve a competitive advantage.