Data Engineering: A Formula 1-inspired Guide for Beginners

Artificial Intelligence

Data Engineering: A Formula 1-inspired Guide for Beginners

admin

December 5, 2023

Data Engineering: A Formula 1-inspired Guide for Beginners

A Glossary with Use Cases for First-Timers in Data Engineering

Are you a knowledge engineering rookie fascinated about knowing more about modern data infrastructures? I bet you’re, this text is for you!

On this guide Data Engineering meets Formula 1. But, we’ll keep it easy.

I strongly imagine that the very best approach to describe an idea is via examples, although a few of my university professors used to say, “For those who need an example to clarify it, it means you didn’t get it”.
In any case, I wasn’t paying enough attention during university classes, and today I’ll walk you thru data layers using — guess what — an example.

Imagine this: next 12 months, a latest team on the grid, Red Thunder Racing, will call us (yes, me and also you) to establish their latest data infrastructure.

In today’s Formula 1, data is on the core, way greater than it was 20 or 30 years back. Racing teams are improving performance with an exceptional data-driven approach, making improvements millisecond by millisecond.

It’s not only concerning the lap time; Formula 1 is a multi-billion-dollar business. Boosting fan engagement isn’t only for fun; making the game more attractive isn’t only for drivers’s fun. These activities generate revenues.
A strong data infrastructure is a must-have to compete within the F1 business.

We’ll construct a knowledge architecture to support our racing team ranging from the three canonical layers: Data Lake, Data Warehouse, and Data Mart.

Data Lake

A knowledge lake would function a repository for raw and unstructured data generated from various sources throughout the Formula 1 ecosystem: telemetry data from the cars (e.g. tyre pressure per second, speed, fuel consumption), driver configurations, lap times, weather conditions, social media feeds, ticketing, fans registered to marketing events, merchandise purchases, …

All kind of information could be stored in our consolidated data lake: unstructured (audio, video, images), semistructured (JSON, XML) and structured (CSV, Parquet, AVRO).

We’ll face our first challenge while we integrate and consolidate all the pieces in a single place. We’ll create batch jobs extracting records from marketing tools and we’ll also cope with real-time streaming telemetry data (and be certain, there shall be very low latency requirements with that).

We’ll have an extended list of systems to integrate and every shall be supporting a special protocol or interface: Kafka Streaming, SFTP, MQTT, REST API and more.

We won’t be alone on this data collection; thankfully, there are data integration tools available available in the market that could be adopted to configure and maintain ingestion pipelines in a single place (e.g. in alphabetical order: Fivetran, Hevo, Informatica, Segment, Stitch, Talend, …).
As a substitute of counting on lots of of python scripts scheduled on crontab or having custom processes handling data streaming from Kafka topics, these tools will help us simplifying, automating and orchestrating all these processes.

Data Warehouse

After a couple of weeks defining all of the datastreams we want to integrate, we are actually ingesting a remarkable variety of information in our data lake. It’s time to maneuver on to the subsequent layer.

The information warehouse is used to scrub, structure, and store processed data from the information lake, providing a structured, high-performance environment for analytics and reporting.

At this stage, it’s not about ingesting data and we’ll focus increasingly more on business use cases. We must always consider how the information shall be utilised by our colleagues offering structured datasets, often refreshed, about:

Automobile Performance: telemetry data is cleaned, normalised and integrated to supply a unified view.
Strategy and Trend Review: past race data are used to discover trends, driver performance and understand the impact of specific strategies.
Team KPI: pit stop times, tyres temperature before pit stop, budget control on automobile developments.

We’ll have quite a few pipelines dedicated to data transformation and normalisation.
Like for the information integration, there are many products available available in the market to simplify and efficiently manage data pipelines. These tools can streamline our data processes, reducing operational costs and increasing developments’ effectiveness (e.g. in alphabetical order: Apache Airflow, Azure Data Factory, DBT, Google DataForm, …).

Data Marts

There’s a skinny line between Data Warehouses and Data Marts.
Let’s not forget that we’re working for Red Thunder Racing, a big company, with 1000’s of employees involved in diverse areas.
Data should be accessible and tailored to specific business units requirements. Data models are built around business needs.

Data marts are specialized subsets of information warehouses that deal with specific business functions.

Automobile Performance Mart: RnD Team analyses data related to engine efficiency, aerodynamics, and reliability. Engineers will use this data mart to optimize the automobile’s setup for various race tracks or run simulations to know the very best automobile configuration based on weather conditions.
Fan Engagement Mart: Marketing Team analyses social media data, fan surveys, and viewer rankings to know fan preferences. The Marketing Team is using this data to perform tailored marketing strategies, merchandise development, and improve their Fan360 knowledge.
Bookkeeping Analytics Mart: The Finance Team needs data as well (lot of numbers, I feel!). Now greater than ever, racing teams need to cope with budget restrictions and regulations. It’s necessary to maintain track of budget allocations, revenues and value overviews normally.

Furthermore, It’s often a requirement to make sure that sensitive data stays accessible only to authorised teams. As an example, the Research and Development team may require exclusive access to telemetry information, and so they need that data could be analysed using a particular data model. Nevertheless, they won’t be permitted (or interested) in accessing financial reports.

Our layered data architecture will enable Red Thunder Racing to leverage the ability of information for automobile performance optimization, strategic decision-making, enhanced marketing campaign… and beyond!

That’s it?

Absolutely not! We barely scratched the surface of a knowledge architecture. There are probably other lots of of integration points we should always consider, furthermore we didn’t transcend just mentioning data transformation and data modeling.

We didn’t cover the Data Science domain in any respect, which probably deserves its own article, same for data governance, data observability, data security, and more.

But hey, as they are saying, “Rome was not in-built a day”. Now we have already quite quite a bit on our plate for today, including the primary draft of our data architecture (below).

Data Engineering is a magical realm, with a plethora of books dedicated to it.

Throughout the journey, data engineers will engage with unlimited integration tools, diverse data platforms aiming to cover a number of of the layers mentioned above (e.g. in alphabetical order: AWS Redshift, Azure Synapse, Databricks, Google BigQuery, Snowflake, …), business intelligence tools (e.g. Looker, PowerBI, Tableau, ThoughtSpot, …) and data pipelines tools.

Our data engineering journey at Red Thunder Racing has just began and we should always leave loads of space for flexibility in our toolkit!

Data Layers could be often combined together, sometimes in a single platform. Data platforms and tools are raising the bar and reducing gaps daily releasing latest features. The competition is intense on this market.

Do you mostly must have a knowledge lake? It depends.
Do you mostly must have data stored as soon as possible (a.k.a. streaming and real-time processing)? It depends, what’s the information freshness requirement by Business Users?
Do you mostly must depend on third party tools for data pipelines management? It depends!
? It depends!

If you may have any questions or suggestions, please be happy to achieve out to me on LinkedIn. I promise I’ll answer with something different from: It depends!

Opinions expressed in this text are solely my very own and don’t reflect the views of my employer. Unless otherwise noted, all images are by the creator.

The story, all names and incidents portrayed in this text are fictitious. No identification with actual places, buildings, and products is meant or must be inferred.

A Glossary with Use Cases for First-Timers in Data Engineering

Data Lake

Data Warehouse

Data Marts

That’s it?

LEAVE A REPLY Cancel reply