Pandas for Data Engineers

Artificial Intelligence

Pandas for Data Engineers

admin

February 11, 2024

Advanced techniques to process and cargo data efficiently

On this story, I would love to speak about things I like about Pandas and use often in ETL applications I write to process data. We’ll touch on exploratory data evaluation, data cleansing and data frame transformations. I’ll show a few of my favourite techniques to optimize memory usage and process large amounts of knowledge efficiently using this library. Working with relatively small datasets in Pandas isn’t an issue. It handles data in data frames with ease and provides a really convenient set of commands to process it. In terms of data transformations on much greater data frames (1Gb and more) I’d normally use Spark and distributed compute clusters. It might handle terabytes and petabytes of knowledge but probably can even cost numerous money to run all that hardware. That’s why Pandas may be a more sensible choice when we’ve to cope with medium-sized datasets in environments with limited memory resources.

Pandas and Python generators

In one among my previous stories I wrote about the way to process data efficiently using generators in Python [1].

It’s a straightforward trick to optimize the memory usage. Imagine that we’ve an enormous dataset somewhere in external storage. It might be a database or simply a straightforward large CSV file. Imagine that we’d like to process this 2–3 TB file and apply some transformation to every row of knowledge on this file. Let’s assume that we’ve a service that can perform this task and it has only 32 Gb of memory. This may limit us in data loading and we won’t give you the chance to load the entire file into the memory to separate it line by line applying easy Python split(‘n’) operator. The answer could be to process it row by row and yield it every time freeing the memory for the subsequent one. This can assist us to create a continually streaming flow of ETL data into the ultimate destination of our data pipeline. It might be anything — a cloud storage bucket, one other database, a knowledge warehouse solution (DWH), a streaming topic or one other…

Advanced techniques to process and cargo data efficiently

Pandas and Python generators

LEAVE A REPLY Cancel reply