If you could have some experience in data science you surely have faced developed algorithms from tabular data, common challenges of this sort are for instance the Titanic — Machine Learning From Disaster or the Boston Housing.
Data represented in tabular form (corresponding to CSV files) will be distinguished into and . In computing, row-major order and column-major order are methods for storing multidimensional arrays in linear storage corresponding to random access memory. Depending on the paradigm with which the format was designed, there are best practices to follow to optimize file read and write times. Fairly often data scientists unfortunately use libraries corresponding to pandas within the fallacious way going to waste beneficial time
Row major format signifies that in a table, consecutive rows are saved consecutively in memory. So if I’m reading row i , then accessing row i+1 will probably be a really fast operation.
Formats that follow the Column major format paradigm, corresponding to Parquet, consecutively save columns in memory.
In Machine Learning we regularly have the case where the rows are the information samples and the columns are the features. So we’ll use a CSV file if we want to access samples quickly while Parquet if we regularly must access features (e.g. to calculate statistics etc.).
Pandas
Pandas is a library widely utilized in data science, especially when coping with tabular data. Pandas is built on the concept of DataFrame, precisely a tabular representation of knowledge. The DataFrame though follows the column major format paradigm.
So iterating a DataFrame, row by row, as is commonly done, could be very slow. Let’s have a look at an example Let’s import the BostonHousing DataFrame and iterate it.
import pandas as pd
import time
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
df.head()
In this primary experiment, we iterate through the columns of the DataFrame (df.columns) after which access all the weather in each column, and calculate the time it takes to complete the method.
#iterating df by column
start = time.time()
for col in df.columns:
for item in df[col]:
pass
print(time.time() -start , " seconds")#OUTPUT: 0.0021004676818847656 seconds
As a substitute, on this second experiment we iterate for rows within the DataFrame with the df.iloc function, which returns the contents of your entire row.
#iterating df by row
n_rows = len(df)
start = time.time()
for i in range(n_rows):
for item in df.iloc[i]:
pass
print(time.time() -start , " seconds")#OUTPUT : 0.059470415115356445 seconds
As you possibly can see the results of the second experiment is far greater than the primary. On this case, our dataset was very small, but in case you try along with your own larger working dataset you’ll notice how this difference will turn into increasingly pronounced.
Fortunately, the numpy library involves our rescue. Once we use numpy we will specify the foremost order we wish to make use of, by default the row-major order is used.
So what we will do is convert a pandas DataFrame to numpy and iterate the latter line by line. Let’s have a look at some experiments.
We first convert the DataFrame to a numpy format.
df_np = df.to_numpy()
n_rows, n_cols = df_np.shape
Now let’s iterate the information by column, and calculate the time.
#iterating numpy by columns
start = time.time()
for j in range(n_cols):
for item in df_np[:,j]:
pass
print(time.time() -start, " seconds")#OUTPUT : 0.002185821533203125 seconds
Now same thing iterating by rows.
#iterating numpy by row
start = time.time()
for i in range(n_rows):
for item in df_np[i]:
pass
print(time.time() -start, " seconds")#OUTPUT : 0.0023500919342041016 seconds
We see that through the use of numpy the speed of each experiments is increased! Furthermore, the difference between the 2 is minimal.
On this paper, we introduced the difference between row-major and column-major paradigms when coping with tabular data. We identified a typical mistake that’s made by many data scientists using Pandas. The time difference in accessing the information, on this case, is minimal because we used a small dataset. But you could have to watch out since the larger the dataset you utilize the larger this difference will turn into in turn, and you would possibly lose a variety of time just reading the information. As an answer all the time try to make use of numpy each time possible.
Follow me for more articles of this kind!😉
Marcello Politi