Home Artificial Intelligence 5 Signs You’ve Change into an Advanced Pandas User Without Even Realizing It

5 Signs You’ve Change into an Advanced Pandas User Without Even Realizing It

1
5 Signs You’ve Change into an Advanced Pandas User Without Even Realizing It

3. Friends with Pandas

If there may be one thing that makes Pandas the king of information evaluation libraries, it’s got to be its integration with the remainder of the information ecosystem.

For instance, by now it’s essential to have realized how you possibly can change the plotting backend of Pandas from Matplotlib to either Plotly, HVPlot, holoviews, Bokeh, or Altair.

Yes, Matplotlib is best friends with Pandas but for occasionally, you fancy something interactive like Plotly or Altair.

import pandas as pd
import plotly.express as px

# Set the default plotting backend to Plotly
pd.options.plotting.backend = 'plotly'

Talking about backends, you’ve also noticed that Pandas added a fully-supported PyArrow implementation for its read_* functions to load data files within the brand-new 2.0.0 version.

import pandas as pd

pd.read_csv(file_name, engine='pyarrow')

When it was NumPy backend only, there have been many limitations like little support for non-numeric data types, near-total disregard to missing values or no support for complex data structures (dates, timestamps, categoricals).

Before 2.0.0, Pandas had been cooking up in-house solutions to those problems but they weren’t pretty much as good as some heavy users have hoped. With PyArrow backend, loading data is considerably faster and it brings a set of information types that Apache Arrow users are conversant in:

import pandas as pd

pd.read_csv(file_name, engine='pyarrow', dtype_engine='pyarrow')

One other cool feature of Pandas I’m sure you employ on a regular basis in JupyterLab is styling DataFrames.

Since project Jupyter is so awesome, Pandas developers added a little bit of HTML/CSS magic under the .style attribute so you possibly can boost plain old DataFrames in a way that reveals additional insights

df.sample(20, axis=1).describe().T.style.bar(
subset=["mean"], color="#205ff2"
).background_gradient(
subset=["std"], cmap="Reds"
).background_gradient(
subset=["50%"], cmap="coolwarm"
)
image.png
Image by creator.

4. The info sculptor

Since Pandas is an information evaluation and manipulation library, the truest sign you might be pro is how flexibly you possibly can shape and transform datasets to fit your purposes.

While most online courses provide the ready-made, cleaned columnar format data, the datasets within the wild are available many shapes and forms. For instance, one of the vital annoying formats of information is row-based (quite common with financial data):

import pandas as pd

# create example DataFrame
df = pd.DataFrame(
{
"Date": [
"2022-01-01",
"2022-01-02",
"2022-01-01",
"2022-01-02",
],
"Country": ["USA", "USA", "Canada", "Canada"],
"Value": [10, 15, 5, 8],
}
)

df

png
Image by creator

You have to find a way to convert row-based format right into a more useful format just like the below example using pivot function:

pivot_df = df.pivot(
index="Date",
columns="Country",
values="Value",
)

pivot_df

png

Chances are you’ll also should perform the alternative of this operation, called a melt.

Here is an example with melt function of Pandas that turns columnar data into row-based format:

df = pd.DataFrame(
{
"Date": ["2022-01-01", "2022-01-02", "2022-01-03"],
"AAPL": [100.0, 101.0, 99.0],
"GOOG": [200.0, 205.0, 195.0],
"MSFT": [50.0, 52.0, 48.0],
}
)

df

png
Image by creator
melted_df = pd.melt(
df, id_vars=["Date"], var_name="Stock", value_name="Price"
)

melted_df

png
Image by creator

Such functions could be quite difficult to know and even harder to use.

There are other similar ones like pivot_table, which creates a pivot table that may compute various kinds of aggregations for every value within the table.

One other function is stack/unstack, which may collapse/explode DataFrame indices. crosstab computes a cross-tabulation of two or more aspects, and by default, computes a frequency table of the aspects but may also compute other summary statistics.

Then there’s groupby. Though the fundamentals of this function is easy, its more advanced use-cases are very hard to master. If the contents of the Pandas groupby function were made right into a separate library, it might be larger than most within the Python ecosystem.

# Group by a date column, use a monthly frequency 
# and find the entire revenue for `category`

grouped = df.groupby(['category', pd.Grouper(key='date', freq='M')])
monthly_revenue = grouped['revenue'].sum()

Skillfully selecting the best function for a selected situation is an indication you might be true data sculptor.

Read parts two and three to learn the ins and outs of the functions mentioned on this section.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here