Being built on top of numpy made it hard for pandas to handle missing values in a hassle-free, flexible way, since
For example, , which just isn’t ideal:
Note how points robotically changes from int64 to float64 after the introduction of a singleNone value.
, especially inside a data-centric AI paradigm.
Erroneous typesets directly impact data preparation decisions, cause incompatibilities between different chunks of information, and even when passing silently, they may compromise certain operations that output nonsensical ends in return.
For example, on the Data-Centric AI Community, we’re currenlty working on a project around synthetic data for data privacy. One among the features, NOC (number of kids), has missing values and due to this fact it’s robotically converted to float when the info is loaded. The, when passing the info right into a generative model as a float , we’d get output values as decimals resembling 2.5 — unless you’re a mathematician with 2 kids, a newborn, and a weird humorousness, having 2.5 children just isn’t OK.
dtype = 'numpy_nullable', so we are able to keep our original data types (int64 on this case):
, but under the hood it signifies that now pandas can natively . This makes operations , since pandas doesn’t must implement its own version for handling null values for every data type.
