Missing Value Imputation, Explained: A Visual Guide with Code Examples for Beginners

-

DATA PREPROCESSING

One (tiny) dataset, six imputation methods?

Let’s discuss something that each data scientist, analyst, or curious number-cruncher has to take care of in the end: missing values. Now, I do know what you’re considering — “Oh great, one other missing value guide.” But hear me out. I’m going to point out you the best way to tackle this problem using not one, not two, but six different imputation methods, all on a single dataset (with helpful visuals as well!). By the top of this, you’ll see why domain knowledge is price its weight in gold (something even our AI friends might struggle to copy).

All visuals: Writer-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Before we get into our dataset and imputation methods, let’s take a moment to grasp what missing values are and why they’re such a typical headache in data science.

What Are Missing Values?

Missing values, often represented as NaN (Not a Number) in pandas or NULL in databases, are essentially holes in your dataset. They’re the empty cells in your spreadsheet, the blanks in your survey responses, the info points that got away. On the earth of knowledge, not all absences are created equal, and understanding the character of your missing values is crucial for deciding the best way to handle them.

Image by writer.

Why Do Missing Values Occur?

Missing values can sneak into your data for quite a lot of reasons. Listed here are some common reasons:

  1. Data Entry Errors: Sometimes, it’s just human error. Someone might forget to input a worth or by accident delete one.
  2. Sensor Malfunctions: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.
  3. Survey Non-Response: In surveys, respondents might skip questions they’re uncomfortable answering or don’t understand.
  4. Merged Datasets: When combining data from multiple sources, some entries won’t have corresponding values in all datasets.
  5. Data Corruption: During data transfer or storage, some values might get corrupted and turn into unreadable.
  6. Intentional Omissions: Some data may be intentionally unnoticed resulting from privacy concerns or irrelevance.
  7. Sampling Issues: The information collection method might systematically miss certain sorts of data.
  8. Time-Sensitive Data: In time series data, values may be missing for periods when data wasn’t collected (e.g., weekends, holidays).

Kinds of Missing Data

Understanding the form of missing data you’re coping with can provide help to select essentially the most appropriate imputation method. Statisticians generally categorize missing data into three types:

  1. Missing Completely at Random (MCAR): The missingness is totally random and doesn’t depend upon some other variable. For instance, if a lab sample was by accident dropped.
  2. Missing at Random (MAR): The probability of missing data depends upon other observed variables but not on the missing data itself. For instance, men may be less more likely to answer questions on emotions in a survey.
  3. Missing Not at Random (MNAR): The missingness depends upon the worth of the missing data itself. For instance, individuals with high incomes may be less more likely to report their income in a survey.

Why Care About Missing Values?

Missing values can significantly impact your evaluation:

  1. They’ll introduce bias if not handled properly.
  2. Many machine learning algorithms can’t handle missing values out of the box.
  3. They’ll result in lack of necessary information if instances with missing values are simply discarded.
  4. Improperly handled missing values can result in incorrect conclusions or predictions.

That’s why it’s crucial to have a solid strategy for coping with missing values. And that’s exactly what we’re going to explore in this text!

First things first, let’s introduce our dataset. We’ll be working with a golf course dataset that tracks various aspects affecting the crowdedness of the course. This dataset has a little bit of every part — numerical data, categorical data, and yes, loads of missing values.

This dataset is artificially made by the writer (inspired by [1]) to advertise learning.
import pandas as pd
import numpy as np

# Create the dataset as a dictionary
data = {
'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
'08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display basic information in regards to the dataset
print(df.info())

# Display the primary few rows of the dataset
print(df.head())

# Display the count of missing values in each column
print(df.isnull().sum())

Output:


RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 20 non-null object
1 Weekday 20 non-null int64
2 Holiday 19 non-null float64
3 Temp 16 non-null float64
4 Humidity 17 non-null float64
5 Wind 19 non-null float64
6 Outlook 17 non-null object
7 Crowdedness 15 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.3+ KB

Date Weekday Holiday Temp Humidity Wind Outlook Crowdedness
0 08-01 0 0.0 25.1 99.0 0.0 rainy 0.14
1 08-02 1 0.0 26.4 NaN 0.0 sunny NaN
2 08-03 2 0.0 NaN 96.0 0.0 rainy 0.21
3 08-04 3 0.0 24.1 68.0 0.0 overcast 0.68
4 08-05 4 NaN 24.7 98.0 0.0 rainy 0.20

Date 0
Weekday 0
Holiday 1
Temp 4
Humidity 3
Wind 1
Outlook 3
Crowdedness 5
dtype: int64

As we will see, our dataset comprises 20 rows and eight columns:

  • Date: The date of the statement
  • Weekday: Day of the week (0–6, where 0 is Monday)
  • Holiday: Boolean indicating if it’s a vacation (0 or 1)
  • Temp: Temperature in Celsius
  • Humidity: Humidity percentage
  • Wind: Wind condition (0 or 1, possibly indicating calm or windy)
  • Outlook: Weather outlook (sunny, overcast, or rainy)
  • Crowdedness: Percentage in fact occupancy

And have a look at that! We’ve got missing values in every column except Date and Weekday. Perfect for our imputation party.

Now that we’ve our dataset loaded, let’s tackle these missing values with six different imputation methods. We’ll use a unique strategy for every form of data.

Listwise deletion, also often known as complete case evaluation, involves removing entire rows that contain any missing values. This method is straightforward and preserves the distribution of the info, but it may well result in a big loss of knowledge if many rows contain missing values.

👍 Common Use: Listwise deletion is commonly used when the variety of missing values is small and the info is missing completely at random (MCAR). It’s also useful once you need a whole dataset for certain analyses that may’t handle missing values.

In Our Case: We’re using listwise deletion for rows which have at the very least 4 missing values. These rows won’t provide enough reliable information, and removing them will help us concentrate on the more complete data points. Nevertheless, we’re being cautious and only removing rows with significant missing data to preserve as much information as possible.

# Count missing values in each row
missing_count = df.isnull().sum(axis=1)

# Keep only rows with lower than 4 missing values
df_clean = df[missing_count < 4].copy()

We’ve removed 2 rows that had too many missing values. Now let’s move on to imputing the remaining missing data.

Easy imputation involves replacing missing values with a summary statistic of the observed values. Common approaches include using the mean, median, or mode of the non-missing values in a column.

👍 Common Use: Mean imputation is commonly used for continuous variables when the info is missing at random and the distribution is roughly symmetric. Mode imputation is usually used for categorical variables.

In Our Case: We’re using mean imputation for Humidity and mode imputation for Holiday. For Humidity, assuming the missing values are random, the mean provides an inexpensive estimate of the standard humidity. For Holiday, because it’s a binary variable (holiday or not), the mode gives us essentially the most common state, which is a smart guess for missing values.

# Mean imputation for Humidity
df_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())

# Mode imputation for Holiday
df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])

Linear interpolation estimates missing values by assuming a linear relationship between known data points. It’s particularly useful for time series data or data with a natural ordering.

👍 Common Use: Linear interpolation is commonly used for time series data, where missing values could be estimated based on the values before and after them. It’s also useful for any data where there’s expected to be a roughly linear relationship between adjoining points.

In Our Case: We’re using linear interpolation for Temperature. Since temperature tends to alter progressively over time and our data is ordered by date, linear interpolation can provide reasonable estimates for the missing temperature values based on the temperatures recorded on nearby days.

df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')

Forward fill (or “last statement carried forward”) propagates the last known value forward to fill gaps, while backward fill does the alternative. This method assumes that the missing value is more likely to be just like the closest known value.

👍 Common Use: Forward/backward fill is commonly used for time series data, especially when the worth is more likely to remain constant until modified (like in financial data) or when essentially the most recent known value is the perfect guess for the present state.

In Our Case: We’re using a mix of forward and backward fill for Outlook. Weather conditions often persist for several days, so it’s reasonable to assume that a missing Outlook value may be just like the Outlook of the previous or following day.

df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')

This method involves replacing all missing values in a variable with a particular constant value. This constant may very well be chosen based on domain knowledge or a secure default value.

👍 Common Use: Constant value imputation is commonly used when there’s a logical default value for missing data, or when you desire to explicitly flag that a worth was missing (through the use of a worth outside the conventional range of the info).

In Our Case: We’re using constant value imputation for the Wind column, replacing missing values with -1. This approach explicitly flags imputed values (since -1 is outside the conventional 0–1 range for Wind) and it preserves the data that these values were originally missing.

df_clean['Wind'] = df_clean['Wind'].fillna(-1)

K-Nearest Neighbors (KNN) imputation estimates missing values by finding the K most similar samples within the dataset (identical to KNN as Classification Algorithm) and using their values to impute the missing data. This method can capture complex relationships between variables.

👍 Common Use: KNN imputation is flexible and could be used for each continuous and categorical variables. It’s particularly useful when there are expected to be complex relationships between variables that simpler methods might miss.

In Our Case: We’re using KNN imputation for Crowdedness. Crowdedness likely depends upon a mix of things (like temperature, holiday status, etc.), and KNN can capture these complex relationships to supply more accurate estimates of missing crowdedness values.

from sklearn.impute import KNNImputer

# One-hot encode the 'Outlook' column
outlook_encoded = pd.get_dummies(df_clean['Outlook'], prefix='Outlook')

# Prepare features for KNN imputation
features_for_knn = ['Weekday', 'Holiday', 'Temp', 'Humidity', 'Wind']
knn_features = pd.concat([df_clean[features_for_knn], outlook_encoded], axis=1)

# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(knn_imputer.fit_transform(pd.concat([knn_features, df_clean[['Crowdedness']]], axis=1)),
columns=list(knn_features.columns) + ['Crowdedness'])

# Update the unique dataframe with the imputed Crowdedness values
df_clean['Crowdedness'] = df_imputed['Crowdedness']

So, there you might have it! Six other ways to handle missing values, all applied to our golf course dataset.

Now, all missing values are filled!

Let’s recap how each method tackled our data:

  1. Listwise Deletion: Helped us concentrate on more complete data points by removing rows with extensive missing values.
  2. Easy Imputation: Filled in Humidity with average values and Holiday with essentially the most common occurrence.
  3. Linear Interpolation: Estimated missing Temperature values based on the trend of surrounding days.
  4. Forward/Backward Fill: Guessed missing Outlook values from adjoining days, reflecting the persistence of weather patterns.
  5. Constant Value Imputation: Flagged missing Wind data with -1, preserving the indisputable fact that these values were originally unknown.
  6. KNN Imputation: Estimated Crowdedness based on similar days, capturing complex relationships between variables.

Each method tells a unique story about our missing data, and the “right” selection depends upon what we find out about our golf course operations and what questions we’re attempting to answer.

The important thing takeaway? Don’t just blindly apply imputation methods. Understand your data, consider the context, and select the tactic that makes essentially the most sense on your specific situation.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x