Home Artificial Intelligence 12 Ways to Handle Missing Values in Data 1. Delete the row that has missing values 2. Delete your entire column that has missing values 3. Impute missing values with Mean 4. Impute missing values with Median 5. Impute missing values with Mode 6. Impute missing values with a latest category 7. Last commentary carried forward (LOCF) method 8. Interpolate Missing Values 9. Use IterativeImputer or Regression 10. Nearest Neighbours Imputations (KNNImputer) 11. Imputation using Deep Learning Library — Datawig 12. Use Algorithms that support missing values Conclusion

12 Ways to Handle Missing Values in Data 1. Delete the row that has missing values 2. Delete your entire column that has missing values 3. Impute missing values with Mean 4. Impute missing values with Median 5. Impute missing values with Mode 6. Impute missing values with a latest category 7. Last commentary carried forward (LOCF) method 8. Interpolate Missing Values 9. Use IterativeImputer or Regression 10. Nearest Neighbours Imputations (KNNImputer) 11. Imputation using Deep Learning Library — Datawig 12. Use Algorithms that support missing values Conclusion

1
12 Ways to Handle Missing Values in Data
1. Delete the row that has missing values
2. Delete your entire column that has missing values
3. Impute missing values with Mean
4. Impute missing values with Median
5. Impute missing values with Mode
6. Impute missing values with a latest category
7. Last commentary carried forward (LOCF) method
8. Interpolate Missing Values
9. Use IterativeImputer or Regression
10. Nearest Neighbours Imputations (KNNImputer)
11. Imputation using Deep Learning Library — Datawig
12. Use Algorithms that support missing values
Conclusion

Photo by Isaac Smith on Unsplash

Many machine learning algorithms fail if the dataset comprises missing values. Also, sometimes missing records impact the accuracy of the entire evaluation. That’s the reason it is rather necessary to handle missing values in the info. Let’s see how you possibly can take care of missing values like an authority!

If you will have and your total data size is large enough to your evaluation, in that case you possibly can remove the rows which have missing data. It is a quick and dirty method that could be applied if you should quickly analyse your data without having to handle its missing values.

  • Loss of data
  • Works just for smaller variety of missing values and massive data size

Sometimes no information is best than incomplete information. So If you will have in the identical column, then you definitely can remove the entire column from evaluation to avoid having incomplete data.

If you will have missing values in a and you can’t afford to lose data by removing your entire row or column then you definitely can impute the missing value with the mean value of the entire column.

  • Only works on numeric data
  • Only works if no outliers

If you will have , then the mean would turn into much higher/lower than a lot of the values. In that case we must always use median to impute missing values.

Pros

  • No Data Loss
  • Works on datasets with outliers

Cons

  • Only works on numeric data

If there are missing values in a then we cannot calculate its mean or median. In that case we impute missing values using the Most frequent category within the column or the Mode.

Pros

  • Higher than deleting entire row

Cons

  • Not beneficial if too many missing values

In case your has then it’s safer create a latest category for them. Like in the instance below we’ve got created a latest category called ‘missing’.

Pros

Cons

  • Adds a latest feature to the evaluation

Let’s say, a student drops out of a study before the session ends, then his/her last observed rating is used for all subsequent (i.e., missing) commentary points. This is named Last commentary carried forward method or . This method is barely applicable to (Repeated Measure per subject). Similarly we also can do (Filling up with the subsequent available value).

Pros

  • Higher than removing all records for that subject (on this case student)

Cons

  • Gets influenced by outliers. Sometimes Forward/Backward fillcould be removed from actuals because of outliers. For instance, during Covid 19 period some exams weren’t taken and marks were forward filled using previous exam marks. If a student is an overall good performer, but in the most recent exam someway she couldn’t perform well, then after applying LOCF she can be given a low rating, which is removed from the fact. Here her marks in previous exam were outliers and the LOCF method got impacted by that.

Same thing applies to the scenario if a mean performer student by accident got very high marks in the most recent exam, and scores in next exams are imputed based on those good marks.

Additionally it is a very good idea to interpolate missing values using the remainder of the values within the series. Interpolation means completing the chart. It could possibly be accomplished using a line, using a curve. after which fill within the missing value using the info points that fall on the road or the curve.

Pros

  • More accurate than forward or backward fill

Cons

  • Doesn’t work well if other values within the column vary lots
import interpolate 

In Python we’ve got an inbuilt function called interpolate under pandas library which imputes missing values by interpolating and it allows us to decide on if we wish ‘polynomial’, ‘linear’, ‘quadratic’ interpolation.

If the variable that has missing values is expounded to a different column within the dataset then we will impute the missing values using a regression model. On this data we’ve got name, age and salary. If age and salary are related to one another such that folks with lower age are prone to have lower salary, in that case the very best option to impute missing values could be running a regression model of salary → age after which predict the missing salary values through the use of regression model on their corresponding age.

It is extremely easy to do in python using IterativeImputer() function from sci-kit learn Library.

Pros

  • More accurate than forward or backward fill

Cons

  • Requires A number of columns which are correlated to missing value column
  • Doesn’t work well if the underlying relationship will not be significant

For instance, if Salary column has missing values for age 25 and tenure 2 years, this method will try to seek out the workers with the closest possible age and tenure. and take a mean of their salaries. This is named method.

In python we’ve got KNNImputer in library that may impute missing values using this method.

Pros

  • More accurate than forward or backward fill

Cons

  • Doesn’t work well if other values in missing data column vary lots

This method works thoroughly with categorical, continuous, and non-numerical features. is a library that learns ML models using Deep Neural Networks to impute missing values within the datagram. Datawig can take a knowledge frame and fit an imputation model for every column with missing values, with all other columns as inputs.

import datawig

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

Pros

  • More Accurate as in comparison with other approaches

Cons

  • May take very long time to run

A number of the Machine learning or Data Mining algorithms support missing values. For instance K Nearesr Neighber Algorithm, Naive Bayes. These algorithms could be used when the dataset comprises null or missing values. Pros

  • No must handle missing values

Cons

  • Limits to certain data evaluation approaches/algorithms

Every data is different and before handling missing values we must always all the time understand the info, how the missing values column is expounded to other columns, what’s the info sort of missing value column, what percentage of the info is missing. That is going to assist us find an appropriate method for missing value imputation.

Reference

[1] Datawig: https://github.com/awslabs/datawig

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here