perfect. You’re going to come across plenty of data inconsistencies. Nulls, negative values, string inconsistencies, etc. If these aren’t handled early in your data evaluation workflow, querying and analysing your data could be a pain afterward.
Now, I’ve done data cleansing before using SQL and Excel, not likely with Python. So, to find out about Pandas (one in all Python’s data evaluation libraries), I’ll be dabbling in some data cleansing.
In this text, I’ll be sharing with you a repeatable, beginner-friendly data cleansing workflow. By the top of this text, you have to be pretty confident in using Python for data cleansing and evaluation.
The Dataset we’ll be working with
I’ll be working with an artificial, messy HR dataset containing typical real-world errors (inconsistent dates, mixed data types, compound columns). This dataset is from Kaggle, and it’s designed for practising data cleansing, transformation, exploratory evaluation, and preprocessing for data visualisation and machine learning.
The dataset incorporates over 1,000 rows and 13 columns, including worker information reminiscent of names, department-region mixtures, contact details, status, salaries, and performance scores. It includes examples of:
- Duplicates
- Missing values
- Inconsistent date formats
- Erroneous entries (e.g., non-numeric salary values)
- Compound columns (e.g., “Department_Region” like “Cloud Tech-Texas” that could be split)
It incorporates columns like:
- Employee_ID: Unique synthetic ID (e.g., EMP1001)
- First_Name, Last_Name: Randomly generated personal names
- Name: Full name (could also be redundant with first/last)
- Age: Includes missing values
- Department_Region: Compound column (e.g., “HR-Florida”)
- Status: Worker status (Lively, Inactive, Pending)
- Join_Date: Inconsistent format (YYYY/MM/DD)
- Salary: Includes invalid entries (e.g., “N/A”)
- Email, Phone: Synthetic contact information
- Performance_Score: Categorical performance rating
- Remote_Work: Boolean flag (True/False)
You may access the dataset here and mess around with it
The dataset is fully synthetic. It doesn’t contain any real individuals’ data and is protected to make use of for public, academic, or industrial projects.
This dataset is in the general public domain under the CC0 1.0 Universal license. You might be free to make use of, modify, and distribute it without restriction.
Overview of the Cleansing Workflow
The info cleansing workflow I’ll be working with consists of 5 easy stages.
- Load
- Inspect
- Clean
- Review
- Export
Let’s dive deeper into each of those stages.
Step 1 — Load the CSV (And Handle the First Hidden Issues)
There are some things to be mindful before loading your dataset. Nonetheless, that is an optional step, and we probably wouldn’t encounter most of those issues in our dataset. Nevertheless it doesn’t hurt to know these items. Listed here are some key things to contemplate while loading.
Encoding issues (utf-8, latin-1)
Encoding defines how characters are stored as bytes within the file. Python and Pandas normally default to UTF-8, which handles most up-to-date text and special characters globally. Nonetheless, if the file was created in an older system or a non-English environment, it would use a special encoding, mostly Latin-1
So in case you attempt to read a Latin-1 file with UTF-8, Pandas will encounter bytes it doesn’t recognise as valid UTF-8 sequences. You’ll typically see a UnicodeDecodeError once you attempt to read a CSV with encoding issues.
If perhaps the default load fails, you might attempt to specify a special encoding:
# First attempt (the default)
try:
df = pd.read_csv(‘messy_data.csv’)
except UnicodeDecodeError:
# Second attempt with a standard alternative
df = pd.read_csv(‘messy_data.csv’, encoding=’latin-1')
Flawed delimiters
CSV stands for “Comma Separated Values,” but in point of fact, many files use other characters as separators, like semicolons (common in Europe), tabs, and even pipes (|). Pandas typically defaults to the comma (,).
So, in case your file uses a semicolon (;) but you load it with the default comma delimiter, Pandas will treat your complete row as a single column. The result could be a DataFrame with a single column containing entire lines of information, making it unimaginable to work with.
The fix is pretty easy. You may try checking the raw file (opening it in a text editor like VS Code or Notepad++ is best) to see what character separates the values. Then, pass that character to the sep argument like so
# If the file uses semicolons
df = pd.read_csv('messy_data.csv', sep=';')
# If the file uses tabs (TSV)
df = pd.read_csv('messy_data.csv', sep='t')
Columns that import incorrectly
Sometimes, Pandas guesses the info type for a column based on the primary few rows, but later rows contain unexpected data (e.g., text mixed right into a column that began with numbers).
As an example, Pandas may appropriately discover 0.1, 0.2, 0.3 as floats, but when row 100 incorporates the worth N/A, Pandas might force your complete column into an object (string) type to accommodate the mixed values. This sucks since you lose the flexibility to perform fast, vectorised numeric operations on that column until you clean up the bad values.
To repair this, I take advantage of the dtype argument to inform Pandas what data type a column must be explicitly. This prevents silent type casting.
df = pd.read_csv(‘messy_data.csv’, dtype={‘price’: float, ‘quantity’: ‘Int64’})
Reading the primary few rows
You possibly can save time by checking the primary few rows directly through the loading process using the nrows parameter. That is great, especially once you’re working with large datasets, because it means that you can test encoding and delimiters without loading your complete 10 GB file.
# Load only the primary 50 rows to verify encoding and delimiter
temp_df = pd.read_csv('large_messy_data.csv', nrows=50)
print(temp_df.head())
When you’ve confirmed the arguments are correct, you may load the total file.
Let’s load the Worker dataset. I don’t expect to see any issues here.
import pandas as pd
df = pd.read_csv(‘Messy_Employee_dataset.csv’)
df
Output:
1020 rows × 12 columns
Now we will move on to Step 2 : Inspection
Step 2 — Inspect the Dataset
I treat this phase like a forensic audit. I’m searching for evidence of chaos hidden beneath the surface. If I rush this step, I guarantee myself a world of pain and analytical errors down the road. I at all times run these 4 crucial checks before writing any transformation code.
The next methods give me the total health report on my 1,020 worker records:
1. df.head() and df.tail(): Understanding the Boundaries
I at all times start with a visible check. I take advantage of df.head() and df.tail() to see the primary and last five rows. That is my quick sanity check to see if all columns look aligned and if the info visually is sensible.
My Finding:
After I ran df.head(), I noticed my Worker ID was sitting in a column, and the DataFrame was using the default numerical index (0, 1, 2, …) as a substitute.
While I do know I could set Worker ID because the index, for now, I’ll leave it. The larger immediate visual risk I’m searching for here is data misaligned within the mistaken column or obvious leading/trailing spaces on names that can cause trouble later.
2. df.info(): Spotting Datatype Problems and Missingness
That is essentially the most critical method. It tells me the column names, the info types (Dtype), and the precise variety of non-null values.
My Findings on 1,020 Rows:
- Missing Age: My total entry count is 1,020, however the
Agecolumn only has 809 non-null values. That’s a major amount of missing data that I’ll have to make a decision the right way to handle later—do I impute it, or do I drop the rows? - Missing Salary: The
Salarycolumn has 996 non-null values, which is barely a minor gap, but still something I need to resolve. - The ID Type Check: The
Worker IDis appropriately listed as anobject(string). This isn’t right. IDs are identifiers, not numbers to be averaged, and using the string type prevents Pandas from unintentionally stripping leading zeros.
3. Data Integrity Check: Duplicates and Unique Counts
After checking dtypes, I would like to know if I even have duplicate records and the way consistent my categorical data is.
- Checking for Duplicates: I ran
df.duplicated().sum()and got a results of 0. This is ideal! It means I don’t have similar rows cluttering up my dataset. - Checking Unique Values (
df.nunique()): I take advantage of this to grasp the range inside each column. Low counts in categorical columns are fantastic, but I search for columns that must be unique but aren’t, or columns which have unique values, suggesting typos. - Employee_ID have 1020 unique records. This is ideal. It means all records are unique.
- The First_Name / Last_Name field has eight unique records. This confirms the dataset’s synthetic nature. My evaluation won’t be skewed by a big number of names, since I’ll treat them as standard strings.
- Department_Region has 36 unique records. There’s high potential for typos36 unique values for region/department is simply too many. I’ll need to envision this column for spelling variations (e.g., “HR” vs. “Human Resources”) in the subsequent step.
- Email (64 unique records). With 1,020 employees, having only 64 unique emails suggests many employees share the identical placeholder email. I’ll flag this for exclusion from evaluation, because it’s useless for identifying individuals.
- Phone (1020 unique records). This is ideal since it confirms phone numbers are unique identifiers.
- Age / Performance Rating / Status / Distant Work (2–4 unique records). These low counts are expected for categorical or ordinal data, meaning they’re ready for encoding.
4. df.describe(): Catching Odd and Unattainable Values
I take advantage of df.describe() to get a statistical summary of all my numerical columns. That is the place where truly unimaginable values—the “red flags”—show up immediately. I mostly give attention to the min and max rows.
My Findings:
I immediately noticed an issue in what I to be the Phone Number column, which Pandas mistakenly converted to a numerical type.
Mean
-4.942253 * 10⁹
Min
-9.994973 * 10⁹
Max
-3.896086 * 10⁶
25%
-7.341992e * 10⁹
50%
4.943997 * 10⁹
75%
-2.520391e * 10⁹
It seems all of the phone number values were massive negative numbers! This confirms two things:
Pandas incorrectly inferred this column as a number, though phone numbers are strings.
There should be characters within the text that Pandas cannot interpret (for instance, parentheses, dashes, or country codes). I would like to convert this to an object type and clean it up completely.
5. df.isnull().sum(): Quantifying Missing Data
While df.info() gives me non-null counts, df.isnull().sum() gives me the full count of nulls, which is a cleaner strategy to quantify my next steps.
My Findings:
Agehas 211 nulls (1020 – 809 = 211), andSalaryhas 24 nulls (1020 – 996 = 24). This precise count sets the stage for Step 3.
This inspection process is my safety net. If I had missed the negative phone numbers, any analytical step that involved numerical data would have failed or, worse, produced skewed results suddenly.
By identifying the necessity to treat Phone Number as a string and the numerous missing values in Age , I even have a concrete cleansing list. This prevents runtime errors and, critically, ensures that my final evaluation relies on plausible, non-corrupted data.
Step 3 — Standardise Column Names, Correct Dtypes, and Handle Missing Values
With my list of flaws in hand (missing Age, missing Salary, the terrible negative Phone Numbers, and the messy categorical data), I move into the heavy lifting. I treat this step in three sub-phases: ensuring consistency, fixing corruption, and filling gaps.
1. Standardising Column Names and Setting the Index (The Consistency Rule)
Before I do any serious data manipulation, I implement strict consistency on column names. Why? Because typing df['Employee ID '] unintentionally as a substitute of df['employee_id'] is a silent, frustrating error. Once the names are clean, I set the index.
My golden rule is snake_case and lowercase all over the place, and ID columns must be the index.
I take advantage of an easy command to strip whitespace, replace spaces with underscores, and convert the whole lot to lowercase.
# The Standardization Command
df.columns = df.columns.str.lower().str.replace(' ', '_').str.strip()
# Before: ['Employee_ID', 'First_Name', 'Phone']
# After: ['employee_id', 'first_name', 'phone']
Now that our columns are standardised. I can move on to set employee_id as an index.
# Set the Worker ID because the DataFrame Index
# That is crucial for efficient lookups and clean merges later.
df.set_index('employee_id', inplace=True)
# Let’s review it real quick
print(df.index)
Output:
Index(['EMP1000', 'EMP1001', 'EMP1002', 'EMP1003', 'EMP1004', 'EMP1005',
'EMP1006', 'EMP1007', 'EMP1008', 'EMP1009',
...
'EMP2010', 'EMP2011', 'EMP2012', 'EMP2013', 'EMP2014', 'EMP2015',
'EMP2016', 'EMP2017', 'EMP2018', 'EMP2019'],
dtype='object', name='employee_id', length=1020)
Perfect, the whole lot is in place.
2. Fixing Data Types and Corruption (Tackling the Negative Phone Numbers)
My df.describe() check revealed essentially the most urgent structural flaw: the Phone column, which was imported as a garbage numerical type. Since phone numbers are (not quantities), they have to be strings.
On this phase, I’ll convert your complete column to a string type, which is able to turn all those negative scientific notation numbers into human-readable text (though still stuffed with non-digit characters). I’ll leave the actual text cleansing (removing parentheses, dashes, etc.) for a dedicated standardisation step (Step 4).
# Fix the Phone dtype immediately
# Note: The column name is now 'phone' attributable to standardization in 3.1
df['phone'] = df['phone'].astype(str)
3. Handling Missing Values (The Age & Salary Gaps)
Finally, I address the gaps revealed by df.info(): the 211 missing Age values and the 24 missing Salary values (out of 1,020 total rows). My strategy depends entirely on the column’s role and the magnitude of the missing data:
- Salary (24 missing values): On this case, removing or dropping all missing values could be the very best strategy. Salary is a critical metric for financial evaluation. Imputing it risks skewing conclusions. Since only a small fraction (2.3%) is missing, I decide to drop the unfinished records.
- Age (211 missing values). Filling the missing values is the very best strategy here. Age is usually a feature for predictive modelling (e.g., turnover). Dropping 20% of my data is simply too costly. I’ll fill the missing values using the median age to avoid skewing the distribution with the mean.
I execute this strategy with two separate commands:
# 1. Removal: Drop rows missing the critical 'salary' data
df.dropna(subset=['salary'], inplace=True)
# 2. Imputation: Fill missing 'age' with the median
median_age = df['age'].median()
df['age'].fillna(median_age, inplace=True)
After these commands, I might run df.info() or isnull().sum() again just to verify that the non-null counts for salary and age now reflect a clean dataset.
# Rechecking the null counts for salary and age
df[‘salary’].isnull().sum())
df[‘age’].isnull().sum())
Output:
np.int64(0)
Thus far so good!
By addressing the structural and missing data issues here, the following steps can focus entirely on value standardisation, reminiscent of the messy 36 unique values in department_region—which we tackle in the subsequent phase.
Step 4 — Value Standardization: Making Data Consistent
My DataFrame now has the suitable structure, however the values inside are still dirty. This step is about consistency. If “IT,” “i.t,” and “Info. Tech” all mean the identical department, I would like to force them right into a single, clean value (“IT”). This prevents errors in grouping, filtering, and any statistical evaluation based on categories.
1. Cleansing Corrupted String Data (The Phone Number Fix)
Remember the corrupted phone column from Step 2? It’s currently a multitude of negative scientific notation numbers that we converted to strings in Step 3. Now, it’s time to extract the actual digits.
So, I’ll be removing every non-digit character (dashes, parentheses, dots, etc.) and converting the result right into a clean, unified format. Regular expressions (.str.replace()) are perfect for this. I take advantage of D to match any non-digit character and replace it with an empty string.
# The phone column is currently a string like '-9.994973e+09'
# We use to remove the whole lot that may not a digit
df['phone'] = df['phone'].str.replace(r'D', '', =True)
# We can even truncate or format the resulting string if needed
# For instance, keeping only the last 10 digits:
df['phone'] = df['phone'].str.slice(-10)
print(df['phone'])
Output:
employee_id
EMP1000 1651623197
EMP1001 1898471390
EMP1002 5596363211
EMP1003 3476490784
EMP1004 1586734256
...
EMP2014 2470739200
EMP2016 2508261122
EMP2017 1261632487
EMP2018 8995729892
EMP2019 7629745492
Name: phone, Length: 996, dtype: object
Looks a lot better now. That is at all times a great practice to scrub identifiers that contain noise (like IDs with leading characters or zip codes with extensions).
2. Separating and Standardizing Categorical Data (Fixing the 36 Regions)
My df.nunique() check revealed 36 unique values within the department_region column. After I reviewed all of the unique values within the column, the output revealed that they’re all neatly structured as department-region (e.g., devops-california, finance-texas, cloud tech-new york).
I suppose one strategy to solve that is to separate this single column into two dedicated columns. I’ll split the column on the hyphen (-) and assign the parts to recent columns: department and region.
# 1. Split the combined column into two recent, clean columns
df[['department', 'region']] = df['department_region'].str.split('-', expand=True)
Next, I’ll drop the department_region column because it’s just about useless now
# 2. Drop the redundant combined column
df.drop('department_region', axis=1, inplace=True)
Let’s review our recent columns
print(df[[‘department’, ‘region’]])
Output:
department region
employee_id
EMP1000 devops california
EMP1001 finance texas
EMP1002 admin nevada
EMP1003 admin nevada
EMP1004 cloud tech florida
... ... ...
EMP2014 finance nevada
EMP2016 cloud tech texas
EMP2017 finance big apple
EMP2018 hr florida
EMP2019 devops illinois
[996 rows x 2 columns]
After splitting, the brand new department column has only 6 unique values (e.g., ‘devops’, ‘finance’, ‘admin’, etc.). That is great news. The values are already standardised and prepared for evaluation! I suppose we could at all times map all similar departments to 1 single category. But I’m gonna skip that. I don’t wish to get too advanced in this text.
3. Converting Date Columns (The Join_Date Fix)
The Join_Date column is frequently read in as a string (object) type, which makes time-series evaluation unimaginable. This implies now we have to convert it to a correct Pandas datetime object.
pd.to_datetime() is the core function. I often use errors='coerce' as a security net; if Pandas can’t parse a date, it converts that value to NaT (Not a Time), which is a clean null value, stopping the entire operation from crashing.
# Convert the join_date column to datetime objects
df['join_date'] = pd.to_datetime(df['join_date'], errors='coerce')
The conversion of dates enables powerful time-series evaluation, like calculating average worker tenure or identifying turnover rates by 12 months.
After this step, every value within the dataset is clean, uniform, and appropriately formatted. The specific columns (like department and region) are ready for grouping and visualisation, and the numerical columns (like salary and age) are ready for statistical modeling. The dataset is officially ready for evaluation.
Step 5 — Final Quality Check and Export
Before closing the notebook, I at all times perform one last audit to make sure the whole lot is ideal, after which I export the info so I can perform evaluation on it later.
The Final Data Quality Check
That is quick. I re-run the 2 most important inspection methods to verify that every one my cleansing commands actually worked:
df.info(): I confirm there are no more missing values within the critical columns (age,salary) and that the info types are correct (phoneis a string,join_dateis datetime).df.describe(): I make sure the statistical summary shows plausible numbers. ThePhonecolumn should now be absent from this output (because it’s a string), andAgeandSalarymust have logical minimum and maximum values.
If these checks pass, I do know the info is reliable.
Exporting the Clean Dataset
The ultimate step is to save lots of this cleaned version of the info. I normally put it aside as a brand new CSV file to maintain the unique messy file intact for reference. I take advantage of index=False here if I don’t want the employee_id (which is now the index) to be saved as a separate column, or index=True if I need to save lots of the index as the primary column in the brand new CSV.
# Exporting the clean DataFrame to a brand new CSV file
# We use index=True to maintain our primary key (employee_id) within the exported file
df.to_csv('cleaned_employee_data.csv', index=True)
By exporting with a transparent, recent filename (e.g., _clean.csv), you officially mark the top of the cleansing phase and supply a clean slate for the subsequent phase of the project.
Conclusion
Truthfully, I used to feel overwhelmed by a messy dataset. The missing values, the weird data types, the cryptic columns — it felt like facing the blank page syndrome.
But this structured, repeatable workflow modified the whole lot. By specializing in Load, Inspect, Clean, Review, and Export, we established order immediately: standardizing column names, making the employee_id the index, and using smart strategies for imputation and splitting messy columns.
Now, I can jump straight into the fun evaluation part without continually second-guessing my results. When you struggle with the initial data cleansing step, check out this workflow. I’d love to listen to the way it goes. If you would like to mess around with the dataset, you may download it here.
Wanna connect? Be at liberty to say hi on these platforms
