Feature Engineering Techniques for Healthcare Data Evaluation — Part II.

-

We’ll proceed our deal with feature engineering — this stays the core objective of this project.

Upon completing all feature engineering tasks, I’ll save the ends in a CSV file as the ultimate deliverable, marking the project’s completion.

Our primary objective here stays consistent: refining data through feature engineering. Within the previous tutorial, we explored several techniques and stopped at this cell.

# 57. Value counts for 'admission_source_id' after recategorization
df['admission_source_id'].value_counts()

I’ll now proceed working in the identical notebook, picking up where we left off. In our dataset, we’ve got three variables — diag_1, diag_2, and diag_3—each representing a medical diagnosis.

So, how should we handle these variables? I don’t have a background in medical diagnoses, nor am I a healthcare skilled.

In cases like this, what will we do? Research. If needed, we seek the advice of experts, or we study reference materials.

Let’s start by taking a take a look at the info, we could?

# 58. Viewing the info
df[['diag_1', 'diag_2', 'diag_3']].head()

I’ll filter the DataFrame to deal with diag_1, diag_2, and diag_3, each containing numerical ICD-9 codes that classify specific diseases (primary, secondary, and extra) for every patient.

Using these codes directly might make the evaluation too granular, so as an alternative, we’ll group them into 4 comorbidity-based categories—a healthcare concept that highlights when multiple health conditions coexist.

This step shifts our approach from raw disease codes to a more interpretable, high-level metric. Slightly than complex code, this involves interpretive decisions for higher insight extraction.

If we keep the codes as-is, our evaluation will remain focused on disease classifications alone. But by consolidating the info from diag_1, diag_2, and diag_3 right into a recent comorbidity variable, we gain richer insights. Effective feature engineering means converting available information into higher-value metrics.

To proceed, we’ll define this recent variable based on a transparent criterion — comorbidity. This manner, our transformation is clinically relevant and adaptable for other analyses. Even when domain knowledge is restricted, we are able to seek the advice of field experts to guide the feature design.

I’ll walk through creating this feature in Python, transforming the raw diagnoses right into a feature that captures critical patient health patterns, underscoring the facility of domain-driven feature engineering.

Applying Feature Engineering Strategies

We’re working here to uncover hidden insights inside our dataset by transforming the variables.

This information exists, nevertheless it’s not immediately visible; we’d like feature engineering to disclose it. The visible details, like individual disease codes, are straightforward and worthwhile in their very own right, but there’s often more depth within the hidden layers of knowledge.

By extracting these invisible insights, we are able to analyze the info from a brand new angle or perspective — a shift that may greatly enhance every day data evaluation. Personally, I see feature engineering as more of an art than a purely technical task.

The Python programming we’re doing isn’t particularly complex; the true skill is in reaching a level of abstraction where we are able to see insights that aren’t immediately obvious.

This ability to abstract develops with experience — working on diverse projects, learning from mistakes, and step by step noticing that nearly every dataset holds hidden information that, when properly engineered, can enhance evaluation. That’s precisely what we’re working on here together.

Based on our exploration, we’ve decided to create a recent variable from these three diagnostic columns. We’ll apply comorbidity as our guiding criterion, which can allow us to group these variables based on whether the patient has multiple coexisting conditions.

To proceed, I’ll create a brand new DataFrame named diagnosis that can contain diag_1, diag_2, and diag_3. This setup allows us to focus exclusively on these columns as we implement the comorbidity-based transformation.

# 59. Concatenating 3 variables right into a dataframe
diagnosis = df[['diag_1', 'diag_2', 'diag_3']]

Here, I even have the values for you — they’re all disease codes.

# 60. Viewing the info
diagnosis.head(10)

Also, note that we’ve got no missing values.

# 61. Checking for missing values
diagnosis.isnull().any()

To create a recent variable based on comorbidity, our first step is to ascertain a transparent criterion that defines it inside our dataset. In practical terms, comorbidity simply means the presence of multiple disorder in a patient. As an illustration, if a patient has three diagnoses corresponding to 3 different conditions, it’s likely they’ve comorbidities.

Imagine a patient diagnosed with each depression and diabetes — these conditions could also be interconnected. Our aim is to detect these overlaps and extract useful information. This process transforms raw data into actionable insights.

Feature engineering, on this sense, goes beyond the plain. Many professionals focus only on visible data — analyzing it because it is, without uncovering deeper, interconnected patterns. Nevertheless, invisible information can reveal more nuanced insights, and uncovering it requires experience and a refined sense of abstraction.

To find out the comorbidity of various conditions, we’ll need to make use of domain knowledge. Here’s where understanding patterns within the medical field helps us apply relevant criteria. For instance:

  1. Mental Health and Chronic Conditions: Someone diagnosed with social anxiety and depression has comorbid mental health conditions. Similar patterns apply with other pairs, like diabetes and cardiovascular diseases or infectious diseases and dementia.
  2. Eating Disorders: Commonly overlap with anxiety disorders and substance abuse, forming a posh comorbid profile.

When identifying these connections, it’s often helpful to discuss with a data dictionary or seek the advice of with the business or healthcare team, especially if we’re unfamiliar with the precise disorders. The goal isn’t simply to look knowledgeable but to learn and leverage expert insights. Persistently, insights from others reveal features of knowledge that we may not have anticipated.

Our task now could be to arrange criteria for comorbidity inside this dataset. It will involve:

  • Making a function to investigate the diagnoses.
  • Assigning codes to discover specific disorders, which we’ll use to find out if a patient has multiple overlapping health issues.

Once the standards are defined, we’ll translate them into Python code, generating a brand new variable that represents the comorbidity level for every patient. This recent feature will allow us to explore how overlapping conditions impact health outcomes in a structured, data-driven way.

Let’s begin by organising the Python function to implement this approach.

# 63. Function that calculates Comorbidity
def calculate_comorbidity(row):

# 63.a Code 250 indicates diabetes
diabetes_disease_codes = "^[2][5][0]"

# Codes 39x (x = value between 0 and 9)
# Codes 4zx (z = value between 0 and 6, and x = value between 0 and 9)
# 63.b These codes indicate circulatory problems
circulatory_disease_codes = "^[3][9][0-9]|^[4][0-6][0-9]"

# 63.c Initialize return variable
value = 0

# Value 0 indicates that:
# 63.d Diabetes and circulatory problems weren't detected concurrently within the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0

# Value 1 indicates that:
# 63.e At the least one diagnosis of diabetes AND circulatory problems was detected concurrently within the patient
elif (bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 1

# Value 2 indicates that:
# 63.f Diabetes and no less than one diagnosis of circulatory problems were detected concurrently within the patient
elif (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
(bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3'])))))):
value = 2

# Value 3 indicates that:
# At the least one diagnosis of diabetes and no less than one diagnosis of circulatory problems
# 63.g were detected concurrently within the patient
elif (bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
(bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) or
bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3'])))))):
value = 3

return value

At first glance, I do know this Python code might look intimidating, right? What’s this? This huge block of code? Don’t worry — it’s much simpler than it seems, okay? Follow the reason with me here.

I even have a function called calculate_comorbidity, which takes a row from my DataFrame as input, processes it, and outputs a result. I even call this function here, like so.

# 64. Applying the comorbidity function to the info
%%time
df['comorbidity'] = diagnosis.apply(calculate_comorbidity, axis=1)

Notice that I’m calling the diagnosis DataFrame, which incorporates the values for diag1, diag2, and diag3. I’m applying the function and generating a brand new column. So, what does this function actually do?

First, once we enter the function, we create a Python variable called diabetes_disease_codes. I’m using diabetes as one in all the health conditions here, because it’s a critical issue, right? What’s the code for diabetes? It’s 250.

Where did I get this information? I pulled it from the ICD table. Should you visit this table, which incorporates classification codes for diseases, you’ll see that 250 corresponds to diabetes.

The patient with ID 2 was diagnosed with diabetes within the second diagnosis. So, I retrieved the diabetes code, which is 250.

Nevertheless, I added the caret symbol (^). Why did I do that? Because I’m making a string that will likely be used as a regular expression to go looking inside my DataFrame.

In truth, I’m using it below, have a look:

# Value 0 indicates that:
# 63.d Diabetes and circulatory problems weren't detected concurrently within the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0

re is the Python package for normal expressions, used specifically for data searching based on defined criteria.

Here, I’ll use it to go looking for diabetes_disease_codes in diag1, diag2, and diag3. It is a method to envision if these columns contain the code 250.

Along with diabetes, I’ll also use circulatory_disease_codes for circulatory conditions.

To discover circulatory issues, I’ll create a pattern based on the ICD-9 code system. Specifically:

  • Code pattern “39x”: where x ranges from 0 to 9.
  • Code pattern “4zx”: where z ranges from 0 to six and x from 0 to 9.

Using this data, I created a regular expression to focus on these ranges:

  1. I start with the caret (^), which specifies the start of the string, followed by 39 to capture any codes that start with “39” and end with any digit (0–9).
  2. I exploit the pipe (|) operator, meaning “or”, to expand the pattern to incorporate codes starting with “4” and followed by a digit from 0 to six after which 0 to 9.

By combining these patterns, we are able to filter for general circulatory issues without being too specific. This regular expression enables a versatile but targeted approach for our evaluation.

Creating the Filter

I’ll apply this pattern as a filter on diag_1, diag_2, and diag_3. This filter will likely be assigned to a brand new variable named value (defined earlier in #63.c), which serves as our return variable.

The value variable is initialized as 0 and later adjusted based on specific criteria.

Classification Values

We’ll establish 4 distinct categories for comorbidity:

  • Value 0: No comorbidities detected.
  • Value 1: Diabetes detected, no circulatory issues.
  • Value 2: Circulatory issues detected, no diabetes.
  • Value 3: Each diabetes and circulatory issues detected.

This recent variable will consolidate information from diag_1, diag_2, and diag_3 right into a single categorical feature with 4 levels based on these conditions, streamlining our data and enhancing its usability for downstream evaluation.

# Value 0 indicates that:
# 63.d Diabetes and circulatory problems weren't detected concurrently within the patient
if (not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(diabetes_disease_codes, str(np.array(row['diag_3'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_1'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_2'])))) and
not bool(re.match(circulatory_disease_codes, str(np.array(row['diag_3']))))):
value = 0

Let’s break down what’s happening within the code:

I’m using re, Python’s regular expressions package, to match specific patterns in each diagnosis column (diag_1, diag_2, and diag_3). Specifically, I’m checking whether each diagnosis incorporates a diabetes code or a circulatory issue code.

Here’s the process:

  1. Convert each diagnosis right into a string format suitable for normal expression searches.
  2. Check each column (diag_1, diag_2, diag_3) for diabetes or circulatory codes using re.match.
  3. Convert these checks into Boolean values (True if a match is found, False if not).
  4. Negate the outcomes to discover when no matches for either diabetes or circulatory issues exist in any of the three diagnoses.

The final result:

  • If no diabetes or circulatory codes are present across all three columns (diag_1, diag_2, diag_3), the value is ready to 0.

By negating the Boolean checks, we classify cases where each diabetes and circulatory issues are absent as 0, marking this category because the baseline for patients without these comorbidities.

If this returns True, it means the code was found. But that’s not the goal here; we wish cases without diabetes or circulatory codes. That’s why we negate the result.

Note how I’m also using not for circulatory issues. If all checks return not (meaning neither diabetes nor circulatory issues are present in diag_1, diag_2, or diag_3), we set the value to 0.

For value 1, we capture cases where no less than one diagnosis has diabetes but no circulatory problem. Here, I’ve removed the not for diabetes, while keeping it for circulatory codes to isolate diabetes-only cases.

So, if it finds a diabetes diagnosis, even when it doesn’t discover a circulatory problem, it would assign the worth 1.

For value 2, it indicates that diabetes and no less than one diagnosis of circulatory problems were detected concurrently.

Here, I kept the not condition specifically for diabetes and removed it for circulatory problems. Notice the detail here: we’re using each AND and OR logic, following the foundations we defined to assign the worth.

Finally, if there may be no less than one diabetes diagnosis and no less than one circulatory problem diagnosis detected concurrently, we assign value 3.

Notice here that the OR operator applies to every diagnosis (diag_1, diag_2, and diag_3) when each diabetes and circulatory issues are considered. This enables your entire condition to return True if anyone diagnosis meets these criteria.

With this setup, the calculate_comorbidity function consolidates information from diag_1, diag_2, and diag_3 right into a recent variable that reflects comorbidity status—an example of domain-based feature engineering. This function will classify the comorbidity status into 4 categories based on the foundations we established.

Here, we’re focusing specifically on diabetes and circulatory issues to streamline the instance. This approach, nevertheless, can easily be adapted to create variables for other comorbid conditions if needed.

Now, create the function and proceed with the subsequent instruction to use it.

# 64. Applying the comorbidity function to the info
%%time
df['comorbidity'] = diagnosis.apply(calculate_comorbidity, axis=1)

# -> CPU times: user 6.72 s, sys: 4.43 ms, total: 6.73 s
# Wall time: 6.78 s

It takes a little bit of time, doesn’t it, to process your entire dataset? Notice that I’m using diagnosis, which incorporates precisely the three variables: diag_1, diag_2, and diag_3. So, this step takes slightly over eight seconds.

Let’s now check the shape of the dataset, after which take a take a look at the data itself.

# 65. Shape
df.shape

# (98052, 43)

# # 66. Viewing the info
df.head()

Take a take a look at what we’ve completed here. The comorbidity variable is now added on the very end of our dataset.

Now, we’ve got a brand new variable that identifies if a patient has each diabetes and circulatory issues concurrently.

This goes beyond technical work — it’s almost an art. We’ve uncovered hidden insights and created a worthwhile recent variable.

This enables us to perform further analyses, which we’ll explore shortly. Let’s check the unique values on this variable.

# 67. Unique values in 'comorbidity'
df['comorbidity'].unique()

# > array([1, 3, 2, 0])

As you’ll be able to see, we’ve got precisely the 4 categories we defined within the function: 0, 1, 2, and 3.

Now, let’s check the count and frequency of every category.

# 68. Unique value counts in 'comorbidity'
df['comorbidity'].value_counts()

So, we observe that the highest frequency is for index 2, while the lowest is for index 3.

Let’s take a more in-depth take a look at what index 2 represents.

# Value 2 indicates that:
# 63.f Diabetes and no less than one diagnosis of circulatory problems were
# detected concurrently within the patient

Diabetes and no less than one circulatory problem diagnosis were detected concurrently within the patient. This statement applies to the majority of cases, indicating that many patients have each diabetes and no less than one circulatory issue.

This raises some necessary questions:

  • Do these patients require a different treatment approach?
  • Does this condition influence their hospital readmission rates?

These findings open up quite a few avenues for further evaluation. Now, let’s discover the category with the fewest entriesCategory 3.

# Value 3 indicates that:
# 63.g At the least one diagnosis of diabetes and no less than one diagnosis of
# circulatory problems were detected concurrently within the patient

A simultaneous diagnosis of diabetes and circulatory issues is less frequent, with Category 2 being probably the most common.

This evaluation goes beyond the plain, unlocking deeper insights through feature engineering that others might overlook.

These comorbidity insights weren’t created — they were simply hidden inside the data. By combining existing columns, we generated a variable that answers questions not yet asked. This process takes time and experience and might elevate your data evaluation.

To wrap up, let’s create a chart. But first, let’s delete the unique columns, diag_1, diag_2, and diag_3, as we’ve consolidated them into the comorbidity variable. While other diseases is likely to be present, our focus here is strictly on diabetes and circulatory issues.

# 69. Dropping individual diagnosis variables
df.drop(['diag_1', 'diag_2', 'diag_3'], axis=1, inplace=True)

Delete those columns now, after which let’s proceed by making a cross-tabulation between comorbidity and readmission status.

# 70. Calculating the proportion of comorbidity by type and goal variable class
percent_com = pd.crosstab(df['comorbidity'], df['readmitted'], normalize='index') * 100

Remember this variable? Now, I’ll calculate the percentage and display it for you.

Zero (0) indicates no readmission, while one (1) indicates readmission. Amongst readmitted patients, 44% had no comorbidities—no occurrence of diabetes or circulatory issues—revealing key insights already embedded in the info.

Category 2, with each diabetes and circulatory issues, shows the best readmission rate at 48%. This highlights a direct correlation: patients with two conditions usually tend to be readmitted.

These findings, uncovered through feature engineering, display how hidden information can guide operational strategies. Let’s proceed with visualizing these insights.

# 71. Plot

# Prepare the figure from the info
fig = percent_com.plot(kind='bar',
figsize=(16, 8),
width=0.5,
edgecolor='g',
color=['b', 'r'])

# Draw each group
for i in fig.patches:
fig.text(i.get_x() + 0.00,
i.get_height() + 0.3,
str(round((i.get_height()), 2)),
fontsize=15,
color='black',
rotation=0)

# Title and display
plt.title("Comorbidity vs Readmissions", fontsize=15)
plt.show()

I’ll create the plot using the comorbidity percentages we’ve calculated.

I’ll arrange a bar chart with parameters and formatting, adding titles and labels for clarity, and ensuring each group is distinct and straightforward to interpret.

The X-axis displays comorbidity levels (0, 1, 2, and 3).

Blue bars represent patients not readmitted, while red barsindicate those readmitted, allowing a transparent visual comparison across each comorbidity level.

  • The largest blue bar, corresponding to index 0 (patients with no comorbidities like diabetes or circulatory issues), shows that about 55% of those patients weren’t readmitted, suggesting effective treatment and lower readmission rates attributable to the absence of comorbid conditions.
  • Red bar at index 2 represents patients with each diabetes and a circulatory problem. This group shows a notably higher readmission rate, aligning with expectations that comorbid patients are at greater risk of requiring further medical care.

This graph reflects greater than a straightforward visualization; it encapsulates critical steps:

  1. Understanding the domain-specific problem.
  2. Defining criteria for comorbidity.
  3. Applying feature engineering to remodel raw data into actionable insights.
  4. Using Python for automated data processing.

The underlying query, likely unconsidered without these steps, is: Does having two simultaneous conditions impact readmission rates? The information provides a transparent yes.

This insight enables healthcare providers to higher support high-risk patients and potentially lower readmissions — a testament to how data evaluation can turn hidden insights into concrete, actionable strategies, rooted in data-driven evidence moderately than speculation.

Have we accomplished the feature engineering work? Not quite. There’s yet another aspect of the data that I haven’t yet shown you.

# 72. Viewing the info
df.head()

Let’s take a take a look at the columns to see how the dataset is organized after our feature engineering efforts.

# 73. Viewing column names
df.columns

The dataset includes 23 medications, each indicating whether a change was made through the patient’s hospitalization. This prompts the query: Does a drugs change impact the likelihood of readmission?

Consider two scenarios:

  1. No change in medication, the patient recovers, and returns home.
  2. A significant dosage adjustment occurs, potentially causing negative effects and resulting in a return to the hospital.

To research this, moderately than plotting all 23 variables (which can have similar behaviors), we’ll chart 4 chosen medications to focus on specific trends.

# 74. Plot
fig = plt.figure(figsize=(20, 15))

ax1 = fig.add_subplot(221)
ax1 = df.groupby('miglitol').size().plot(kind='bar', color='green')
plt.xlabel('miglitol', fontsize=15)
plt.ylabel('Count', fontsize=15)

ax2 = fig.add_subplot(222)
ax2 = df.groupby('nateglinide').size().plot(kind='bar', color='magenta')
plt.xlabel('nateglinide', fontsize=15)
plt.ylabel('Count', fontsize=15)

ax3 = fig.add_subplot(223)
ax3 = df.groupby('acarbose').size().plot(kind='bar', color='black')
plt.xlabel('acarbose', fontsize=15)
plt.ylabel('Count', fontsize=15)

ax4 = fig.add_subplot(224)
ax4 = df.groupby('insulin').size().plot(kind='bar', color='cyan')
plt.xlabel('insulin', fontsize=15)
plt.ylabel('Count', fontsize=15)

plt.show()

I created 4 plots for 4 variables, each representing a unique medication. Below, you’ll find the outcomes visualized across 4 distinct charts.

Consider the primary medication within the chart. Will we know its specifics? No, and for our purposes, we don’t have to. All we’d like is to grasp the 4 possible categories:

  • Modification in dosage
  • Reduction in dosage
  • No modification (regular level)
  • Increase in dosage

That is sufficient for our evaluation. Deep domain knowledge isn’t required here; the main focus is on identifying these categories.

Now, let’s interpret the chart: For one medication, most entries are labeled as recent, meaning no change in dosage. A skinny pink line stands out, indicating cases with regular dosage.

n some cases, the medication remained regular, which may very well be notable, especially for certain patients.

Nevertheless, for many, there was no modification in dosage.

Now, observe the light blue chart — the distribution here is more varied, indicating a broader range of dosage adjustments.

Some patients had a reduction in dosage, others had no modification, some remained regular, and just a few experienced an increase. That is our current view of medication variables.

Now, do we’d like feature engineering here? As a substitute of displaying all 4 categories, we could simplify by making a binary variable: Did the medication change or not? This might streamline evaluation by recoding categories into binary information.

This recoding allows us to take a look at these variables in another way, extracting hidden insights. By counting total medication modifications per patient, we are able to create a brand new attribute which will reveal correlations with the frequency of changes.

One other attribute could track the full number of medicines a patient consumed, which we are able to analyze against readmission rates.

Let’s implement this strategy.

# 75. List of medication variable names (3 variables were previously removed)
medications = ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide',
'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'troglitazone', 'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
'glimepiride-pioglitazone', 'metformin-pioglitazone']

First, let’s create a Python list containing the column names that represent the medications. In previous steps, we already removed three variables.

Subsequently, while the original dataset had 23 medication variables, we now have only 20 because three were deleted attributable to identified issues and thus are not any longer a part of our evaluation. Nevertheless, in the unique dataset, there are indeed 23 medication variables.

With the list created, let’s proceed to iterate over it in a loop to implement the subsequent steps.

# 76. Loop to regulate the worth of medication variables
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Regular') else 1)

For every column within the medications list, I’ll locate it within the DataFrame, append a temp suffix for a brand new column, and apply a lambda function:

  • If x is “No” or “Regular”, return 0.
  • Otherwise, return 1.

This recodes the variable from 4 categories to simply two (0 or 1), simplifying our interpretation. We will then confirm the brand new columns at the top of the DataFrame.

Check if the temp variables are already present, right on the end of the dataset.

Now, I’ll create a brand new variable to store the variety of medication dosage changes.

# 78. Making a variable to store the count per patient
df['num_med_dosage_changes'] = 0

I’ll create the variable and initialize it with 0. Then, I’ll run one other loop to update it.

# 79. Counting medication dosage changes
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df['num_med_dosage_changes'] = df['num_med_dosage_changes'] + df[colname]
del df[colname]

For every column within the medications list, I seek for it within the DataFrame, create a brief column with a temp suffix, then:

  • Add the worth in df[colname] to df['num_med_dosage_changes'] to count dosage changes per patient.
  • Delete the temporary column to maintain the DataFrame clean.

Finally, using value_counts on df['num_med_dosage_changes'] reveals dosage adjustment frequency across patients, offering insight into treatment patterns.

# 80. Checking the full count of medication dosage changes
df.num_med_dosage_changes.value_counts()

The distribution of dosage changes is as follows:

  • 0 changes: 71,309
  • 1 change: 25,350
  • 2 changes: 1,281
  • 3 changes: 107
  • 4 changes: 5

Now, let’s check the dataset head to verify the brand new variable has been accurately incorporated.

# 81. Viewing the info
df.head()

Run the command, scroll to the top, and there it’s — the recent variable has been successfully added at the top of the dataset.

Now I do know the exact count of medication dosage changes for every patient. As an illustration, the primary patient had one change, the second had none, the third had one, and so forth.

Next, we’ll adjust the medication columns to reflect whether each medication is being administered to a patient. That is an extra modification to simplify the dataset.

As you’ve observed, the attribute engineering strategy here mainly involves using loops. We start with the first loop:

# 76. Loop to regulate the worth of medication variables
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df[colname] = df[col].apply(lambda x: 0 if (x == 'No' or x == 'Regular')

Then the second loop:

# 79. Counting medication dosage changes
for col in medications:
if col in df.columns:
colname = str(col) + 'temp'
df['num_med_dosage_changes'] = df['num_med_dosage_changes'] + df[colname]
del df[colname]

The strategy here is technical, but the true challenge is abstracting the info: understanding what each variable represents and viewing it from a unique approach.

This abstraction allows us to extract recent features through feature engineering. It’s not a sure bet — it requires experience to “see” invisible insights.

When you grasp this idea, the programming becomes straightforward. Now, let’s move on to switch the medication columns.

# 82. Recoding medication columns
for col in medications:
if col in df.columns:
df[col] = df[col].replace('No', 0)
df[col] = df[col].replace('Regular', 1)
df[col] = df[col].replace('Up', 1)
df[col] = df[col].replace('Down', 1)

I’ll create a loop once more through the medication list, iterating over each column. I’ll replace no with zero(indicating no change), while regular, up, and down will imply that there was a change within the medication. I’ll now convert this into zero and one, effectively recoding the variable.

After this, we’ll create a brand new column to reflect what number of medications are being administered to every patient.

# 83. Variable to store the count of medicines per patient
df['num_med'] = 0

After which, we load the recent variable.

# 84. Populating the brand new variable
for col in medications:
if col in df.columns:
df['num_med'] = df['num_med'] + df[col]

Let’s take a take a look at the value_counts.

# 85. Checking the full count of medicines
df['num_med'].value_counts()

One medication was administered to most patients (45,447 cases), with 22,702 receiving none, 21,056 receiving two, and seven,485 receiving three.

Only five patients required six medications. After creating these recent columns, the unique medication columns are not any longer needed, as they’ve served their purpose for insight generation. We will now discard them.

# 86. Removing the medication columns
df = df.drop(columns=medications)

Identical to I did with the comorbidity variable, where I used the Diag columns to create a brand new variable, I not need the unique Diag columns.

So, I simply dropped them. I’m doing the identical thing here now. Take a take a look at the shape.

# 87. Shape
df.shape

# (98052, 22)

We now have 22 columns. Here is the head of the dataset.

# 88. Viewing the info
df.head()

Our dataset is getting higher and higher.

Every time simpler. Every time more compact. Making our evaluation work easier.

Let’s take a take a look at the dtypes.

# 89. Variables and their data types
df.dtypes
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x