The Intuition behind Concordance Index — Survival Evaluation

-

Rating accuracy versus absolute accuracy

Taken by the creator and her Border Collie. “Be grateful for what you will have. Be fearless for what you would like”

How long would you retain your Gym membership before you select to cancel it? or Netflix should you are a series fan but busier than usual to allocate 2 hours of your time to your sofa and your TV? Or when to upgrade or replace your smartphone ? What best path to take when considering traffic, road closure, time of the day? or How long until your automotive needs servicing? These are all regular (but not trivial) questions we face (a few of them) in our day by day life without considering an excessive amount of (or nothing in any respect) of the thought process we undergo on the various aspects that influence our next plan of action. Surely (or perhaps after reading these lines) one would have an interest to know what factor or aspects could have the best influence on the expected time until a given event (from the above or another for that matter) occurs? In statistics, that is referred as time-to-event-analysis or Survival evaluation. And that is the main target of this study.

In Survival Evaluation one goals to research the time until an event occurs. In this text, I can be employing survival evaluation to predict when a registered member is prone to leave (churn), specifically the variety of days until a member cancels his/her membership contract. Because the variable of interest is the variety of days, one key element to explicitly reinforce at this point: the time to event dependent variable is of a continuous type, a variable that may take any value inside a certain range. For this, survival evaluation is the one to employ.

DATA

This study was conducted using a proprietary dataset provided by a non-public organization within the tutoring industry. The info includes anonymized records for confidentiality purposes collected over a period of two years, namely July 2022 to October 2024. All analyses were conducted in compliance with ethical standards, ensuring data privacy and anonymity. Subsequently, to respect the confidentiality of the info provider, any specific organizational details and/or unique identifier details have been omitted.

The ultimate dataset after data pre-processing (i.e. tackling nulls, normalizing to handle outliers, aggregating to remove duplicates and grouping to a smart level) accommodates a complete of 44,197 records at unique identifier level. A complete of 5 columns were input into the model, namely: 1) Age, 2) Variety of visits, 3) First visit 4) and Last visit during membership and 5) Tenure. The later representing the variety of days holding a membership hence the time-to-event goal variable. The visit-based variables are a feature engineered product for this study generated from the unique, existing variables and by performing some calculations and aggregation on the raw data for every identifier over the period under evaluation. Finally and really importantly, the dataset is ONLY composed of uncensored records. That is, all unique identifiers have experienced the event by the point of the evaluation, namely membership cancellation. Subsequently there isn’t any censored data on this evaluation where individuals survived (didn’t cancel their membership) beyond their observed duration. This is essential when choosing the modelling technique as I’ll explain next.

Amongst all different techniques utilized in survival evaluation, three stand out as mostly used:

Kaplan-Meier Estimator.

  • It is a non-parametric model hence no assumptions on the distribution of the info is made.
  • KM just isn’t interested on how individual features affect churn thus it doesn’t offer feature-based insights.
  • It’s widely used for exploratory evaluation to evaluate what the survival curve looks like.
  • Very importantly, it doesn’t provide personalized predictions.

Cox Proportional Hazard (PH) Model

  • The Cox PH Model is a semi-parametric model so it doesn’t assume any specific distribution of the survival time, making it more flexible for a wider range of information.
  • It estimates the hazard function.
  • It relies heavily on uncensored in addition to censored data to give you the chance to distinguish between individuals “in danger” of experiencing the event versus those that already had the event. Thus, if only uncensored data is analyzed the model assumes all individuals experienced the event yielding bias results thus leading the Cox PH to perform poorly.

AFT Model

  • It doesn’t require censor data. Thus, will be used where everyone has experienced the event.
  • It directly models the connection between covariates.
  • Used when time-to-event outcomes are of primary interest.
  • The model estimate the time-to-event explicitly. Thus, provide direct predictions on the duration until cancellation.

Given the characteristics of the dataset utilized in this study, I actually have chosen the Accelerated Failure Time (AFT) Model as essentially the most suitable technique. This alternative is driven by two key aspects: (1) the dataset accommodates only uncensored data, and (2) the evaluation focuses on generating individual-level predictions for every unique identifier.

Now before diving any deeper into the methodology and model output, I’ll cover some key concepts:

Survival Function: It provides insight into the likelihood of survival over time

Hazard Function: Rate at which the event is happening at cut-off date t. It captures how the event is changing over time.

Time-to-event: Refers back to the (goal) variable capturing the time until an event occurs.

Censoring: Flag referring to those event which have not occurred yet for among the subjects inside the timeframe of the evaluation. NOTE: On this piece of labor only uncensored data is analyzed, that is the survival time for all the themes under the study is thought.

Concordance Index: A measure of how well the model predicts the relative ordering of survival time. It’s a measure of rating accuracy somewhat than absolute accuracy that assess the proportion of all pairs of subjects whose predicted survival time align with the actual final result.

Akaike Information Criterion (AIC): A measure that evaluates the standard of a model penalizing against the variety of irrelevant variables used. When comparing several models, the one with the bottom AIC is taken into account the perfect.

Next, I’ll expand on the primary two concepts.

In mathematical terms:

The survival function is given by:

(1)

where,

T is a random variable representing the time to event — duration until the event occurs.

S(t) is the probability that the event has not yet occurred by time t.

The Hazard function alternatively is given by:

(2)

where,

f(t) is the probability density function (PDF), which describes the speed at which the event occurs at time t.

S(t) is the survival function that describes the probability of surviving beyond time t

Because the PDF f(t) will be expressed by way of the survival function by taking the derivative of S(t) with respect to t:

(3)

substituting the derivative of S(t) within the hazard function:

(4)

taking the derivative of the Log Survival Function:

(5)

from the chain rule of differentiation it follows:

(6)

thus, the connection between the Hazard and Survival function is defined as follow:

(7)

the hazard rate captures how quickly the survival probability changes at a particular cut-off date.

The Hazard function is at all times non-negative, it could possibly never go below zero. The form can increase, decrease, stay constant or vary in additional complex forms.

Simply put, the hazard function is a measure of the instantaneous risk of experiencing the event at a cut-off date t. It tells us how likely is the topic to experience the event right then. The survival (rate) function, alternatively, measures the probability of surviving beyond a given cut-off date. That is the general probability of no experiencing the event as much as cut-off date t.

The survival function is at all times decreasing over time as increasingly individuals experience the event. That is illustrated within the below histogram plotting the time-to-event variable: Tenure.

Generated by the creator by plotting the time-to-event goal variable from the dataset under study.

At t=0, no individual has experienced the event (no individual have cancel their membership yet), thus

(8)

Eventually all individuals experience the event so the survival function tends to zero (0).

(9)

MODEL

For the needs of this text, I can be specializing in a Multivariate parametric-based model: The Accelerated Failure Time (AFT) model, which explicitly estimate the continual time-to-event goal variable.

Given the AFT Model:

(10)

Taking the natural logarithm on each side of the equation leads to:

(11)

where,

log(T) is the logarithm of the survival time, namely time-to-event (duration), which as shown by equation (11) is a linear function of the covariates.

X is the vector of covariates

β is the vector of regression coefficients.

and this may be very essential:

The coefficients β within the model describe how the covariates speed up or decelerate the event time, namely the survival time. In an AFT Model (the main target of this piece), the coefficients affect directly the survival time (not the hazard function), specifically:

if β > 1 survival time is longer hence resulting in a deceleration of the time to event. That is, the member will take longer to terminate his(her) membership (experiencing the event later).

if β < 1 survival time is shorter hence resulting in an acceleration of the time to event. That is, the member will terminate his(her) membership earlier (experiencing the event sooner).

finally,

ϵ is the random error term that represents unobserved aspects that affect the survival time.

Now, a couple of explicit points based on the above:

  1. this can be a Multivariate approach, where the time-to-event (duration) goal variable is fit on multiple covariates.
  2. a Parametric approach because the model holds an assumption regarding a specific shape of the survival rate distribution.
  3. three algorithms sitting under the AFT model umbrella have been implemented. These are:

3.1) Weibull AFT Model

  • The model is flexible and might capture different patterns of survival. Supports consistently monotonic increasing/decreasing function. That is: at any two points as defined by the function, the later point is not less than as high because the earliest point.
  • One doesn’t must explicitly model the hazard function. The model has two parameters from which the survival function is derived: shape, which determines the form of the distribution hence helps to find out the skewness of the info and scale which determines the spread of the distribution. This PLUS a regression coefficient related to every covariate. The form parameter dictates the monotonic behaviors of the hazard function, which in turns affects the behavior of the survival function.
  • Right-skewed, left-skewed distributions of the time-to-event goal variable are example of those.

3.2) LogNormal AFT Model

  • Focuses on modelling the log-transformed of survival time. Logarithm of a random variable whose continuous probability distribution is roughly normally distributed.
  • Supports right-skewed distributions of the time-to-event goal variable. Allows for non-monotonic hazard functions. Useful when the danger of the event doesn’t follow a straightforward pattern.
  • It doesn’t require to explicitly model the hazard function.
  • Two important parameters (plus any regression coefficients): scale and location, the previous representing the usual deviation of the log-transformed survival time, the later representing the mean of the log-transformed survival time. This represent the intercept when no covariates are included, otherwise representing the linear combination of those.

3.3) Generalized Gamma AFT Model.

  • Good fit for a big selection of survival data patterns. Highly adaptable parametric model that accommodates for the above mentioned shapes in addition to more complicated mathematical forms on the survival function.
  • It will possibly be used to check if simpler models (i.e. Weibull, logNormal) will be used as an alternative because it encompasses these as special cases.
  • It doesn’t require to specify the hazard function.
  • It has three parameters other than the regression coefficient ones: shape, scale and location, the later corresponding to the log of the median of survival time when covariates will not be included thus the intercept within the model.

TIP: There may be a big amount of literature on these algorithms that specifically give attention to each of those algorithms and their features which I strongly suggest the reader to get an understanding on.

Lastly, the performance of the above algorithms is analyzed specializing in the Concordance Index (yes, the C-Index, our metric of interest) and The Akaike Information Criterion (AIC). These are shown next with the models’ output:

REGRESSION OUTPUTS

Weibull AFT Model

Generated by the creator employing lifelines library

Log Normal AFT Model

Generated by the creator employing lifelines library

Generalized Gamma AFT Model

Generated by the creator employing flexsurv library

On the proper hand side, the graphs for every predictor are shown: plotting the log accelerated failure rate on the x axis hence their positive/negative (speed up/decelerate respectively) impact on the survival time. As shown, all models concur across predictors on the direction of the effect on the survival time providing a consistent conclusion in regards to the predictors positive or negative impact. Now, by way of The Concordance Index and AIC, the LogNormal and Weibull are each shown with the very best C-Index value BUT specifically the LogNormal Model dominating resulting from a lower AIC. Thus, the LogNormal is chosen because the model with the perfect fit.

Specializing in the LogNormal AFT Model and interpretation of the estimated coefficient for every covariate (coef), typically predictors are all shown with a p-value lower than the traditional threshold 5% significance level hence rejecting the Null Hypothesis and proving to have a statistical significant impact on the survival time. Age is shown with a negative coefficient -0.06 indicating that as age increases, the member is more prone to experience the event sooner hence terminating his(her) membership earlier. That is: each additional 12 months of age represents a 6% decrease in survival time when the later is multiplied by an element of 0.94 (exp(coef)) hence accelerating the survival time. In contrast, variety of visits, first visit since joined and last visit are all shown with a powerful positive effect on survival indicating a powerful association between, more visits, early engagement and up to date engagement increasing survival time.

Now, by way of The Concordance Index across models (the main target of this evaluation), the Generalized Gamma AFT Model is the one with the bottom C-index value hence the model with the weakest predictive accuracy. That is the model with the weakest ability to appropriately rank survival times based on the expected risk scores. This highlights a very important aspect about model performance: whatever the model ability to capture the proper direction of the effect across predictors, this doesn’t necessarily guarantee predictive accuracy, specifically the power to discriminate across subjects who experience the event sooner versus later as measured by the concordance index. The C-index explicitly evaluates rating accuracy of the model versus absolute accuracy. It is a fundamental distinction lying at the center of this evaluation, which I’ll expand next.

CONCORDANCE INDEX (C-INDEX)

A “ranked survival time” refers to the expected risk scores produced by the model for every individual and used to rank hence discriminate individuals who experience the event earlier when put next to those that experience the event later. Concordance Index is a measure of rating accuracy somewhat than absolute accuracy, specifically: the C-index assesses the proportion of all pairs of people whose predicted survival time align with the actual final result. In absolute terms, there isn’t any concern on how precise the model is on predicting the precise variety of days it took for the member to cancel its membership, as an alternative how accurate the model ranks individuals when the actual and predicted time it took for a member to cancel its membership align. The below illustrate this:

Drawn by the creator based on instances: actual and estimate values from the validation dataset.

The 2 instances above are taken from the validation set after the model was trained on the training set and predictions were generated for unseen data. These examples illustrate cases where the expected survival time (as estimated by the model) exceeds the actual survival time. The horizontal parallel lines represent time.

For Member 1, the actual membership duration was 390 days, whereas the model predicted a duration of 486 days — an overestimation of 96 days. Similarly, Member 2’s actual membership duration was 1,003 days, however the model predicted the membership cancellation to occur 242 days later than it actually did, that is 1,245 days membership duration.

Despite these discrepancies in absolute predictions (and this is very important): the model appropriately ranked the 2 members by way of risk, accurately predicting that Member 1 would cancel their membership before Member 2. This distinction between absolute error and relative rating is a critical aspect of model evaluation. Consider the next hypothetical scenario:

Drawn by the creator based on instances: actual and estimate values from the validation dataset.

if the model had predicted a membership duration of 1,200 days for Member 1 as an alternative of 486 days, this may not affect the rating. The model would still predict that Member 1 terminates their membership sooner than Member 2, whatever the magnitude of the error within the prediction (i.e., the variety of days). In survival evaluation, any prediction for Member 1 that falls before the dotted line within the graph would maintain the identical rating, classifying this as a concordant pair. This idea is central to calculating the C-index, which measures the proportion of all pairs which might be concordant within the dataset.

A few hypothetical scenarios are shown below. In each of them, the magnitude of the error increases/decreases, namely the difference between the actual event time and the expected event time, this is absolutely the error. Nonetheless, the rating accuracy stays unchanged.

Drawn by the creator based on instances: actual and estimate values from the validation dataset.

The below are also taken from the validation set BUT for these instances the model predicts the termination of the membership before the actual event occurs. For Member 3, the actual membership duration is 528 days, however the model predicted termination 130 days earlier, namely 398 membership duration. Similarly, for Member 4, the model anticipates the termination of membership before the actual event. In each cases, the model appropriately ranks Member 4 to terminate their membership before Member 3.

Drawn by the creator based on instances: actual and estimate values from the validation dataset.

Within the hypothetical scenario below, even when the model had predicted the termination 180 days earlier for Member 3, the rating would remain unchanged. This is able to still be classified as a concordant pair. We are able to repeat this evaluation multiple times and in 88% of cases, the LogNormal Model will produce this result, as indicated by the concordance index. That is: where the model appropriately predicts the relative ordering of the individuals’ survival times.

Drawn by the creator based on instances: actual and estimate values from the validation dataset.

As the whole lot, the secret’s to discover when strategically to make use of survival evaluation based on the duty at hand. Use cases specializing in rating individuals employing survival evaluation as essentially the most efficient strategy versus give attention to reducing absolutely the error are:

Customer retention — Businesses rank customers by their likelihood of churning. Survival Evaluation would allow to discover essentially the most in danger customers to focus on retention efforts.

Worker attrition — HR evaluation Organizations use survival evaluation to predict and rank employees by their likelihood of leaving the corporate. Much like the above, allowing to discover most in danger employees. This aiming to enhance retention rates and reducing turnover costs.

Healthcare — resource allocation survival models may be used to rank patients based on their risk of antagonistic outcomes (i.e. disease progression). In here, appropriately identifying which patients are at the very best risk and want urgent intervention, allowing to allocate limited resources more effectively is more critical hence more relevant than the precise survival time.

Credit risk — finance Financial institutions employ survival models to rank borrowers based on their risk of default. Thus, they’re more concerned on identifying the riskiest customers to make more informed lending decisions somewhat than specializing in the precise month of default. This is able to positively guide loan approvals (amongst others).

On the above, the relative rating of subjects (e.g., who’s at higher or lower risk) directly drives actionable decisions and resource allocation. Absolute error in survival time predictions may not significantly affect the outcomes, so long as the rating accuracy (C-index) stays high. This demonstrates why models with high C-index will be highly effective, even when their absolute predictions are less precise.

IN SUMMARY

In survival evaluation, it’s crucial to differentiate between absolute error and rating accuracy. Absolute error refers back to the difference between the expected and actual event times, on this evaluation measured in days. Metrics reminiscent of Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) are used to quantify the magnitude of those discrepancies hence measuring the general predictive accuracy of the model. Nonetheless, these metrics don’t capture the model’s ability to appropriately rank subjects by their likelihood of experiencing the event in the end.

Rating accuracy, alternatively evaluates how well the model orders subjects based on their predicted risk, whatever the exact time prediction as illustrated above. That is where the concordance index (C-index) plays a key role. The C-index measures the model’s ability to appropriately rank pairs of people, with higher values indicating higher rating accuracy. A C-index of 0.88 suggests that the model successfully ranks the danger of membership termination appropriately 88% of the time.

Thus, while absolute error provides helpful insights into the precision of time predictions, the C-index focuses on the model’s ability to rank subjects appropriately, which is usually more essential in survival evaluation. A model with a high C-index will be highly effective in rating individuals, even when it has a point of absolute error, making it a strong tool for predicting relative risks over time.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x