to many areas of knowledge, helping us deal with uncertainty, calculate probabilities, and support decisions along the way.
One of those areas that relies heavily on statistics is the medical industry, using tools like T-Tests, A/B Tests, or Survival Analysis. This last one is the subject of this article.
Survival analysis originated in the medical and biological sciences, where they were trying to model, as their primary event, the death of a patient or organism. That’s the reason for the name.
However, statisticians understood that such analysis was so powerful that it could be applied to many other areas of life, and so it spread to the business domain, even more after the surge of Data Science.
Let’s learn more about it.
Survival Analysis
Survival Analysis [SA] is a branch of statistics used to predict the amount of time it takes for a specific event to occur.[1]
Also known as Time-to-event, this study can determine how long it will take for something to happen while accounting for the fact that some events haven’t happened yet by the time the data is collected.
The examples are not only in the medical and biological sciences, but everywhere.
- Time until a machine fails
- Time until a customer cancels a subscription
- Time until the customer buys again
Now, given that we are trying to estimate a number, rather than a group or class, this means we are dealing with a type of regression problem. So why can’t we go with OLS Linear Regression?
Why Use Survival Analysis?
Standard regression models like OLS or Logistic Regression struggle with survival data because they are designed to handle completed events, not “ongoing” stories.
Imagine you want to predict who finished a 10-mile race, but the input data is an event that is still going on. The race is at 2 hours, and you want to use the data you have so far to estimate something.
The regular regression algorithms will fail because:
- OLS: You only have the data from those who have already finished the race. Using only their data will create a huge bias for faster people.
- Logistic Regression: It can tell if someone finished the race, probably, but it treats those who finished at 30 minutes the same as those who finished in 8 hours.
The Fundamentals of Survival Analysis
Let us go over a few important concepts for understanding Survival Analysis.
First, we must understand the birth and death of a data point.
- Birth: The moment we started to measure that data point. For example, the moment a patient is diagnosed with cancer, or the day a person is hired by a company. Notice that the observations don’t need to start all at the same time.
- Death: It happens at the occurrence of the event of interest. The day the employee left the company.
Now, the interesting thing about SA is that the study or the observation can end before the event happens. In this case, we will have another important concept: the censored data point.
- Censoring (Non-death): If the study ends or a subject drops out before the event happens, the data is “censored,” meaning we only know they survived at least until that point.
Data can be censored in different ways, though.
- Right Censoring: Most common. The event occurs after the observation period ends or the subject drops out.
Data point C is right-censored. Image by the author.
- Left Censoring: The event occurred before the study started.
Great. It is important to note that survival analysis is a way to estimate the probability of an event occurring as a function of time. By treating survival as a function of time, we can answer questions that a single probability score can’t, such as: “At what specific month does the risk of a customer churning peak?”
Now that we know the basics, let’s learn more about the functions involved in SA.
Survival Function
The survival function S(t) expresses the probability of the event not occurring as a function of time. It will naturally decrease as time passes, since more and more individuals will experience the event.
So, applying it to our employee churn example, we would see the probability that an employee is still in the company after N years.
Survival Function. Image by the author.
Hazard Function
The hazard function indicates the probability of the event occurring at a given point in time. It is the opposite of the survival function, and represents the risk of churn (instead of the probability of staying in the company).
This function will calculate what is the probability that the employees who have not churned until now will do so from this point in time.
Hazard Function. Image by the author.
Choosing Your Model for Survival Analysis
As you see, SA is a topic that can get deep and dense real quick. But let’s try to keep it simple.
There are two main models used when performing survival analysis. One is the Kaplan-Meier, which is simpler but does not consider the effect of additional predictor variables, and it requires a few assumptions to work.
The other one is the Cox Proportional Hazard model, which is the industry standard because it can take other variables into the model, it is more stable mathematically, and it works well even if some assumptions are violated.
Let’s learn more about them.
Kaplan-Meier
- Works well with right-censored data (remember? when the event occurs after the observation period ends)
- Intuitive model
- Non-parametric: does not follow any distribution
- Assumptions are required, like dropouts are not related to the event; Entry time does not affect survival risk; and Event times are known accurately.
- Returns a survival function that looks like a staircase
When to use:
- Simple survival analysis without other covariates or predictors.
- Great for quick visualizations.
Cox Proportional Hazard
- Industry standard
- Accepts additional predictors or covariates
- Works well even if some assumptions are violated
- Estimates a hazard function, which tend to be more stable than survival functions
When to use:
- Estimate on data with multiple predictor (covariate) variables.
Next, let’s get our hands on some code.
Code
In this section, we will learn how to model an SA using both models previously presented.
The dataset chosen for this exercise is the Telco Customer Churn, which you can find in the UCI Machine Learning Repository under the Creative Commons license.
View of the dataset. Image by the author.
Next, let’s import the packages needed.
# Data
from ucimlrepo import fetch_ucirepo
# Data Wrangling
import pandas as pd
import numpy as np
# DataViz
import matplotlib.pyplot as plt
import seaborn as sns
# Lifelines Survival Analysis
from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter
# fetch dataset
telco_churn = fetch_ucirepo(id=563)
# data (as pandas dataframes)
X = telco_churn.data.features
y = telco_churn.data.targets
# Pandas df
df = pd.concat([X, y], axis=1)
df.head(3)
Implementing Kaplan-Meier
Now, as mentioned, the Kaplan-Meier [KM] model is really simple and straightforward to use, being a good choice for visualizations. All we need are two variables: one predictor and one label.
Then, we can instantiate the KM model and fit it to the data, using Subscription Length (total months of subscription) as the predictor, and Churn as the event observed.
# Instantiate K-M
kmf = KaplanMeierFitter()
# Fit the model
kmf.fit(df[‘Subscription Length’],
event_observed=df[‘Churn’],
label= ‘Customer Churn’)
Done. Next, we can visualize the survival function.
# Plot survival curve
plt.figure(figsize=(12, 5))
kmf.plot_survival_function()
plt.title(‘Kaplan-Meier Survival Curve: Telco Customer Lifetime’)
plt.xlabel(‘Time (months)’)
plt.ylabel(‘Probability of Remaining Subscribed’)
plt.grid(True)
plt.show()
This is so great! We can see that more than 90% of the customers stay with the Telecom company for approximately 35 months.
Kaplan-Meier model is great for visualizations. Image by the author.
If we want to confirm, we can easily code that to learn that 90% stay with the company for 34 months, actually.
# Checking survival rate at 34 months
kmf.survival_function_at_times(34)
Customer Churn
34 0.900613
If we want to know what the median time is when people churn, we can use KM’s attribute .median_survival_time_. This is the point in time (t) where the survival probability drops to 50%. In our case, it will be inf because the survival function never drops under 0.5. But if the result was 24 (for example), it means that on average, half of your customers will have churned by month 24.
# Time (t) when Survival drops under 50%
median_survival = kmf.median_survival_time_
print(f”Median Customer Lifetime: {median_survival} months”)
We can also perform other analyses, such as comparisons between groups. Imagine that this Telco company classifies its customers into two groups:
- Heavy-users: Frequency of Use > median
- Soft-users: Frequency of Use <= median
We can compare both survival functions from these two groups.
# Column Groups
df[‘Heavy_User’] = np.where(df[‘Frequency of use’] > df[‘Frequency of use’].median(), 1, 0)
df.head()
plt.figure(figsize=(12, 5))
plt.title(‘Kaplan-Meier Survival Curve: Telco Customer Lifetime’)
plt.xlabel(‘Time (months)’)
plt.ylabel(‘Probability Churn’)
# Fit the model for Soft users and plot
kmf.fit(df[df.Heavy_User == 0][‘Subscription Length’], df[df.Heavy_User == 0][‘Churn’], label=’Soft User’)
ax = kmf.plot_survival_function()
# Fit the model for Heavy users and plot
kmf.fit(df[df.Heavy_User == 1][‘Subscription Length’], df[df.Heavy_User == 1][‘Churn’], label=’Heavy User’)
ax = kmf.plot_survival_function(ax=ax)
plt.show()
And there it is. While heavy users stay steady with the company throughout the whole timeframe, the soft users will churn quickly after the 30th month. Their median survival time is 40 months.
Survival comparison between groups. Image by the author.
When comparing groups, you must make sure that the difference is statistically significant. For that, the package lifelines has the log-rank test implemented. It is a hypothesis test:
- Ho (null hypothesis): The survival curves of two populations do not differ.
- Ha (alternative hypothesis): The survival curves of two populations are different.
from lifelines.statistics import logrank_test
# 3. Perform the Log-Rank Test
results = logrank_test(df[df.Heavy_User == 0][‘Subscription Length’],
df[df.Heavy_User == 1][‘Subscription Length’],
event_observed_A= df[df.Heavy_User == 0][‘Churn’],
event_observed_B= df[df.Heavy_User == 1][‘Churn’])
# 4. Print Results
print(f”P-value: {results.p_value}”)
print(f”Test Statistic: {results.test_statistic}”)
if results.p_value < 0.05:
print(“Result: Statistically significant difference between groups.”)
else:
print(“Result: No significant difference detected.”)
P-value: 7.23487469906141e-103
Test Statistic: 463.7794219211866
Result: Statistically significant difference between groups.
Implementing Cox Proportional Hazard
The first cool thing that you can do with the Cox Proportional Hazard [CPH] Model is checking how other variables can influence the survival of your observed individual.
Let’s break it down.
- We start by choosing some covariates
- We filter the dataset
- Instantiate the model
- Fit the model
# 1. Prepare the data
# Selecting the time, the event, and our chosen covariates
cols_to_use = [
‘Subscription Length’, # Time (t)
‘Churn’, # Event (E)
‘Charge Amount’, # Covariate 1
‘Complains’, # Covariate 2
‘Frequency of use’ # Covariate 3
]
# Dropping any missing values for the model
df_model = df[cols_to_use].dropna()
# 2. Initialize and fit the Cox model
# Use the penalizer to stabilize the math if not converging.
cph = CoxPHFitter(penalizer=0.1)
cph.fit(df_model,
duration_col=’Subscription Length’,
event_col=’Churn’)
# 3. Display the results
cph.print_summary()
# 4. Visualize the influence of covariates
cph.plot()
This is our beautiful result.
CPH model. Image by the author.
How can we interpret this?
The dashed vertical line at 0.0 is the neutral point.
- If a variable’s point sits at 0, it has no effect on churn.
- To the Right (> 0): Increases the hazard (makes churn happen faster).
- To the Left (< 0): Decreases the hazard (makes the customer stay longer).
- On the table, the most important column for business stakeholders is the Hazard Ration exp(coef). It tells us the multiplier effect on the risk of churn.
[TABLE] Complains (5.36): A customer who complains is 5.36 times (or 436%) more likely to churn at any given time than a customer who doesn’t complain. This is a massive effect.
[GRAPHIC] Complains (High Hazard): This is our strongest predictor. Customers with complaints are roughly 5.4 times more likely to churn at any given moment compared to those who don’t.
[TABLE] Frequency of use (0.99): While the p-value says this is technically significant, an HR of 0.99 is effectively 1. It means the impact on churn is negligible (only a 1% change).
[GRAPHIC] Frequency of Use (Neutral): The square is sitting almost exactly on the 0.0 line. In this specific model, how often a customer uses the service doesn’t significantly change when they churn.
[TABLE] Charge Amount (0.83): For every one-unit increase in charge, the risk of churn drops by 17% ($1 – 0.83 = 0.17$). Higher-paying customers are more stable.
[GRAPHIC] Charge Amount (Protective Factor): The square is to the left of the zero line. Higher charges are associated with a lower risk of churn.
We can also take a look at both the Survival and the Hazard functions for this model.
Survival and Hazard functions from the CPH model. Image by the author.
The curve is similar to the KM model. Let’s compare the survival probability at the same 34th month.
# Extract the baseline survival probability at time 34
survival_at_34 = cph.baseline_survival_.loc[34]
print(f”Baseline Survival Probability at period 34: {survival_at_34.values[0]:.4f}”)
Baseline Survival Probability at period 34: 0.9294
It is almost 3% higher, at ~93%
And to close this article, let’s pick two different customers, one without complaints and the other with complaints, and let’s compare their survival probabilities at the 34th month.
# 1. Pick a customer (or predict for a new one)
individual = df_model.iloc[[110,111]]
# 2. Predict their full survival curve
pred_survival = cph.predict_survival_function(individual)
# 3. Get the value at time 34
prob110_at_34 = pred_survival.loc[34].values[0]
prob111_at_34 = pred_survival.loc[34].values[1]
print(f”Customer 110 (no complaints) Probability of ‘Surviving’ to period 34: {prob110_at_34:.2%}”)
print(f”Customer 111 (yes compaints) Probability of ‘Surviving’ to period 34: {prob111_at_34:.2%}”)
Customer 110 (no complaints) Probability of ‘Surviving’ to period 34: 93.94%
Customer 111 (yes compaints) Probability of ‘Surviving’ to period 34: 61.68%
Big difference, huh? More than 30%. And we can finally calculate the time in months when each customer is expected to churn.
# Time Until Churn (Expected life) by customer
pred_churn = cph.predict_expectation(df_model.iloc[[110,111]])
# Get the values in months
prob110_churn = pred_churn.loc[110]
prob111_churn = pred_churn.loc[111]
print(f”Customer 110 (no complaints) expected churn at: {prob110_churn: .0f} months”)
print(f”Customer 111 (yes compaints) expected churn at: {prob111_churn:.0f} months”)
Customer 110 (no complaints) expected churn at: 41 months
Customer 111 (yes compaints) expected churn at: 31 months
Definitely, complaints make a difference in churn for this Telco company.
Before You Go
Well, survival analysis is much more than just a statistical function. Companies can use it to understand customer behavior.
The Kaplan-Meier and Cox Proportional Hazard models provide actionable insights into subscriber longevity. We’ve seen how variables like customer value and service complaints directly affect churn, allowing decision makers to pursue more targeted retention strategies.
Data professionals who understand these models can build a powerful tool for companies to improve their relationship with their user base. Use these tools to stay ahead of the curve. Literally.
If you liked this content, find me on my website.
https://gustavorsantos.me
GitHub Repository
https://github.com/gurezende/Survival-Analysis
References
[1. Survival Analysis Definition] (https://en.wikipedia.org/wiki/Survival_analysis)
[2. The Complete Introduction to Survival Analysis in Python] (https://medium.com/data-science/the-complete-introduction-to-survival-analysis-in-python-7523e17737e6)
[3. Introduction to Customer Survival Analysis: Understanding Customer Lifetimes] (https://medium.com/@slavyolov/introduction-to-customer-survival-analysis-understanding-customer-lifetimes-6e4ba41d7724)
[4. Ultimate Guide to Survival Analysis] (https://www.graphpad.com/guides/survival-analysis)
[5. What is the difference between Kaplan-Meier (KM) and Cox Proportional Hazards (CPH) ratio?] (https://www.droracle.ai/articles/218904/what-is-the-difference-between-kaplan-meier-km-and-cox)
[6. Lifelines Documentation] (https://lifelines.readthedocs.io/en/latest/)
[7. Survival Analysis in R For Beginners] (https://www.datacamp.com/tutorial/survival-analysis-R)

