how math can solve so many problems in the real world. When I was in grade school, I certainly did not see it that way. I never hated math, by the way, and neither did I have trouble learning most of the basic concepts.
However, I confess that for most of the classes beyond the classic arithmetic, I usually thought, “I will never use that for anything in my life”.
Those were other times, though. There was no Internet, no data science, and computers were barely a thing. But time passes. Life happens, and we get to see the day when we will solve important business problems with good old math!
In this post, we will use the famous linear regression for a different problem: predicting customer churn.
Linear Regression vs Churn
Customer churn rarely happens overnight. In many cases, customers will gradually reduce their purchasing frequency before stopping completely. Some call that silent churn [1].
Predicting churn can be done with the traditional churn models, which (1) require labeled churn data; (2) sometimes are complex to explain; (3) detect churn after it already happened.
On the other hand, this project shows a different solution, answering a simpler question:
Is this customer
slowing down the shopping?
This question is answered with the following logic.
We use monthly purchase trends and linear regression to measure customer momentum over time. If the customer continues to increase their expenses, the summed amount will grow over time, leading to a trend upward (or a positive slope in a linear regression, if you will). The opposite is also true. Lower transaction amounts will add up to a downtrend.
Let’s break down the logic in small steps, and understand what we will do with the data:
- Aggregate customer transactions by month
- Create a continuous time index (e.g. 1, 2, 3…n)
- Fill missing months with zero purchases
- Fit a linear regression line
- Use the slope (converted to degrees) to quantify buying behavior
- Assessment: A negative slope indicates declining engagement. A positive slope indicates increasing engagement.
Well, let’s move on to the implementation next.
Code
The first thing is importing some modules into a Python session.
# Imports
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Then, we will generate some data that simulates some customers transactions. You can look at the complete code in this GitHub repository. The dataset generated brings the columns customer_id, transaction_date, and total_amt, and will look like the next picture.
Dataset generated for this exercise. Image by the author.
Now we will create a new column that extracts the month of the date, so it becomes easier for us to group the data later.
# Create new column month
df[‘mth’] = df[‘transaction_date’].dt.month
# Group customers by month
df_group = (
df
.groupby([‘mth’,’customer_id’])
[‘total_amt’]
.sum()
.reset_index()
)
Here is the result.
Grouped data. Image by the author.
If we quickly check if there are customers who have not made a transaction every month, we will find a few cases.
That leads us to the next point. We have to make sure that, if the customer does not have at least one purchase per month, then we have to add that month with a $0 expense.
Let’s build a function that can do that and also calculate the slope of the customer’s shopping trend.
This function looks enormous, but we will go over it in smaller chunks. Let’s do this.
- Filter the data for a given customer using Pandas query() method.
- Make a quick group and check if the customer has at least one purchase for every month.
- If not, we will add the missing month with a $0 expense. I implemented this by merging a temporary dataframe with the 12 months and $0 with the original data. After the merge on months, those periods missing will be rows with NaN for the original data column, which can be filled with $0.
- Then, we normalize the axes. Remember that the X-axis is an index from 1 to 12, but the Y-axis is the expense amount, in thousands of dollars. So, to avoid distortion in our slope, we normalize everything to the same scale, between 0 and 1. For that, we use the custom function min_max_standardize.
- Next, we can plot the regression using another custom function.
- Then we will calculate the slope, which is the first result returned from the function scipy.linregress().
- Finally, to calculate the angle of the slope in degrees, we will appeal to pure mathematics, using the concept of arc tangent to calculate the angle between the X-axis and the linear regression slope line. In Python, just use the functions np.arctan() and np.degrees() from numpy.
Arctan concept. Image by the author.
# Standardize the data
def min_max_standardize(vals):
return (vals – np.min(vals)) / (np.max(vals) – np.min(vals))
#————
# Quick Function to plot the regression
def plot_regression(x,y, cust):
plt.scatter(x,y, color = ‘gray’)
plt.plot(x,
stats.linregress(x,y).slope*np.array(x) + stats.linregress(x,y).intercept,
color = ‘red’,
linestyle=’–‘)
plt.suptitle(“Slope of the Linear Regression [Expenses x Time]”)
plt.title(f”Customer {cust} | Slope: {np.degrees(np.arctan(stats.linregress(x,y).slope)):.0f} degrees. Positive = Buying more | Negative = Buying less”, size=9, color=’gray’)
plt.show()
#—–
def get_trend_degrees(customer, plot=False):
# Filter the data
one_customer = df.query(‘customer_id == @customer’)
one_customer = one_customer.groupby(‘mth’).total_amt.sum().reset_index().rename(columns={‘mth’:’period_idx’})
# Check if all months are in the data
cnt = one_customer.groupby(‘period_idx’).period_idx.nunique().sum()
# If not, add 0 to the months without transactions
if cnt < 12:
# Create a DataFrame with all 12 months
all_months = pd.DataFrame({‘period_idx’: range(1, 13), ‘total_amt’: 0})
# Merge with the existing one_customer data.
# Use ‘right’ merge to keep all 12 months from ‘all_months’ and fill missing total_amt.
one_customer = pd.merge(all_months, one_customer, on=’period_idx’, how=’left’, suffixes=(‘_all’, ”))
# Combine the total_amt columns, preferring the actual data over the 0 from all_months
one_customer[‘total_amt’] = one_customer[‘total_amt’].fillna(one_customer[‘total_amt_all’])
# Drop the temporary _all column if it exists
one_customer = one_customer.drop(columns=[‘total_amt_all’])
# Sort by period_idx to ensure correct order
one_customer = one_customer.sort_values(by=’period_idx’).reset_index(drop=True)
# Min Max Standardization
X = min_max_standardize(one_customer[‘period_idx’])
y = min_max_standardize(one_customer[‘total_amt’])
# Plot
if plot:
plot_regression(X,y, customer)
# Calculate slope
slope = stats.linregress(X,y)[0]
# Calculate angle degrees
angle = np.arctan(slope)
angle = np.degrees(angle)
return angle
Great. It is time to put this function to test. Let’s get two customers:
- C_014.
- This is an uptrend customer who’s buying more over time.
# Example of strong customer
get_trend_degrees(‘C_014’, plot=True)
The plot it yields shows the trend. We notice that, even though there are some weaker months in between, overall, the amounts tend to increase as time passes.
Uptrending customer. Image by the author.
The trend is 32 degrees, thus pointing well up, indicating a strong relationship with this customer.
- C_003.
- This is a downtrend customer who’s buying less over time.
# Example of customer stop buying
get_trend_degrees(‘C_003’, plot=True)
Downtrending customer. Image by the author.
Here, the expenses over the months are clearly decreasing, making the slope of this curve point down. The line is 29 degrees negative, indicating that this customer is going away from the brand, thus requires to be stimulated to come back.
Before You Go
Well, that is a wrap. This project demonstrates a simple, interpretable approach to detecting declining customer purchase behavior using linear regression.
Instead of relying on complex churn models, we analyze purchase trends over time to identify when customers are slowly disengaging.
This simple model can give us a great notion of where the customer is moving towards, whether it is a better relationship with the brand or moving away from it.
Certainly, with other data from the business, it is possible to improve this logic and apply a tuned threshold and quickly identify potential churners every month, based on past data.
Before wrapping up, I would like to give proper credit to the original post that inspired me to learn more about this implementation. It is a post from Matheus da Rocha that you can find here, in this link.
Finally, find more about me on my website.
https://gustavorsantos.me
GitHub Repository
Here you find the full code and documentation.
https://github.com/gurezende/Linear-Regression-Churn/tree/main
References
[1. Forbes] https://www.forbes.com/councils/forbesbusinesscouncil/2023/09/15/is-silent-churn-killing-your-business-four-indicators-to-monitor
[2. Numpy Arctan] https://numpy.org/doc/2.1/reference/generated/numpy.arctan.html
[3. Arctan Explanation] https://www.cuemath.com/trigonometry/arctan/
[4. Numpy Degrees] https://numpy.org/doc/2.1/reference/generated/numpy.degrees.html
[5. Scipy Lineregress] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html

