Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

task in Data Science, especially if we are performing an A/B Test to understand the effects of a given variable over those groups.

The problem is that the world is just… well, real. I mean, it is very beautiful to think of a controlled environment where we can isolate just one variable and measure the effect of it. But what happens most of the time is that life just runs over everything, and the next thing you know, your boss is asking you to compare the effect of the latest campaign on customers’ expenses.

But you never prepared the data for the experiment. All you have is the ongoing data before and after the campaign.

Enter Propensity Score Matching

In simple terms, Propensity Score Matching (PSM) is a statistical technique used to see if a specific action (a “treatment”) actually caused a result.

Because we can’t go back in time and see what would have happened if someone had made a different choice, we find a “twin” in the data, someone who looks almost exactly like them but didn’t take the treatment action, and compare their results instead. Finding these “statistical twins” helps us compare customers fairly, even when you haven’t run a perfectly randomized experiment.

The Problem With the Averages

Simple averages assume the groups were identical to begin with. When you compare a simple average of a treated group to a control group, you are measuring all the pre-existing differences that led people to choose that treatment in the first place.

Suppose we want to test a new energy gel for runners. If we just compare everyone who used the gel to everyone who didn’t, we are ignoring important factors like the levels of experience and knowledge of the runners. People who bought the gel might be more experienced, have better shoes, or even train harder and be supervised by a professional. They were already “predisposed” to run faster anyway.

PSM acknowledges the differences and acts like a scout:

The Scouting Report: For every runner who used the gel, the scout looks at their stats: age, years of experience, and average training miles.
Finding the Twin: The scout then looks through the group of runners who didn’t use the gel to find a “twin” with the exact same stats.
The Comparison: Now, you compare the finish times of these “twins.”

Did you notice how now we are comparing similar groups? High-performers vs. High-performers, Low-Low. In that way, we can isolate the other factors that can cause the desired effect (confounding) and measure the real impact of the energy gel.

Great. Let’s move on to learn how to implement this model.

Step-by-Step of PSM

Now we will go over the steps we must take to implement a PSM in our data. This is important, so we can build the intuition and learn logical steps to take when we need to apply this to any dataset.

The first step is creating a simple Logistic Regression Model. This is a well-known classification model that will try to predict what is the probability that the subject could be in the treatment group. In simpler words, what is the propensity of that individual to take the action being studied?
From the step one, we will add the propensity score (probability) to the dataset.
Next, we will use the Nearest Neighbors algorithm to scan the control group and find the person with the closest score to each treated user.
As a “quality filter”, we add a threshold number for calibration. If the “closest” match is still higher than that threshold, we toss them out. It’s better to have a smaller, perfect sample than a large, biased one.
We evaluate the matched pairs using Standardized Mean Difference (SMD). It is for checking if two groups are actually comparable.

Let’s code then!

Dataset

For the purpose of this exercise, I will generate a dataset of 1000 rows with the following variables:

Age of the person
Past expenses with this company
A binary flag indicating the use of a mobile device
A binary flag indicating whether the person saw the advertising

age past_spend is_mobile saw_ad
0 29 557.288206 1 1
1 45 246.829612 0 1
2 24 679.609451 0 0
3 67 1039.030017 1 1
4 20 323.241117 0 1

You can find the code that generated this dataset in the GitHub repository.

Code Implementation

Next, we are going to implement the PSM using Python. Let’s start importing the modules.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

Now, we can start by creating the propensity score.

Step 1: Calculating the Propensity Scores

In this step, we will just run a LogisticRegression model that takes into consideration the age, past_spend, and is_mobile variables and estimates the probability that this person saw the advertising.

Our idea is not to have a 99% accuracy in the prediction, but to balance the covariates, ensuring that the treated and control groups have nearly identical average characteristics (like age, expenses) so that any difference in the outcome can be attributed to the treatment rather than pre-existing differences.

# Step 1: Calculate the Propensity Scores

# Define covariates and treament
covariates = [‘age’, ‘past_spend’, ‘is_mobile’]
treatment_col = ‘saw_ad’

# 1. Estimate Propensity Scores (Probability of treatment)
lr = LogisticRegression()
X = df[covariates]
y = df[treatment_col]

# Fit a Logistic Regression
lr.fit(X, y)

# Store the probability of being in the ‘Treatment’ group
df[‘pscore’] = lr.predict_proba(X)[:, 1]

So, after we fit the model, we sliced the predict_proba() results to return only the column with the probabilities to be in the treatment group (prediction of saw_ad == 1)

Propensity Score added to the dataset. Image by the author.

Next, we will split the data into control and test.

Control: people who did not see the advertising.
Treatment: people who saw the advertising.

# 2. Split into Treatment and Control
treated = df[df[treatment_col] == 1].copy()
control = df[df[treatment_col] == 0].copy()

It is time to find the statistical twins in this data.

Step 2: Finding the Matching Pairs

In this step, we will use NearestNeighbors also from Scikit Learn to find the matching pairs for our observations. The idea is simple.

We have two groups with their propensity to be part of the treatment group, considering all the confounding variables.
So we find the one observation from the control dataset that matches the most with each one from the treatment dataset.
We use pscore and age for this match. It could be only the propensity score, but after looking at the matched pairs, I saw that adding age would give us a better match.

# 3. Use Nearest Neighbors to find matches
# We use a ‘caliper’, or a threshold to ensure matches aren’t too far apart
caliper = 0.05
nn = NearestNeighbors(n_neighbors=1, radius=caliper)
nn.fit(control[[‘pscore’, ‘age’]])

# Find the matching pairs
distances, indices = nn.kneighbors(treated[[‘pscore’, ‘age’]])

Now that we have the pairs, we can calibrate the model to discard those that are not too close to each other.

Step 3: Calibrating the Model

This code snippet filters distances and indices based on the caliper to identify valid matches, then extracts the original Pandas indices for the successfully matched control and treated observations. Any index over the threshold is discarded.

Then we just concatenate both datasets with the remaining observations that passed the quality control.

# 4. Filter out matches that are outside our ‘caliper’ (quality control)
matched_control_idx = [control.index[i[0]] for d, i in zip(distances, indices) if d[0] <= caliper]
matched_treated_idx = [treated.index[i] for i, d in enumerate(distances) if d[0] <= caliper]

# Combine the matched pairs into a new balanced dataframe
matched_df = pd.concat([df.loc[matched_treated_idx], df.loc[matched_control_idx]])

Ok. We have a dataset with matched pairs of customers who saw the advertising and did not see it. And the best thing is that we are now able to compare similar groups and isolate the effect of the advertising campaign.

print(matched_df.saw_ad.value_counts())

saw_ad
1 532
0 532
Name: count, dtype: int64

Let’s see if our model gave good matches.

Step 4: Evaluation

To evaluate a PSM model, the best metrics are:

Standardized Mean Difference (SMD)
Check the standard deviation of the Propensity Score.
Visualize the data overlap

Let’s begin by checking the propensity score statistics.

# Check standard deviation (variance around the mean) of the Propensity Score
matched_df[[‘pscore’]].describe().T

Propensity Score stats. Image by the author.

These statistics suggest that our propensity score matching process has created a dataset where the treated and control groups have very similar propensity scores. The small standard deviation and the concentrated interquartile range (25%-75%) indicate good overlap and balance of propensity scores. This is a positive sign that our matching was effective in bringing the distributions of covariates closer together between the treated and control groups.

Moving on, To compare the means of other covariates like age and is_mobile after Propensity Score Matching, we can refer to the Standardized Mean Differences (SMD). A small SMD (typically below 0.1 or 0.05) indicates that the means of the covariate are well-balanced between the treated and control groups, suggesting successful matching.

We will calculate the SMD metric using a custom function that takes the mean and standard deviation of a given covariate variable and calculates the metric.

def calculate_smd(df, covariate, treatment_col):
treated_group = df[df[treatment_col] == 1][covariate]
control_group = df[df[treatment_col] == 0][covariate]

mean_treated = treated_group.mean()
mean_control = control_group.mean()
std_treated = treated_group.std()
std_control = control_group.std()

# Pooled standard deviation
pooled_std = np.sqrt((std_treated**2 + std_control**2) / 2)

if pooled_std == 0:
return 0 # Avoid division by zero if there’s no variance
else:
return (mean_treated – mean_control) / pooled_std

# Calculate SMD for each covariate
smd_results = {}
for cov in covariates:
smd_results[cov] = calculate_smd(matched_df, cov, treatment_col)

smd_df = pd.DataFrame.from_dict(smd_results, orient=’index’, columns=[‘SMD’])

# Interpretation of SMD values
for index, row in smd_df.iterrows():
smd_value = row[‘SMD’]
interpretation = “well-balanced (excellent)” if abs(smd_value) < 0.05 else \
“reasonably balanced (good)” if abs(smd_value) < 0.1 else \
“moderately balanced” if abs(smd_value) < 0.2 else \
“poorly balanced”
print(f”The covariate ‘{index}’ has an SMD of {smd_value:.4f}, indicating it is {interpretation}.”)

SMD
age 0.000000
past_spend 0.049338
is_mobile 0.000000

The covariate ‘age’ has an SMD of 0.0000, indicating it is well-balanced (excellent).
The covariate ‘past_spend’ has an SMD of -0.0238, indicating it is well-balanced (excellent).
The covariate ‘is_mobile’ has an SMD of 0.0000, indicating it is well-balanced (excellent).

SMD < 0.05 or 0.1: This is often considered well-balanced or excellent balance. Most researchers aim for an SMD less than 0.1, and ideally less than 0.05.

We can see that our variables pass this test!

Finally, let’s check the distributions overlay between Control and Treatment.

# Control and Treatment Distribution Overlays
plt.figure(figsize=(10, 6))
sns.histplot(data=matched_df, x=’past_spend’, hue=’saw_ad’, kde=True, alpha=.4)
plt.title(‘Distribution of Past Spend for Treated vs. Control Groups’)
plt.xlabel(‘Past Spend’)
plt.ylabel(‘Density / Count’)
plt.legend(title=’Saw Ad’, labels=[‘Control (0)’, ‘Treated (1)’])
plt.show()

Distributions overlay: They should be one over the other and similar in shape. Image by the author.

It looks nice. The distributions are entirely overlapping and have a fairly similar shape.

This is a sample of the matched pairs. You can find the code to build this on GitHub.

Sample of the matched pairs dataset. Image by the author.

With that said, I believe we can conclude that this model is working properly, and we can move on to check the results.

Results

Ok, since we have matching groups and distributions, let’s move on to the results. We will check the following:

Difference of Means between the two groups
T-Test to check for statistical difference
Cohen’s D to calculate the effect size.

Here are the statistics of the matched dataset.

Stats on the final dataset. Image by the author.

After Propensity Score Matching, the estimated causal effect of seeing the ad (saw_ad) on past_spend can be inferred from the difference in means between the matched treated and control groups.

# Difference of averages
avg_past_spend_treated = matched_df[matched_df[‘saw_ad’] == 1][‘past_spend’].mean()
avg_past_spend_control = matched_df[matched_df[‘saw_ad’] == 0][‘past_spend’].mean()

past_spend_difference = avg_past_spend_treated – avg_past_spend_control

print(f”Average past_spend (Treated): {avg_past_spend_treated:.2f}”)
print(f”Average past_spend (Control): {avg_past_spend_control:.2f}”)
print(f”Difference in average past_spend: {past_spend_difference:.2f}”)

Average past_spend (Treated Group): 541.97
Average past_spend (Control Group): 528.14
Difference in Average past_spend (Treated – Control): 13.82

This indicates that, on average, users who saw the ad (treated) spent approximately 13.82 more than users who did not see the ad (control), after accounting for the observed covariates.

Let’s check if the difference is statistically significant.

# T-Test
treated_spend = matched_df[matched_df[‘saw_ad’] == 1][‘past_spend’]
control_spend = matched_df[matched_df[‘saw_ad’] == 0][‘past_spend’]

t_stat, p_value = stats.ttest_ind(treated_spend, control_spend, equal_var=False)

print(f”T-statistic: {t_stat:.3f}”)
print(f”P-value: {p_value:.3f}”)

if p_value < 0.05:
print(“The difference in past_spend between treated and control groups is statistically significant (p < 0.05).”)
else:
print(“The difference in past_spend between treated and control groups is NOT statistically significant (p >= 0.05).”)

T-statistic: 0.805
P-value: 0.421
The difference in past_spend between treated and control groups
is NOT statistically significant (p >= 0.05).

The difference is not significant, given that the standard deviation is still very high (~280) between groups.

Let us also run a calculation of the effect size using Cohen’s D.

# Cohen’s D Effect measurement

def cohens_d(df, outcome_col, treatment_col):
treated_group = df[df[treatment_col] == 1][outcome_col]
control_group = df[df[treatment_col] == 0][outcome_col]

mean1, std1 = treated_group.mean(), treated_group.std()
mean2, std2 = control_group.mean(), control_group.std()
n1, n2 = len(treated_group), len(control_group)

# Pooled standard deviation
s_pooled = np.sqrt(((n1 – 1) * std1**2 + (n2 – 1) * std2**2) / (n1 + n2 – 2))

if s_pooled == 0:
return 0 # Avoid division by zero
else:
return (mean1 – mean2) / s_pooled

# Calculate Cohen’s d for ‘past_spend’
d_value = cohens_d(matched_df, ‘past_spend’, ‘saw_ad’)

print(f”Cohen’s d for past_spend: {d_value:.3f}”)

# Interpret Cohen’s d
if abs(d_value) < 0.2:
interpretation = “negligible effect”
elif abs(d_value) < 0.5:
interpretation = “small effect”
elif abs(d_value) < 0.8:
interpretation = “medium effect”
else:
interpretation = “large effect”

print(f”This indicates a {interpretation}.”)

Cohen’s d for past_spend: 0.049
This indicates a negligible effect.

The difference is small, suggesting a negligible average treatment effect on past_spend in this matched sample.

With that, we conclude this article.

Before You Go

Causal effect is the area of Data Science that gives us the reasons why something happens, other than just telling us if that is probable or not to happen.

Many times, you may face this challenge of understanding why something works (or not) in a business. Companies love that, even more if it can save money or make sales increase because of that information.

Just remember the basic steps to create your model.

Run a Logistic Regression to calculate propensity scores
Split the data into Control and Treatment
Run Nearest Neighbors to find the perfect match of Control and Treatment groups, so you can isolate the real effect.
Evaluate your model using SMD
Calculate your results.

If you liked this content, find out more about me in my website.

https://gustavorsantos.me

GitHub Repository

https://github.com/gurezende/Propensity-Score-Matching

References

[1. Propensity Score Matching] (https://en.wikipedia.org/wiki/Propensity_score_matching)

[2. A Detailed Introduction to Causal Inference] (https://medium.com/data-science-collective/a-detailed-introduction-to-causal-inference-b72a70e86a87?sk=16545d9faa55f83c83f2d3792d0d135d)

[3. Logistic Regression Scikit-Learn Documentation] (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[4. Nearest Neighbors Scikit-Learn Documentation] (https://scikit-learn.org/0.15/modules/generated/sklearn.neighbors.NearestNeighbors.html)

[5. Complete Guide on PSM from DataCamp] (https://www.datacamp.com/tutorial/propensity-score)

https://sites.google.com/site/econometricsacademy/econometrics-models/propensity-score-matching

What's Hot

Who Will Win the 2026 Soccer World Cup?

15 Best Vibe Coding Tools in 2026 Compared: Pricing, Features, and Best Fit

Picking an Experimentation Platform: A Retrospective

Who Will Win the 2026 Soccer World Cup?

15 Best Vibe Coding Tools in 2026 Compared: Pricing, Features, and Best Fit

Picking an Experimentation Platform: A Retrospective

Microsoft Fara Tutorial: Run a Browser-Use Agent in Google Colab with a Mock OpenAI-Compatible Endpoint

My SciPy ODE Solver Was Killing My Bayesian Inference: A Cosmologist’s Honest Account of Discovering Diffrax

Moonshot AI Releases Kimi Code CLI: A Terminal AI Coding Agent Built in TypeScript for Next-Gen Agents

Who Will Win the 2026 Soccer World Cup?

15 Best Vibe Coding Tools in 2026 Compared: Pricing, Features, and Best Fit

Picking an Experimentation Platform: A Retrospective

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services

What's Hot

Correlation vs. Causation: Measuring True Impact with Propensity Score Matching

Enter Propensity Score Matching

The Problem With the Averages

Step-by-Step of PSM

Dataset

Code Implementation

Step 1: Calculating the Propensity Scores

Step 2: Finding the Matching Pairs

Step 3: Calibrating the Model

Results

Before You Go

GitHub Repository

References

Related Posts

News

Company

Services

Subscribe to Updates