Introduction
undefined, I started thinking about the parallels between point-anomaly detection and trend-detection. When it comes to points, it’s generally intuitive, and the z-score solves most problems. What took me a while to figure out was applying some kind of statistical test to trends — singular points are now whole distributions, and the standard deviation that made a lot of sense when I was looking at one point, started to feel plain wrong. This is what I uncovered.
For easier understanding, I’ve peppered this post with some simulations I set up and some charts I created as a result.
Z-Scores: When they stop working
Most people reach for the z-score the moment they want to spot something weird. It’s dead simple:
$$ z = \frac{x – \mu}{\sigma} $$
\(x\) is your new observation, \( \mu \) is what “normal” usually looks like, \( \sigma \) is how much things normally wiggle. The number you get tells you: “this point is this many standard deviations away from the pack.”
A z of 3? That’s roughly the “holy crap” line — under a normal distribution, you only see something that far out about 0.27% of the time (two-tailed). Feels clean. Feels honest.
Why it magically becomes standard normal (quick derivation)
Start with any normal variable X ~ N(\( \mu \), \( \sigma^2 \)).
- Subtract the mean → \(x – \mu\). Now the center is zero.
- Divide by the standard deviation → \( (x – \mu) / \sigma \). Now the spread (variance) is exactly 1.
Do both and you get:
$$ Z = \frac{X – \mu}{\sigma} \sim N(0, 1) $$
That’s it. Any normal variable, no matter its original mean or scale, gets squashed and stretched into the same boring bell curve we all memorized. That’s why z-scores feel universal — they let you use the same lookup tables everywhere.
The catch
In the real world we almost never know the true \( \mu \) and \( \sigma \). We estimate them from recent data — say the last 7 points.
Here’s the dangerous bit: do you include the current point in that window or not?
If you do, a huge outlier inflates your \( \sigma \) on the spot. Your z-score shrinks. The anomaly hides itself. You end up thinking “eh, not that weird after all.”
If you exclude it (shift by 1, use only the previous window), you get a fair fight: “how strange is this new point compared to what was normal before it arrived?”
Most solid implementations do the latter. Include the point and you’re basically smoothing, not detecting.
This snippet should give you an example.
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Set seed for reproducibility
np.random.seed(42)
# set dpi to 250 for high-resolution plots
plt.rcParams[‘figure.dpi’] = 250
# Generate 30-point series: base level 10, slight upward trend in last 10 points, noise, one big outlier
n = 30
t = np.arange(n)
base = 10 + 0.1 * t[-10:] # small trend only in last part
data = np.full(n, 10.0)
data[:20] = 10 + np.random.normal(0, 1.5, 20)
data[20:] = base + np.random.normal(0, 1.5, 10)
data[15] += 8 # big outlier at index 15
df = pd.DataFrame({‘value’: data}, index=t)
# Rolling window size
window = 7
# Version 1: EXCLUDE current point (recommended for detection)
df[‘roll_mean_ex’] = df[‘value’].shift(1).rolling(window).mean()
df[‘roll_std_ex’] = df[‘value’].shift(1).rolling(window).std()
df[‘z_ex’] = (df[‘value’] – df[‘roll_mean_ex’]) / df[‘roll_std_ex’]
# Version 2: INCLUDE current point (self-dampening)
df[‘roll_mean_inc’] = df[‘value’].rolling(window).mean()
df[‘roll_std_inc’] = df[‘value’].rolling(window).std()
df[‘z_inc’] = (df[‘value’] – df[‘roll_mean_inc’]) / df[‘roll_std_inc’]
# Add the Z-scores comparison as a third subplot
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 12), sharex=True)
# Top plot: original + means
ax1.plot(df.index, df[‘value’], ‘o-‘, label=’Observed’, color=’black’, alpha=0.7)
ax1.plot(df.index, df[‘roll_mean_ex’], label=’Rolling mean (exclude current)’, color=’blue’)
ax1.plot(df.index, df[‘roll_mean_inc’], ‘–‘, label=’Rolling mean (include current)’, color=’red’)
ax1.set_title(‘Time Series + Rolling Means (window=7)’)
ax1.legend()
ax1.grid(True, alpha=0.3)
# Middle plot: rolling stds
ax2.plot(df.index, df[‘roll_std_ex’], label=’Rolling std (exclude current)’, color=’blue’)
ax2.plot(df.index, df[‘roll_std_inc’], ‘–‘, label=’Rolling std (include current)’, color=’red’)
ax2.set_title(‘Rolling Standard Deviations’)
ax2.legend()
ax2.grid(True, alpha=0.3)
# Bottom plot: Z-scores comparison
ax3.plot(df.index, df[‘z_ex’], ‘o-‘, label=’Z-score (exclude current)’, color=’blue’)
ax3.plot(df.index, df[‘z_inc’], ‘x–‘, label=’Z-score (include current)’, color=’red’)
ax3.axhline(3, color=’gray’, linestyle=’:’, alpha=0.6)
ax3.axhline(-3, color=’gray’, linestyle=’:’, alpha=0.6)
ax3.set_title(‘Z-Scores: Exclude vs Include Current Point’)
ax3.set_xlabel(‘Time’)
ax3.set_ylabel(‘Z-score’)
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
The difference between including vs excluding the current (evaluated) point.
P-values
You compute z, then ask: under the null (“this came from the same distribution as my window”), what’s the chance I’d see something this extreme?
Two-tailed p-value = 2 × (1 − cdf(|z|)) in the standard normal.
z = 3 → p ≈ 0.0027 → “probably not random noise.”
z = 1.5 → p ≈ 0.1336 → “eh, could happen.”
Simple. Until the assumptions start falling apart.
Assumptions
The z-score (and its p-value) assumes two things:
- The window data is roughly normal (or at least the tails behave).
- Your estimated \( \sigma \) is close enough to the true population value.
A skewed window, for example, violates #1. This means that saying something is within 3\(\sigma\) might actually be only 85% likely, rather than the expected 99.7%.
Similarly, with a small enough window, the \( \sigma \) is noisy, causing z-scores to swing more than they should.
Hypothesis Testing Basics: Rejecting the Null, Not Proving the Alternative
Hypothesis testing provides the formal framework for deciding whether observed data support a claim of interest. The structure is consistent across tools like the z-score and t-statistic.
The process begins with two competing hypotheses:
- The null hypothesis (H₀) represents the default assumption: no effect, no difference, or no trend. In anomaly detection, H₀ states that the observation belongs to the same distribution as the baseline data. In trend analysis, H₀ typically states that the slope is zero.
- The alternative hypothesis (H₁) represents the claim under investigation: there is an effect, a difference, or a trend.
The test statistic (z-score or t-statistic) quantifies how far the data deviate from what would be expected under H₀.
The p-value is the probability of obtaining a test statistic at least as extreme as the one observed, assuming H₀ is true. A small p-value indicates that such an extreme result is unlikely under the null.
The decision rule is straightforward:
- If the p-value is below a pre-specified significance level (commonly 0.05), reject H₀.
- If the p-value exceeds the threshold, fail to reject H₀.
A key point is that failing to reject H₀ does not prove H₀ is true. It only indicates that the data do not provide sufficient evidence against it. Absence of evidence is not evidence of absence.
The two-tailed test is standard for anomaly detection and many trend tests because deviations can occur in either direction. The p-value is therefore calculated as twice the one-tailed probability.
For the z-score, the test relies on the standard normal distribution under the null. For small samples or when the variance is estimated from the data, the t-distribution is used instead, as discussed in later sections.
This framework applies uniformly: the test statistic measures deviation from the null, the distribution provides the reference for how unusual that deviation is, and the p-value translates that unusualness into a decision rule.
The assumptions underlying the distribution (normality of errors, independence) must hold for the p-value to be interpreted correctly. When those assumptions are violated, the reported probabilities lose reliability, which becomes a central concern when extending the approach beyond point anomalies.
The Signal-to-Noise Principle: Connecting Z-Scores and t-Statistics
The z-score and the t-statistic are both instances of the ratio
$$ \frac{\text{signal}}{\text{noise}}. $$
The signal is the deviation from the null value: \(x – \mu\) for point anomalies and \(\hat{\beta}_1 – 0\) for the slope in linear regression.
The noise term is the measure of variability under the null hypothesis. For the z-score, noise is \(\sigma\) (standard deviation of the baseline observations). For the t-statistic, noise is the standard error \(\text{SE}(\hat{\beta}_1)\).
Standard Error vs Standard Deviation
The standard deviation measures the spread of individual observations around their mean. For a sample, it is the square root of the sample variance, typically denoted s:
$$ s = \sqrt{ \frac{1}{n-1} \sum (x_i – \bar{x})^2 }. $$
The standard error quantifies the variability of a summary statistic (such as the sample mean or a regression coefficient) across repeated samples from the same population. It is always smaller than the standard deviation because averaging or estimating reduces variability.
For the sample mean, the standard error is
$$ \text{SE}(\bar{x}) = \frac{s}{\sqrt{n}}, $$
where s is the sample standard deviation, and n is the sample size. The division by \(\sqrt{n}\) reflects the fact that the mean of n independent observations has variance equal to the population variance divided by n.
In regression, the standard error of the slope \(\text{SE}(\hat{\beta}_1)\) depends on the residual variance s², the spread of the predictor variable, and the sample size, as shown in the previous section. Unlike the standard deviation of the response variable, which includes both signal and noise, the standard error isolates the uncertainty in the parameter estimate itself.
The distinction is essential: standard deviation describes the dispersion of the raw data, while standard error describes the precision of an estimated quantity. Using the standard deviation in place of the standard error for a derived statistic (such as a slope) mixes signal into the noise, leading to incorrect inference.
The ratio quantifies the observed effect relative to the variability expected if the null hypothesis were true. A large value indicates that the effect is unlikely under random variation alone.
In point anomaly detection, \(\sigma\) is the standard deviation of the individual observations around \(\mu\). In trend detection, the quantity of interest is \(\hat{\beta}_1\) from the model \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\). The standard error is
$$ \text{SE}(\hat{\beta}_1) = \sqrt{ \frac{s^2}{\sum (x_i – \bar{x})^2} }, $$
where \(s^2\) is the residual mean squared error after fitting the line.
Using the raw standard deviation of \(y_i\) as the denominator would yield
$$ \frac{\hat{\beta}_1}{\sqrt{\text{Var}(y)}} $$
and include both the systematic trend and the random fluctuations in the denominator, which inflates the noise term and underestimates the strength of the trend.
The t-statistic uses
$$ t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} $$
and follows the t-distribution with \(n-2\) degrees of freedom because \(s^2\) is estimated from the residuals. This estimation of variance introduces additional uncertainty, which is reflected in the wider tails of the t-distribution compared with the standard normal.
The same signal-to-noise structure appears in most test statistics. The F-statistic compares explained variance to residual variance:
$$ F = \frac{\text{explained MS}}{\text{residual MS}}. $$
The chi-square statistic compares observed to expected frequencies, scaled by expected values:
$$ \chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}. $$
In each case, the statistic is a ratio of observed deviation to expected variation under the null. The z-score and t-statistic are specific realisations of this principle adapted to tests about means or regression coefficients.
When Z-Scores Break: The Trend Problem
The z-score performs reliably when applied to individual observations against a stable baseline. Extending it to trend detection, however, introduces fundamental issues that undermine its validity.
Consider a time series where the goal is to test whether a linear trend exists. One might compute the ordinary least squares slope \(\hat{\beta}_1\) and attempt to standardise it using the z-score framework by dividing by the standard deviation of the response variable:
$$ z = \frac{\hat{\beta}_1}{\sqrt{\text{Var}(y)}}. $$
This approach is incorrect. The standard deviation \(\sqrt{\text{Var}(y)}\) measures the total spread of the response variable, which includes both the systematic trend (the signal) and the random fluctuations (the noise). When a trend is present, the variance of y is inflated by the trend itself. Placing this inflated variance in the denominator reduces the magnitude of the test statistic, leading to underestimation of the trend’s significance.
A common alternative is to use the standard deviation estimated from data before the suspected trend begins, for example from observations prior to some time t = 10. This appears logical but fails for the same reason as before: the process may not be stationary.
A short refresher on stationarity
Stationarity in a time series means that the statistical properties of the process (mean, variance, and autocovariance structure) remain constant over time.
A stationary series has no systematic change in level (no trend), no change in spread (constant variance), and no dependence of the relationship between observations on the specific time point, making it predictable and suitable for standard statistical modeling.
If the core properties of our distribtuion (which is our window in this case) change, the pre-trend \(\sigma\) is no longer representative of the variability during the trend period. The test statistic then reflects an irrelevant noise level, producing either false positives or false negatives depending on how the variance has evolved.
The core problem is that the quantity being tested—the slope—is a derived summary statistic computed from the same data used to estimate the noise. Unlike point anomalies, where the test observation is independent of the baseline window, the trend parameter is entangled with the data. Any attempt to use the raw variance of y mixes signal into the noise estimate, violating the requirement that the denominator should represent variability under the null hypothesis of no trend.
This contamination is not a minor technical detail. It systematically biases the test toward conservatism when a trend exists, because the denominator grows with the strength of the trend. The result is that genuine trends are harder to detect, and the reported p-values are larger than they should be.
These limitations explain why the z-score, despite its simplicity and intuitive appeal, cannot be directly applied to trend detection without modification. The t-statistic addresses precisely this issue by constructing a noise measure that excludes the fitted trend, as explained in the next section.
A quick simulation to compare the results of the t-statistic with the “wrong”/naive z-score result:
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
# ────────────────────────────────────────────────
# Data generation (same as before)
np.random.seed(42)
n = 30
t = np.arange(n)
data = np.full(n, 10.0)
data[:20] = 10 + np.random.normal(0, 1.5, 20)
data[20:] = 10 + 0.1 * t[20:] + np.random.normal(0, 1.5, 10)
data[15] += 8 # outlier at index 15
df = pd.DataFrame({‘time’: t, ‘value’: data})
# ────────────────────────────────────────────────
# Fit regression on last 10 points only (indices 20 to 29)
last10 = df.iloc[18:].copy()
slope, intercept, r_value, p_value, std_err = stats.linregress(
last10[‘time’], last10[‘value’]
)
last10[‘fitted’] = intercept + slope * last10[‘time’]
t_stat = slope / std_err
# Naive “z-statistic” — using std(y) / sqrt(n) as denominator (wrong for trend)
z_std_err = np.std(last10[‘value’]) / np.sqrt(len(last10))
z_stat = slope / z_std_err
# Print comparison
print(“Correct t-statistic (using proper SE of slope):”)
print(f” Slope: {slope:.4f}”)
print(f” SE of slope: {std_err:.4f}”)
print(f” t-stat: {t_stat:.4f}”)
print(f” p-value (t-dist): {p_value:.6f}\n”)
print(“Naive ‘z-statistic’ (using std(y)/sqrt(n) — incorrect):”)
print(f” Slope: {slope:.4f}”)
print(f” Wrong SE: {z_std_err:.4f}”)
print(f” z-stat: {z_stat:.4f}”)
# ────────────────────────────────────────────────
# Plot with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
# Top: Correct t-statistic plot
ax1.plot(df[‘time’], df[‘value’], ‘o-‘, color=’black’, alpha=0.7, linewidth=1.5,
label=’Full time series’)
ax1.plot(last10[‘time’], last10[‘fitted’], color=’red’, linewidth=2.5,
label=f’Linear fit (last 10 pts): slope = {slope:.3f}’)
ax1.axvspan(20, 29, color=’red’, alpha=0.08, label=’Fitted window’)
ax1.text(22, 11.5, f’Correct t-statistic = {t_stat:.3f}\np-value = {p_value:.4f}’,
fontsize=12, bbox=dict(facecolor=’white’, alpha=0.9, edgecolor=’gray’))
ax1.set_title(‘Correct t-Test: Linear Fit on Last 10 Points’)
ax1.set_ylabel(‘Value’)
ax1.legend(loc=’upper left’)
ax1.grid(True, alpha=0.3)
# Bottom: Naive z-statistic plot (showing the mistake)
ax2.plot(df[‘time’], df[‘value’], ‘o-‘, color=’black’, alpha=0.7, linewidth=1.5,
label=’Full time series’)
ax2.plot(last10[‘time’], last10[‘fitted’], color=’red’, linewidth=2.5,
label=f’Linear fit (last 10 pts): slope = {slope:.3f}’)
ax2.axvspan(20, 29, color=’red’, alpha=0.08, label=’Fitted window’)
ax2.text(22, 11.5, f’Naive z-statistic = {z_stat:.3f}\n(uses std(y)/√n — wrong denominator)’,
fontsize=12, bbox=dict(facecolor=’white’, alpha=0.9, edgecolor=’gray’))
ax2.set_title(‘Naive “Z-Test”: Using std(y)/√n Instead of SE of Slope’)
ax2.set_xlabel(‘Time’)
ax2.set_ylabel(‘Value’)
ax2.legend(loc=’upper left’)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Correct t-statistic (using proper SE of slope):
Slope: 0.2439
SE of slope: 0.1412
t-stat: 1.7276
p-value (t-dist): 0.114756
Naive ‘z-statistic’ (using std(y)/sqrt(n) — incorrect):
Slope: 0.2439
Wrong SE: 0.5070
z-stat: 0.4811
Comparing the t-test for trend detection vs the Naive z-test
Enter the t-Statistic: Designed for Estimated Noise
The t-statistic addresses the limitations of the z-score by explicitly accounting for uncertainty in the variance estimate. It is the appropriate tool when testing a parameter, such as a regression slope, where the noise level must be estimated from the same data used to compute the parameter.
Consider the linear regression model
$$ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, $$
where the errors \(\epsilon_i\) are assumed to be independent and normally distributed with mean 0 and constant variance \(\sigma^2\).
The ordinary least squares estimator of the slope is
$$ \hat{\beta}_1 = \frac{\sum (x_i – \bar{x})(y_i – \bar{y})}{\sum (x_i – \bar{x})^2}. $$
Under the null hypothesis H₀: \(\beta_1 = 0\), the expected value of \(\hat{\beta}_1\) is zero.
The standard error of \(\hat{\beta}_1\) is
$$ \text{SE}(\hat{\beta}_1) = \sqrt{ \frac{s^2}{\sum (x_i – \bar{x})^2} }, $$
where \(s^2\) is the unbiased estimate of \(\sigma^2\), computed as the residual mean squared error:
$$ s^2 = \frac{1}{n-2} \sum (y_i – \hat{y}_i)^2. $$
The t-statistic is then
$$ t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)} = \frac{\hat{\beta}_1}{\sqrt{ \frac{s^2}{\sum (x_i – \bar{x})^2} }}. $$
Under the null hypothesis and the model assumptions, this statistic follows a t-distribution with n−2 degrees of freedom.
A quick refresher on degrees of freedom
Degrees of freedom represent the number of independent values that remain available to estimate a parameter after certain constraints have been imposed by the data or the model.
In the simplest case, when estimating the variance of a sample, one degree of freedom is lost because the sample mean must be calculated first. The deviations from this mean are constrained to sum to zero, so only n−1 values can vary freely. Dividing the sum of squared deviations by n−1 (rather than n) corrects for this loss and provides an unbiased estimate of the population variance:
$$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i – \bar{x})^2. $$
This adjustment, known as Bessel’s correction, ensures that the sample variance does not systematically underestimate the population variance. The same principle applies in regression: fitting a line with an intercept and slope uses two degrees of freedom, leaving n−2 for estimating the residual variance.
In general, degrees of freedom equal the sample size minus the number of parameters estimated from the data. The t-distribution uses these degrees of freedom to adjust its shape: fewer degrees of freedom produce heavier tails (greater uncertainty), while larger values cause the distribution to approach the standard normal.
The key distinction from the z-score is the use of \(s^2\) rather than a fixed \(\sigma^2\). Because the variance is estimated from the residuals, the denominator incorporates sampling uncertainty in the variance estimate. This uncertainty widens the distribution of the test statistic, which is why the t-distribution has heavier tails than the standard normal for small degrees of freedom.
As the sample size increases, the estimate \(s^2\) becomes more precise, the t-distribution converges to the standard normal, and the distinction between t and z diminishes.
The t-statistic therefore provides a more accurate assessment of significance when the noise level is unknown and must be estimated from the data. By basing the noise measure on the residuals after removing the fitted trend, it avoids mixing the signal into the noise denominator, which is the central flaw in naive applications of the z-score to trends.
Here’s a simulation to see how sampling from various t-distribution results in varying p-values:
- Sampling from the null distribution leads to a uniform p-value distribution: You’re essentially equally likely to get any p-value if you sample from the null distribution
- Say you add a little shift — your bump your mean by 4: You’re now essentially confident that its from a different distribution so you’re p-value skew’s left.
- Interestingly, unless your test is extremely conservative (that is, unlikely to reject the null hypothesis), its unlikely to get a skew towards 1. The third set of plots shows my unsuccessful attempt where I repeatedly sample from an extremely tight distribution around the mean of the null distribution hoping that would maximize my p-value.
Code
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from tqdm import trange
n_simulations = 10_000
n_samples = 30
baseline_mu = 50
sigma = 10
df = n_samples – 1
def run_sim(true_mu, sigma_val):
t_stats, p_vals = [], []
for _ in trange(n_simulations):
# Generate sample
sample = np.random.normal(true_mu, sigma_val, n_samples)
t, p = stats.ttest_1samp(sample, baseline_mu)
t_stats.append(t)
p_vals.append(p)
return np.array(t_stats), np.array(p_vals)
# 1. Null is True (Ideal)
t_null, p_null = run_sim(baseline_mu, sigma)
# 2. Effect Exists (Shifted)
t_effect, p_effect = run_sim(baseline_mu + 4, sigma)
# 3. Too Perfect (Variance suppressed, Mean forced to baseline)
# We use a tiny sigma so the sample mean is always basically the baseline. Even then, we still get a uniform p-value distribution.
t_perfect, p_perfect = run_sim(baseline_mu, 0.1)
# Plotting
fig, axes = plt.subplots(3, 2, figsize=(12, 13))
x = np.linspace(-5, 8, 200)
t_pdf = stats.t.pdf(x, df)
scenarios = [
(t_null, p_null, “Null is True (Ideal)”, “skyblue”, “salmon”),
(t_effect, p_effect, “Effect Exists (Shifted)”, “lightgreen”, “gold”),
(t_perfect, p_perfect, “Too Perfect (Still Uniform)”, “plum”, “lightgrey”)
]
for i, (t_data, p_data, title, t_col, p_col) in enumerate(scenarios):
# T-Stat Plots
axes[i, 0].hist(t_data, bins=50, density=True, color=t_col, alpha=0.6, label=”Simulated”)
axes[i, 0].plot(x, t_pdf, ‘r–‘, lw=2, label=”Theoretical T-dist”)
axes[i, 0].set_title(f”{title}: T-Statistics”)
axes[i, 0].legend()
# P-Value Plots
axes[i, 1].hist(p_data, bins=20, density=True, color=p_col, alpha=0.7, edgecolor=’black’)
axes[i, 1].set_title(f”{title}: P-Values”)
axes[i, 1].set_xlim(0, 1)
if i == 0:
axes[i, 1].axhline(1, color=’red’, linestyle=’–‘, label=’Uniform Reference’)
axes[i, 1].legend()
plt.tight_layout()
plt.show()
Simulating p-values:
(a) Null distribution Sampling
(b) Mean shift sampling
(c) Unsuccessful right-skew simulation attempt
Alternatives and Extensions: When t-Statistics Are Not Enough
The t-statistic provides a robust parametric approach for trend detection under normality assumptions. Several alternatives exist when those assumptions are untenable or when greater robustness is required.
The Mann-Kendall test is a non-parametric method that assesses monotonic trends without requiring normality. It counts the number of concordant and discordant pairs in the data: for every pair of observations (\(x_i\), \(x_j\)) with \(i < j\), it checks whether the trend is increasing (\(x_j > x_i\)), decreasing (\(x_j < x_i\)), or tied. The test statistic \(S\) is the difference between the number of increases and decreases:
$$ S = \sum_{i
where sgn is the sign function (1 for positive, −1 for negative, 0 for ties). Under the null hypothesis of no trend, \(S\) is approximately normally distributed for large \(n\), allowing computation of a z-score and p-value. The test is rank-based and insensitive to outliers or non-normal distributions.
Sen’s slope estimator complements the Mann-Kendall test by providing a measure of trend magnitude. It computes the median of all pairwise slopes:
$$ Q = \text{median} \left( \frac{x_j – x_i}{j – i} \right) \quad \text{for all } i < j. $$
This estimator is robust to outliers and does not assume linearity.
The bootstrap method offers a flexible, distribution-free alternative. To test a trend, fit the linear model to the original data to obtain \(\hat{\beta}_1\). Then, resample the data with replacement many times (typically 1000–10,000 iterations), refit the model each time, and collect the distribution of bootstrap slopes. The p-value is the proportion of bootstrap slopes that are more extreme than zero (or the original estimate, depending on the null). Confidence intervals can be constructed from the percentiles of the bootstrap distribution. This approach makes no parametric assumptions about errors and works well for small or irregular samples.
Each alternative trades off different strengths. Mann-Kendall and Sen’s slope are computationally simple and robust but assume monotonicity rather than strict linearity. Bootstrap methods are highly flexible and can incorporate complex models, though they require more computation. The choice depends on the data characteristics and the specific question: parametric power when assumptions hold, non-parametric robustness when they do not.
In Conclusion
The z-score and t-statistic both measure deviation from the null hypothesis relative to expected variability, but they serve different purposes. The z-score assumes a known or stable variance and is well-suited to detecting individual point anomalies against a baseline. The t-statistic accounts for uncertainty in the variance estimate and is the correct choice when testing derived parameters, such as regression slopes, where the noise must be estimated from the same data.
The key difference lies in the noise term. Using the raw standard deviation of the response variable for a trend mixes signal into the noise, leading to biased inference. The t-statistic avoids this by basing the noise measure on residuals after removing the fitted trend, providing a cleaner separation of effect from variability.
When normality or independence assumptions do not hold, alternatives such as the Mann-Kendall test, Sen’s slope estimator, or bootstrap methods offer robust options without parametric requirements.
In practice, the choice of method depends on the question and the data. For point anomalies in stable processes, the z-score is efficient and sufficient. For trend detection, the t-statistic (or a robust alternative) is necessary to ensure reliable conclusions. Understanding the assumptions and the signal-to-noise distinction helps select the appropriate tool and interpret results with confidence.
Code
Colab
General Code Repository
References and Further Reading
- Hypothesis testing A solid university lecture notes overview covering hypothesis testing basics, including types of errors and p-values. Purdue University Northwest: Chapter 5 Hypothesis Testing
- t-statistic Detailed lecture notes on t-tests for small samples, including comparisons to z-tests and p-value calculations. MIT OpenCourseWare: Single Sample Hypothesis Testing (t-tests)
- z-score Practical tutorial explaining z-scores in hypothesis testing, with examples and visualizations for mean comparisons. Towards Data Science: Hypothesis Testing with Z-Scores
- Trend significance scoring: Step-by-step blog on performing the Mann-Kendall trend test (non-parametric) for detecting monotonic trends and assessing significance. It’s in R. GeeksforGeeks: How to Perform a Mann-Kendall Trend Test in R
- p-value Clear, beginner-friendly explanation of p-values, common misconceptions, and their role in hypothesis testing. Towards Data Science: P-value Explained
- t-statistic vs z-statistic Blog comparing t-test and z-test differences, when to use each, and practical applications. Statsig: T-test vs. Z-test
- Additional university notes on hypothesis testing. Comprehensive course notes from Georgia Tech covering hypothesis testing, test statistics (z and t), and p-values. Georgia Tech: Hypothesis Testing Notes

