A Case for the T-statistic

Introduction

undefined, I began eager about the parallels between point-anomaly detection and trend-detection. In relation to points, it’s generally intuitive, and the z-score solves most problems. What took me some time to determine was applying some form of statistical test to trends — singular points at the moment are whole distributions, and the usual deviation that made numerous sense once I was one point, began to feel plain incorrect. That is what I uncovered.

For easier understanding, I’ve peppered this post with some simulations I arrange and a few charts I created consequently.

Z-Scores: Once they stop working

Most individuals reach for the z-score the moment they need to spot something weird. It’s dead easy:

$$ z = frac{x – mu}{sigma} $$

(x) is your recent remark, ( mu ) is what “normal” often looks like, ( sigma ) is how much things normally wiggle. The number you get tells you: “this point is that this many standard deviations away from the pack.”

A z of three? That’s roughly the “holy crap” line — under a traditional distribution, you simply see something that far out about 0.27% of the time (two-tailed). Feels clean. Feels honest.

Why it magically becomes standard normal (quick derivation)

Start with any normal variable X ~ N(( mu ), ( sigma^2 )).

Subtract the mean → (x – mu). Now the middle is zero.
Divide by the usual deviation → ( (x – mu) / sigma ). Now the spread (variance) is strictly 1.

Do each and also you get:

$$ Z = frac{X – mu}{sigma} sim N(0, 1) $$

That’s it. Any normal variable, regardless of its original mean or scale, gets squashed and stretched into the identical boring bell curve all of us memorized. That’s why z-scores feel universal — they let you utilize the identical lookup tables in every single place.

The catch

In the true world we almost never know the true ( mu ) and ( sigma ). We estimate them from recent data — say the last 7 points.

Here’s the damaging bit: do you include the present point in that window or not?

Should you do, an enormous outlier inflates your ( sigma ) on the spot. Your z-score shrinks. The anomaly hides itself. You find yourself considering “eh, not that weird in spite of everything.”

Should you exclude it (shift by 1, use only the previous window), you get a good fight: “how strange is that this recent point in comparison with what was normal before it arrived?”

Most solid implementations do the latter. Include the purpose and also you’re principally smoothing, not detecting.

This snippet should provide you with an example.

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# set dpi to 250 for high-resolution plots
plt.rcParams['figure.dpi'] = 250

# Generate 30-point series: base level 10, slight upward trend in last 10 points, noise, one big outlier
n = 30
t = np.arange(n)
base = 10 + 0.1 * t[-10:]  # small trend only in last part
data = np.full(n, 10.0)
data[:20] = 10 + np.random.normal(0, 1.5, 20)
data[20:] = base + np.random.normal(0, 1.5, 10)
data[15] += 8  # big outlier at index 15

df = pd.DataFrame({'value': data}, index=t)

# Rolling window size
window = 7

# Version 1: EXCLUDE current point (advisable for detection)
df['roll_mean_ex'] = df['value'].shift(1).rolling(window).mean()
df['roll_std_ex']  = df['value'].shift(1).rolling(window).std()
df['z_ex'] = (df['value'] - df['roll_mean_ex']) / df['roll_std_ex']

# Version 2: INCLUDE current point (self-dampening)
df['roll_mean_inc'] = df['value'].rolling(window).mean()
df['roll_std_inc']  = df['value'].rolling(window).std()
df['z_inc'] = (df['value'] - df['roll_mean_inc']) / df['roll_std_inc']

# Add the Z-scores comparison as a 3rd subplot
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 12), sharex=True)

# Top plot: original + means
ax1.plot(df.index, df['value'], 'o-', label='Observed', color='black', alpha=0.7)
ax1.plot(df.index, df['roll_mean_ex'], label='Rolling mean (exclude current)', color='blue')
ax1.plot(df.index, df['roll_mean_inc'], '--', label='Rolling mean (include current)', color='red')
ax1.set_title('Time Series + Rolling Means (window=7)')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Middle plot: rolling stds
ax2.plot(df.index, df['roll_std_ex'], label='Rolling std (exclude current)', color='blue')
ax2.plot(df.index, df['roll_std_inc'], '--', label='Rolling std (include current)', color='red')
ax2.set_title('Rolling Standard Deviations')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Bottom plot: Z-scores comparison
ax3.plot(df.index, df['z_ex'], 'o-', label='Z-score (exclude current)', color='blue')
ax3.plot(df.index, df['z_inc'], 'x--', label='Z-score (include current)', color='red')
ax3.axhline(3, color='gray', linestyle=':', alpha=0.6)
ax3.axhline(-3, color='gray', linestyle=':', alpha=0.6)
ax3.set_title('Z-Scores: Exclude vs Include Current Point')
ax3.set_xlabel('Time')
ax3.set_ylabel('Z-score')
ax3.legend()
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

The difference between including vs excluding the present (evaluated) point.

P-values

You compute z, then ask: under the null (“this got here from the identical distribution as my window”), what’s the possibility I’d see something this extreme?

Two-tailed p-value = 2 × (1 − cdf(|z|)) in the usual normal.

z = 3 → p ≈ 0.0027 → “probably not random noise.”
z = 1.5 → p ≈ 0.1336 → “eh, could occur.”

Easy. Until the assumptions start falling apart.

Assumptions

The z-score (and its p-value) assumes two things:

The window data is roughly normal (or not less than the tails behave).
Your estimated ( sigma ) is close enough to the true population value.

A skewed window, for instance, violates #1. Because of this saying something is inside 3(sigma) might actually be only 85% likely, moderately than the expected 99.7%.

Similarly, with a sufficiently small window, the ( sigma ) is noisy, causing z-scores to swing greater than they need to.

Hypothesis Testing Basics: Rejecting the Null, Not Proving the Alternative

Hypothesis testing provides the formal framework for deciding whether observed data support a claim of interest. The structure is consistent across tools just like the z-score and t-statistic.

The method begins with two competing hypotheses:

The null hypothesis (H₀) represents the default assumption: no effect, no difference, or no trend. In anomaly detection, H₀ states that the remark belongs to the identical distribution because the baseline data. In trend evaluation, H₀ typically states that the slope is zero.
The choice hypothesis (H₁) represents the claim under investigation: there may be an effect, a difference, or a trend.

The test statistic (z-score or t-statistic) quantifies how far the info deviate from what could be expected under H₀.

The p-value is the probability of obtaining a test statistic not less than as extreme because the one observed, assuming H₀ is true. A small p-value indicates that such an extreme result’s unlikely under the null.

The choice rule is simple:

If the p-value is below a pre-specified significance level (commonly 0.05), reject H₀.
If the p-value exceeds the brink, fail to reject H₀.

A key point is that failing to reject H₀ doesn’t prove H₀ is true. It only indicates that the info don’t provide sufficient evidence against it. Absence of evidence is just not evidence of absence.

The 2-tailed test is standard for anomaly detection and plenty of trend tests because deviations can occur in either direction. The p-value is due to this fact calculated as twice the one-tailed probability.

For the z-score, the test relies on the usual normal distribution under the null. For small samples or when the variance is estimated from the info, the t-distribution is used as an alternative, as discussed in later sections.

This framework applies uniformly: the test statistic measures deviation from the null, the distribution provides the reference for a way unusual that deviation is, and the p-value translates that unusualness into a call rule.

The assumptions underlying the distribution (normality of errors, independence) must hold for the p-value to be interpreted appropriately. When those assumptions are violated, the reported probabilities lose reliability, which becomes a central concern when extending the approach beyond point anomalies.

The Signal-to-Noise Principle: Connecting Z-Scores and t-Statistics

The z-score and the t-statistic are each instances of the ratio

$$ frac{text{signal}}{text{noise}}. $$

The signal is the deviation from the null value: (x – mu) for point anomalies and (hat{beta}_1 – 0) for the slope in linear regression.

The noise term is the measure of variability under the null hypothesis. For the z-score, noise is (sigma) (standard deviation of the baseline observations). For the t-statistic, noise is the usual error (text{SE}(hat{beta}_1)).

Standard Error vs Standard Deviation

$$ s = sqrt{ frac{1}{n-1} sum (x_i – bar{x})^2 }. $$

$$ text{SE}(bar{x}) = frac{s}{sqrt{n}}, $$

The ratio quantifies the observed effect relative to the variability expected if the null hypothesis were true. A big value indicates that the effect is unlikely under random variation alone.

In point anomaly detection, (sigma) is the usual deviation of the person observations around (mu). In trend detection, the amount of interest is (hat{beta}_1) from the model (y_i = beta_0 + beta_1 x_i + epsilon_i). The usual error is

$$ text{SE}(hat{beta}_1) = sqrt{ frac{s^2}{sum (x_i – bar{x})^2} }, $$

where (s^2) is the residual mean squared error after fitting the road.

Using the raw standard deviation of (y_i) because the denominator would yield

$$ frac{hat{beta}_1}{sqrt{text{Var}(y)}} $$

and include each the systematic trend and the random fluctuations within the denominator, which inflates the noise term and underestimates the strength of the trend.

The t-statistic uses

$$ t = frac{hat{beta}_1}{text{SE}(hat{beta}_1)} $$

and follows the t-distribution with (n-2) degrees of freedom because (s^2) is estimated from the residuals. This estimation of variance introduces additional uncertainty, which is reflected in the broader tails of the t-distribution compared with the usual normal.

The identical signal-to-noise structure appears in most test statistics. The F-statistic compares explained variance to residual variance:

$$ F = frac{text{explained MS}}{text{residual MS}}. $$

The chi-square statistic compares observed to expected frequencies, scaled by expected values:

$$ chi^2 = sum frac{(O_i – E_i)^2}{E_i}. $$

In each case, the statistic is a ratio of observed deviation to expected variation under the null. The z-score and t-statistic are specific realisations of this principle adapted to tests about means or regression coefficients.

When Z-Scores Break: The Trend Problem

The z-score performs reliably when applied to individual observations against a stable baseline. Extending it to trend detection, nonetheless, introduces fundamental issues that undermine its validity.

Consider a time series where the goal is to check whether a linear trend exists. One might compute the abnormal least squares slope (hat{beta}_1) and try and standardise it using the z-score framework by dividing by the usual deviation of the response variable:

$$ z = frac{hat{beta}_1}{sqrt{text{Var}(y)}}. $$

This approach is wrong. The usual deviation (sqrt{text{Var}(y)}) measures the whole spread of the response variable, which incorporates each the systematic trend (the signal) and the random fluctuations (the noise). When a trend is present, the variance of y is inflated by the trend itself. Placing this inflated variance within the denominator reduces the magnitude of the test statistic, resulting in underestimation of the trend’s significance.

A typical alternative is to make use of the usual deviation estimated from data before the suspected trend begins, for instance from observations prior to a while t = 10. This appears logical but fails for a similar reason as before: the method is probably not stationary.

If the core properties of our distribtuion (which is our window on this case) change, the pre-trend (sigma) isn’t any longer representative of the variability in the course of the trend period. The test statistic then reflects an irrelevant noise level, producing either false positives or false negatives depending on how the variance has evolved.

The core problem is that the amount being tested—the slope—is a derived summary statistic computed from the identical data used to estimate the noise. Unlike point anomalies, where the test remark is independent of the baseline window, the trend parameter is entangled with the info. Any try and use the raw variance of y mixes signal into the noise estimate, violating the requirement that the denominator should represent variability under the null hypothesis of no trend.

This contamination is just not a minor technical detail. It systematically biases the test toward conservatism when a trend exists, since the denominator grows with the strength of the trend. The result’s that real trends are harder to detect, and the reported p-values are larger than they ought to be.

These limitations explain why the z-score, despite its simplicity and intuitive appeal, can’t be directly applied to trend detection without modification. The t-statistic addresses precisely this issue by constructing a noise measure that excludes the fitted trend, as explained in the subsequent section.

A fast simulation to match the outcomes of the t-statistic with the “incorrect”/naive z-score result:

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# ────────────────────────────────────────────────
# Data generation (same as before)
np.random.seed(42)
n = 30
t = np.arange(n)
data = np.full(n, 10.0)
data[:20] = 10 + np.random.normal(0, 1.5, 20)
data[20:] = 10 + 0.1 * t[20:] + np.random.normal(0, 1.5, 10)
data[15] += 8  # outlier at index 15

df = pd.DataFrame({'time': t, 'value': data})

# ────────────────────────────────────────────────
# Fit regression on last 10 points only (indices 20 to 29)
last10 = df.iloc[18:].copy()
slope, intercept, r_value, p_value, std_err = stats.linregress(
    last10['time'], last10['value']
)
last10['fitted'] = intercept + slope * last10['time']
t_stat = slope / std_err

# Naive "z-statistic" — using std(y) / sqrt(n) as denominator (incorrect for trend)
z_std_err = np.std(last10['value']) / np.sqrt(len(last10))
z_stat = slope / z_std_err

# Print comparison
print("Correct t-statistic (using proper SE of slope):")
print(f"  Slope: {slope:.4f}")
print(f"  SE of slope: {std_err:.4f}")
print(f"  t-stat: {t_stat:.4f}")
print(f"  p-value (t-dist): {p_value:.6f}n")

print("Naive 'z-statistic' (using std(y)/sqrt(n) — incorrect):")
print(f"  Slope: {slope:.4f}")
print(f"  Flawed SE: {z_std_err:.4f}")
print(f"  z-stat: {z_stat:.4f}")

# ────────────────────────────────────────────────
# Plot with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)

# Top: Correct t-statistic plot
ax1.plot(df['time'], df['value'], 'o-', color='black', alpha=0.7, linewidth=1.5,
         label='Full time series')
ax1.plot(last10['time'], last10['fitted'], color='red', linewidth=2.5,
         label=f'Linear fit (last 10 pts): slope = {slope:.3f}')
ax1.axvspan(20, 29, color='red', alpha=0.08, label='Fitted window')

ax1.text(22, 11.5, f'Correct t-statistic = {t_stat:.3f}np-value = {p_value:.4f}',
         fontsize=12, bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))

ax1.set_title('Correct t-Test: Linear Fit on Last 10 Points')
ax1.set_ylabel('Value')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)

# Bottom: Naive z-statistic plot (showing the error)
ax2.plot(df['time'], df['value'], 'o-', color='black', alpha=0.7, linewidth=1.5,
         label='Full time series')
ax2.plot(last10['time'], last10['fitted'], color='red', linewidth=2.5,
         label=f'Linear fit (last 10 pts): slope = {slope:.3f}')
ax2.axvspan(20, 29, color='red', alpha=0.08, label='Fitted window')

ax2.text(22, 11.5, f'Naive z-statistic = {z_stat:.3f}n(uses std(y)/√n — incorrect denominator)',
         fontsize=12, bbox=dict(facecolor='white', alpha=0.9, edgecolor='gray'))

ax2.set_title('Naive "Z-Test": Using std(y)/√n As a substitute of SE of Slope')
ax2.set_xlabel('Time')
ax2.set_ylabel('Value')
ax2.legend(loc='upper left')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Correct t-statistic (using proper SE of slope):
  Slope: 0.2439
  SE of slope: 0.1412
  t-stat: 1.7276
  p-value (t-dist): 0.114756

Naive 'z-statistic' (using std(y)/sqrt(n) — incorrect):
  Slope: 0.2439
  Flawed SE: 0.5070
  z-stat: 0.4811

Comparing the t-test for trend detection vs the Naive z-test

Enter the t-Statistic: Designed for Estimated Noise

The t-statistic addresses the constraints of the z-score by explicitly accounting for uncertainty within the variance estimate. It’s the suitable tool when testing a parameter, comparable to a regression slope, where the noise level have to be estimated from the identical data used to compute the parameter.

Consider the linear regression model

$$ y_i = beta_0 + beta_1 x_i + epsilon_i, $$

where the errors (epsilon_i) are assumed to be independent and normally distributed with mean 0 and constant variance (sigma^2).

The abnormal least squares estimator of the slope is

$$ hat{beta}_1 = frac{sum (x_i – bar{x})(y_i – bar{y})}{sum (x_i – bar{x})^2}. $$

Under the null hypothesis H₀: (beta_1 = 0), the expected value of (hat{beta}_1) is zero.

The usual error of (hat{beta}_1) is

$$ text{SE}(hat{beta}_1) = sqrt{ frac{s^2}{sum (x_i – bar{x})^2} }, $$

where (s^2) is the unbiased estimate of (sigma^2), computed because the residual mean squared error:

$$ s^2 = frac{1}{n-2} sum (y_i – hat{y}_i)^2. $$

The t-statistic is then

$$ t = frac{hat{beta}_1}{text{SE}(hat{beta}_1)} = frac{hat{beta}_1}{sqrt{ frac{s^2}{sum (x_i – bar{x})^2} }}. $$

Under the null hypothesis and the model assumptions, this statistic follows a t-distribution with n−2 degrees of freedom.

A fast refresher on degrees of freedom

$$ s^2 = frac{1}{n-1} sum_{i=1}^n (x_i – bar{x})^2. $$

The important thing distinction from the z-score is the usage of (s^2) moderately than a set (sigma^2). Since the variance is estimated from the residuals, the denominator incorporates sampling uncertainty within the variance estimate. This uncertainty widens the distribution of the test statistic, which is why the t-distribution has heavier tails than the usual normal for small degrees of freedom.

Because the sample size increases, the estimate (s^2) becomes more precise, the t-distribution converges to the usual normal, and the excellence between t and z diminishes.

The t-statistic due to this fact provides a more accurate assessment of significance when the noise level is unknown and have to be estimated from the info. By basing the noise measure on the residuals after removing the fitted trend, it avoids mixing the signal into the noise denominator, which is the central flaw in naive applications of the z-score to trends.

Here’s a simulation to see how sampling from various t-distribution ends in various p-values:

Sampling from the null distribution results in a uniform p-value distribution: You’re essentially equally more likely to get any p-value for those who sample from the null distribution
Say you add a little bit shift — your bump your mean by 4: You’re now essentially confident that its from a special distribution so that you’re p-value skew’s left.
Interestingly, unless your test is incredibly conservative (that’s, unlikely to reject the null hypothesis), its unlikely to get a skew towards 1. The third set of plots shows my unsuccessful attempt where I repeatedly sample from an especially tight distribution across the mean of the null distribution hoping that may maximize my p-value.

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from tqdm import trange

n_simulations = 10_000
n_samples = 30
baseline_mu = 50
sigma = 10
df = n_samples - 1

def run_sim(true_mu, sigma_val):
    t_stats, p_vals = [], []
    for _ in trange(n_simulations):
        # Generate sample
        sample = np.random.normal(true_mu, sigma_val, n_samples)
        t, p = stats.ttest_1samp(sample, baseline_mu)
        t_stats.append(t)
        p_vals.append(p)
    return np.array(t_stats), np.array(p_vals)

# 1. Null is True (Ideal)
t_null, p_null = run_sim(baseline_mu, sigma)

# 2. Effect Exists (Shifted)
t_effect, p_effect = run_sim(baseline_mu + 4, sigma)

# 3. Too Perfect (Variance suppressed, Mean forced to baseline)
# We use a tiny sigma so the sample mean is at all times principally the baseline. Even then, we still get a uniform p-value distribution.
t_perfect, p_perfect = run_sim(baseline_mu, 0.1) 

# Plotting
fig, axes = plt.subplots(3, 2, figsize=(12, 13))
x = np.linspace(-5, 8, 200)
t_pdf = stats.t.pdf(x, df)

scenarios = [
    (t_null, p_null, "Null is True (Ideal)", "skyblue", "salmon"),
    (t_effect, p_effect, "Effect Exists (Shifted)", "lightgreen", "gold"),
    (t_perfect, p_perfect, "Too Perfect (Still Uniform)", "plum", "lightgrey")
]

for i, (t_data, p_data, title, t_col, p_col) in enumerate(scenarios):
    # T-Stat Plots
    axes[i, 0].hist(t_data, bins=50, density=True, color=t_col, alpha=0.6, label="Simulated")
    axes[i, 0].plot(x, t_pdf, 'r--', lw=2, label="Theoretical T-dist")
    axes[i, 0].set_title(f"{title}: T-Statistics")
    axes[i, 0].legend()
    
    # P-Value Plots
    axes[i, 1].hist(p_data, bins=20, density=True, color=p_col, alpha=0.7, edgecolor='black')
    axes[i, 1].set_title(f"{title}: P-Values")
    axes[i, 1].set_xlim(0, 1)
    if i == 0:
        axes[i, 1].axhline(1, color='red', linestyle='--', label='Uniform Reference')
        axes[i, 1].legend()

plt.tight_layout()
plt.show()

Simulating p-values:
(a) Null distribution Sampling
(b) Mean shift sampling
(c) Unsuccessful right-skew simulation attempt

Alternatives and Extensions: When t-Statistics Are Not Enough

The t-statistic provides a strong parametric approach for trend detection under normality assumptions. Several alternatives exist when those assumptions are untenable or when greater robustness is required.

The Mann-Kendall test is a non-parametric method that assesses monotonic trends without requiring normality. It counts the variety of concordant and discordant pairs in the info: for each pair of observations ((x_i), (x_j)) with (i < j), it checks whether the trend is increasing ((x_j > x_i)), decreasing ((x_j < x_i)), or tied. The test statistic (S) is the difference between the variety of increases and reduces:

$$ S = sum_{i

where sgn is the sign function (1 for positive, −1 for negative, 0 for ties). Under the null hypothesis of no trend, (S) is roughly normally distributed for big (n), allowing computation of a z-score and p-value. The test is rank-based and insensitive to outliers or non-normal distributions.

Sen’s slope estimator complements the Mann-Kendall test by providing a measure of trend magnitude. It computes the median of all pairwise slopes:

$$ Q = text{median} left( frac{x_j – x_i}{j – i} right) quad text{for all } i < j. $$

This estimator is powerful to outliers and doesn’t assume linearity.

The bootstrap method offers a versatile, distribution-free alternative. To check a trend, fit the linear model to the unique data to acquire (hat{beta}_1). Then, resample the info with alternative over and over (typically 1000–10,000 iterations), refit the model every time, and collect the distribution of bootstrap slopes. The p-value is the proportion of bootstrap slopes which might be more extreme than zero (or the unique estimate, depending on the null). Confidence intervals could be constructed from the percentiles of the bootstrap distribution. This approach makes no parametric assumptions about errors and works well for small or irregular samples.

Each alternative trades off different strengths. Mann-Kendall and Sen’s slope are computationally easy and robust but assume monotonicity moderately than strict linearity. Bootstrap methods are highly flexible and might incorporate complex models, though they require more computation. The alternative is determined by the info characteristics and the precise query: parametric power when assumptions hold, non-parametric robustness once they don’t.

In Conclusion

The z-score and t-statistic each measure deviation from the null hypothesis relative to expected variability, but they serve different purposes. The z-score assumes a known or stable variance and is well-suited to detecting individual point anomalies against a baseline. The t-statistic accounts for uncertainty within the variance estimate and is the proper alternative when testing derived parameters, comparable to regression slopes, where the noise have to be estimated from the identical data.

The important thing difference lies within the noise term. Using the raw standard deviation of the response variable for a trend mixes signal into the noise, resulting in biased inference. The t-statistic avoids this by basing the noise measure on residuals after removing the fitted trend, providing a cleaner separation of effect from variability.

When normality or independence assumptions don’t hold, alternatives comparable to the Mann-Kendall test, Sen’s slope estimator, or bootstrap methods offer robust options without parametric requirements.

In practice, the alternative of method is determined by the query and the info. For point anomalies in stable processes, the z-score is efficient and sufficient. For trend detection, the t-statistic (or a strong alternative) is crucial to make sure reliable conclusions. Understanding the assumptions and the signal-to-noise distinction helps select the suitable tool and interpret results with confidence.

Code

Colab

General Code Repository

References and Further Reading

Hypothesis testing A solid university lecture notes overview covering hypothesis testing basics, including sorts of errors and p-values. Purdue University Northwest: Chapter 5 Hypothesis Testing
t-statistic Detailed lecture notes on t-tests for small samples, including comparisons to z-tests and p-value calculations. MIT OpenCourseWare: Single Sample Hypothesis Testing (t-tests)
z-score Practical tutorial explaining z-scores in hypothesis testing, with examples and visualizations for mean comparisons. Towards Data Science: Hypothesis Testing with Z-Scores
Trend significance scoring: Step-by-step blog on performing the Mann-Kendall trend test (non-parametric) for detecting monotonic trends and assessing significance. It’s in R. GeeksforGeeks: The right way to Perform a Mann-Kendall Trend Test in R
p-value Clear, beginner-friendly explanation of p-values, common misconceptions, and their role in hypothesis testing. Towards Data Science: P-value Explained
t-statistic vs z-statistic Blog comparing t-test and z-test differences, when to make use of each, and practical applications. Statsig: T-test vs. Z-test
Additional university notes on hypothesis testing. Comprehensive course notes from Georgia Tech covering hypothesis testing, test statistics (z and t), and p-values. Georgia Tech: Hypothesis Testing Notes

A Case for the T-statistic

Introduction

Z-Scores: Once they stop working

Why it magically becomes standard normal (quick derivation)

The catch

P-values

Assumptions

Hypothesis Testing Basics: Rejecting the Null, Not Proving the Alternative

The Signal-to-Noise Principle: Connecting Z-Scores and t-Statistics

When Z-Scores Break: The Trend Problem

Enter the t-Statistic: Designed for Estimated Noise

Alternatives and Extensions: When t-Statistics Are Not Enough

In Conclusion

Code

References and Further Reading

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Anthropic writes Claude’s Structure

Llama 2 is here – get it on Hugging Face

Completely happy 1st anniversary 🤗 Diffusers!

Constructing a Self-Healing Data Pipeline That Fixes Its Own Python Errors

Results of the Open Source AI Game Jam

A Case for the T-statistic

Introduction

Z-Scores: Once they stop working

Why it magically becomes standard normal (quick derivation)

The catch

P-values

Assumptions

Hypothesis Testing Basics: Rejecting the Null, Not Proving the Alternative

The Signal-to-Noise Principle: Connecting Z-Scores and t-Statistics

When Z-Scores Break: The Trend Problem

Enter the t-Statistic: Designed for Estimated Noise

Alternatives and Extensions: When t-Statistics Are Not Enough

In Conclusion

Code

References and Further Reading

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.