Post

Diagnose Linear Regression with Cross-Validation and Residuals

Evaluate a linear regression model with a holdout set, cross-validation, and residual inspection.

Diagnose Linear Regression with Cross-Validation and Residuals

A fitted regression model is only the beginning. Evaluate it on unseen data, compare cross-validation folds, and inspect residuals before trusting its predictions.

Build a Reproducible Example

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd

data = pd.DataFrame(
    {
        "area": [60, 75, 90, 105, 120, 135, 150, 165, 180, 195],
        "bedrooms": [1, 2, 2, 3, 3, 3, 4, 4, 4, 5],
        "age": [35, 20, 18, 12, 10, 8, 6, 5, 4, 2],
        "price": [310, 365, 420, 505, 560, 615, 690, 735, 790, 875],
    }
)

X = data[["area", "bedrooms", "age"]]
y = data["price"]

Hold Out Test Data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("MAE:", mean_absolute_error(y_test, predictions))
print("R2:", r2_score(y_test, predictions))

Add Cross-Validation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.model_selection import KFold, cross_validate

cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_validate(
    LinearRegression(),
    X,
    y,
    cv=cv,
    scoring=("r2", "neg_mean_absolute_error"),
    return_train_score=True,
)

print("Test R2:", scores["test_r2"])
print("Test MAE:", -scores["test_neg_mean_absolute_error"])

Look at the spread between folds. A single good split can hide an unstable model.

Inspect Residuals

1
2
3
4
5
6
7
8
9
import matplotlib.pyplot as plt

residuals = y_test - predictions

plt.scatter(predictions, residuals)
plt.axhline(0, color="black", linestyle="--")
plt.xlabel("Predicted price")
plt.ylabel("Residual")
plt.show()

A useful first residual plot should look roughly scattered around zero. A curve, widening spread, or obvious groups suggests that the model is missing a pattern.

Correlated features can also make coefficients unstable. Cross-validation helps reveal whether results change substantially with the training rows.

Next Steps

A later article will cover building a reusable scikit-learn pipeline to keep preprocessing and evaluation together.

References

This post is licensed under CC BY 4.0 by the author.