Linear Regression#
Now that we have introduced somewhat more formally the learning problem and its notation lets us study a simple but instructive regression problem that is known in the statistics literature as shrinkage.
Suppose that we are given the training set

Training Dataset (m=10) for the Regression Model. The green curve is the uknown target function. Please replace the label axis denoted by
Since the output
!pip install git+https://github.com/pantelis-classes/PRML.git#egg=prml
import seaborn as sns
# Apply the default theme
sns.set_theme()
Defaulting to user installation because normal site-packages is not writeable
Collecting prml
Cloning https://github.com/pantelis-classes/PRML.git to /tmp/pip-install-4jmy2x2l/prml_d2cf171c4aa64b08b838dcd5fe8fe772
Running command git clone --filter=blob:none --quiet https://github.com/pantelis-classes/PRML.git /tmp/pip-install-4jmy2x2l/prml_d2cf171c4aa64b08b838dcd5fe8fe772
Resolved https://github.com/pantelis-classes/PRML.git to commit 14cf88538d2704fd18b391ae6309fb38c65a6412
Preparing metadata (setup.py) ... ?25ldone
?25hRequirement already satisfied: numpy in /usr/local/lib/python3.8/dist-packages (from prml) (1.22.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.8/dist-packages (from prml) (1.6.3)
Building wheels for collected packages: prml
Building wheel for prml (setup.py) ... ?25ldone
?25h Created wheel for prml: filename=prml-0.0.1-py3-none-any.whl size=88407 sha256=8e8f307f19e75eaeee1f12cbbf40ff5ec4f201ab7a2a2c2d545f1fe4c241bff6
Stored in directory: /tmp/pip-ephem-wheel-cache-g9hwg7ro/wheels/5d/56/f3/c40fd432ed3d5d565f131c2146282d43c65eb720576825e67f
Successfully built prml
Installing collected packages: prml
Successfully installed prml-0.0.1
import numpy as np
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
%matplotlib inline
from prml.preprocess import GaussianFeature, PolynomialFeature, SigmoidalFeature
from prml.linear import (
BayesianRegression,
EmpiricalBayesRegression,
LinearRegression,
RidgeRegression
)
np.random.seed(1234)
def create_toy_data(func, sample_size, std, domain=[0, 1]):
x = np.linspace(domain[0], domain[1], sample_size)
np.random.shuffle(x)
y = func(x) + np.random.normal(scale=std, size=x.shape)
return x, y
def sinusoidal(x):
return np.sin(2 * np.pi * x)
x_train, y_train = create_toy_data(sinusoidal, 10, 0.25)
x_test = np.linspace(0, 1, 100)
y_test = sinusoidal(x_test)
plt.figure(figsize=[10,8])
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, y_test, label="$\sin(2\pi x)$")
plt.plot(x_test, y_test, "-g", label="Target Function")
plt.legend()
plt.show()

Let us now pick the hypothesis set that corresponds in general to the following sum,
where
x = np.linspace(-1, 1, 100)
X_polynomial = PolynomialFeature(11).transform(x[:, None])
X_gaussian = GaussianFeature(np.linspace(-1, 1, 11), 0.1).transform(x)
X_sigmoidal = SigmoidalFeature(np.linspace(-1, 1, 11), 10).transform(x)
plt.figure(figsize=(20, 5))
for i, X in enumerate([X_polynomial, X_gaussian, X_sigmoidal]):
plt.subplot(1, 3, i + 1)
for j in range(12):
plt.plot(x, X[:, j])
txt="Polynomial, Gaussian and Sigmoidal basis functions"
plt.figtext(0.5, 0.01, txt, wrap=True, horizontalalignment='center', fontsize=12)
Text(0.5, 0.01, 'Polynomial, Gaussian and Sigmoidal basis functions')

Our job is to find
.
The loss function chosen for this regression problem, corresponds to the sum of the squares of the displacements of each data point and our hypothesis. The sum of squares in the case of Gaussian errors gives raise to an (unbiased) Maximum Likelihood estimate of the model parameters. Contrast this to sum of absolute differences.
Now our job has become to choose two things: the weight vector




Various final hypotheses.
Obviously you can reduce the training error to almost zero by selecting a model that is complicated enough (M=9) to perfectly fit the training data (if m is small).
But this is not what you want to do. Because when met with test data, the model will perform far worse than a less complicated model that is closer to the true model (e.g. M=3). This is a central observation in statistical learning called overfitting. In addition, you may not have the time to iterate over M (very important in online learning settings).
M=9
# Pick one of the three features below
feature = PolynomialFeature(M)
#feature = GaussianFeature(np.linspace(0, 1, M), 0.1)
# feature = SigmoidalFeature(np.linspace(0, 1, M), 10)
X_train = feature.transform(x_train)
X_test = feature.transform(x_test)
model = LinearRegression()
model.fit(X_train, y_train)
plt.figure(figsize=[10,8])
plt.plot(model.w)
plt.xlabel("index of $w$")
plt.ylabel("$w$")
y, y_std = model.predict(X_test, return_std=True)
plt.figure(figsize=[10,8])
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, y, label="mean")
plt.fill_between(
x_test, y - y_std, y + y_std,
color="orange", alpha=0.5, label="std.")
plt.legend()
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.show()


To avoid overfitting we have multiple strategies. One straightforward one is evident by observing the wild oscillations of the
This type of solution is called regularization and because we effectively shrink the weight dynamic range it is also called in statistics shrinkage or ridge regression. We have introduced a new parameter
The graph below show the results of each search iteration on the
model = RidgeRegression(alpha=1e-3)
model.fit(X_train, y_train)
y = model.predict(X_test)
plt.figure(figsize=[10,8])
plt.scatter(x_train, y_train, facecolor="none", edgecolor="b", s=50, label="training data")
plt.plot(x_test, y, label="Regularized Final Hypothesis")
plt.legend()
plt.show()

Lets reflect on the MSE and how model complexity gives raise to various generalization errors.
which means that the MSE captures both bias and variance of the estimated target variables and as shown in the plots above, increasing model capacity can really increase the variance of
As capacity increases (x-axis), bias (dotted) tends to decrease and variance(dashed) tends to increase, yielding another U-shaped curve for generalization error (bold curve). If we vary capacity along one axis, there is an optimal capacity, with underfitting when the capacity is below this optimum and overfitting when it is above.
Bias and Variance Decomposition during the training process#
Apart from the composition of the generalization error for various model capacities, it is interesting to make some general comments regarding the decomposition of the generalization error (also known as empirical risk) during training. Early in training the bias is large because the predictor output is far from the target function. The variance is very small because the data has had little influence yet. Late in training the bias is small because the predictor has learned the underlying function. However if train for too long then the predictor will also have learned the noise specific to the dataset (overfitting). In such case the variance will be large because the noise varies between training and test datasets.
References#
This notebook makes extensive use of the Pattern Recognition for Machine Learning python package.