The Learning Problem
The Supervised (Inductive) Learning Problem Statement
Let us start with a classic formal definition of the supervised learning problem.
Vapnik’s formulation of the learning problem (enhanced with notation from the Deep Learning book)
The description below is taken from Vadimir Vapnik’s classic book Statistical Learing Theory, albeit with some enhancements on terminology to make it more in line with our needs.
The generator is a source of situations that determines the environment in which the target function (he calls it supervisor) and the learning algorithm act. Here we consider the simplest environment: the data generator generates the vectors $\mathbf{x} \in \mathcal{X}$ independently and identically distributed (i.i.d.) according to some unknown (but fixed) probability distribution function $p_{data}(\mathbf{x})$.
The vector $\mathbf x$ are inputs to the target function (or operator); the target function returns the output values $y \in \mathcal{Y}$. The target function which transforms the vectors $\mathbf{x}$ into values $y$, is unknown but we know that it exists and does not change. The target function effectively returns the output $y$ on the vector $\mathbf x$ according to a conditional distribution function $p(y | \mathbf x)$.
The learning algorithm observes data that is drawn randomly and independently from the joint distribution function $p_{data}(\mathbf x , y) = p_{data}(\mathbf{x}) p_{data}(y | \mathbf x)$. The sampling distribution denoted as $\hat{p}_{data}$ produces _examples_ of inputs $\mathbf{x}$ and the targets (labels) $y$.
$$\mathtt{data} = \{ (\mathbf{x}_1, y_1), \dots, (\mathbf{x}_m, y_m) \}$$
During what is called training, the learning algorithm constructs some operator which will he used for prediction of the supervisor’s answer $y_i$ on any specific vector $\mathbf{x}_i$ generated by the generator. The goal of the learning algorithm is to construct an approximation to the target function $f$. Recall that we do not know this function but we do know that it exists. We will symbolize this approximation $g$ and we will call it a hypothesis drawn from a hypothesis set. The hypothesis is parametrized with a set of weights, the vector $\mathbf w$ and can be iteratively constructed so the final hypothesis, that results from the best possible $\mathbf w$, $\mathbf w^*$, is the one that is used to produce the predicted label $\hat{y}$.
The ability to optimally predict, according to a criterion, when observing data that we have never seen before, the test set, is called generalization. Note that in the literature supervised learning is also called inductive learning. Induction is reasoning from observed training cases to general rules (e.g. the final hypothesis function), which are then applied to the test cases.
In summary, to learn we need three components:
- Data that may be stored (batch) or streamed (online).
- Learning algorithm that optimizes an objective function
- Hypothesis set