The primary goal of machine learning is to leverage machine intelligence to derive an appropriate answer to a given question or task. Mathematically, the true relationship between a family of questions and their corresponding answers would be a composition of several hidden, unknown, and non-elementary functions.

Hence creating a universal engine capable of solving any type of task is highly challenging and nearly impractical—although reasoning LLMs have recently shown promising progress in this direction. As a result, different approaches are typically employed to address specific types of tasks. One common approach is supervised learning, which is well-suited for scenarios where the desired output varies systematically with changes in the input. ¹

Given a set of data points $\left\lbrace x_i \in \mathcal{X} : i = 1, 2, \cdots, N\right\rbrace$ and $\left\lbrace y_i \in \mathcal{Y} : i = 1, 2, \cdots, N \right\rbrace$ , our goal is to construct a model that learns a function $f : \mathcal{X} \rightarrow \mathcal{Y}$ that nicely predicts (or helps to predict) the corresponding output $y_i$ from input $x_i$ . Our data $x_i$ and $y_i$ are often fixed-dimensional vectors of numbers, i.e. $\mathcal{X} \subseteq \mathbb{R}^D$ and $\mathcal{Y} \subseteq \mathbb{R}^J$ for arbitrary numbers $D$ and $J$ . We usually focus on problems where $y$ is univariate ( $J=1$ ).

Terminology

The inputs $x_i$ are also called features, covariates, or predictors.
The outputs $y_i$ are also called labels, targets, or responses.
A pair $\left(x_i, y_i\right)$ is called a training example, and the list of $N$ training examples $\left\lbrace \left(x_i, y_i\right) : i = 1, 2, \cdots, N\right\rbrace$ is called a training set.

Regression vs Classification

Supervised learning tasks can be classified in two ways according to the nature of the output space $\mathcal{Y}$ .

Tasks where the output space $\mathcal{Y}$ is continuous are called regression problems. Linear regression is a popular method to solve regression problems.

Tasks where $\mathcal{Y}$ is a set of discrete labels (e.g. $\mathcal{Y} = \left\lbrace 1, 2, \cdots, C\right\rbrace$ ) are called classification problems. Especially, if there are just two classes, often formulated as $\mathcal{Y} = \left\lbrace 0, 1\right\rbrace$ or $\mathcal{Y} = \left\lbrace −1, +1\right\rbrace$ , the task is called binary classification. Logistic regression, SVM, or LDA are examples of methods to solve classification problems.

Models, Training and Prediction

Now that we have our task specified, we need to find a way to solve it.

In practice, we first construct a parameterized model and then adjust the parameters based on chosen evaluation methods to make the model as close to the true function as possible. Afterward, we can use our model to answer arbitrary questions within the family that we began with. These processes are each called training and prediction.

Deterministic Models

Models in supervised learning refer to parameterized functions that can be used to derive the desired output from inputs. The most intuitive way to construct a model is to directly express the underlying function that maps the inputs to the outputs. These models are called deterministic models, in contrast to probabilistic models that give the probability distribution of the output given the input. Deterministic models are more intuitive as they directly give the desired targets as function outputs, while probabilistic models are more flexible and can capture the uncertainty in the mapping.

Let's first look over deterministic models. We use the notation $f(x;\theta)$ to refer to a model $f$ parameterized by $\theta$ where $x$ is the input variable.

Training : Empirical Risk Minimization

The process of determining the optimal parameter values for a model is referred to as training, fitting, or learning (from the model’s perspective). Since our goal is to make the model closely approximate the true underlying function, we define a loss function $\ell$ which quantifies the discrepancy between the model’s predictions and the true function.

Loss functions in general will have the form of $\ell : \mathcal{S}^2 \rightarrow \mathbb{R}$ , where a pair of two vectors of same dimension from an arbitrary set $\mathcal{S}$ will be given as an input, and an evaluation of their closeness will be derived as the output. Loss functions do not necessarily have to be symmetric, i.e. a change in the order of the inputs may give different outputs. For a supervised learning setup, $\mathcal{S} = \mathcal{Y}$ and a specific $y_i$ and its estimate $f\left(x_i; \theta\right)$ will be the inputs for the loss function.

Now we define a cost function, or the empirical risk as below to be the average loss over the entire training set.

\mathcal{L}(\theta) \coloneqq \frac{1}{N}\sum_{i=1}^{N}\ell\left(y_i, f\left(x_i; \theta\right)\right)

Then our task converges to an optimization problem to find the parameters that minimize the empirical risk.

\hat{\theta} = \underset{\theta}{\argmin}\,\mathcal{L}(\theta) = \underset{\theta}{\argmin}\,\frac{1}{N}\sum_{i=1}^{N}\ell\left(y_i, f\left(x_i; \theta\right)\right)

Prediction

After $\hat{\theta}$ is found, we can use $f\left(x; \hat{\theta}\right)$ as a prediction of the corresponding label to $x$ .

Probabilistic Models

In many cases though, it is hard to perfectly determine the correct output given the input. This may be due to lack of knowledge of the input-output mapping (called epistemic uncertainty or model uncertainty), or due to intrinsic (irreducible) stochasticity in the mapping (called aleatoric uncertainty or data uncertainty).

In these cases, we decide to capture the uncertainty itself, by modeling the conditional probability distribution of the variables. This estimation can be done both on the inputs or outputs, and also in both cases where the variables are discrete or continuous.

If the variable of our interest is discrete, e.g. a class label $y \in \left\lbrace 1, 2, \cdots, C \right\rbrace$ in a classification task, we can set our model $f : \mathcal{X} \rightarrow \left[0, 1\right]^C$ to take a data point $x$ as an input and return the probability mass of $y$ over the $C$ possible labels as an output. Therefore the probablity that $y$ equals to a specific label will be:

p(y=c \mid x;\theta) = f_c\left(x;\theta\right).

If we are to predict continuous variables, we can set our model $f : \mathcal{D} \rightarrow R$ to take a data point (from either $\mathcal{X}$ or $\mathcal{Y}$ ) as an input and return the probability distribution of its counterpart as an output. Generally we design our model so that $\mathcal{D} = \mathcal{X}$ , i,e. we estimate the probability of $y$ given $x$ . These models are called discriminative, and we get the following conditional probability distribution:

p(y \mid x;\theta) = f\left(x;\theta\right).

In contrast, if $\mathcal{D} = \mathcal{Y}$ , i.e. we estimate the probability of $x$ given $y$ , the model is called generative. Note that we are predicting $x$ not $y$ , so $\mathcal{Y}$ need not be continuous and this case will include classification problems too. Now our conditional probability distribution is:

p(x \mid y;\theta) = f\left(y;\theta\right).

It is also common to model only a portion of the distribution and fix other attributes. For example, in regression problems, it is common to assume the output distribution is a Gaussian. Here we can make only the mean depend on the inputs and assume the variance is fixed. The resulting model will look like:

p(y \mid x;\theta) = \mathcal{N}\left(y \mid f(x;\theta), \sigma^2\right).

Training : Maximum Likelihood Estimation (MLE)

Suppose we constructed a model that predicts the probability distribution $p(x)$ .² Say the model is parameterized by $\theta$ , then we can denote the distribution as $p(x;\theta)$ . Intuitively, given a set of i.i.d. samples $\left\lbrace x_i \sim p(x) : i = 1, 2, \cdots, N \right\rbrace$ , we can say that the model is good if it assigns high probability to the actually sampled values of $x$ . In other words, the model should be built in a way to maximize the likelihood of the given samples. Mathematically, we define this likelihood³ as:

\mathrm{L}(\theta) \coloneqq p\left(x_1, x_2, \cdots x_N;\theta\right) = \prod_{i=1}^{N}p\left(x_i;\theta\right).

The second equality comes from the i.i.d. assumption of the samples. In practice, it is common to convert the maximization problem to a minimization problem and use the negative log likelihood as the loss function. The negative log likelihood is defined as:

\mathrm{NLL}(\theta) \coloneqq -\frac{1}{N}\log\mathrm{L}(\theta) = -\frac{1}{N}\sum_{i=1}^{N}\log p\left(x_i;\theta\right).

The factor $\frac{1}{N}$ is added for convenience, to format the objective into a finite-sum optimization problem. Now our task converges to an optimization problem to find the parameters that minimize the negative log likelihood.

\hat{\theta} = \underset{\theta}{\argmin}\,\mathrm{NLL}(\theta) = \underset{\theta}{\argmin}\,\left(-\frac{1}{N}\sum_{i=1}^{N}\log p\left(x_i;\theta\right)\right)

Training : Cross-Entropy Loss

As outlined above, a loss function measures the distance between two vectors of the same dimension. When the negative log-likelihood (NLL) is used as the loss function, a natural question arises: what kind of distance is it evaluating? Intuitively, since our model estimates a probability distribution, it seems plausible that $\mathrm{NLL}(\theta)$ represents a meaningful measure of the difference between the true distribution and the model’s estimation.

This intuition is supported by two key concepts: the Kullback-Leibler (KL) divergence and entropy. For simplicity, we will focus on discrete random variables for now. In the context of information theory, given probability masses $p, q : \mathcal{X} \rightarrow [0, 1]$ , the entropy and cross-entropy are defined as follows:

\begin{aligned} \mathrm{H}(p) &\coloneqq \mathbb{E}_p[-\log p(x)] =-\sum_{x\in\mathcal{X}}p(x)\log p(x) \\ \mathrm{H}(p, q) &\coloneqq \mathbb{E}_p[-\log q(x)] = -\sum_{x\in\mathcal{X}}p(x)\log q(x) \end{aligned} \;\;.

The entropy of a probability distribution can be interpreted as a measure of uncertainty, or lack of predictability of the data drawn from the distribution. To make this precise, from an informational perspective, entropy corresponds to the expected value of the information content $(\mathrm{I}(x) \coloneqq -\log p(x))$ of a data source. The choice of the logarithm base may vary in different applications, e.g. $\log_2$ will scale the factor to the unit of bits, and in this case the entropy measures the expected number of bits in a data set.

Meanwhile the cross-entropy of two probability distributions is in a similar form, but the difference is that it evaluates the expected information content of a data source (sampled from $p$ ) on a different distribution ( $q$ ). Analogous to the "bits" interpretation of entropy, the cross-entropy can be understood as a measure of the expected number of bits needed to compress data sampled from a distribution $p$ when using codes from another distribution $q$ .

Now the KL divergence is defined and can be expanded as the following.

\begin{aligned} D_{\mathrm{KL}}(p || q) &\coloneqq \mathbb{E}_p\left[\log\frac{p(x)}{q(x)}\right]\\ &= \sum_{x\in\mathcal{X}}p(x)\log\frac{p(x)}{q(x)} \\ &= \sum_{x\in\mathcal{X}}p(x)\log p(x) - \sum_{x\in\mathcal{X}}p(x)\log q(x) \\ &= -\mathrm{H}(p) + \mathrm{H}(p, q) \end{aligned}

Note that the KL divergence is the sum of the negative entropy and the cross-entropy. Based on the "bits" interpretation of entropy and cross-entropy, the KL divergence represents the extra number of bits required to encode data from one distribution using a coding scheme optimized for a different distribution. This, intuitively, serves as a measure of the divergence or distance between the two distributions: the farther apart the distributions are, the greater the extra number of bits required will be.

Now suppose $p$ is the empirical distribution of our interest and $q$ is our model. Then $p$ can be defined as the following, where we put a probability mass on the observed training data and zero mass everywhere else.

p(x) = \frac{1}{N}\sum_{x^\prime \in \mathcal{X}}\delta_{xx^\prime}

In this case where $x$ is discrete, $\delta$ represents the Kronecker delta. Also note that $p$ is independent to the parameter $\theta$ ; only $q$ relates to $\theta$ . Finally we can show the equivalence of maximum likelihood estimation and KL divergence minimization as the following.

\begin{aligned} &\underset{\theta}{\argmin}\,D_{\mathrm{KL}}(p || q) \\ =\; & \underset{\theta}{\argmin}\,\left(\sum_{x\in\mathcal{X}}p(x)\log p(x) - \sum_{x\in\mathcal{X}}p(x)\log q(x;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\sum_{x\in\mathcal{X}}p(x)\log q(x;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\sum_{x\in\mathcal{X}}\left(\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\delta_{xx^\prime}\right)\log q(x;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\sum_{x\in\mathcal{X}}\delta_{xx^\prime}\log q(x;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\log q(x^\prime;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\mathrm{NLL}(\theta) \end{aligned}

Now our initial intuition that maximum likelihood estimation minimizes the distance between the true distribution and our model has been validated. Moreover, since minimizing the KL divergence has only involved the process of minimizing the cross-entropy, we can directly use the cross-entropy as a loss function, simply called the cross-entropy loss, as a more concrete and explicit formulation of $\mathrm{NLL}(\theta)$ for use in optimization.

The same can be shown for continuous random variables. For probability densities $p$ and $q$ , the entropy, cross-entropy, and KL divergence are defined as the following.

\begin{aligned} \mathrm{H}(p) &\coloneqq \mathbb{E}_p[-\log p(x)] = -\int_{-\infty}^{\infty} p(x)\log p(x)dx \\ \mathrm{H}(p, q) &\coloneqq \mathbb{E}_p[-\log q(x)] = -\int_{-\infty}^{\infty} p(x)\log q(x)dx \\ D_{\mathrm{KL}}(p || q) &\coloneqq \mathbb{E}_p\left[\log\frac{p(x)}{q(x)}\right] = \int_{-\infty}^{\infty} p(x)\log\frac{p(x)}{q(x)}dx \\ &= \int_{-\infty}^{\infty}p(x)\log p(x)dx - \int_{-\infty}^{\infty}p(x)\log q(x) \\ &= -\mathrm{H}(p) + \mathrm{H}(p, q) \end{aligned}

Again, suppose $p$ is the empirical distribution of our interest and $q$ is our model. This time $p$ is as the following, which is slightly different in that we use the Dirac delta function instead of the Kronecker delta.

p(x) = \frac{1}{N}\sum_{x^\prime \in \mathcal{X}}\delta\left(x-x^\prime\right)

Similarly, we can show the equivalence of maximum likelihood estimation and KL divergence minimization as the following:

\begin{aligned} &\underset{\theta}{\argmin}\,D_{\mathrm{KL}}(p || q) \\ =\; & \underset{\theta}{\argmin}\,\left(\int_{-\infty}^{\infty}p(x)\log p(x)dx - \int_{-\infty}^{\infty}p(x)\log q(x;\theta)dx\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\int_{-\infty}^{\infty}p(x)\log q(x;\theta)dx\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\int_{-\infty}^{\infty}\left(\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\delta\left(x-x^\prime\right)\right)\log q(x;\theta)dx\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\int_{-\infty}^{\infty}\delta\left(x-x^\prime\right)\log q(x;\theta)dx\right) \\ =\; & \underset{\theta}{\argmin}\,\left(-\frac{1}{N}\sum_{x^\prime\in\mathcal{X}}\log q(x^\prime;\theta)\right) \\ =\; & \underset{\theta}{\argmin}\,\mathrm{NLL}(\theta) \end{aligned}

Prediction

After $\hat{\theta}$ is found, we can sample from the model to get the predicted output. For discriminative models, we can sample from the distribution $f(x;\hat{\theta})$ to get the predicted label. For generative models, we can sample from the distribution $f(y;\hat{\theta})$ to get the predicted input.

References

[1]

K. P. Murphy, Probabilistic Machine Learning: An introduction. MIT Press, 2022. Available: https://probml.github.io/pml-book/book1.html

[2]

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, 2009. Available: https://hastie.su.domains/ElemStatLearn

Some other widely used approaches are unsupervised learning and reinforcement learning. Together with supervised learning, these three are the most common paradigms in machine learning. The target tasks of each method can be described in the following format.
• Supervised Learning
Produce [output] satisfying [condition] relative to [input], given collection of ([input], [output]) pairs.
• Unsupervised Learning
Produce [output] satisfying [condition], given collection of possible [output]s.
• Reinforcement Learning
Find the optimal next action to perform in a [single|multi]-state environment, given rewards and [∅|transitions].
Text in blue indicates the desired output for the task, and text in brackets represents elements that may vary depending on the specific task. ↩
Here, we can substitute either $y \mid x$ (for discriminative models) or $x \mid y$ (for generative models) into $x$ . ↩
In Bayesian statistics, the likelihood is defined as the probability of the variable of our interest given a variable with prior knowledge. In this case, the likelihood is the probability of the observed data given the parameters of the model, i.e. we are using $P\left(x ; \theta\right)$ and $P\left(x \mid \theta\right)$ interchangeably. ↩

An Overview of Supervised Learning

Terminology

Regression vs Classification

Models, Training and Prediction

Deterministic Models

Training : Empirical Risk Minimization

Prediction

Probabilistic Models

Training : Maximum Likelihood Estimation (MLE)

Training : Cross-Entropy Loss

Prediction

References

Footnotes