Machine Learning
Supervised learning & Unsupervised learning
Starting Point
Outcome measurement $Y$ (Also dependent variable, response, target )
- In the regression problem, $Y$ is quantitative.
- In the classification problem, $Y$ takes values in a finite, unordered set.
Vector of $p$ predictor measurements $X$ (also called inputs, regressors, covariates, features, independent variables)
Unsupervised Learning
Starting Point
- No outcome varibale, just a set of predictors (features) measured on a set of samples.
- Objective is more fuzzy - find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
- Difficult to know how well you are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised learning or as an exploratory analysis tool
Our objectives
- Accurately predict unseen test cases.
- Understand which inputs affect the outcome, and how.
- Access the quality of our predictions and inferences.
ML is to generalize knowledge beyond the training examples
Philosophy
-
It is important to understand the ideas behind the various techniques, in order to know how and when to use them.
-
One has to understand the simpler methods first, in order to grasp the more sophisticated ones.
-
It is important to accurately assess the performance of a method, to know how well or how badly it is working [simple methods often perform as well as fancier ones!]
Supervised learning : regression problems
Find feature and response
$X$(Independent variable, feature, covariate, input): TV, Radio, Newspaper
$Y$(Dependent variable, target, response, output): Sales
We try to build a model: $$ \text{Sales} \approx f(\text{TV, Radio, Newspaper}) $$ We can refer to the input vector collectively as: $$ X=\left(\begin{array}{l} X_{1} \ X_{2} \ X_{3} \end{array}\right) $$ Now we can write our model as: $$ Y=f(x) + \varepsilon $$ $\varepsilon$ captures measurement errors and other discrepancies.
What is regression function?
The ideal $f(x)=E(Y|X=x)$ Is called the regression function.
What is our goal?
$f(x)$ is optimal predictor of $Y$ with regard to mean-squared prediction error $$ \text{Minimize}:E\left[(Y-g(X))^{2} \mid X=x\right] $$
How to estimate $f$ ?
Typically we have few if any data points with $X=4$ exactly, so we cannot compute $E(Y|X=x)$ !
What we do is to relax the definition and let:
$$ \hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x)) $$ where $\mathcal{N}(x)$ is some neighborhood of $x$.
Build linear model
$$ f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\cdots+\beta_{p} X_{p} $$
-
A linear model is specified in terms of $p+1$ parameters $\beta_0,\beta_1,…,\beta_p$
-
We estimate the parameters by fitting the model to training data
-
Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$
Interpretability and Flexibility
- Why under-fitting is bad?
- If a model is under-fitting, it means the model even cannot fit the training data, let alone the testing data or use it in real-world cases
- Why over-fitting is bad?
- Although the model can fit training data well, but it’s too “well”, we cannot use it in other cases.
- How do we know when the fit is just right?
- Parsimony v.s. black box
- We often prefer a simpler model involving fewer variables over a black-box predictor involving them all.
Accessing Model Accuracy
Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left{x_{i}, y_{i}\right}_{1}^{n}$, and we wish to see how it performs.
We could compute the average squared prediction error over $\text{Tr}$: $$ M S E_{T r}=A v e_{i \in \operatorname{Tr}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$ And then we compute it using fresh test data $\operatorname{Te}=\left{x_{i}, y_{i}\right}{1}^{n}$ : $$ M S E{T e}=A v e_{i \in \operatorname{Te}}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2} $$
Black curve is truth. Red curve on right is $M S E_{T e}$. Grey curve is $M S E_{T r}$.
Orange, blue and green curves/squares correspond to fits of different flexibility.
Choose the flexibility based on average test error amounts to a bias-variance trade-off.
Supervised learning : classification problems
Here the response variable $Y$ is qualitative , e.g. email is one of $\C = \text{(spam, ham)}$, ham is good email ; digit class is one of $\C = {0, 1, …,9}$.
Our goals
- Build a classifier $C(X)$ That assigns a class label from $\C$ to a future unlabeled observation $X$
- Access the uncertainty in each classification
- Understand the roles of the different predictors among $X=X_1, X_2,…,X_p$
Bayes optimal classifier
Suppose the $K$ elements in $\C$ Are numbered $1,2,…,K$, Let: $$ p_k(x)=\text{Pr}(Y=K|X=x),k=1,2,…K $$ These are the conditional/posterior class probabilities at x. Suppose those class probabilities are known, the Bayes optimal classifier at $x$ is: $$ C(x)=j \text { if } p_{j}(x)=\max \left{p_{1}(x), p_{2}(x), \ldots, p_{K}(x)\right} $$