supervised learning 的缺点: little labelled data available generating labels is expensive Unsupervised learning Discover structure in only X1, X2, …, Xp Part of explorative data analysis, very useful in data science Main problem: no clear way to check the answer Two types of approaches: dimensionality reduction, clustering Principal component analysis(PCA) find a linear low-dimensional representation that captures as much variation in the data as possible( projection from p dimensions to d < p) Dimensionality reduction How to investigate high-dimensional data? scatterplots, correlation or covariance, select most interesting 概念 Projecting points on a line Zi (principal component, PC): maximum variation along the line Zi数据在这个方向上的分布变化幅度最大,也就是minimum distance to the line Zi通过最大化方差来选择主成分方向,等价于通过最小化距离来选择低维子空间。 Principal components are ordered: Z1 is the direction of most variation,Z2 is the direction of most variation perpendicular to Z1, Z3 is the direction of most variation perpendicular to Z1 and Z2 Find the first principal component 通过使载荷向量的长度最大化(1)来最大化方差 size does matter Scaling to unit standard deviation is useful if features have different ranges Scaling is essential if measured in different units, If units are the same, maybe we do not want to scale to unit variance as we may inflate noise这是因为单位方差假设所有特征的方差应该相等,但在这种情况下,这种假设可能不合适。 biplot PVE(proportion of variance explained)
Separating Hyperplanes when you want to find a decision boundary, avoid estimating densities linear decision boundry,就是设置这个式子为0,就是它的decision boundry hyperplanes separate feature space into regions for new X, 可以带入最后一个式子,然后看y=1,还是y=-1来分类 exercise1. This problem involves hyperplanes in two dimensions. (a) Sketch the hyperplane 1 + 3X1 − X2 = 0. Indicate the set of points for which 1 + 3X1 − X2 > 0, as well as the set of points for which 1 + 3X1 − X2 < 0. (b) On the same plot, sketch the hyperplane −2 + X1 + 2X2 = 0. Indicate the set of points for which −2 + X1 + 2X2 > 0, as well as the set of points for which −2 + X1 + 2X2 < 0. Maximum margin classifiers (怎么找一个最佳的hyperplane) generalization link to regularization for generalization, the decision boundary should lie between the class boundaries Maximize perpendicular distance(最大化垂直距离) between the decision boundary and the ==nearest observations==: the margin M the decision boundary then only depends on a few points on the margin, the support vectors:if you remove one from other ob, nothing changes 所以, leave one out will fail Construction maximize the margin(尝试最大边际化), under the constraint that training observations are classified correctly Limitation ● separable classes ● linear separability ● two classes Problems Maximum margin classifier is prone to ==overfitting==: very sensitive to training set When classes ==overlap==, separating hyperplane does not exist We need to make a trade-off between errors on the training set and predicted performance on the test set (generalization) Solution: The soft margin 为了解决上面的问题 Solution: ==allow (some, small) errors on the training set,==introducing slack (松弛)variables ∊i ≥ 0»>Add slack variables to maximum margin classifier, but limit total slack to C: the trade-off parameter 变化:M» M(1-e),然后引入一个C,误差平方和<=C C influence solution: There is no a prori best choice for C The multi-class case one-versus-one one- versus-all exercise Optimization(optional) 从最大化M(margin)到最小化一个系数w和常数b的平方和 但是为了解决is a (large) quadratic programming (QP) problem 又引入了lagrange multipliers alphai The nonlinear case Ideal: classes may become linearly separable if higher order terms are added(cf. nonlinear regression) ![[22e573c5252d8db1948d586ddaa54dd.png]] Questions feature是要平方还是开方,等等,我不知道 efficiently train the SVC,就是p= 10 ,如果要3次,就会有286种组合方式了 The Kernel Trick 训练阶段,我们需要计算所有训练样本之间的内积,如 xi 和 xi’ 之间的内积。训练完成后,当我们需要对新的样本进行分类时,我们只需要计算支持向量 (support vectors) 与新样本之间的内积。 引入一个用于泛化的核函数Think of kernel functions as similarities: large when the inputs are very alike, small when they are not 到这一步了,才有两种方法可以选择K(x_i, x) Polynomial kernel Radial kernel The support vector machine Choosing Kernels What kernel functions should we use? type: prior knowledge of problem, trial-and-error parameters: cross-validation, like for C exercise More Kernels A large number of kernels have been proposed,not limited to numerical/vector data! ● Vector kernels ● Set kernels ● String kernels ● Empirical kernel map ● Kernel kernels ● Kernel combination ● Kernels on graphs ● Kernels in graphs ● Kernels on probabilistic models Recap Separating hyperplane: any plane that separates classes The support vector machine has evolved from fundamental work by Vapnik: separable, linear problems: maximum margin classifier non-separable, linear problems - add slack variables:support vector classifier non-separable, non-linear problems - use kernel trick:support vector machine Training involves quadratic programming (optimization) The final classifier only depends on the support vectors
Decision Trees We categorize data through chain of simple decisions. a binary tree, which is a graph data structure in which each node has at most two children Terminology input features: data to be classified root node: starts the decision process decision nodes: Each decision node (split node) implements a test function with discrete outcomes The test function of each decision node splits the input space into regions leaf nodes: A leaf node indicates the end of a sequence of decisions A single (output) class is associated to each leaf node Algorithm to make predictions Increasing node purity every decision should increase the node purity Quantifying Node Purity Measuring Node Purity Low Entropy: more pure;Higher Entropy: less pure»H(p) how flat Gini Index: higher gini index ,less pure 0.25* 0.75 4 times 0.45* 0.55 2 times + 0.05 * 0.95 2 times Regression Trees leaf nodes return the average sample variance as node purity,high variance before splitting, low variacne after splitting Classification versus Regression Trees How to find the thresholde t? Try different thresholds optimum as best recordede threshold Binary Splitting we test as many thresholds as we have points ![[ab2533bf07de0f47d6c10fc874cc20f.png]] Growing a Decision tree with Recursive Binary Splitting: perform binary splitting recursively for each decision node to maximize node purity We end uo with one sample per node = a 1-NN regressor/classifier Overfitting of DTs Tree fits training data perfectly (zero variance) Tree extremely large, and uninterpretable Regularization of DTs Restricting Depth Tree fits training data perfectly (zero variance) Tree extremely large, and uninterpretable Tree Pruning Slow Grows(限制生长,greedy) Strategy: Grow first the tree slowly with additional checks during growing • Simple checks: maximum depth,max number of leaf nodes. • Add decision node only if increase in node purity is above a certain threshold Fast Cutters(先长再切, non-greedy) Strategy: Grow the full tree and then remove not important nodes later. • Cost complexity pruning: we identify the weakest link in the tree to remove Cost Complexity Pruning Remove leafs to reduce the complexity, while maintaining a low training error, i.e., cost. The hyperparameter 𝛼 balances both objectives. Clamada(T) = C(T) + lamada T 就是cost(training error)+complexity(number leafs) higher alpha: produces shallower tree with less nodes 如何利用 cost complexity pruning来构建树 ![[638c2052311e5c539c5d454e45e1a73.png]] Advantages and Disadvantages of Trees Advantages: ● Inexpensive to construct ● Extremely fast at classifying unknown records ● Easy to interpret for small-sized trees ● Robust to noise (especially when methods to avoid overfitting are employed) ● Can easily handle redundant or irrelevant attributes (unless the attributes are interacting) Disadvantages: ● Space of possible decision trees is exponentially large. Greedy approaches are often unable to find the best tree. ● Does not take into account interactions between attributes ● Each decision boundary involves only a single attribute ● too deep ,more sensitive to noise Bagging, Forest,Boosting Bagging是在干啥: we split the feature space into regions The modes(i.e., centers) of the data distributions are well assigned to the classes. But the uncertain overlapping region is a mess low bias each tree fiting vey well high variance > difference between different trees higher Bias - Variance Trade-off: Finding a balance between fitting and overfitting for tree-like classifiiers Strategy 1: Reduce DT complexity Greedy (restrict DT complexity) Non-greedy (post-hoc pruning of the decision tree) Stragegy 2 Ensemble multiple Decision Trees together集成多个决策树:森林!! Step 1 Bagging split different data subset: randomly sampled subsets (bootstrap) ensemble prediction: majority voting for classification output averaging for regression effect: Decision boundaries become blurry: more realistic boundary Out-of-Bag Error Estimation Process: Fit decision tree DT_i on 𝒟_i , Test its performance on the out-of-bag set 𝒟_oob Repeat for each tree in the bag Idea: - This OOB error approximates the test error. Similar idea to cross-validation, but for every DT separately Re-gaining Interpretability Bagging trees improves accuracy over a single tree through increasing diversity in the decisions 意思就是随机性增加了,所以增加了精度 But we loose interpretability: Each tree has different decision rules Interpreting many decision rules in parallel is difficult Strategy: Measure the increase in node purity across trees 就是看每个感兴趣点(橘色点)的node purity,然后求平均 if bagging trees are still highly correlated with each other
Motivation prediction accuracy avoid overfitting 特别是p>n 的时候,就是变量大于样本量 model interpreteability lower practical requirements:fewer measurements, less computation visualization, if selecting 2-3 features Introduction linear model, least squares minimization Solution feature subset selection: select feature extraction(PCA): map to shrinkage: adjust coefficients so that some features are used to a lesser extent (or not at all) regularization: reducing the model flexibility Feature subset selection Basic approach: select the best features Problem: the k best individual features are not necessarily the best set of k features(bc, the corelated ) Best Subset Selection(Exhaustive) Search space 2^p -1 subsets Why not try all possible subsets then? 虽然尝试所有可能的子集是理论上的完美方法,但在实际情况下,由于搜索空间的巨大,这几乎是不可行的 algorithm: 为了在这个庞大的搜索空间中找到最佳的子集,最佳子集选择方法通常使用贪婪算法。该算法从空集开始,然后在每一步中添加一个特征,从而构建一个特征子集。在每一步中,算法评估当前特征子集的性能,并选择对性能有最大贡献的特征添加到子集中。该过程一直持续到达到所需的特征数量k为止。 let M_0 denote the null model for k =1,2….p>fit all models that use k predictors> M_k is the best of these(smallest RSS or largest R^2)> select a single best model from M_0….M_p based on CV error, C,AIC,BIC,adjusted R^2 缺点 computationally intensive when 𝑝 is large, risk finding models that do well on training data but do not generalize well Stepwise Selection Forward Selection start with no predictors, iteratively add predictors until a stop criterion is reached Computational advantage is clear, but may miss optimal subset
Motivation Bias and Variance Prediction errors are due to Bias: the inability of the model to capture essential interactions, processes, or dynamics of the true system. Bias is associated with ==non-flexibility==, or ==lacking information in the data set== Variance: sensitivity of coefficients towards errors or variations in the data. Variance is associated with ==flexibility.== Irreducible errors: measurement errors that cannot be predicted 为什么 test 会U形,training 随着flexibility 下降 The challenge lies in finding a method for which both the variance and the squared bias are low. 两个评价方面 Model assessment (performance) Model selection (flexibility) Test error and training error test error:used in its training training error: not used in its training 其实下面这个问题就是那个图 The training error rate often is quite different from the test error rate, and in particular the training error can considerably ==underestimate== the test error What is the most likely cause, bias or variance? Variance, a sensitivity towards the configuration of the data (training error is small, test error is large) Best estimation of test error: ==a large== designated test set. (大量的test set) Sometimes not available mathematical ==adjustment== to the training error rate in order to estimate the test error rate For example: R2adjusted. Others: Cp statistic, AIC and BIC In this lecture we consider a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and then applying the statistical learning method to those held out observations The Validation Set Approach Here we ==randomly== divide the available set of samples into two parts: a ==training set ==and a ==validation or hold-out set== Drawbacks test error is highly variable, and only half of the obsevations » hence, the validation set error tends to overestimate the test error 50% of the data: inability to incorporate all information from the data set (bias) Leave one out cross validation Steps spliting the set of ob into two parts single ob is used for the validation set remaining ob make up the training set prediction y^_1 ,using x_1 Since (x1, y1) was not used in the fitting process, MSE1 = (y1 − y ̂1)2 provides an approximately unbiased estimate for the test error repeat MSE1….MSEn - average these: CV(n) Advantages and Disadvantages (compared to the validation set approach) It tends not to overestimate the test error rate as much as the validation set (there is no randomness in the training/validation set splits) LOOCV has the potential to be ==expensive== to implement, since the model has to be fit n times (if n is large, and if each individual model is slow to fit) k-Fold Cross-Validation step divide data into K roughly equal-sized parts We leave out part k, fit the model to the other K − 1 parts (combined), and then obtain predictions for the left-out kth part CV(K) Algorithm we are interested only in the location of the minimum point in the estimated test MSE curve K-fold cross-validation Since each training set is only (K − 1)/K as big as the original training set, the prediction error will typically be biased upwards. Why? ==Bias is increased by reducing the data set == This ==bias is minimized when K = n (LOOCV), but this estimate has high variance ==(since all training sets are nearly equal, so the MSE’s are highly correlated, and depend much on which data set is used) K = 5 or 10 provides a good compromise for the bias-variance trade-off The Bootstrap data set is too small The bootstrap can be used to ==estimate the uncertainty associated with a given estimator== or statistical learning method The bootstrap works by==repeatedly== sampling from the same data set, thereby creating multiple data sets 就是有放回抽样 minimize the risk (反正就有个公式) Time series if the data is a time series, we can’t simply sample the observations with replacement. Why not? The data is correlated over time; this correlation represents the dynamics. Randomly resampling will completely alter the dynamics
KNN,LDA and QDA are Bayesian classification Classification Finding class k that maximizing the conditional probability 举例来说,如果你的任务是将一张图片分类为狗、猫、或鸟,那么你可能会得到每个类别的概率,比如: 狗:0.8 猫:0.1 鸟:0.05 在这种情况下,P最大化的类别是狗,因为它有最高的概率。 K-nearest neighbor classifier non-parametric what probability the point belongs to the class j,as the fraction of points in ==N_0== whose reponse values equal ==j== N_0 contains the ==K== points that are closest to x_0 Example K= 3 表示有三个最近的点在这个sample 点的附近 the region of the cricle :the last nearest neighbor Size of the green circle varies, depending on the location of the black cross x1 10000 but x2 78. the scale , the dominated by x1 Decision boundary Corresponding KNN ==decision boundary== P(Y = blue | X) =P(Y = orange | X) = 0.5 if there are three classes,=0.33 迭代error, test overfiting on the nosiy We have the Bayes Decision boundary automatically How to choose K K= 1: decision boundary overly flexible high variance ,overfitting K=100: decision boundary close to linear high bias underfitting K= 10 just right in this case only ==validation set== to choose which K is the best Training and test error 1/K related to flexibility of kNN K smaller, 1/K bigger, more flexible
used for classification Motivation linear regression quantitative vs. qualitative(categorical) data quantitative data > regression quantitative Y qualitative data> classification qualitative Y class and regression can be used together Why not linear regression in case of variable Y with >2 categories: Coding categories as numbers (ordering and distance between categories) is arbitrary In case of variable Y with 2 categories: Coding with 0/1 would work Even with 2 categories there is an issue: Example: credit card default / not default 但是还是不好predict probability The Logistic Model main idea: predict probability P(Y=1| X=..) Logistic function S-shaped curve between 0 and 1: 这里Linear regression 范围会小于0,但是它的意义是概率 How do we see that the curves have same B0? They cross the x= 0 at same point Is B0 >0 or B0 <0? if at the mentioned point the Y value is smaller than 0.5 , the B0<0 temperature of meidian low and high Odds The ratio of the probability of one event to that of an alternative event Logistic function leads to logit that is linear in X Logit = log(Odds) B1 in linear regression: gives average change of Y associated with one-unit increase in X B1 in logistic regression: gives average change of log odds (logit) with one-unit increase in X
Simple linear regression assume a model : (Y≈β_0+β_1X) β ̂0intercept β ̂1 slope Given some estimates β ̂0 and β ̂1, we can predict future sales y ̂=β ̂0+β ̂1x
Y= f(X)+e Y: (response, dependent variable, predicted value) X: (predictor,independent variable,feature) f: unknow relationship e: random error(1. with 平均值等于0 mean zero 2. 模拟了个体之间的差异) the multivariate case:Usually more than 2 input dimensions! p: number of input dimensions 要分析几个变量 n: number of data points(samples in the data) in a sample 样本量 Prediction 是啥?就是找到Y就行了,f这个关系可以是一个黑盒 Y and f are ==unknown== , so we estimate f in order to predict Y from konwn X values 有帽子的f 和Y是estimated / predicted f is ==estimated== using ==training data==, consisting of X values and corresponding Y values. Then, the Y values can be ==predicted== for new X values 如果说e 的平均值是0, 那么 Y^ = f(X)^ Error of the model Y- Y with hat Can be estimated from the data set: mean squared error: 1/N sum(yi-yi with hat)^2 Reducible and irreducible error reducible : change the ==learning techniques and models ==and ==better training data== » minimized while estimating f (inference) irreducible: cannot be reduced because of ==unmeasured but relevant inputs ==or ==unmeasured variation(nosie)==» set upper bound on the accuracy of predicting Y(prediction) Inference 就是虽然俺们也预测了Y,但是重要的是找到f 关系的方程 estimate f , but ==understanding how== X influence Y Do not treat f with hat as black box Summary Prediction: estimating f^ to get good prediction of Y^ Inference: estimating f ^ to get an understanding of the relationship between X1-Xp and Y Prediction accuracy vs. model interpretablity 预测准确度:预测准确度是指模型在测试集上的预测结果与实际观测值之间的一致程度 模型解释度:线性回归模型通常具有很高的解释性,因为它可以明确地表示变量之间的线性关系,并且可以解释每个自变量对因变量的影响程度 ==linear models==: high interpretability and sometimes high accuracy»inference ==highly non-linear==: low interpretability, high accuracy»prediction Choice depends:prediction or inference Parametric methods choose the functional form of f, then learn its parameters from training data,using least squares or a differenct method 就是先假设确定了一个方法去作为f,然后求参数 ad: much easier to estimate a set of parameters than to fit an arbitrary function ==less training data== dis: if the chosen functional form is too far from the truth, prediction and inference results can be poor >we don’t know the relationship in advance Non parametric based on training data itself非参数方法不需要事先确定模型的函数形式或者参数的数量,而是从数据中直接学习模型的结构。这使得非参数方法在处理复杂的数据结构或者对数据分布了解不充分时更具有灵活性。 ad: ==good fit== ,even if the input-output relations are complex dis: require much ==more training data== risk of ==overfitting== modelling 了 the nosie e (don’ t have enough data) Overfitting是指模型在训练数据上表现很好,但在测试数据(或新数据)上表现较差的现象,因为噪声等»指模型过于复杂,过度拟合了训练数据中的噪声和细微特征的情况。 Overfitting 的原因有: 模型过于复杂,训练数据量不足,特征选择不当,训练数据和测试数据的分布不一致 Supervised vs Unsupervised Learning 最大区别在有没有标签 Assessing Model Accuracy Measuring the quality of fit 回归和分类的区别 回归:预测连续的变量 分类:预测离散的变量 Mean squared error(regression) training MSE isnot important: 增加模型的灵活性可以使其更容易适应复杂的数据模式和关系,从而可能降低训练 MSE。more flexiblity less taining MSE。 test MSE(unseen) different parts of field ==goal: select the model with the smallest test MSE== Underfitting:模型过于简单,无法捕捉数据中的真实模式和关系的情况。欠拟合的模型通常对训练数据和测试数据的表现都较差 Overfitting:模型过于复杂,过度拟合了训练数据中的噪声和细微特征的情况。过拟合的模型在训练数据上表现很好,但在未见过的测试数据上表现较差 Bias vs. variance 因为Biase和variance,所以MSE才是U形的 Bias : model too simple Bias refers to the error that is introduced by approximating a real-life problem by a too simpler model真实值和期望值之间的差异 无论我们使用多少训练数据,这个误差都会存在 more flexible methods have less bias Variance: model too complex 如果换成不同的training sample, f的预测有多大的变化 如果model 有High variance,就是很小的变化都会在f的预测结果上产生很大的变化方差越大意味着模型对数据的变化更敏感。 more flexible methods have higher variance ==more flexible, less bias, more variance== Good test set performance requires ==low variance as well as low squared bias.== (a) inflexible biase »fiexible Flexible is generally better. A flexible method has many degrees of freedom, so it can follow the patterns in the data, even if they are highly non-linear. If the data is more linear, there are enough data points to train the parameters so that the model turns out more linear as well. If flexibility is chosen extremely large, overfitting could still occur. 大样本量:对于极大的样本量,灵活的模型往往表现更好,因为它们有更多的数据可供学习 predictor 量少:当预测变量的数量较少时,意味着数据可能具有简单的关系,这种关系可以由灵活的模型充分捕获而无需担心过拟合 (b) inflexible There is a high risk of overfitting. (c) flexible (d) inflexible (flexible model» there lots of nosiy » fiting the nosiy) 就是用对了,就是low,low。 For Classification Setting Instead of MSE, we get ==error rate==:I= 1 if the pefect model Again, there is a ==training error rate and a test error rate==. They express the fraction of incorrect classifications不正确分类的比例 Training set and test set 就是用training set 去训练模型,然后在调整参数的时候用validation set 找哪个参数合适,同时也找哪个模型是更适合这个数据的,再去test set里面验证这个模型好不好 Training set to ==train the model== Validation set to ==optimize the hyper parameters== Test set to test the performance of the model on an ==independent ==part of the data set. To get an estimate on how good it will ==work in practice== with limited amount of data 少量数据 可以用 cross validation
What is machine learning? 分类 supervised learning : classification and regression unsupervised learning: cluastering and structure Supervised learning the relationship f between input X(predictor,independent variable,feature) and output Y(response, dependent variable, predicted value) based on training data> Y= f(X) Classification and regression区别 1. tomato ripe> regression methods for regression: linear regression, decision trees/random forest/neural networks 2. object database> representation/object featrues> feature space(distance measure to deteccted object)> when new image poped up, we can use this model to detect object methods for classification: Logistic regression K-Nearest Neighbors Linear/Quadratic Discriminant Analysis线性、二次判别分析 Decision Trees/Random Forest Support Vector Machines Neural Networls Unsupervised learning learning structure, no Y, only X 分为:clustering example(K-Means Clustering, Expectation Maximization,Hierarchical Clustering ) dimensionality reduction(Principal Component Analysis) Which model is the best? Model and feature selection How to optimally use the training /test data? Resampling methods Cross-validation Bootstrapping