The relational model is a meta-model described as E-R model domain, intergrity and operators to retrieve, derive, modify data
Chapter 7 how to model and design a database Data Modelling = the art of database design Find out which facts : things (entiy). keep what (attribute), model what might happen, how the facts link (relationship) Not process, data flow, experimental design Make a database metadata identify the facts to be stored in the database client and analysit Building Blocks entity attribute relationship identifiers (+ and frim’boss 是什么?) Q Aim 1 a well formed data model Construction rules obeyed No ambiguity Aim 2 a high fidelity image faithfully describes the world relationships data model is complete, understandable and accurate make sense 7 habits of highly effective data modellers
What is Many to Many A sale includes many items An item can appear in many sales How to link the tables with m:m relationship we can not link them directly PK are saleno and lineno or itemno Query Join for Multiple Tables Set Operator: UNION List all items that A) were sold on 1995-01-16, or B) are brown 比OR多了一个unique They are same. UNION doesn’t check the information is logical or not,just put them together INTERSECT List all items that A) were sold on 1995-01-16, and B) are brown INTERSECT is not supported by MySQL
Chapter 4: One-to-Many Relstionship Link multiple entities together Relationship between one instance of an entity to many instances of another entity Why we need multiple tables? redundant data and redundant operation Potential Problems insert: need to duplicate exchange rate delete: delete the compony, we also remove the india update How to solve this problem Split: the redundant info Link: the two tables» Foreign Key is same as the PK of another table
project structure advanced topic: imporove data processing using DB, Link python program to manage DB server Information System step 1 : Data collection,sensing step 2 : Data management , database step 3 : Data Analyzing ,machine learning Data Storage: Local RAM, Files, Database, Distributed storage, cloud and blockchain Why Database? Shareable, Transportable, Accurate, speed, secure and privacy Database Systems Challenges of DM Redundancy, poor interface, lack of data intergration, lack of scalability Single Entity What is a database? A coherent collection of data and the data is searchable collection of data: related tables Searchable : DBMS Basic Structure of DataBase rows: things columns : attributes Definitions of Database Entity Instance Primary Key Entity-Relationship Diagrams(ERD) ERD shows the relationships of entity Manage Database Structured Query Language(SQL) create a database query a database Define a Table CREATE TABLE PRIMARY KEY Insert Rows into a Table INSERT INTO VALUES Query a Database SELECT FROM WHERE GROUP BY ORDER BY Select SELECT FROM
W3L1 Unsupervised learning Lab PCA Scaling pca_loadings»>pca_fit.components_.T principal component scores»> pca = PCA(random_state=42) pca_df = pd.DataFrame(pca.fit_transform(X_scaled)) PVE Clustering–Hierarchical Clustering hc_avg = linkage(X, method='average') hc_complete = linkage(X, method='complete') hc_single = linkage(X, method='single') dend = dendrogram(hc_avg) cut_tree(hc_avg, 2).ravel()就是只聚类了两个类型 Correlation-based distance can be by passing the method='correlation' argument to the linkage function. 如果我们有一个包含至少三个特征的数据集,我们可以计算两个数据点之间的相关性,并将其视为一种距离度量。具体来说,我们可以使用 Pearson 相关系数来计算相关性。Pearson 相关系数可以度量两个变量之间的线性关系程度 Applied PCA 获得Proportion Variance Explained(PVE)pca.explained_variance_ratio_ 手动算PVE,不用上面的代码![[a231610ce8f6af47eafbf207cd98180.png]]用这个公式,写这个Loop ![[2421ca477df65ec16149569ec3aaecd.png]] hierarchical clustering 用complete linkage : hc_complete = linkage(USArrests, method ='complete') 改3个聚类还是4个聚类 缩放scaling 画 dendrogram scaler = StandardScaler() X_scaled = scaler.fit_transform(USArrests) hc_complete_sc = linkage(X_scaled, method ='complete') cut_tree(hc_complete_sc, 4).ravel()
Sampling Methods and Model Selection 8 Cross-Validation 生成数据集 设置随机种子,然后计算使用最小二乘拟合以下四个模型(增加次数的一次到四次的多项式)所产生的 LOOCV 误差 使用最小二乘法拟合以上的每个模型所产生的系数估计值的统计显着性。这些结果与基于交叉验证结果得出的结论一致吗? 得出的结论X1和X2的t统计值的绝对值是最大的, 所以对应了二次项 9. 直接计算数据集的std 和mean 通过Bootstrap来估计数据集的std 和mean def boot(var, n): m = np.zeros(n) for i in range(0, n): v = var.sample(frac=1, replace=True) 就是百分之百,采样,有放回 m[i] = np.mean(v) res1 = np.mean(m) res2 = np.std(m) print('mu: %.5f; se: %.5f' %(res1, res2)) return(res1, res2) 调用: result = boot(medv, 1000) # close to (b) 置信区间 print('lowerbd:%.2f' %(result[0] - 2*result[1])) print('upperbd:%.2f' %(result[0] + 2*result[1])) from scipy import stats stats.t.interval(0.95, # confidence level df = len(df)-1, # degrees of freedom loc = mu_hat, # sample mean scale= mu_hat_se) # sample std dev 中位数 中位数的标准误差 rovide an estimate for the tenth percentile of y # 6.9 生成模拟数据集 np.random.randn(1) 是用来生成一个标准正态分布的随机数的。np.random.randn() 函数生成一个随机数,该随机数服从标准正态分布(均值为0,标准差为1) (1)10* 100 个 X, y是一个多项式 (2) 放到一个数据框里 执行这个两种model selection (1)best subset selection lm = OLS(fit_intercept=True) lm_exhaustive = exhaustive_search(lm, X_new_df, y,nvmax=len(X_new_df.columns)) (2)forward and backward lm = OLS(fit_intercept=True) lm_forward = forward_search(lm, X_new_df, y, nvmax=len(X_new_df.columns)) lm = OLS(fit_intercept=True) lm_backward = backward_search(lm, X_new_df, y, nvmin=1) 现在将==套索模型==拟合到模拟数据,再次使用 X,X2,…,X10 作为预测变量。==使用交叉验证来选择 λ 的最佳值==。(这个其实是一整步,它是一起的)创建作为 ==λ 函数的交叉验证误差图==。(连这个是跟之前那一步一起的)报告所得的==系数估计值==,并讨论获得的结果。==(同时要scale 数据)== scale scaler = StandardScaler() X_new = scaler.fit_transform(X_new) 2. lambdas = 10 ** np.linspace(3,-3,50)生成1000到0.001的50个等间距的值 mean_scores = np.zeros(len(lambdas)) std_scores = np.zeros(len(lambdas)) for i, lambda_ in enumerate(lambdas): 它用于计算模型在交叉验证中的得分(我scoring设置什么它算什么,cv.mean就得到它的mean) cv = cross_val_score(Lasso(lambda_, max_iter=10000),X_new, y, cv=10, scoring='neg_mean_squared_error') mean_scores[i] = cv.mean() std_scores[i] = cv.std() 3.![[d26871bc8cfbf8b45cc98feed382a87.png]]就是yerr是标准差,lecture也提到了,以标准差作为一个间距 找个mean_scroe最小的一点mn,和对应的lambdas就是opt ![[1e233ecedc7b7977808dd41ae0a9776.png]]这个结果就是,coefficient 为0 的变量就不被考虑了 4.现在根据模型 y=β0+β7X7+e 生成响应向量 y ,并执行最佳子集选择和套索。讨论获得的结果。换了一个模型并执行best subset selection and the lasso
F»T Adding more variables to a model will most likely decrease the training errors 增加P,会减少training errors T K=n应该是最少bias ? T»F Adding more data in the training set will likely decrease the total sum of squares (TSS) 不变 增加n,TSS不变 T »F The likelihood function gives the probability of the model parameters given the data.似然函数是在给定模型参数的情况下观察到数据的概率,而不是给定数据的情况下模型参数的概率 F?LDA 不止用于普通二项分布 ? F 加data of training set 对TSS,没有变化 T?For a problem with p features, any separating hyperplane has p+1 parameters. F?Weight decay regularizes neural networks by minimizing the number of errors Weight decay是通过惩罚模型的权重大小来正则化神经网络 T For a model of the form y= β0+ β1x1 + β2x2 it holds that β-> 2β implies y->2y. In other words,doubling the parameter vector will double the explained variable The linear SVC (support vector classifier) just has a single user-defined parameter. T The curse of dimensionality states that the more features, the more computation is required. F而是说他精度会减少 Bayesian classification assigns the label of the class with the highest prior 这个应该是后验probability. F F We have 2 models with the same number of variables, but with a different training error. The model with the highest training error is more likely to have the smallest test error of the 2.
Neurons and gradient descent Neural nets are interconnected networks of simple processing units, a.k.a neurons It remains just a parallel, the artificial neuron is just an approximation! The output y was modelled as a weighted linear combination of inputs xj. Moreover, a transfer function, or activation function was used to make a decision. Gradient descent–how to find the right weights? We want to find the minimum of an error function We start with an initial guess of the parameter b We change its value in the direction of maximal slope (导数Derivatives are slopes) We continue until reaching a minimum Steps To update a weight b, we remove to its value the derivative a is called a learning rate»It decides by how much you multiply the step vector»Large a can lead to faster convergence than small» too large a can lead to disaster r is the iteration Why the minus sign? Because we want to move towards a minimum! Momentum Numerical example Learning rate controls oscillation and speed Momentum uses a bit of the previous step the models using them multi-layer perceptron convolutional neural networks ( deep learning) Multilayer perceptron (MLP) (from linear classifier to nonlinear version) It’s a feed forward network: it goes from inputs to outputs, without loops Every neuron includes an activation function (e.g. sigmoid, see earlier slides) Chain rule How to compute gradient descent on a cascade级联 of operations? How to reach all trainable weights? The forward pass: you put a data point in and obtain the prediction The Backward pass: you update parameters by back-propagating errors exercise Describe the architecture of the following network. ● How many layers are there? 2 ● How many neurons per layer? 3,1 ● How many trainable parameters (weights and biases)?10 Convolutional Neural Networks(CNN) they proposed a network learning spatial filters on top of spatial filters. They were (and are) called convolutional neural networks The CNN was considered interesting, but very hard to train. It needed● Loads of training data > nobody had them● A lot of computational power > same Change big data graphic processing How to work They are conceptually very similar to MLPs But their weights (b) are 2D convolutional filters For this, they are very well suited for images In convolutional neural networks, the filters are spatial (on 2D grids). ● local : they convolve the values of the image in a local window● shared: the same filter is applied everywhere in the image: why shared? “Recycling” the same operation allows to have much less parameters to learn! Learn By “learns”, I mean adapt filters weights to minimize prediction error (i.e. backpropagation) Steps Convolutional filters start with random numbers Iteratively they are improved: each coefficient is updated in the direction of largest gradient of the cost function At the end, the filters become quite meaningful! Summary Perceptrons are “neuron-inspired” linear discriminants Multilayer perceptrons are trainable, nonlinear discriminants Feed-forward neural networks in general can be used for classification, regression and feature extraction There is a large body of alternative neural nets Key problems in the application of ANNs are choosing the right size and good training parameters Convolutional neural networks have a constrained architecture encoding prior knowledge of the problem Deep learning is concerned with constructing extremely large neural networks that depend on:● special hardware (GPUs), to be able to train them● specific tricks (e.g. rectified linear units) to prevent overfitting
Clustering methods–Hierarchical dissimilarity matrix(cluster)dengrogram(cut)clustering bottom-up or agglomerative(as opposed to top-down or divisive) Disadavantages: once clustered, objects stay clustered, hard clustering: objects are assigned to a single cluster Clustering methods–K-means Data, Criterion, Clusters K(cluster)Clustering For every selected number of clusters K, choose optimal clustering What is optimal? K-means definitions 都是并集,clusters do not overlap Optimal clustering:squared distances between all pairs in each cluster or, equivalently, squared distances to cluster means, are minimal problem: need known W(C_k), but there are possibilities (!) iterative optimization K-means algorithm 就是k means是表示xi只属于一个簇,所以我定下有k个簇,我随机把数据点放到这些簇里,然后求每个簇的平均值(和方差),找离这个簇最近的点,迭代一遍所有点,让分类的结果最后不变 Choose number of clusters K and randomly assign each sample to a cluster Iterate until nothing changes: (a) for each cluster, calculate the centroid (mean) (b) re-assign each sample to the cluster whose mean is closest (in the Euclidean sense) Guaranteed to only decrease the criterion (why?):这个过程可以保证每次迭代都至少不会增加目标函数的值。 exercise 3 Choice of K Rule of thumb: look for “drop” in criterion K-means problems Clusters can lose all samples Why 初始质心选择不当:如果初始质心选取不当,有些质心可能会被分配到一个不包含任何样本的区域。这会导致该簇失去所有的样本。 非凸形状的簇:如果数据集包含非凸形状的簇,例如环状或月牙形状的簇,K-means 算法可能会将其分割成多个较小的簇,导致某个簇失去所有的样本。 数据量不均衡:如果某个簇中的样本数量太少,而其他簇中的样本数量较多,则某个簇可能会失去所有的样本,因为K-means 算法的更新过程是基于样本的平均值。 Solutions remove cluster and continue with K – 1 means alternatively, split largest cluster into two or add a random mean to continue with K means Clustering result depends on initialization –Algorithm can get stuck in local minima Solution: start from (many) different random initialisations keep the best clustering(lowest sumW(C_k)) Limitations K-means model not necessarily optimal Equal cluster models not necessarily optimal Hard clustering not necessarily optimal exercise The K-means model What cluster model actually underlies K-means? ● spherical, uniform ● implicit in criterion Choosing an explicit model can help to: ● understand the result ● quantify the model fit ● try alternative models ● make assumptions explicit Mixture Models Model the probability distribution of the data gives model for overall data distribution "soft" clustering: captures uncertainty in assignments parameters can be found using maximum likelihood Distribution-based clustering Each cluster is described by a probability density function Total dataset described by a mixture of density functions Clustering means maximizing the mixture fit, cluster assignment is based on posterior probabilities 就是我要知道x这个点属于每个簇的概率,把x分配给概率最高的簇,但是给定k簇的情况下求x(就是已经知道一个点属于k,求这个点是x点的概率)更好求(高斯的分布的密度函数求,就是用平均值和平方和求),所以先求这个。 Fitting mixture models 对于混合高斯模型,我们有以下步骤: 初始化参数:我们首先随机初始化每个高斯分布的均值、方差和权重。 计算每个样本属于每个高斯分布的概率:对于每个样本xi,我们计算它属于每个高斯分布 k 的概率 P(xi∣k)。(Note that mixture models also work for other component densities!) 更新参数:对于每个高斯分布 k,我们根据每个样本属于该分布的概率来更新均值、方差和权重。这个更新过程使用最大似然估计的方法。(说白了就是选最大的概率) 重复步骤2和3:我们重复步骤2和3,直到参数不再改变或达到最大迭代次数。 输出参数:最终,我们输出参数的估计值,这些参数可以用来描述数据的分布情况。 Mixture of Gaussians Latent variable Problem: need to simultaneously estimate two interdependent things… no closed form solution! cluster membership of each object solutions The EM algorithm Expectation-Maximization algorithm: general class of algorithms for this type of problem repeatedly: recalculate cluster membership of each sample (E) recalculate density parameters of each cluster (M) EM算法是一种求解含有隐变量的概率模型的参数估计方法。它的基本思想是:假设观测数据的生成过程包含两个步骤,即隐变量的生成和观测数据的生成。在E步,算法利用当前参数估计隐变量的后验概率,即估计隐变量的分布;在M步,算法利用E步的结果估计模型参数。通过反复迭代E步和M步,最终达到收敛。EM算法的目标是最大化似然函数,使得观测数据的生成概率最大,从而得到最优的参数估计。 The EM algorithm for MoGs EM 就是隐变量的概率我知道(但是会随着高斯分布的均值和协方差变化而变化),求浅变量的概率就是求responsibility,(把这个代入这个高斯求浅变量的式子里面就是Estep),用最大似然法找到最大的(把式子求导找极值)就是Mstep EM 就是隐变量的概率我知道(但是会随着高斯分布的均值和协方差变化而变化),求浅变量的概率就是求responsibility,用最大似然法找到最大的(在M步中,我们通过最大化似然函数来更新模型参数估计,例如我们可以通过对似然函数求导,令导数为0,找到似然函数的最大值点,这个过程通常称为最大似然估计(MLE))