首页 > 代码库 > Notes : <Hands-on ML with Sklearn & TF> Chapter 5
Notes : <Hands-on ML with Sklearn & TF> Chapter 5
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script><script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>
<style></style><style></style><style></style><style></style>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
<script type="text/x-mathjax-config">// </script>
- capable of performing linear and nonlinear classification, regression even outlier detection
- particularly well suit for classification of complex but small or medium size datasets
Linear SVM Classification¶
- 不仅仅在匹配一条可以准确分割两类的线了,而是要fit出一条尽可能宽的街道。
- 在两个边界之外的实例与decision boundary(决定边界)没有关系,影响边界的只有街道里面的实例,被称为support vector(支持向量)
- sensitive to the feature scale ; need feature scaling like StandardScaler
Soft Margin(边缘) Classification
- all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
- in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street
import numpy as npfrom sklearn import datasetsfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVCiris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)svm_clf = Pipeline(( (‘scaler‘, StandardScaler()), (‘linear_svc‘, LinearSVC(C=1, loss=‘hinge‘)),))svm_clf.fit(X, y)
Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘linear_svc‘, LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘, penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))
svm_clf.predict([[5.5, 1.7]]) #don‘t output probabilities
array([ 1.])
from sklearn.svm import SVCfrom sklearn.linear_model import SGDClassifiersvc_clf = Pipeline(( (‘scaler‘, StandardScaler()), (‘svc‘, SVC(kernel=‘linear‘, C=1)),))m = len(X)C =1sgd_clf = Pipeline(( (‘scaler‘, StandardScaler()), (‘sgd‘, SGDClassifier(loss=‘hinge‘, alpha=1/(m*C)))))svc_clf.fit(X, y)sgd_clf.fit(X, y)print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))
[ 1.] [ 1.]
- sgd慢,但可以处理巨大数据集和online classification task,svc更慢
- LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
- loss = hinge
- for better preformance set dual=False, unless more features than training instances
Nonlinear SVM Classification¶
- add more features like Polynomial to use linear Classification
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.15, random_state=42)
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import PolynomialFeaturespolynomial_svm_clf = Pipeline(( (‘poly_features‘, PolynomialFeatures(degree=3)), (‘scaler‘, StandardScaler()), (‘svm_clf‘, LinearSVC(C=10, loss=‘hinge‘))))polynomial_svm_clf.fit(X, y)
Pipeline(steps=((‘poly_features‘, PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), (‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘, penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))
import matplotlib.pyplot as pltdef plot_predictions(clf, axes): x0s = np.linspace(axes[0], axes[1], 100) x1s = np.linspace(axes[2], axes[3], 100) x0, x1 = np.meshgrid(x0s, x1s) X = np.c_[x0.ravel(), x1.ravel()] y_pred = clf.predict(X).reshape(x0.shape) y_decision = clf.decision_function(X).reshape(x0.shape) plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2) plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)def plot_datasets(X, y, axes): plt.plot(X[:, 0][y==0], X[:, 1][y==0], ‘bs‘) plt.plot(X[:, 0][y==1], X[:, 1][y==1], ‘g^‘) plt.axis(axes) plt.grid(True, which=‘both‘) plt.xlabel(r‘$x_1$‘, fontsize=10) plt.ylabel(r‘$x_2$‘, fontsize=10, rotation=0) plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])plt.show()
- 可以使用grid search需要最优参数。常常首先粗略的搜索,在细致的搜索
Adding Similarity Feature
- add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
- $$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距离,由此可计算出新的Features,并抛弃之前的$$
- how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset
Gauss RBF Kernel
- 数据集很大的时候计算非常费时
- SVM的魔力就在于不必真实的添加feature就可以获得与添加之后相似的结果
rbf_kernel_svm_clf = Pipeline(( (‘scaler‘, StandardScaler()), (‘svm_clf‘, SVC(kernel=‘rbf‘, gamma=5, C=0.001))))rbf_kernel_svm_clf.fit(X,y)
Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=5, kernel=‘rbf‘, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))))
%matplotlib inlinefrom sklearn.svm import SVCgamma1, gamma2 = 0.1, 5C1, C2 = 0.001, 1000hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)svm_clfs = []for gamma, C in hyperparams: rbf_kernel_svm_clf = Pipeline(( ("scaler", StandardScaler()), ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C)) )) rbf_kernel_svm_clf.fit(X, y) svm_clfs.append(rbf_kernel_svm_clf)plt.figure(figsize=(14, 9))for i, svm_clf in enumerate(svm_clfs): plt.subplot(221 + i) plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5]) plot_datasets(X, y, [-1.5, 2.5, -1, 1.5]) gamma, C = hyperparams[i] plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)plt.show()
- $\gamma$增大,instance的影响范围就变小,bell-shape curve变瘦
- $\gamma$增小,instance的影响范围就变大,bell-shape curve变胖
- $C$和$\gamma$类似
- 其他的kernel用的少或者应用面窄。
- String kernels常用在文本分类和DNA序列如:使用string subsequence kernel或者其他基于Levenshtein distance
- how to choose:
- linear first, LinearSVC is fast
- then SVC(kernel=‘linear‘)
- not to large, try the rbf kernel
- try others kernel using cross-validation and grid search
Computational Complexity
Class | Time Complexity | Out-of-core Support | Scaling required | Kernel Trick |
---|---|---|---|---|
LinearSVC | $O(m\times n)$ | No | Yes | No |
SGDClassifier | $O(m\times n)$ | Yes | Yes | No |
svc | $O(m^2\times n)\\ to\\ O(m^3\times n)$ | No | Yes | Yes |
SVM Regression¶
- 与分类器的区别在于要使support vector在margin被限制的条件下尽可能多
from sklearn.svm import LinearSVRimport numpy.random as rndrnd.seed(42)m = 50X = 2 * rnd.rand(m, 1)y = (4 + 3 * X + rnd.randn(m, 1)).ravel()svm_reg = LinearSVR(epsilon=1.5)svm_reg.fit(X, y)
LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True, intercept_scaling=1.0, loss=‘epsilon_insensitive‘, max_iter=1000, random_state=None, tol=0.0001, verbose=0)
from sklearn.svm import SVRsvm_poly_reg = SVR(kernel=‘poly‘, degree=2, C=100, epsilon=0.1)svm_poly_reg.fit(X,y)
SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma=‘auto‘, kernel=‘poly‘, max_iter=-1, shrinking=True, tol=0.001, verbose=False)
svm_reg1 = LinearSVR(epsilon=1.5)svm_reg2 = LinearSVR(epsilon=0.5)svm_reg1.fit(X, y)svm_reg2.fit(X, y)def find_support_vectors(svm_reg, X, y): y_pred = svm_reg.predict(X) off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon) return np.argwhere(off_margin) #nonzero(满足条件)的位置 svm_reg1.support_ = find_support_vectors(svm_reg1, X, y) #实际上是margin之外的svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)eps_x1 = 1eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilondef plot_svm_regression(svm_reg, X, y, axes): x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1) y_pred = svm_reg.predict(x1s) plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$") plt.plot(x1s, y_pred + svm_reg.epsilon, "k--") plt.plot(x1s, y_pred - svm_reg.epsilon, "k--") plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors=‘#FFAAAA‘) plt.plot(X, y, "bo") plt.xlabel(r"$x_1$", fontsize=18) plt.legend(loc="upper left", fontsize=18) plt.axis(axes)plt.figure(figsize=(9, 4))plt.subplot(121)plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)plt.ylabel(r"$y$", fontsize=18, rotation=0)#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)plt.annotate( ‘‘, xy=(eps_x1, eps_y_pred), xycoords=‘data‘, xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon), textcoords=‘data‘, arrowprops={‘arrowstyle‘: ‘<->‘, ‘linewidth‘: 1.5} )plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)plt.subplot(122)plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)plt.show()
Under the Hood¶
Decision Function and Predictions
- $$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
- Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.
iris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)
from mpl_toolkits.mplot3d import Axes3Ddef plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]): x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1]) X_crop = X[x1_in_bounds] y_crop = y[x1_in_bounds] x1s = np.linspace(x1_lim[0], x1_lim[1], 20) x2s = np.linspace(x2_lim[0], x2_lim[1], 20) x1, x2 = np.meshgrid(x1s, x2s) xs = np.c_[x1.ravel(), x2.ravel()] df = (xs.dot(w) + b).reshape(x1.shape) m = 1 / np.linalg.norm(w) boundary_x2s = -x1s*(w[0]/w[1])-b/w[1] margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1] margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1] ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100) ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$") ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$") ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2) ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^") ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k") ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs") ax.axis(x1_lim + x2_lim) ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15) ax.set_xlabel(r"Petal length", fontsize=15) ax.set_ylabel(r"Petal width", fontsize=15) ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18) ax.legend(loc="upper left", fontsize=16)svm_clf2 = LinearSVC(C=10, loss=‘hinge‘)svm_clf2.fit(X, y)fig = plt.figure(figsize=(11, 6))ax1 = fig.add_subplot(111, projection=‘3d‘)plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])plt.show()
Training Objective
- 要求所有的都分类正确(hard margin)$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
- 于是就转化为了约束优化问题$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之所以不用\left \| w \right \|是因为其不可微 $$
- 为了得到soft margin,一如松弛变量$:\ \varsigma^{(i)} \geq 0$,用来测量违反margin的程度
- $w$表示斜率,越小margin越宽;$\varsigma$越小表示违反程度越低,但margin也会越小
- $C$用来平衡$w$和$\varsigma$
- $$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$
Quadratic Programming
- $$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
- use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM
The Dual Problem
- a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
- for SVM, the primal and dual problem has the same solution
- Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
- from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$
Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$, without having to compute(or enev to know about) the transformaton $\phi$
$$\begin{align*}Linear &:\ \ \ K(a,b)=a^T \cdot b \Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\end{align*}$$
- 只要K满足Mercer‘s Condition(连续,可交换)就知道对于:$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$一定存在。尽管可能不知道是啥,但这样就可以做kernel
- 计算$\widehat w$时会有一个$\phi (x)$,可能是很难计算的。将$\widehat w$带入cost function中:$$\begin{align*}h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \&= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \&= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \&= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)}))\end{align*}$$
Online SVMs
- means learning icrementrally
- for Linear SVM Classifier:$$J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b))$$
- $hinge\ loss\ function = max(0,\ 1-4)$
Notes : <Hands-on ML with Sklearn & TF> Chapter 5