首页 > 代码库 > Notes : <Hands-on ML with Sklearn & TF> Chapter 5

Notes : <Hands-on ML with Sklearn & TF> Chapter 5

<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/require.js/2.1.10/require.min.js"></script><script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js"></script>

<style></style><style></style><style></style><style></style>

 

 

<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>

<script type="text/x-mathjax-config">// </script>

 
  1. capable of performing linear and nonlinear classification, regression even outlier detection
  2. particularly well suit for classification of complex but small or medium size datasets
 

Linear SVM Classification¶

 
  1. 不仅仅在匹配一条可以准确分割两类的线了,而是要fit出一条尽可能宽的街道。
  2. 在两个边界之外的实例与decision boundary(决定边界)没有关系,影响边界的只有街道里面的实例,被称为support vector(支持向量)
  3. sensitive to the feature scale ; need feature scaling like StandardScaler
 

Soft Margin(边缘) Classification

  1. all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
  2. in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street
In [1]:
import numpy as npfrom sklearn import datasetsfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVCiris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)svm_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘linear_svc‘, LinearSVC(C=1, loss=‘hinge‘)),))svm_clf.fit(X, y)
Out[1]:
Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘linear_svc‘, LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))
In [2]:
svm_clf.predict([[5.5, 1.7]])  #don‘t output probabilities
Out[2]:
array([ 1.])
In [3]:
from sklearn.svm import SVCfrom sklearn.linear_model import SGDClassifiersvc_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘svc‘, SVC(kernel=‘linear‘, C=1)),))m = len(X)C =1sgd_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘sgd‘, SGDClassifier(loss=‘hinge‘, alpha=1/(m*C)))))svc_clf.fit(X, y)sgd_clf.fit(X, y)print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))  
 
[ 1.] [ 1.]
 
  1. sgd慢,但可以处理巨大数据集和online classification task,svc更慢
  2. LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
  3. loss = hinge
  4. for better preformance set dual=False, unless more features than training instances
 

Nonlinear SVM Classification¶

 
  1. add more features like Polynomial to use linear Classification
In [4]:
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.15, random_state=42)
In [5]:
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import PolynomialFeaturespolynomial_svm_clf = Pipeline((    (‘poly_features‘, PolynomialFeatures(degree=3)),    (‘scaler‘, StandardScaler()),    (‘svm_clf‘, LinearSVC(C=10, loss=‘hinge‘))))polynomial_svm_clf.fit(X, y)
Out[5]:
Pipeline(steps=((‘poly_features‘, PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), (‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))
In [21]:
import matplotlib.pyplot as pltdef plot_predictions(clf, axes):    x0s = np.linspace(axes[0], axes[1], 100)    x1s = np.linspace(axes[2], axes[3], 100)    x0, x1 = np.meshgrid(x0s, x1s)    X = np.c_[x0.ravel(), x1.ravel()]    y_pred = clf.predict(X).reshape(x0.shape)    y_decision = clf.decision_function(X).reshape(x0.shape)    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)def plot_datasets(X, y, axes):    plt.plot(X[:, 0][y==0], X[:, 1][y==0], ‘bs‘)    plt.plot(X[:, 0][y==1], X[:, 1][y==1], ‘g^‘)    plt.axis(axes)    plt.grid(True, which=‘both‘)    plt.xlabel(r‘$x_1$‘, fontsize=10)    plt.ylabel(r‘$x_2$‘, fontsize=10, rotation=0)    plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])plt.show()
 
技术分享
 
  • 可以使用grid search需要最优参数。常常首先粗略的搜索,在细致的搜索
 

Adding Similarity Feature

  1. add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
  2. $$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距离,由此可计算出新的Features,并抛弃之前的$$
  3. how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset
 

Gauss RBF Kernel

  • 数据集很大的时候计算非常费时
  • SVM的魔力就在于不必真实的添加feature就可以获得与添加之后相似的结果
In [13]:
rbf_kernel_svm_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘svm_clf‘, SVC(kernel=‘rbf‘, gamma=5, C=0.001))))rbf_kernel_svm_clf.fit(X,y)
Out[13]:
Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma=5, kernel=‘rbf‘,  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False))))
In [22]:
%matplotlib inlinefrom sklearn.svm import SVCgamma1, gamma2 = 0.1, 5C1, C2 = 0.001, 1000hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)svm_clfs = []for gamma, C in hyperparams:    rbf_kernel_svm_clf = Pipeline((            ("scaler", StandardScaler()),            ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))        ))    rbf_kernel_svm_clf.fit(X, y)    svm_clfs.append(rbf_kernel_svm_clf)plt.figure(figsize=(14, 9))for i, svm_clf in enumerate(svm_clfs):    plt.subplot(221 + i)    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])    plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])    gamma, C = hyperparams[i]    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)plt.show()
 
技术分享
 
  • $\gamma$增大,instance的影响范围就变小,bell-shape curve变瘦
  • $\gamma$增小,instance的影响范围就变大,bell-shape curve变胖
  • $C$和$\gamma$类似
 
  1. 其他的kernel用的少或者应用面窄。
  2. String kernels常用在文本分类和DNA序列如:使用string subsequence kernel或者其他基于Levenshtein distance
  3. how to choose:
    1. linear first, LinearSVC is fast
    2. then SVC(kernel=‘linear‘)
    3. not to large, try the rbf kernel
    4. try others kernel using cross-validation and grid search
 

Computational Complexity

 
ClassTime ComplexityOut-of-core SupportScaling requiredKernel Trick
LinearSVC$O(m\times n)$NoYesNo
SGDClassifier$O(m\times n)$YesYesNo
svc$O(m^2\times n)\\ to\\ O(m^3\times n)$NoYesYes
 

SVM Regression¶

  1. 与分类器的区别在于要使support vector在margin被限制的条件下尽可能多
In [33]:
from sklearn.svm import LinearSVRimport numpy.random as rndrnd.seed(42)m = 50X = 2 * rnd.rand(m, 1)y = (4 + 3 * X + rnd.randn(m, 1)).ravel()svm_reg = LinearSVR(epsilon=1.5)svm_reg.fit(X, y)
Out[33]:
LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,     intercept_scaling=1.0, loss=‘epsilon_insensitive‘, max_iter=1000,     random_state=None, tol=0.0001, verbose=0)
In [34]:
from sklearn.svm import SVRsvm_poly_reg = SVR(kernel=‘poly‘, degree=2, C=100, epsilon=0.1)svm_poly_reg.fit(X,y)
Out[34]:
SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma=‘auto‘,  kernel=‘poly‘, max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [35]:
svm_reg1 = LinearSVR(epsilon=1.5)svm_reg2 = LinearSVR(epsilon=0.5)svm_reg1.fit(X, y)svm_reg2.fit(X, y)def find_support_vectors(svm_reg, X, y):    y_pred = svm_reg.predict(X)    off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)    return np.argwhere(off_margin) #nonzero(满足条件)的位置 svm_reg1.support_ = find_support_vectors(svm_reg1, X, y)  #实际上是margin之外的svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)eps_x1 = 1eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilondef plot_svm_regression(svm_reg, X, y, axes):    x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)    y_pred = svm_reg.predict(x1s)    plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")    plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")    plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")    plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors=‘#FFAAAA‘)    plt.plot(X, y, "bo")    plt.xlabel(r"$x_1$", fontsize=18)    plt.legend(loc="upper left", fontsize=18)    plt.axis(axes)plt.figure(figsize=(9, 4))plt.subplot(121)plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)plt.ylabel(r"$y$", fontsize=18, rotation=0)#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)plt.annotate(        ‘‘, xy=(eps_x1, eps_y_pred), xycoords=‘data‘,        xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),        textcoords=‘data‘, arrowprops={‘arrowstyle‘: ‘<->‘, ‘linewidth‘: 1.5}    )plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)plt.subplot(122)plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)plt.show()
 
技术分享
 

Under the Hood¶

 

Decision Function and Predictions

  1. $$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
  2. Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.
In [36]:
iris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)
In [47]:
from mpl_toolkits.mplot3d import Axes3Ddef plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):    x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])    X_crop = X[x1_in_bounds]    y_crop = y[x1_in_bounds]    x1s = np.linspace(x1_lim[0], x1_lim[1], 20)    x2s = np.linspace(x2_lim[0], x2_lim[1], 20)    x1, x2 = np.meshgrid(x1s, x2s)    xs = np.c_[x1.ravel(), x2.ravel()]    df = (xs.dot(w) + b).reshape(x1.shape)    m = 1 / np.linalg.norm(w)    boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]    margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]    margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]    ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100)    ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")    ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")    ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)    ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")    ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")    ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")    ax.axis(x1_lim + x2_lim)    ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)    ax.set_xlabel(r"Petal length", fontsize=15)    ax.set_ylabel(r"Petal width", fontsize=15)    ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18)    ax.legend(loc="upper left", fontsize=16)svm_clf2 = LinearSVC(C=10, loss=‘hinge‘)svm_clf2.fit(X, y)fig = plt.figure(figsize=(11, 6))ax1 = fig.add_subplot(111, projection=‘3d‘)plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])plt.show()
 
技术分享
 

Training Objective

  1. 要求所有的都分类正确(hard margin)$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
  2. 于是就转化为了约束优化问题$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之所以不用\left \| w \right \|是因为其不可微 $$
  3. 为了得到soft margin,一如松弛变量$:\ \varsigma^{(i)} \geq 0$,用来测量违反margin的程度
  4. $w$表示斜率,越小margin越宽;$\varsigma$越小表示违反程度越低,但margin也会越小
  5. $C$用来平衡$w$和$\varsigma$
  6. $$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$
 

Quadratic Programming

  1. $$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
  2. use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM
 

The Dual Problem

  1. a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
  2. for SVM, the primal and dual problem has the same solution
  3. Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
  4. from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$
 

Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$, without having to compute(or enev to know about) the transformaton $\phi$
$$\begin{align*}Linear &:\ \ \ K(a,b)=a^T \cdot b \Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\end{align*}$$

  1. 只要K满足Mercer‘s Condition(连续,可交换)就知道对于:$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$一定存在。尽管可能不知道是啥,但这样就可以做kernel
  2. 计算$\widehat w$时会有一个$\phi (x)$,可能是很难计算的。将$\widehat w$带入cost function中:$$\begin{align*}h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \&= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \&= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \&= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)}))\end{align*}$$
 

Online SVMs

  1. means learning icrementrally
  2. for Linear SVM Classifier:$$J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b))$$
  3. $hinge\ loss\ function = max(0,\ 1-4)$

 

Notes : <Hands-on ML with Sklearn & TF> Chapter 5