首页 > 代码库 > Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

2024-10-02 03:01:39 222人阅读

capable of performing linear and nonlinear classification, regression even outlier detection
particularly well suit for classification of complex but small or medium size datasets

Linear SVM Classification¶

不仅仅在匹配一条可以准确分割两类的线了，而是要fit出一条尽可能宽的街道。
在两个边界之外的实例与decision boundary（决定边界）没有关系，影响边界的只有街道里面的实例，被称为support vector（支持向量）
sensitive to the feature scale ; need feature scaling like StandardScaler

Soft Margin(边缘) Classification

all instance out of the street called hard margin classification : only for linear classification and sensitive to outlier
in sklearn class can control the margin violations by using C hyperparameter, a small C value leads a wider street

In [1]:

import numpy as npfrom sklearn import datasetsfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVCiris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)svm_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘linear_svc‘, LinearSVC(C=1, loss=‘hinge‘)),))svm_clf.fit(X, y)

Out[1]:

Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘linear_svc‘, LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))

In [2]:

svm_clf.predict([[5.5, 1.7]])  #don‘t output probabilities

Out[2]:

array([ 1.])

In [3]:

from sklearn.svm import SVCfrom sklearn.linear_model import SGDClassifiersvc_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘svc‘, SVC(kernel=‘linear‘, C=1)),))m = len(X)C =1sgd_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘sgd‘, SGDClassifier(loss=‘hinge‘, alpha=1/(m*C)))))svc_clf.fit(X, y)sgd_clf.fit(X, y)print(svc_clf.predict([[5.5, 1.7]]), sgd_clf.predict([[5.5, 1.7]]))

[ 1.] [ 1.]

sgd慢，但可以处理巨大数据集和online classification task,svc更慢
LinearSVC class regularizes the bias term, should subtract thr mean or using StandardScaler automatic done
loss = hinge
for better preformance set dual=False, unless more features than training instances

Nonlinear SVM Classification¶

add more features like Polynomial to use linear Classification

In [4]:

from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.15, random_state=42)

In [5]:

from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import PolynomialFeaturespolynomial_svm_clf = Pipeline((    (‘poly_features‘, PolynomialFeatures(degree=3)),    (‘scaler‘, StandardScaler()),    (‘svm_clf‘, LinearSVC(C=10, loss=‘hinge‘))))polynomial_svm_clf.fit(X, y)

Out[5]:

Pipeline(steps=((‘poly_features‘, PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), (‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,     intercept_scaling=1, loss=‘hinge‘, max_iter=1000, multi_class=‘ovr‘,     penalty=‘l2‘, random_state=None, tol=0.0001, verbose=0))))

In [21]:

import matplotlib.pyplot as pltdef plot_predictions(clf, axes):    x0s = np.linspace(axes[0], axes[1], 100)    x1s = np.linspace(axes[2], axes[3], 100)    x0, x1 = np.meshgrid(x0s, x1s)    X = np.c_[x0.ravel(), x1.ravel()]    y_pred = clf.predict(X).reshape(x0.shape)    y_decision = clf.decision_function(X).reshape(x0.shape)    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)def plot_datasets(X, y, axes):    plt.plot(X[:, 0][y==0], X[:, 1][y==0], ‘bs‘)    plt.plot(X[:, 0][y==1], X[:, 1][y==1], ‘g^‘)    plt.axis(axes)    plt.grid(True, which=‘both‘)    plt.xlabel(r‘$x_1$‘, fontsize=10)    plt.ylabel(r‘$x_2$‘, fontsize=10, rotation=0)    plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])plt.show()

可以使用grid search需要最优参数。常常首先粗略的搜索，在细致的搜索

Adding Similarity Feature

add similarity feature compute using a similarity function that measures how math each instances resembles a particular landmark
$$Gauss\ Radial\ Basis\ Function\ :\ \phi \gamma (x, \mathscr{l})=e^{-\gamma \left \| x-\mathscr{l} \right \|^2}\\ x-\mathscr{l}即instance到landmark的距离，由此可计算出新的Features,并抛弃之前的$$
how to select landmrk ,simplest way : creat a landmark at the location of each and every instance in the dataset

Gauss RBF Kernel

数据集很大的时候计算非常费时
SVM的魔力就在于不必真实的添加feature就可以获得与添加之后相似的结果

In [13]:

rbf_kernel_svm_clf = Pipeline((    (‘scaler‘, StandardScaler()),    (‘svm_clf‘, SVC(kernel=‘rbf‘, gamma=5, C=0.001))))rbf_kernel_svm_clf.fit(X,y)

Out[13]:

Pipeline(steps=((‘scaler‘, StandardScaler(copy=True, with_mean=True, with_std=True)), (‘svm_clf‘, SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,  decision_function_shape=None, degree=3, gamma=5, kernel=‘rbf‘,  max_iter=-1, probability=False, random_state=None, shrinking=True,  tol=0.001, verbose=False))))

In [22]:

%matplotlib inlinefrom sklearn.svm import SVCgamma1, gamma2 = 0.1, 5C1, C2 = 0.001, 1000hyperparams = (gamma1, C1), (gamma1, C2), (gamma2, C1), (gamma2, C2)svm_clfs = []for gamma, C in hyperparams:    rbf_kernel_svm_clf = Pipeline((            ("scaler", StandardScaler()),            ("svm_clf", SVC(kernel="rbf", gamma=gamma, C=C))        ))    rbf_kernel_svm_clf.fit(X, y)    svm_clfs.append(rbf_kernel_svm_clf)plt.figure(figsize=(14, 9))for i, svm_clf in enumerate(svm_clfs):    plt.subplot(221 + i)    plot_predictions(svm_clf, [-1.5, 2.5, -1, 1.5])    plot_datasets(X, y, [-1.5, 2.5, -1, 1.5])    gamma, C = hyperparams[i]    plt.title(r"$\gamma = {}, C = {}$".format(gamma, C), fontsize=10)plt.show()

$\gamma$增大，instance的影响范围就变小，bell-shape curve变瘦
$\gamma$增小，instance的影响范围就变大，bell-shape curve变胖
$C$和$\gamma$类似

其他的kernel用的少或者应用面窄。
String kernels常用在文本分类和DNA序列如：使用string subsequence kernel或者其他基于Levenshtein distance
how to choose:
1. linear first, LinearSVC is fast
2. then SVC(kernel=‘linear‘)
3. not to large, try the rbf kernel
4. try others kernel using cross-validation and grid search

Computational Complexity

Class	Time Complexity	Out-of-core Support	Scaling required	Kernel Trick
LinearSVC	$O(m\times n)$	No	Yes	No
SGDClassifier	$O(m\times n)$	Yes	Yes	No
svc	$O(m^2\times n)\\ to\\ O(m^3\times n)$	No	Yes	Yes

SVM Regression¶

与分类器的区别在于要使support vector在margin被限制的条件下尽可能多

In [33]:

from sklearn.svm import LinearSVRimport numpy.random as rndrnd.seed(42)m = 50X = 2 * rnd.rand(m, 1)y = (4 + 3 * X + rnd.randn(m, 1)).ravel()svm_reg = LinearSVR(epsilon=1.5)svm_reg.fit(X, y)

Out[33]:

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,     intercept_scaling=1.0, loss=‘epsilon_insensitive‘, max_iter=1000,     random_state=None, tol=0.0001, verbose=0)

In [34]:

from sklearn.svm import SVRsvm_poly_reg = SVR(kernel=‘poly‘, degree=2, C=100, epsilon=0.1)svm_poly_reg.fit(X,y)

Out[34]:

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma=‘auto‘,  kernel=‘poly‘, max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [35]:

svm_reg1 = LinearSVR(epsilon=1.5)svm_reg2 = LinearSVR(epsilon=0.5)svm_reg1.fit(X, y)svm_reg2.fit(X, y)def find_support_vectors(svm_reg, X, y):    y_pred = svm_reg.predict(X)    off_margin = (np.abs(y - y_pred) >= svm_reg.epsilon)    return np.argwhere(off_margin) #nonzero（满足条件）的位置 svm_reg1.support_ = find_support_vectors(svm_reg1, X, y)  #实际上是margin之外的svm_reg2.support_ = find_support_vectors(svm_reg2, X, y)eps_x1 = 1eps_y_pred = svm_reg1.predict([[eps_x1]]) #再次位置表示epsilondef plot_svm_regression(svm_reg, X, y, axes):    x1s = np.linspace(axes[0], axes[1], 100).reshape(100, 1)    y_pred = svm_reg.predict(x1s)    plt.plot(x1s, y_pred, "k-", linewidth=2, label=r"$\hat{y}$")    plt.plot(x1s, y_pred + svm_reg.epsilon, "k--")    plt.plot(x1s, y_pred - svm_reg.epsilon, "k--")    plt.scatter(X[svm_reg.support_], y[svm_reg.support_], s=180, facecolors=‘#FFAAAA‘)    plt.plot(X, y, "bo")    plt.xlabel(r"$x_1$", fontsize=18)    plt.legend(loc="upper left", fontsize=18)    plt.axis(axes)plt.figure(figsize=(9, 4))plt.subplot(121)plot_svm_regression(svm_reg1, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg1.epsilon), fontsize=18)plt.ylabel(r"$y$", fontsize=18, rotation=0)#plt.plot([eps_x1, eps_x1], [eps_y_pred, eps_y_pred - svm_reg1.epsilon], "k-", linewidth=2)plt.annotate(        ‘‘, xy=(eps_x1, eps_y_pred), xycoords=‘data‘,        xytext=(eps_x1, eps_y_pred - svm_reg1.epsilon),        textcoords=‘data‘, arrowprops={‘arrowstyle‘: ‘<->‘, ‘linewidth‘: 1.5}    )plt.text(0.91, 5.6, r"$\epsilon$", fontsize=20)plt.subplot(122)plot_svm_regression(svm_reg2, X, y, [0, 2, 3, 11])plt.title(r"$\epsilon = {}$".format(svm_reg2.epsilon), fontsize=18)plt.show()

Under the Hood¶

Decision Function and Predictions

$$ w^T \cdot x + b = w_1 x_1 + \cdot \cdot \cdot + w_n x_n + b \ \ ;\ \ bias=b,feature\ vector=w \\ \widehat{y}=\left\{\begin{matrix} 0\ \ if\ w^T \cdot x +b <0\\ 1\ \ if\ w^T \cdot x +b \geq 0 \end{matrix}\right.$$
Training a linear SVM classifier means finding the value of $w$ and $b$ that make this margin as wide as possible while avoiding margin violations or limiting them.

In [36]:

iris = datasets.load_iris()X = iris[‘data‘][:, (2, 3)]y = (iris[‘target‘]==2).astype(np.float64)

In [47]:

from mpl_toolkits.mplot3d import Axes3Ddef plot_3D_decision_function(ax, w, b, x1_lim=[4, 6], x2_lim=[0.8, 2.8]):    x1_in_bounds = (X[:, 0] > x1_lim[0]) & (X[:, 0] < x1_lim[1])    X_crop = X[x1_in_bounds]    y_crop = y[x1_in_bounds]    x1s = np.linspace(x1_lim[0], x1_lim[1], 20)    x2s = np.linspace(x2_lim[0], x2_lim[1], 20)    x1, x2 = np.meshgrid(x1s, x2s)    xs = np.c_[x1.ravel(), x2.ravel()]    df = (xs.dot(w) + b).reshape(x1.shape)    m = 1 / np.linalg.norm(w)    boundary_x2s = -x1s*(w[0]/w[1])-b/w[1]    margin_x2s_1 = -x1s*(w[0]/w[1])-(b-1)/w[1]    margin_x2s_2 = -x1s*(w[0]/w[1])-(b+1)/w[1]    ax.plot_surface(x1s, x2, 0, color="b", alpha=0.2, cstride=100, rstride=100)    ax.plot(x1s, boundary_x2s, 0, "k-", linewidth=2, label=r"$h=0$")    ax.plot(x1s, margin_x2s_1, 0, "k--", linewidth=2, label=r"$h=\pm 1$")    ax.plot(x1s, margin_x2s_2, 0, "k--", linewidth=2)    ax.plot(X_crop[:, 0][y_crop==1], X_crop[:, 1][y_crop==1], 0, "g^")    ax.plot_wireframe(x1, x2, df, alpha=0.3, color="k")    ax.plot(X_crop[:, 0][y_crop==0], X_crop[:, 1][y_crop==0], 0, "bs")    ax.axis(x1_lim + x2_lim)    ax.text(4.5, 2.5, 3.8, "Decision function $h$", fontsize=15)    ax.set_xlabel(r"Petal length", fontsize=15)    ax.set_ylabel(r"Petal width", fontsize=15)    ax.set_zlabel(r"$h = \mathbf{w}^t \cdot \mathbf{x} + b$", fontsize=18)    ax.legend(loc="upper left", fontsize=16)svm_clf2 = LinearSVC(C=10, loss=‘hinge‘)svm_clf2.fit(X, y)fig = plt.figure(figsize=(11, 6))ax1 = fig.add_subplot(111, projection=‘3d‘)plot_3D_decision_function(ax1, w=svm_clf2.coef_[0], b=svm_clf2.intercept_[0])plt.show()

Training Objective

要求所有的都分类正确（hard margin）$$==>:\ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1$$
于是就转化为了约束优化问题$$ \underset{w,b}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w = \frac{1}{2}\left \| w \right \|^2 \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1,\ \ for\ i=1,2,3,...,m\\ 之所以不用\left \| w \right \|是因为其不可微 $$
为了得到soft margin，一如松弛变量$:\ \varsigma^{(i)} \geq 0$,用来测量违反margin的程度
$w$表示斜率，越小margin越宽;$\varsigma$越小表示违反程度越低，但margin也会越小
$C$用来平衡$w$和$\varsigma$
$$ \underset{w,b,\varsigma}{minimize}\ \ \ \ \frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m} \varsigma^{(i)} = \frac{1}{2}\left \| w \right \|^2 + C\sum_{i=1}^{m} \varsigma^{(i)} \\ subject\ to \ \ \ \ t^{(i)}(w^T \cdot x^{(i)} +b) \geq 1-\varsigma^{(i)},\ \ and\ \varsigma^{(i)} \geq 0\ ,\ for\ i=1,2,3,...,m $$

Quadratic Programming

$$ \underset{p}{Minimize}\ \ \frac{1}{2} p^T \cdot H \cdot p + f^T \cdot p \\ subject\ to\ \ \ A \cdot p \leq b \\ where\ \left\{\begin{matrix} p\ \ is\ an\ n_p\ dimensional\ vector(=number\ of\ parameters)\\ H\ \ is\ an\ n_p\ \times\ n_p\ matrix\\ f\ \ is\ an\ n_p\ dimensional\ vector\\ A\ \ is\ an\ n_c\ \times\ n_p\ matrix(n_c=number\ of\ constraints)\\ b\ \ is\ an\ n_c\ dimensional\ vector \end{matrix}\right. $$
use the off-the-shelf QP solevr by passing it the preceding parameters to train a hard margin linear SVM

The Dual Problem

a constrained optimization problem known as primal problem, it is possible to express a different but closely related dual problem
for SVM, the primal and dual problem has the same solution
Dual form of the linear SVM: $$ underset{\alpha }{minimize}\frac{1}{2}\sum_{i=1}^{m} \sum_{j=1}^{m} \alpha^{(i)} \alpha^{(j)} t^{(i)} t^{(j)} x^{(i)T} x^{(j)} - \sum_{i=1}^{m} \alpha^{(i)}\\ subject\ to\ \ \alpha^{(i)} \geq 0\ \ for\ i=1,2,...,m $$
from dual to primal: $$ \widehat{w} = \sum_{i=1}^{m} {\widehat{\alpha}}^{(i)} t^{(i)} x^{(i)} \\ \widehat{b} = \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot x^{(i)})) $$

Kernelized SVM
A kernel is a function zapable of compting the dot product $\phi (a)^T \cdot \phi (b)$ based only the original vectors $a$ and $b$， without having to compute(or enev to know about) the transformaton $\phi$
$$\begin{align*}Linear &:\ \ \ K(a,b)=a^T \cdot b \Polynormial &:\ \ \ K(a,b)=(\gamma a^T \cdot b + r)^d \Gasuuian\ RBF &:\ \ \ K(a,b)=exp(-\gamma \left \| a-b \right \|^2) \Sigmoid &:\ \ \ K(a,b)=tanh(\gamma a^T \cdot b + r) \\end{align*}$$

只要K满足Mercer‘s Condition(连续，可交换)就知道对于：$K(a,b)=\phi (a)^T \cdot \phi (b)$的$\phi$一定存在。尽管可能不知道是啥，但这样就可以做kernel
计算$\widehat w$时会有一个$\phi (x)$，可能是很难计算的。将$\widehat w$带入cost function中：$$\begin{align*}h_{w,\widehat h}(\phi (x^{(n)})) &= w^T \cdot \phi(x^{(n)})+\widehat b \&= (\sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} \phi(x^{(i)}))^T \cdot \phi(x^{(n)}) +\widehat b \&= \sum_{i=1}^{m} \widehat {\alpha}^{(i)} t^{(i)} (\phi(x^{(i)})^T \cdot \phi(x^{(n)})) +\widehat b \&= \sum_{i=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(i)} t^{(i)} K(x^{(i)},x^{(n)}) + \widehat b \\widehat{b} &= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\widehat{w} \cdot \phi(x^{(i)}))) \&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}(\sum_{j=1}^{m} \widehat {\alpha}^{(j)} t^{(j)} \phi(x^{(j)}))^T \cdot \phi(x^{(i)}))\&= \frac{1}{n_s} \sum_{i=1,{\widehat{\alpha}}^{(i)}>0}^{m}(1-t^{(i)}\sum_{j=1,\widehat{\alpha} > 0}^{m}\widehat {\alpha}^{(j)} t^{(j)} K(x^{(i)},x^{(j)}))\end{align*}$$

Online SVMs

means learning icrementrally
for Linear SVM Classifier:$$J(w,b)=\frac{1}{2}w^T\cdot w + C\sum_{i=1}^{m}max(0,1-t^{(i)}(w^T\cdot x^{(i)}+b))$$
$hinge\ loss\ function = max(0,\ 1-4)$

Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

Notes ： <Hands-on ML with Sklearn & TF> Chapter 5

Linear SVM Classification¶

Nonlinear SVM Classification¶

SVM Regression¶

Under the Hood¶

看完仍有疑问？有类似问题直接问程序猿