scikit-learn学习之预处理（preprocessing）一

首页 > 代码库 > scikit-learn学习之预处理（preprocessing）一

scikit-learn学习之预处理（preprocessing）一

2024-11-26 06:36:02 203人阅读

一、标准化，均值去除和按方差比例缩放

　　数据集的标准化：当个体特征太过或明显不遵从高斯正态分布时，标准化表现的效果较差。实际操作中，经常忽略特征数据的分布形状，移除每个特征均值，划分离散特征的标准差，从而等级化，进而实现数据中心化。

　　scale　

1 >>> from sklearn import preprocessing2 >>> import numpy as np3 >>> X = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])4 >>> X_scaled = preprocessing.scale(X)5 >>> X_scaled6 array([[ 0.        , -1.22474487,  1.33630621],7        [ 1.22474487,  0.        , -0.26726124],8        [-1.22474487,  1.22474487, -1.06904497]])

View Code

　　注：scaled data 的均值为0，方差为1。

1 >>> X_scaled.mean(axis=0)  # column mean2 array([ 0.,  0.,  0.])3 >>> X_scaled.std(axis=0)4 array([ 1.,  1.,  1.])

View Code

　　StandardScaler

 1 >>> scaler = preprocessing.StandardScaler().fit(X) 2 >>> scaler 3 StandardScaler(copy=True, with_mean=True, with_std=True) 4 >>> scaler.mean_ 5 array([ 1.        ,  0.        ,  0.33333333]) 6 >>> scaler.std_ 7 array([ 0.81649658,  0.81649658,  1.24721913]) 8 >>> scaler.transform(X, 0) 9 array([[ 0.        , -1.22474487,  1.33630621],10        [ 1.22474487,  0.        , -0.26726124],11        [-1.22474487,  1.22474487, -1.06904497]])12 >>> scaler.transform([[-1., 1., 0.]], 0.)  #scale the new data13 array([[-2.44948974,  1.22474487, -0.26726124]])

View Code 　

　　注：scale和StandardScaler可以用于回归模型中的目标值处理。

1.scaling features to a range

　　MinMaxScaler(最小最大值化)

　　公式：X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) ; X_scaler = X_std / (max - min) + min

　　训练过程（fit_transform()）

1 >>> scaler = preprocessing.StandardScaler().fit(X,None)2 >>> scaler = preprocessing.StandardScaler().fit(X)3 >>> X_train = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])4 >>> min_max_scaler = preprocessing.MinMaxScaler()5 >>> X_train_minmax = min_max_scaler.fit_transform(X_train)6 >>> X_train_minmax7 array([[ 0.5       ,  0.        ,  1.        ],8        [ 1.        ,  0.5       ,  0.33333333],9        [ 0.        ,  1.        ,  0.        ]])

View Code

　　测试过程（transform()）

1 >>> X_test = np.array([[ -3., -1., 4.]])2 >>> X_test_minmax = min_max_scaler.transform(X_test)3 >>> X_test_minmax4 array([[-1.5       ,  0.        ,  1.66666667]])

View Code

2.Nomalization

　　向量空间模型(VSM)的基础，用于文本分类和聚类处理。

　　normalize

　　sklearn.preprocessing.normalize(X, norm=‘l2‘, axis=1, copy=Ture)

　　　　l1: 标准化每个非0样本（row,axis=0） l2:标准化每个非0特征（column,axis=1）

　　Normalizer

1 >>> X2 array([[ 1., -1.,  2.],3        [ 2.,  0.,  0.],4        [ 0.,  1., -1.]])5 >>> normalizer = preprocessing.Normalizer().fit(X)6 >>> normalizer7 Normalizer(copy=True, norm=‘l2‘)

View Code

　　注：normalizer实例可以作为其他数据的转换器

3.Binarization

　　feature binarization是将数值型的特征值转换为布尔值，可以用于概率估计。

1 >>> binarizer = preprocessing.Binarizer().fit(X)2 >>> binarizer3 Binarizer(copy=True, threshold=0.0)4 >>> binarizer.transform(X)5 array([[ 1.,  0.,  1.],6        [ 1.,  0.,  0.],7        [ 0.,  1.,  0.]])

View Code

4.Encoding categorical features

　　类别型特征用整数值进行编码，OneHotEncoder将m种值转换为m个二元位，其中只有一位是活跃的。

1 >>> enc = preprocessing.OneHotEncoder()2 >>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])3 OneHotEncoder(categorical_features=‘all‘, dtype=<type ‘float‘>,4        n_values=‘auto‘, sparse=True)5 >>> enc.transform([[0, 1, 3]]).toarray()6 array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])  # from letf to right,the first ‘1,0‘ represent 0

View Code

　　默认情况下，每个特征有几种值是由数据集确定的。可以通过n_values参数对其进行显性指定。

scikit-learn学习之预处理（preprocessing）一

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > scikit-learn学习之预处理（preprocessing）一

scikit-learn学习之预处理（preprocessing）一

看完仍有疑问？有类似问题直接问程序猿