首页 > 代码库 > scikit-learn:4. 数据集预处理(clean数据、reduce降维、expand增维、generate特征提取)

scikit-learn:4. 数据集预处理(clean数据、reduce降维、expand增维、generate特征提取)

本文參考:http://scikit-learn.org/stable/data_transforms.html


本篇主要讲数据预处理,包含四部分:

数据清洗、数据降维(PCA类)、数据增维(Kernel类)、提取自己定义特征。

哇哈哈。还是关注预处理比較靠谱。

。。

重要的不翻译:scikit-learn providesa library of transformers, which mayclean (see Preprocessing data), reduce (seeUnsupervised dimensionality reduction), expand (see Kernel Approximation) or generate (see Feature extraction) feature representations.


fit、transform、fit_transform三者差别:

fit:从训练集中学习模型的參数(比如,方差、中位数等;也可能是不同的词汇表)

transform:将训练集/測试集中的数据转换为fit学到的參数的维度上(測试集的方差、中位数等;測试集在fit得到的词汇表下的向量值等)。

fit_transform:同一时候进行fit和transform操作。

Like other estimators, these are represented by classes with fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.


八大块内容。翻译会在之后慢慢更新:

4.1. Pipeline and FeatureUnion: combining estimators

4.1.1. Pipeline: chaining estimators

4.1.2. FeatureUnion: composite feature spaces

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46991465

4.2. Feature extraction

4.2.3. Text feature extraction

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46997379

4.2.4. Image feature extraction

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/46992105


4.3. Preprocessing data

翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47016313

4.3.1. Standardization, or mean removal and variance scaling

4.3.2. Normalization

4.3.3. Binarization

4.3.4. Encoding categorical features

4.3.5. Imputation of missing values

4.4. Unsupervised dimensionality reduction

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47066239

4.4.1. PCA: principal component analysis

4.4.2. Random projections

4.4.3. Feature agglomeration (特征聚集)

4.5. Random Projection

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47067003

4.5.1. The Johnson-Lindenstrauss lemma

4.5.2. Gaussian random projection

4.5.3. Sparse random projection

4.6. Kernel Approximation

翻译之后的文章,參考:http://blog.csdn.net/mmc2015/article/details/47068223

4.6.1. Nystroem Method for Kernel Approximation

4.6.2. Radial Basis Function Kernel

4.6.3. Additive Chi Squared Kernel

4.6.4. Skewed Chi Squared Kernel

4.6.5. Mathematical Details

4.7. Pairwise metrics, Affinities and Kernels

翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47068895

4.7.1. Cosine similarity

4.7.2. Linear kernel

4.7.3. Polynomial kernel

4.7.4. Sigmoid kernel

4.7.5. RBF kernel

4.7.6. Chi-squared kernel

4.8. Transforming the prediction target (y)

翻译之后的文章。參考:http://blog.csdn.net/mmc2015/article/details/47069869

4.8.1. Label binarization

4.8.2. Label encoding




scikit-learn:4. 数据集预处理(clean数据、reduce降维、expand增维、generate特征提取)