首页 > 代码库 > scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)
scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)
http://scikit-learn.org/stable/modules/feature_extraction.html
带病在网吧里。
。。。。。
写。求支持。
。。
1、首先澄清两个概念:特征提取和特征选择(
Feature extraction is very different from Feature selection
)。the former consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning. The latter is a machine learning technique applied on these features(从已经提取的特征中选择更好的特征).
以下分为四大部分来讲。主要还是4、text feature extraction
2、loading features form dicts
class DictVectorizer。举个样例就好:
上面的PoS特征就能够vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
3、feature hashing
The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
因为hash。所以仅仅保存feature的interger index。而不保存原来feature的string名字。所以没有inverse_transform方法。
FeatureHasher 接收dict对,即 (feature, value) 对,或者strings,由构造函数的參数input_type决定.结果是scipy.sparse matrix。假设是strings,则value默认取1,比如 [‘feat1‘, ‘feat2‘, ‘feat2‘] 被解释为[(‘feat1‘, 1), (‘feat2‘, 2)].
4、text feature extraction
由于内容太多,分开写了。參考着篇博客:http://blog.csdn.net/mmc2015/article/details/46997379
5、image feature extraction
提取部分图片(Patch extraction):
The extract_patches_2d function从图片中提取小块,存储成two-dimensional array, or three-dimensional with color information along the third axis. 使用reconstruct_from_patches_2d. 可以将全部的小块重构成原图:
重构方式例如以下:
The PatchExtractor class和 extract_patches_2d,一样,仅仅只是能够同一时候接受多个图片作为输入:
图片像素的连接(Connectivity graph of an image):
主要是依据像素的区别来推断图片的每两个像素点是否连接。
。。
。
。
The function img_to_graph returns such a matrix from a 2D or 3D image. Similarly, grid_to_graph build a connectivity matrix for images given the shape of these image.
这有个直观的样例:http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html#example-cluster-plot-lena-ward-segmentation-py
头疼。。。。
碎觉。
。。
scikit-learn:4.2. Feature extraction(特征提取,不是特征选择)