首页 > 代码库 > 初识分类算法(1)------knn近邻算法

初识分类算法(1)------knn近邻算法

例子:某人想要由以下1000行训练样本数据构建一个分类器,将数据分成3类(喜欢,一般,不喜欢)。样本数据的特征有主要有3个,

 A:每年获得的飞行常客里程数

 B:玩视频游戏所耗时间百分比

 C:每周消费冰淇淋公升数

1. 数据的读取

 1 filename=D://machine_learn//Ch02//datingTestSet2.txt 2 def file2matrix(filename): 3     fr = open(filename) 4     a=fr.readlines() 5     numberOfLines = len(a)         #get the number of lines in the file 6     returnMat = zeros((numberOfLines,3))        #prepare matrix to return 7     classLabelVector = []                       #prepare labels return   8     index=0  9     for line in a:10         line = line.strip()11         listFromLine = line.split(\t)12         returnMat[index,:] = listFromLine[0:3]  #第index行=右边数据13         classLabelVector.append(int(listFromLine[-1]))14         index += 115     return returnMat,classLabelVector16 data,labels=file2matrix(filename)
data

2. 数据的归一化处理:由于A的特征值远大于B,C的特征值,因此为了使3个特征转化为真正等权重的特征,需要进行数据标准化操作

1 def autoNorm(dataSet):2     minVals = dataSet.min(0)                       #矩阵中每一列的最小值3     maxVals = dataSet.max(0)                       #矩阵中每一列的最大值4     ranges = maxVals - minVals5     normDataSet = zeros(shape(dataSet))6     m = dataSet.shape[0]7     normDataSet = dataSet - tile(minVals, (m,1))8     normDataSet = normDataSet/tile(ranges, (m,1))   #element wise divide9     return normDataSet, ranges, minVals
autoNorm(dataSet)

3.应用kNN算法进行分类

  3.1 首先简述knn-算法的思想

 

  3.2 python 实现knn

 1 def classify0(inX, dataSet, labels, k): 2     dataSetSize = dataSet.shape[0]  3     diffMat = tile(inX, (dataSetSize,1)) - dataSet 4     sqDiffMat = diffMat**2 5     sqDistances = sqDiffMat.sum(axis=1) 6     distances = sqDistances**0.5 7     sortedDistIndicies = distances.argsort()      8     classCount={}           9     for i in range(k):10         voteIlabel = labels[sortedDistIndicies[i]]11         classCount[voteIlabel] = classCount.get(voteIlabel,0) + 112     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)13     return sortedClassCount[0][0]
knn-classify0

  3.3 在上述数据中应用knn,并且计算出误判率

 1 def datingClassTest(): 2     hoRatio = 0.50      #hold out 10% 3     datingDataMat,datingLabels = file2matrix(datingTestSet2.txt)       #load data setfrom file 4     normMat, ranges, minVals = autoNorm(datingDataMat) 5     m = normMat.shape[0] 6     numTestVecs = int(m*hoRatio) 7     errorCount = 0.0 8     for i in range(numTestVecs): 9         classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)10         print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])11         if (classifierResult != datingLabels[i]): errorCount += 1.012     print "the total error rate is: %f" % (errorCount/float(numTestVecs))13     print errorCount
datingClassTest

4. 可视化分类结果

 1 import matplotlib 2 import matplotlib.pyplot as plt 3 fig=plt.figure() 4 ax=fig.add_subplot(111) 5 #ax.scatter(data[:,0],data[:,1]) 6 ax.set_xlabel(B) 7 ax.set_ylabel(C) 8 ax.scatter(data[:,1],data[:,2],15.0*array(labels),array(labels)) 9 ax.scatter([20,20,20],[1.8,1.6,1.4],15*array(list(set(labels))),list(set(labels)))10 legends=[dislike,smallDoses,largeDoses]11 ax.text(22,1.8,%s %(legends[0]))12 ax.text(22,1.6,%s %(legends[1]))13 ax.text(22,1.4,%s %(legends[2]))14 plt.show()
scatter

 

 

初识分类算法(1)------knn近邻算法