首页 > 代码库 > 初识分类算法(1)------knn近邻算法
初识分类算法(1)------knn近邻算法
例子:某人想要由以下1000行训练样本数据构建一个分类器,将数据分成3类(喜欢,一般,不喜欢)。样本数据的特征有主要有3个,
A:每年获得的飞行常客里程数
B:玩视频游戏所耗时间百分比
C:每周消费冰淇淋公升数
1. 数据的读取
1 filename=‘D://machine_learn//Ch02//datingTestSet2.txt‘ 2 def file2matrix(filename): 3 fr = open(filename) 4 a=fr.readlines() 5 numberOfLines = len(a) #get the number of lines in the file 6 returnMat = zeros((numberOfLines,3)) #prepare matrix to return 7 classLabelVector = [] #prepare labels return 8 index=0 9 for line in a:10 line = line.strip()11 listFromLine = line.split(‘\t‘)12 returnMat[index,:] = listFromLine[0:3] #第index行=右边数据13 classLabelVector.append(int(listFromLine[-1]))14 index += 115 return returnMat,classLabelVector16 data,labels=file2matrix(filename)
2. 数据的归一化处理:由于A的特征值远大于B,C的特征值,因此为了使3个特征转化为真正等权重的特征,需要进行数据标准化操作
1 def autoNorm(dataSet):2 minVals = dataSet.min(0) #矩阵中每一列的最小值3 maxVals = dataSet.max(0) #矩阵中每一列的最大值4 ranges = maxVals - minVals5 normDataSet = zeros(shape(dataSet))6 m = dataSet.shape[0]7 normDataSet = dataSet - tile(minVals, (m,1))8 normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide9 return normDataSet, ranges, minVals
3.应用kNN算法进行分类
3.1 首先简述knn-算法的思想
3.2 python 实现knn
1 def classify0(inX, dataSet, labels, k): 2 dataSetSize = dataSet.shape[0] 3 diffMat = tile(inX, (dataSetSize,1)) - dataSet 4 sqDiffMat = diffMat**2 5 sqDistances = sqDiffMat.sum(axis=1) 6 distances = sqDistances**0.5 7 sortedDistIndicies = distances.argsort() 8 classCount={} 9 for i in range(k):10 voteIlabel = labels[sortedDistIndicies[i]]11 classCount[voteIlabel] = classCount.get(voteIlabel,0) + 112 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)13 return sortedClassCount[0][0]
3.3 在上述数据中应用knn,并且计算出误判率
1 def datingClassTest(): 2 hoRatio = 0.50 #hold out 10% 3 datingDataMat,datingLabels = file2matrix(‘datingTestSet2.txt‘) #load data setfrom file 4 normMat, ranges, minVals = autoNorm(datingDataMat) 5 m = normMat.shape[0] 6 numTestVecs = int(m*hoRatio) 7 errorCount = 0.0 8 for i in range(numTestVecs): 9 classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)10 print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])11 if (classifierResult != datingLabels[i]): errorCount += 1.012 print "the total error rate is: %f" % (errorCount/float(numTestVecs))13 print errorCount
4. 可视化分类结果
1 import matplotlib 2 import matplotlib.pyplot as plt 3 fig=plt.figure() 4 ax=fig.add_subplot(111) 5 #ax.scatter(data[:,0],data[:,1]) 6 ax.set_xlabel(‘B‘) 7 ax.set_ylabel(‘C‘) 8 ax.scatter(data[:,1],data[:,2],15.0*array(labels),array(labels)) 9 ax.scatter([20,20,20],[1.8,1.6,1.4],15*array(list(set(labels))),list(set(labels)))10 legends=[‘dislike‘,‘smallDoses‘,‘largeDoses‘]11 ax.text(22,1.8,‘%s‘ %(legends[0]))12 ax.text(22,1.6,‘%s‘ %(legends[1]))13 ax.text(22,1.4,‘%s‘ %(legends[2]))14 plt.show()
初识分类算法(1)------knn近邻算法
声明:以上内容来自用户投稿及互联网公开渠道收集整理发布,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任,若内容有误或涉及侵权可进行投诉: 投诉/举报 工作人员会在5个工作日内联系你,一经查实,本站将立刻删除涉嫌侵权内容。