kNN: k Nearest Neighbors
Input: inX: vector to compare to existing dataset (1xN)
dataSet: size m data set of known vectors (NxM)
labels: data set labels (1xM vector)
k: number of neighbors to use for comparison (should be an odd number)
Output: the most popular class label
from numpy import *
import operator
from os import listdir
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet #将inX重复成dataSetSize行1列,tile(A,n),功能是将数组A重复n次,构成一个新的数组
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1) #sum(axis=1)就是将矩阵的每一行向量相加
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort() #argsort()得到的是排序后数据原来位置的下标
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]#确定前k个距离最小元素所在的主要分类labels
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
#计算各个元素标签的出现次数(频率),当voteIlabel在classCount中时,classCount.get()返回1,否则返回0
#operator.itemgetter(1)表示按照第二个元素的次序对元组进行排序,reverse=True表示为逆序排序,即从大到小排序
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0] #最后返回发生频率最高的元素标签新建一个kNN.py文件,将上面的KNN的核心代码加到里面,同时加入创建数据集函数createDataSet():#创建数据集和标签
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labelsK-近邻算法:带有四个数据点的简单例子x=[1,1,0,0];
y=[1.1,1,0,0.1];
L={'A','A','B','B'}; %4个标注
plot(x,y,'.'); %画4个点
axis([-0.2 1.2 -0.2 1.2])
for ii=1:4
text(x(ii)+0.01,y(ii)+0.01,L{ii}); %利用4个点的坐标添加对应标注
%适当增加一些距离,让文字和点分开会美观一些
end
figure(gcf);>>> kNN.classify0([0,0],group,labels,3) 'B' >>> kNN.classify0([1,0],group,labels,3) 'B' >>> kNN.classify0([2,2],group,labels,3) 'A'
原文:http://blog.csdn.net/geekmanong/article/details/50517909