算法概述:朴素贝叶斯的两个假设:特征之间相互独立;特征同等重要。
假设有类别1,2,3,...n,待分类数据d,则d属于分类1,2,3...的概率为p1,p2,p3...,那么最高概率对应的类别即为待分类数据所对应的类别,即选择具有最高概率的决策!
p(ci|x,y)=p(x,y|ci)p(ci) / p(x,y),如果p(c1|x,y)>p(c2|x,y),则(x,y)属于类别c1,反之属于类别c2。
4.1准备数据构建词向量
postingList=[[‘my‘, ‘dog‘, ‘has‘, ‘flea‘, ‘problems‘, ‘help‘,
‘please‘],
[‘maybe‘, ‘not‘, ‘take‘, ‘him‘, ‘to‘,
‘dog‘, ‘park‘, ‘stupid‘],
[‘my‘, ‘dalmation‘, ‘is‘,
‘so‘, ‘cute‘, ‘I‘, ‘love‘, ‘him‘],
[‘stop‘, ‘posting‘,
‘stupid‘, ‘worthless‘, ‘garbage‘],
[‘mr‘, ‘licks‘,
‘ate‘, ‘my‘, ‘steak‘, ‘how‘, ‘to‘, ‘stop‘, ‘him‘],
[‘quit‘, ‘buying‘, ‘worthless‘, ‘dog‘, ‘food‘, ‘stupid‘]]
classVec
= [0,1,0,1,0,1] #1 is abusive, 0 not
return
postingList,classVec
def createVocabList(dataSet):
vocabSet
= set([]) #create empty set
for document in dataSet:
vocabSet = vocabSet | set(document) #union of the two sets
return
list(vocabSet)
def setOfWords2Vec(vocabList,
inputSet):
returnVec =
[0]*len(vocabList)
for word
in inputSet:
if word in
vocabList:
returnVec[vocabList.index(word)] = 1
else: print "the word: %s is not in my Vocabulary!" %
word
return
returnVec
4.2从词向量计算概率
相当于计算公式中的p(x,y|ci),顺便计算了一下p(ci)。
def
trainNB0(trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)#计算共有多少条文档
numWords=len(trainMatrix[0])#计算共出现过多少谷歌单词,即此表的长度
pAbusive=sum(trainCategory)/float(numTrainDocs)#计算侮辱性文档的比例,相当于p(c1)
p0Num=ones(numWords);p1Num=ones(numWords)#在指定类别下各个单词出现的次数初始化为1(为防止计算p(ci|W)时一个为零结果为零的情况)),是一个列表
p0Denom=0.0;p1Denom=0.0#在指定类别下单词出现的总数,初始化为2(包括重复出现的单词)
for i in
range(numTrainDocs):
if
trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect=log(p1Num/p1Denom)#除得的结果即为在指定类别下的条件概率p(x,y|ci),取log便于处理
p0Vect=log(p0Num/p0Denom)
return p0Vect,p1Vect,pAbusive
4.3分类函数
def
classifyNB(vec2Classify,p0Vec,p1Vec,pClass1):#在自然对数里+即表示乘法,因为p(w)一样所以就不必计算,只需计算分子
p1=sum(vec2Classify*p1Vec)+log(pClass1)#p(w|ci)p(ci)
p0=sum(vec2Classify*p0Vec)+log(1.0-pClass1)
if
p1>p0:
return 1
else:
return 0
def
testingNB():
listOPosts,listClasses=loadDataSet()
myVocabList=createVocabList(listOPosts)
trainMat=[]
for
postinDoc in
listOPosts:
trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
p0V,p1V,pAb=trainNB0(array(trainMat),array(listClasses))
testEntry=[‘love‘,‘my‘,‘dalmation‘]
thisDoc=array(setOfWords2Vec(myVocabList,testEntry))
print
testEntry,‘classified as
:‘,classifyNB(thisDoc,p0V,p1V,pAb)
testEntry=[‘stupid‘,‘garbage‘]
thisDoc=array(setOfWords2Vec(myVocabList,testEntry))
print
testEntry,‘classified as :‘,classifyNB(thisDoc,p0V,p1V,pAb)
小小的疑惑:v1求得的是在训练数据在类别1下各个单词的条件概率,而thisDoc则是测试数据在单词表中出现与否的标签(以0和1分别表示出现与不出现),那么二者相乘就是测试数据testEntry在类别1下的条件概率?总觉的怪怪的!
以上是词集模型,即将此的出现与否作为一个特征,而不考虑词出现的个数。如果考虑到词出现的个数则为词袋模型。
4.4词袋模型
只需改变一下setOfWords2Vec函数即可,每当遇到一个单词,它会增加词向量中的对应值,而不只是将对应数值设为1.
def bagOfWords2VecMN(vocabList, inputSet):
returnVec =
[0]*len(vocabList)
for word in inputSet:
if word in
vocabList:
returnVec[vocabList.index(word)] +=1
return returnVec
下面就可以将它用于书上的过滤垃圾邮件和从个人广告中获取区域倾向了。快去练习吧!
原文:http://www.cnblogs.com/woshikafeidouha/p/3575531.html