阶段作业1：完整的中英文词频统计

时间：2018-09-29 11:00:16 阅读：203 评论：0 收藏：0 [点我收藏+]

1.英文小说词频统计：

a.准备utf-8编码的文本文件file

b.通过文件读取字符串 str

c.对文本进行预处理

d.分解提取单词 list

e.单词计数字典 set , dict

f.按词频排序 list.sort(key=)

g.排除语法型词汇，代词、冠词、连词等无语义词

h.输出TOP(20)

#准备utf-8编码的文本文件file
fo=open("geci.txt","r",encoding="utf-8")

#通过文件读取字符串 str
str=fo.read().lower()
fo.close()
print(str)

#对文本进行预处理
sep=". , ; : ? ! -"
for ch in sep:
    str.replace(ch, " ")
    print(str)

#分解提取单词 list
List=str.split()
print(List)

#单词计数字典 set , dict
strset=set(List)
print(len(strset),strset)
for song in strset:
    print(song,List.count(song))

#排除语法型词汇，代词、冠词、连词等无语义词
a={‘the‘,‘you‘,‘and‘,‘or‘,‘we‘,‘a‘,‘me‘,‘on‘,‘of‘}
strset=strset-a


#按词频排序 list.sort(key=)
dict={}
for song in strset:
    dict[song]=List.count(song)
    print(len(dict),dict)
wclist =list(dict.items())
wclist.sort(key=lambda x:x[1],reverse=True)
print(wclist)

#输出TOP(20)
for i in range(20):
    print(wclist[i])

运行结果如下：

技术分享图片

2.中文小说词频统计：

import jieba
str =open("简爱.txt","r",encoding="utf-8").read()
print(str)

worldl=jieba.lcut(str)


dict={}
for world in worldl:
    if len(world)==1:
        continue
    else:
       dict[world]=dict.get(world,0)+1

#dict[‘这笑声‘]=dict[‘笑声‘]+dict[‘这‘]
#del(dict[‘这‘])


wclist =list(dict.items())
wclist.sort(key=lambda x:x[1],reverse=True)
print(wclist)

for i in range(20):
    print(wclist[i])

运行结果如下：

技术分享图片

阶段作业1：完整的中英文词频统计

原文：https://www.cnblogs.com/cx1234/p/9722229.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)