分析一个文档（英语文章）中各个词出现的频率，并打印频率最高的前10个。

时间：2014-03-03 12:24:59 阅读：517 评论：0 收藏：0 [点我收藏+]

程序思路1.首先完成读文件的操作，并存储单词，

　　　　2，利用map的value排序，并记录单词的个数。

3。打印频率最高的10个。

日记：刚开始拿到这个题目是有点棘手，以前也做过类似的输入一段英文字母，并计算单词个数以及文件读取输入的操作，但要把他们合起来一块就没有做过了。

回去后我做了一下程序的分析思路。

先是确定程序具体需要实现怎样的功能，打印频率前10的单词。

所以我们需要有处理分析一个文本的东西，我用map的value排序，借助vector 。

之后我通过调用函数库里的函数进行了插入，排序，读取在，最后输出。里面最新颖的还是利用了map这个关联式容器，增加和删除节点对迭代的结果很小。还有其中的插入，排序，查找，删除的功能，效率高。

#include <iostream>
#include <algorithm>
#include <fstream>
#include <map>
#include <vector>
#include <iterator>
#include <functional>
#include <string>
#include <cstring>
#include <cstdio>

#include <cstdlib>
using namespace std;
#define COUNT 10 //打印前多少个单词
//单词类
class WordTop10
{
private :
    map<string,int> mapWord;   //存储单词
   vector<pair<string ,int > > pair_vec; //因为要按map的value排序，借助vector
public :

    void insertWord(const string& word); //插入单词到map
    void sortWord();                                  //按map的value排序
    void readFile(const string& strFileName); //从文件读数据
    void outPut();                                          //输出
};

//在map中插入一个单词
void WordTop10::insertWord(const string& word)
{
    map<string,int>::iterator mit;
    mit = mapWord.find(word);
    if( mit != mapWord.end())
    {
        mit->second ++;
    }
    else
    {
        mapWord[word] = 1;
    }
}
//递减排序
int cmp(const pair<string ,int >& a,const pair<string ,int >& b)
{
    return a.second > b.second;
}
//按map中的value排序
void WordTop10::sortWord()
{

    for(map<string,int>::iterator map_iter = mapWord.begin(); map_iter != mapWord.end(); ++map_iter)
    {
        pair_vec.push_back(make_pair(map_iter->first,map_iter->second));
    }
    sort(pair_vec.begin(),pair_vec.end(),cmp);   //排序

}

void WordTop10:: outPut()
{
    int i = 0;
    for(vector<pair<string ,int > >::iterator cur = pair_vec.begin(); cur != pair_vec.end(); ++cur)
    {
        i++;
        if(i >COUNT)   //输出前COUNT个
        {
            return ;
        }
        cout << cur->first <<"\t"<<cur->second<<endl;
    }
}

//从文件中读取数据
void WordTop10::readFile(const string& strFileName)
{
    string text;
//　c_str函数的返回值是const char*的，不能直接赋值给char*，所以就需要我们进行相应的操作转化，下面就是这一转化过程。
    ifstream in(strFileName.c_str());

    if (!in)
    {
        cout << "~文件打开失败~" << endl;

    }

    while (in >> text)
    {
        //text.erase(remove_if(text.begin(),text.end(),bind2nd(equal_to<char>(),‘-‘)), text.end());
        //因为读入是以空格分割的，需要处理两边的表标点符号
        //取出最后的标点，假设最多有三个尾标点
        string::iterator it = text.end();
        if(text.length() != 0 && ispunct(text[text.length()-1]))
            text.erase(it-1);                  //去掉最后那个标点

        it = text.end();
         if(text.length() != 0 && ispunct(text[text.length()-1]))
            text.erase(it-1);

          it = text.end();
         if(text.length() != 0 && ispunct(text[text.length()-1]))
            text.erase(it-1);
        //有时前面也有符号，假设3个
        it = text.begin();
       if(text.length() != 0 &&ispunct(text[0]))
            text.erase(it);

        it = text.begin();
       if(text.length() != 0 && ispunct(text[0]))
            text.erase(it);

             it = text.begin();
       if(text.length() != 0 &&ispunct(text[0]))
            text.erase(it);

        //很多字符串带 “--”，特殊处理,--也是标点符号，两头的已在前面处理过，这里处理中间的
        size_t npos = text.find("--");
        if( npos != -1)
        {
            string text1 = text.substr(0,npos); //--把text分割成两个单词
            string text2 = text.substr(npos+2);
            insertWord(text1);
            insertWord(text2);

        }
        else
        {
            insertWord(text);
        }

    }

    in.close();
    in.clear();
    return ;
}

int main()
{

    WordTop10 wordTop;

    wordTop.readFile("H:\\haha.txt");
    wordTop.sortWord();
     wordTop.outPut();
    return 0;
}

分析一个文档（英语文章）中各个词出现的频率，并打印频率最高的前10个。,布布扣,bubuko.com

分析一个文档（英语文章）中各个词出现的频率，并打印频率最高的前10个。

原文：http://www.cnblogs.com/dxl12306/p/3576735.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)