首页 > Web开发 > 详细

使用lucene query的CharFilter 去掉字符中的script脚本和html标签

时间:2015-05-11 17:37:20      阅读:426      评论:0      收藏:1      [点我收藏+]

标签:des   Lucene   class   style   log   com   代码   使用   src   

1.准备数据,这里我从数据库读取一个带有html标签和script脚本的数据技术分享

代码:

@Before
    public void init(){
        SQLService sqlService = new SQLService();
        sqlService.regist(null);
        BaseDao bd = new BaseDao();
        String sql = "select * from t where title like ‘% 每天读一遍,舌头更无敌%‘";
        lists = bd.getList(sql);
        System.out.println(lists.size());
        content = lists.get(0).get("content").toString();
//        System.out.println(content);
        
    }

2. 使用字符过滤器-HTMLStripCharFilter 和 MappingCharFilter.由于这些字符过滤器都是继承Reader的.所以可以像读取reader那样处理.

代码:

    @Test
    public void test2() throws IOException{
        
        StringBuilder sb = new StringBuilder();
        // html过滤
        HTMLStripCharFilter htmlscript = new HTMLStripCharFilter(new StringReader(content));
        
        //增加映射过滤  主要过滤掉换行符
        NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
        builder.add( "\r", "" );//回车 
        builder.add( "\t", "" );//横向跳格
        builder.add( "\n", "" );//换行
        CharFilter cs = new MappingCharFilter( builder.build(),htmlscript );
        
        char[] buffer = new char[10240];  
        int count; 
        while ((count = cs.read(buffer)) != -1) {  
            sb.append(new String(buffer, 0, count));  
        }  
        System.out.println(sb.toString());
        cs.close();
        
//        String keywords = HanLP.extractKeyword(sb.toString(), 20).toString();
//        System.out.println(keywords);
    }

处理结果:

亲爱的小伙伴们,累了,就放松一下吧!1. Can you can a can as a canner can can a can?­你能够像罐头工人一样装罐头吗?­
2. I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won‘t wish the wish you wish to
wish.­ 我希望梦想着你梦想中的梦想,但是如果你梦想着女巫的梦想,我就不想梦想着你梦想中的梦想。­3. I scream, you scream, we all scream
for ice-cream!­ 我叫喊,你叫喊,我们都喊着要冰淇淋!­4. How many cookies could a good cook cook if a good cook could cook cookies?­
A good cook could cook as much cookies as a good cook who could cook cookies.­ 如果一个好的厨师能做小甜饼,那么他能做多少小甜饼呢?
一个好的厨师能做出和其它好厨师一样多的小甜饼。­5. The driver was drunk and drove the doctor‘s car directly into the deep ditch.
这个司机喝醉了,他把医生的车开进了一个大深沟里。­6. Whether the weather be fine or whether the weather be not.­Whether the weather
be cold or whether the weather be hot.­We‘ll weather the weather whether we like it or not.­无论是晴天或是阴天。­无论是冷或是暖,
­不管喜欢与否,我们都要经受风霜雨露。­7. Peter Piper picked a peck of pickled peppers.­ A peck of pickled peppers Peter Piper
picked.­ If Peter Piper picked a peck of pickled peppers,­ Where‘s the peck of pickled peppers Peter Piper picked?­
彼德派柏捏起一撮泡菜。­ 彼德派柏捏起的是一撮泡菜。­ 那么彼德派捏起的泡菜在哪儿?­8. I thought a thought. But the thought I thought
wasn‘t the thought I thought I thought.­ If the thought I thought I thought had been the thought I thought, I wouldn‘t
have thought so much.­ 我有一种想法,但是我的这种想法不是我曾经想到的那种想法。如果这种想法是我曾经想到的想法,我就不会想那么多了。
­9. Amid the mists and coldest frosts,­ With barest wrists and stoutest boasts,­ He thrusts his fists against the posts,­
And still insists he sees the ghosts.­ 雾蒙蒙,冰霜冻,­ 手腕儿空空,话儿涌,­ 只见他猛所拳头往柱子上砸,­ 直说自己把鬼碰。
­10. Badmin was able to beat Bill at billiards, but Bill always beat Badmin badly at badminton.­
巴德明在台球上能够打败比尔,但是打羽毛球比尔常常大败巴德明。­11. Betty beat a bit of butter to make a better butter.­
贝蒂敲打一小块黄油要做一块更好的奶油面。­12. Rita repeated what Reardon recited when Reardon read the remarks.­

 

使用lucene query的CharFilter 去掉字符中的script脚本和html标签

标签:des   Lucene   class   style   log   com   代码   使用   src   

原文:http://www.cnblogs.com/a198720/p/4494941.html

(0)
(0)
   
举报
评论 一句话评论(0
0条  
登录后才能评论!
文章周排行
© 2014 bubuko.com 版权所有 鲁ICP备09046678号-4
打开技术之扣,分享程序人生!
             

鲁公网安备 37021202000002号