MongoDB在大多数的情形中都是作为数据存储的模块而被使用,作为一个数据库,一般不应该承担更多的任务。
从专业性的角度来说,将文本搜索的任务交由专业的搜索引擎来负责,往往是更好的选择。
常用的搜索引擎与MongoDB往往都有着现成的工具,可以方便的进行结合。
1、Sphinx与mongodb-sphinx
Sphinx是一个C++编写的文本搜索引擎,其本身与MySQL结合的非常好,可以非常方便的从MySQL中导入数据。
对于其他的数据库来说,Sphinx并不提供原生的支持,但是Sphinx提供了xmlpipe2接口,任何程序只要实现了相应的接口就可以与Sphinx进行数据交互。
对于MongoDB来说,mongodb-sphinx(https://github.com/georgepsarakis/mongodb-sphinx)就是一个xmlpipe2接口的实现。
mongo-sphinx中带有一个stackoverflow的样例数据,以及运行的参数样例,只需要将样例数据导入MongoDB再执行以下的命令即可实现数据向sphinx的导入
./mongodb-sphinx.py -d stackoverflow -c posts --text-fields profile_image link --attributes last_activity_date _id --attribute-types timestamp string --timestamp-from=1366045854 --id-field=post_id
常用的参数包括:
-d 指定数据库,-c指定集合,-H指定MongoDB的地址,-p指定MongoDB的端口
-f起始时间戳,-u结束时间戳,-t需要建立搜索索引的字段
-a不索引的属性,--attribute-types为-a中的属性指定属性类型包括字符串,时间戳,整数等等
--id-field用作文档ID的字段,--threads线程数
非常重要的一点在于,mongodb-sphinx默认MongoDB数据中的_id为ObjectID,即带有时间信息的ID,所以如果需要使用自己的ID系统则在时间判断上会出现问题,需要自行修改代码。
2、ElasticSearch和Mongo-Connector
在es2.0及之前的版本中,经常用到的与MongoDB之间进行数据结合的是mongodb-river。
不过在es5之后的版本中,插件已经无法再想之前的版本一样安装,所以网上的mongodb-river教程都无法使用。
同时mongodb-river已经有几年没有更新,可能对es5的支持不如别的程序。
MongoDB官方提供了类似的工具Mongo-Connector(https://github.com/mongodb-labs/mongo-connector)
安装方法非常简单:pip install mongo-connector
Mongo-Connector支持多种不同的搜索引擎,对于es来说支持1.x,2.x,5.x等多个版本,只需要安装对应的doc-manager
也可以直接使用,pip install ‘mongo-connector[elastic5]‘安装,即可直接使用。
使用之前,需要将MongoDB切换为副本集模式,这样MongoDB才会记录oplog。
$ mongod --replSet singleNodeRepl $ mongo > rs.initiate() # MongoDB is now running on port 27017
之后,编辑一个配置文件,例如配置密码信息等:
{"authentication": {"password": XXX}}
官方自带了一个配置文件的样例:
{
"__comment__": "Configuration options starting with ‘__‘ are disabled",
"__comment__": "To enable them, remove the preceding ‘__‘",
"mainAddress": "localhost:27017",
"oplogFile": "/var/log/mongo-connector/oplog.timestamp",
"noDump": false,
"batchSize": -1,
"verbosity": 0,
"continueOnError": false,
"logging": {
"type": "file",
"filename": "/var/log/mongo-connector/mongo-connector.log",
"__format": "%(asctime)s [%(levelname)s] %(name)s:%(lineno)d - %(message)s",
"__rotationWhen": "D",
"__rotationInterval": 1,
"__rotationBackups": 10,
"__type": "syslog",
"__host": "localhost:514"
},
"authentication": {
"__adminUsername": "username",
"__password": "password",
"__passwordFile": "mongo-connector.pwd"
},
"__comment__": "For more information about SSL with MongoDB, please see http://docs.mongodb.org/manual/tutorial/configure-ssl-clients/",
"__ssl": {
"__sslCertfile": "Path to certificate to identify the local connection against MongoDB",
"__sslKeyfile": "Path to the private key for sslCertfile. Not necessary if already included in sslCertfile.",
"__sslCACerts": "Path to concatenated set of certificate authority certificates to validate the other side of the connection",
"__sslCertificatePolicy": "Policy for validating SSL certificates provided from the other end of the connection. Possible values are ‘required‘ (require and validate certificates), ‘optional‘ (validate but don‘t require a certificate), and ‘ignored‘ (ignore certificates)."
},
"__fields": ["field1", "field2", "field3"],
"__namespaces": {
"excluded.collection": false,
"excluded_wildcard.*": false,
"*.exclude_collection_from_every_database": false,
"included.collection1": true,
"included.collection2": {},
"included.collection4": {
"includeFields": ["included_field", "included.nested.field"]
},
"included.collection5": {
"rename": "included.new_collection5_name",
"includeFields": ["included_field", "included.nested.field"]
},
"included.collection6": {
"excludeFields": ["excluded_field", "excluded.nested.field"]
},
"included.collection7": {
"rename": "included.new_collection7_name",
"excludeFields": ["excluded_field", "excluded.nested.field"]
},
"included_wildcard1.*": true,
"included_wildcard2.*": true,
"renamed.collection1": "something.else1",
"renamed.collection2": {
"rename": "something.else2"
},
"renamed_wildcard.*": {
"rename": "new_name.*"
},
"gridfs.collection": {
"gridfs": true
},
"gridfs_wildcard.*": {
"gridfs": true
}
},
"docManagers": [
{
"docManager": "elastic_doc_manager",
"targetURL": "localhost:9200",
"__bulkSize": 1000,
"__uniqueKey": "_id",
"__autoCommitInterval": null
}
]
}
之后执行mongo-connector -c config.json命令即可开始进行数据同步。
原文:http://www.cnblogs.com/ruizhang3/p/6667066.html