舆情系统每日热词用到了lda主题聚类
原先的版本是python项目,分词应用Jieba,LDA应用Gensim
项目工作良好
有以下几点问题
1 舆情产品基于elasticsearch大数据,es内应用lucene分词,python的jieba分词和lucene分词结果并不一致(或需额外的工作保持一致),早期需求只是展示每日热词,分词不一致并不是个问题,现在的新的需求,要求lda和数据无缝结合,es集成jieba,再把es内的数据全用全量数据重新分词,考虑工作量和技术难度上都不现实,只好改lda的分词算法了(实际应用上,不同的分词算法在lda提取主题和热词的场景下几乎没有影响)
2 python项目限于单点计算,不好扩展
应用lucene分词,再计算lda,切换至java的技术栈是最简单的办法
java下也有很多lda的算法实现,只作可行性验证
这篇主要调研的是spark-mllib,官方有lda实现
http://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda
ls examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala
root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample
2019-03-21 10:28:07 WARN  Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:28:07 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:28:07 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Missing argument <input>...
LDAExample: an example LDA app for plain text data.
Usage: LDAExample [options] <input>...
  --k <value>              number of topics. default: 20
  --maxIterations <value>  number of iterations of learning. default: 10
  --docConcentration <value>
                           amount of topic smoothing to use (> 1.0) (-1=auto).  default: -1.0
  --topicConcentration <value>
                           amount of term (word) smoothing to use (> 1.0) (-1=auto).  default: -1.0
  --vocabSize <value>      number of distinct word types to use, chosen by frequency. (-1=all)  default: 10000
  --stopwordFile <value>   filepath for a list of stopwords. Note: This must fit on a single machine.  default:
  --algorithm <value>      inference algorithm to use. em and online are supported. default: em
  --checkpointDir <value>  Directory for checkpointing intermediate results.  Checkpointing helps with recovery and eliminates temporary shuffle files on disk.  default: None
  --checkpointInterval <value>
                           Iterations between each checkpoint.  Only used if checkpointDir is set. default: 10
  <input>...               input paths (directories) to plain text corpora.  Each text file line should hold 1 document.
2019-03-21 10:28:10 INFO  ShutdownHookManager:54 - Shutdown hook called
2019-03-21 10:28:10 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-0022e162-fa2a-4e0e-8e89-c78ea7962dd1
官方示例
Refer to the LDA Scala docs and DistributedLDAModel Scala docs for details on the API.
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("data/mllib/sample_lda_data.txt")
val parsedData = data.map(s => Vectors.dense(s.trim.split(‘ ‘).map(_.toDouble)))
// Index documents with unique IDs
val corpus = parsedData.zipWithIndex.map(_.swap).cache()
// Cluster the documents into three topics using LDA
val ldaModel = new LDA().setK(3).run(corpus)
// Output topics. Each is a distribution over words (matching word count vectors)
println(s"Learned topics (as distributions over vocab of ${ldaModel.vocabSize} words):")
val topics = ldaModel.topicsMatrix
for (topic <- Range(0, 3)) {
  print(s"Topic $topic :")
  for (word <- Range(0, ldaModel.vocabSize)) {
    print(s"${topics(word, topic)}")
  }
  println()
}
// Save and load model.
ldaModel.save(sc, "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
val sameModel = DistributedLDAModel.load(sc,
  "target/org/apache/spark/LatentDirichletAllocationExample/LDAModel")
文件为"data/mllib/sample_lda_data.txt"
root@wx-social-consume2:/usr/spark-2.4.0# ./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt
2019-03-21 10:29:55 WARN  Utils:66 - Your hostname, wx-social-consume2 resolves to a loopback address: 127.0.0.1; using 10.10.3.47 instead (on interface em1)
2019-03-21 10:29:55 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-03-21 10:29:56 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-03-21 10:29:58 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-03-21 10:29:58 INFO  SparkContext:54 - Submitted application: LDAExample with {
  input:	List(data/mllib/sample_lda_data.txt),
  k:	20,
  maxIterations:	10,
  docConcentration:	-1.0,
  topicConcentration:	-1.0,
  vocabSize:	10000,
  stopwordFile:	,
  algorithm:	em,
  checkpointDir:	None,
  checkpointInterval:	10
}
2019-03-21 10:29:58 INFO  SecurityManager:54 - Changing view acls to: root
2019-03-21 10:29:58 INFO  SecurityManager:54 - Changing modify acls to: root
2019-03-21 10:29:58 INFO  SecurityManager:54 - Changing view acls groups to:
2019-03-21 10:29:58 INFO  SecurityManager:54 - Changing modify acls groups to:
2019-03-21 10:29:58 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2019-03-21 10:29:58 INFO  Utils:54 - Successfully started service ‘sparkDriver‘ on port 63504.
2019-03-21 10:29:58 INFO  SparkEnv:54 - Registering MapOutputTracker
2019-03-21 10:29:58 INFO  SparkEnv:54 - Registering BlockManagerMaster
2019-03-21 10:29:58 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2019-03-21 10:29:58 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2019-03-21 10:29:58 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-ac172871-a0e5-41af-9e76-a09ea01c3428
2019-03-21 10:29:58 INFO  MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2019-03-21 10:29:58 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2019-03-21 10:29:59 INFO  log:192 - Logging initialized @4022ms
2019-03-21 10:29:59 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-03-21 10:29:59 INFO  Server:419 - Started @4109ms
2019-03-21 10:29:59 WARN  Utils:66 - Service ‘SparkUI‘ could not bind on port 4040. Attempting port 4041.
2019-03-21 10:29:59 INFO  AbstractConnector:278 - Started ServerConnector@1386313f{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2019-03-21 10:29:59 INFO  Utils:54 - Successfully started service ‘SparkUI‘ on port 4041.
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1921994e{/jobs,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a8df0b3{/jobs/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c112f5f{/jobs/job,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fd05028{/jobs/job/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a2d3909{/stages,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fb392c4{/stages/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@194d329e{/stages/stage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@541179e7{/stages/stage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24386839{/stages/pool,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b32b129{/stages/pool/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@439e3cb4{/storage,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1c9fbb61{/storage/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7b81616b{/storage/rdd,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@15d42ccb{/storage/rdd/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@279dd959{/environment,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46383a78{/environment/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@36c281ed{/executors,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@244418a{/executors/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4b5a078a{/executors/threadDump,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4c361f63{/executors/threadDump/json,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ed922e1{/static,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@35f3a22c{/,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a0c5e9{/api,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@32e652b6{/jobs/job/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4ba02375{/stages/stage/kill,null,AVAILABLE,@Spark}
2019-03-21 10:29:59 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://10.10.3.47:4041
2019-03-21 10:29:59 INFO  SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/scopt_2.11-3.7.0.jar at spark://10.10.3.47:63504/jars/scopt_2.11-3.7.0.jar with timestamp 1553164199210
2019-03-21 10:29:59 INFO  SparkContext:54 - Added JAR file:///usr/spark-2.4.0/examples/jars/spark-examples_2.11-2.4.0.jar at spark://10.10.3.47:63504/jars/spark-examples_2.11-2.4.0.jar with timestamp 1553164199211
2019-03-21 10:29:59 INFO  Executor:54 - Starting executor ID driver on host localhost
2019-03-21 10:29:59 INFO  Utils:54 - Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService‘ on port 29540.
2019-03-21 10:29:59 INFO  NettyBlockTransferService:54 - Server created on 10.10.3.47:29540
2019-03-21 10:29:59 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2019-03-21 10:29:59 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 10.10.3.47:29540 with 366.3 MB RAM, BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 10.10.3.47, 29540, None)
2019-03-21 10:29:59 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6587305a{/metrics/json,null,AVAILABLE,@Spark}
Corpus summary:
	 Training set size: 12 documents
	 Vocabulary size: 10 terms
	 Training set size: 62 tokens
	 Preprocessing time: 5.008004889 sec
Finished training LDA model.  Summary:
	 Training time: 3.565145577 sec
	 Training data average log likelihood: -20.14862821928427
20 topics:
TOPIC 0
0	0.2776616245970018
1	0.2467437437143153
2	0.162799944092254
3	0.14357469330901143
4	0.06928101473026803
9	0.04699050355830519
5	0.030203661449368247
6	0.007640498460509008
7	0.007552251606432597
8	0.007552064482534436
TOPIC 1
0	0.36469051153508564
1	0.18646127739213272
2	0.15522432783704704
3	0.13106181033451642
4	0.06584596207691033
9	0.04395627892970295
5	0.030062711796482375
8	0.007597985215275549
7	0.00759765588135554
6	0.007501479001491501
TOPIC 2
0	0.31931887563375516
1	0.1962082806692423
2	0.17231677857837657
3	0.14159208272582544
4	0.07269228847838642
9	0.044804492369582984
5	0.030074171215304125
7	0.007707667829625457
8	0.007707603060281631
6	0.007577759439619957
TOPIC 3
0	0.33172427138380095
1	0.2099908538476618
2	0.1529315801366396
3	0.13055578109706914
4	0.07654434907166915
9	0.045380994421043694
5	0.03023101958436096
8	0.007550809491464853
7	0.007550747971239613
6	0.00753959299505035
TOPIC 4
0	0.32678177168416583
1	0.2207731082949569
2	0.1515085115377772
3	0.13531105489561282
4	0.06780498015453475
9	0.04539833136958728
5	0.02983504958532268
6	0.007557895637294794
7	0.007514688233324302
8	0.007514608607423405
TOPIC 5
0	0.3239146979457089
1	0.21573894993478243
2	0.14525808340440868
3	0.14073654896536467
4	0.07677900991703532
9	0.04501240647651081
5	0.03001971166224193
6	0.007555937737508214
7	0.00749238505224968
8	0.007492268904189478
TOPIC 6
0	0.31540967245137175
1	0.2455372478674326
2	0.15387618434880607
3	0.11776433445702658
4	0.06958508618823488
9	0.0454694806523595
5	0.029862323423880312
6	0.007550844343998941
7	0.007472470469363186
8	0.007472355797526257
TOPIC 7
0	0.3288373150630124
1	0.1984831879713159
2	0.14673351329227885
3	0.13721294614635954
4	0.09099655027113343
9	0.04408525595608986
5	0.031035301494144973
8	0.007542328434232405
7	0.007542260756614362
6	0.007531340614818258
TOPIC 8
0	0.3193586979466173
2	0.19372032290375615
1	0.17385196290359722
3	0.15067297324203144
4	0.06431584983802775
9	0.04459867057873216
5	0.030109171006041484
7	0.007889712095309039
8	0.007889665041634011
6	0.007592974444253261
TOPIC 9
0	0.30194242995872983
1	0.19005321586219182
2	0.18325896599060398
3	0.13814239333008232
4	0.08613769346253837
9	0.04615261626883708
5	0.031097388101704492
8	0.007811116080942582
7	0.00781095323650845
6	0.007593227707861054
TOPIC 10
0	0.3018333962608689
1	0.2101327107384649
2	0.17976426989184977
3	0.13028433691427393
4	0.0785628713958075
9	0.04589483819874114
5	0.030473519782385595
8	0.007732641246914153
7	0.007732610576531091
6	0.007588804994162942
TOPIC 11
0	0.2731844012758353
1	0.2437828502953355
2	0.15602289141674863
3	0.14124037120394928
4	0.08621787369766198
9	0.04595175938704448
5	0.03092529389836489
6	0.007624331986973666
8	0.007525230917852694
7	0.007524995920233587
TOPIC 12
0	0.2754166406308875
1	0.23790442921451685
2	0.16407059546150252
3	0.15224175554500483
4	0.06911748308147463
9	0.04812626048120002
5	0.030286734901316135
6	0.007658288924620451
8	0.007588920737731278
7	0.0075888910217458685
TOPIC 13
0	0.30106045484425825
1	0.24631760902734479
3	0.14410465573241527
2	0.12524822006978883
4	0.08639636802254201
9	0.04423754911967344
5	0.030456963732366213
6	0.007563767806759953
7	0.00730722647659661
8	0.0073071851682546506
TOPIC 14
0	0.31562044416659885
1	0.2526998226062497
2	0.14535277752245734
3	0.12218888456011873
4	0.0662622160602047
9	0.04569769393216447
5	0.02981990651641202
6	0.007553483942629782
7	0.007402409879010876
8	0.00740236081415358
TOPIC 15
0	0.32058662644832936
1	0.20394629563372832
2	0.14716518634378675
3	0.1428368219219527
4	0.08608013435759047
9	0.046286592753669746
5	0.030458243185553725
6	0.007557253951254248
8	0.007541483951171858
7	0.007541361452962736
TOPIC 16
0	0.34720661993137836
1	0.19335921917713203
2	0.1574642973682748
3	0.1315805755079897
4	0.07262519571107433
9	0.044460628102199806
5	0.03055492723901837
7	0.0076115120495338995
8	0.007611307835359196
6	0.0075257170780395
TOPIC 17
0	0.276113881630822
1	0.23014600960245993
2	0.1791490549120877
3	0.13887566068288787
4	0.0755460207226412
9	0.04662339718458672
5	0.03051544813778674
7	0.007695876502113754
8	0.007695757393934347
6	0.00763889323067976
TOPIC 18
0	0.3143607517778961
1	0.23372981830486098
2	0.15462561577264133
3	0.12748419732486796
4	0.07250495549430869
9	0.04461512010586714
5	0.030101883279393397
6	0.007561984484167818
8	0.007507862285458234
7	0.007507811170538331
TOPIC 19
0	0.27641365880548574
1	0.25811740868556055
2	0.15554256853273
3	0.1299929833331776
4	0.08206677473861888
9	0.04536961458219285
5	0.029948048909439938
6	0.007602171935254187
7	0.007473418458492894
8	0.007473352019047361
官方用例测试通过
./bin/run-example mllib.LDAExample data/mllib/sample_lda_data.txt --k=3
Corpus summary:
	 Training set size: 12 documents
	 Vocabulary size: 10 terms
	 Training set size: 62 tokens
	 Preprocessing time: 3.952237148 sec
Finished training LDA model.  Summary:
	 Training time: 2.096613705 sec
	 Training data average log likelihood: -19.94356723200465
3 topics:
TOPIC 0
0	0.3859910534194794
1	0.27784612698328287
3	0.11309885992982183
2	0.07754302744583665
4	0.05944898611629052
9	0.03956337141492645
5	0.026961271416303733
6	0.00729179016688132
7	0.006153234509962162
8	0.006102278597215016
TOPIC 1
0	0.27760966762034767
2	0.20133458026264603
1	0.17674769647471528
3	0.17119003866520263
4	0.05856966790108555
9	0.054142861315096644
5	0.036444242827270906
6	0.008131736037023057
7	0.008064284551869805
8	0.007765224344742546
TOPIC 2
0	0.26781477963691924
1	0.20419276965972555
2	0.1988283635339889
3	0.12491741400170082
4	0.10934994319103938
9	0.0426866734196486
5	0.027519743427499636
8	0.008867804538854353
7	0.008517400571616986
6	0.007305108019006569
看sample_lda_data.txt的内容
root@wx-social-consume2:/usr/spark-2.4.0# cat data/mllib/sample_lda_data.txt
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0
矩阵11行,怎么只输出了10列,理解有错?
TOPIC 2
0	0.26781477963691924
1	0.20419276965972555
2	0.1988283635339889
3	0.12491741400170082
4	0.10934994319103938
9	0.0426866734196486
5	0.027519743427499636
8	0.008867804538854353
7	0.008517400571616986
6	0.007305108019006569
看代码
val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = 10)
是只取出了top10的词,理解应该没问题
离线单点计算
原文:https://www.cnblogs.com/zihunqingxin/p/10591799.html