首页 > 其他 > 详细

Task 1 - Sentiment Analysis on Movie Reviews

时间:2020-03-17 20:22:06      阅读:64      评论:0      收藏:0      [点我收藏+]
'''
 0 - negative
 1 - somewhat negative
 2 - neutral
 3 - somewhat positive
 4 - positive
'''

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression


train_data = pd.read_csv('train.tsv', sep='\t')
print(train_data.head())   # 训练集

test_data = pd.read_csv('test.tsv', sep='\t')
print(test_data.head())   # 测试集

print('--------- 开始特征提取')
# ---------------------------------- 特征提取

vectorizer = CountVectorizer(ngram_range=(1,3),  # N元特征
                             max_features = 150000) 

corpus_train = train_data['Phrase']   # 语料库
corpus_test = test_data['Phrase']   # 语料库

vectorizer.fit(pd.concat([corpus_train,corpus_test]))
print(vectorizer.get_feature_names()[:10])

X_train = vectorizer.transform(corpus_train)    # 向量化
X_test = vectorizer.transform(corpus_test)    # 向量化

y_train = list(train_data['Sentiment'])

print(type(X_train))
print(X_train[:1])
print(y_train[:5])

print('--------- 开始训练')
# ----------------------------------- 训练

model = LogisticRegression(max_iter=1000000)
model.fit(X_train, y_train)

y_test = model.predict(X_test)
#y_test = [4]*66292

print('--------- 开始输出')
# ------------------------------------ 输出

output = pd.DataFrame({'PhraseId': test_data.PhraseId,
                       'Sentiment': y_test})
output.to_csv('my_submission.csv', index=False) # 输出


print('ok')

Task 1 - Sentiment Analysis on Movie Reviews

原文:https://www.cnblogs.com/holaworld/p/12512889.html

(0)
(0)
   
举报
评论 一句话评论(0
关于我们 - 联系我们 - 留言反馈 - 联系我们:wmxa8@hotmail.com
© 2014 bubuko.com 版权所有
打开技术之扣,分享程序人生!