微信公众号:AIKaggle
欢迎建议和拍砖,若需要资源,请公众号留言;
如果你觉得AIKaggle对你有帮助,点一下在看
作为Boosting算法前世今生的缓冲,今天来讲一个非常有意思的Kaggle比赛,也就是用新闻数据来预测金融市场。这是由对冲基金Two Sigma在Kaggle社区发起的一场竞赛,奖池是100,000万美金,吸引了众多Kagglers大神参与比赛,贡献思路。本文介绍本次比赛的背景、数据、提交方式、评估方式等。并提出一个解决方案,介绍eda(early data analysis)方法,包括时序分析、数据可视化、异常值处理并给出提交结果。该kernel由Andrew Lukyanenko创作。一言以概之,就是EDA两小时,Modeling5分钟。
我们可以通过分析新闻内容来预测股价表现吗?如今,多维度的数据使投资者们能够做出更好的投资决策,而其中的挑战主要在于如何在这个信息海洋中提取有用的信息并加以使用。对冲基金Two Sigma在 Kaggle 社区举办了一个通过分析新闻数据来预测股票价格的比赛,Kagglers有机会来推进此项研究,探索如何用新闻预测股票价格,而且研究的结果可能在全世界产生重大的经济影响。
本次比赛的数据来自以下来源:
Intrinio提供市场数据。
assetCode
下的10天市场调整回报。如果您认为股票在未来十天内与大盘相比具有较大的正回报,您可以为其分配一个大的、正的confidenceValue
(接近\(1.0\))。如果您认为股票具有负回报,您可以为其指定一个较大的、负的confidenceValue
(接近\(-1.0\))。如果不确定,您可以指定confidenceValue
为接近零的值。universe
逻辑变量(详细信息见数据描述),\(u_{ti}\)表示特定资产是否包含在特定日期的评分中。这是一个仅限于Kernels的两阶段竞赛,其中第二阶段是真正的预测未来。在第一阶段,参与者将建立模型,排行榜将反映历史时间段内的分数。在第一阶段结束时,代码将被冻结,排行榜将转换为显示未来数据的分数。Kaggle将重新运行参与者选择的在未来数据上运行的Kernel,并重新提交该Kernel生成的提交文件。
本次竞赛的所有提交都将通过Kernels环境进行。Kernels环境有一个自定义python模块,参与者必须使用它来访问比赛数据,进行预测并编写适当的提交文件。此模块用于确保模型在进行预测时不包含未来信息。为简单起见,本次比赛的提交文件将涵盖历史,第1阶段时段和未来第2阶段时段。这意味着在给定时间只有一个“有效”提交文件(参与者同时预测每个阶段的时间跨度,如上图)。在评分期间,Kaggle将忽略当前阶段之外的预测值。
env.write_submission_file()
时,内核环境会自动格式化并创建提交文件,无需手动创建提交。time,assetCode,confidenceValue
2017-01-03,RPXC.O,0.1
2017-01-04,RPXC.O,0.02
2017-01-05,RPXC.O,-0.3
etc.
assetCode
标识(请注意,单个公司可能有多个assetCode
)。根据您的目的,您可以使用assetCode
,assetName
或time
将市场数据和新闻数据进行JOIN。Within the marketdata, you will find the following columns:
在市场数据中,您将找到以下列:
time
(datetime64[ns, UTC]) - 当前时间 (市场数据中,所有行都显示 22:00 UTC)assetCode
(object) - 资产的唯一IDassetName
(category) - 一组assetCodes
对应的名称。如果相应assetCode
的新闻数据中没有任何行,则这些可能是"Unknown" 。universe
(float64) - 一个布尔值,表示该金融资产是否包含在当天的评分中。在训练数据时间段之外不提供该值。特定日期的交易范围是可用于交易的金融资产集合(评分函数给不在交易领域中的金融资产打分)。交易范围每天都在变化。volume
(float64) - 当天股票交易量close
(float64) - 当天收盘价(未调整分割或股息)open
(float64) - 当天的开盘价(未调整分拆或股息)returnsClosePrevRaw1
(float64) - 请参阅上面的返回说明returnsOpenPrevRaw1
(float64) - 请参阅上面的返回说明returnsClosePrevMktres1
(float64) - 请参阅上面的返回说明returnsOpenPrevMktres1
(float64) - 请参阅上面的返回说明returnsClosePrevRaw10
(float64) - 请参阅上面的返回说明returnsOpenPrevRaw10
(float64) - 请参阅上面的返回说明returnsClosePrevMktres10
(float64) - 请参阅上面的返回说明returnsOpenPrevMktres10
(float64) - 请参阅上面的返回说明returnsOpenNextMktres10
(float64) - 未来10天的市场残差回报。这是竞争评分中使用的目标变量。市场数据已经过滤,因此 returnsOpenNextMktres10
不为空。新闻数据包含新闻文章和资产信息。
time
(datetime64[ns, UTC]) - 显示数据在订阅源上可用的UTC时间戳sourceTimestamp
(datetime64[ns, UTC]) - 此新闻项创建时的UTC时间戳firstCreated
(datetime64[ns, UTC]) - UTC timestamp for the first version of the itemsourceId
(object) - 新闻Idheadline
(object) - 标题urgency
(int8) - 类型 (1: alert, 3: article)takeSequence
(int16) - 新闻项的获取序列号,从1开始。对于给定的故事,alert和article具有单独的序列。provider
(category) - 提供新闻项目的组织的标识符(例如,RTRS代表路透社新闻的,BSW代表美国商业资讯)subjects
(category) - 与该新闻项目相关的主题代码和公司标识符。主题代码描述了新闻项目的主题。这些可以涵盖资产类别,地理位置,事件,行业/部门和其他类型。audiences
(category) - 标识新闻项目所属的桌面新闻产品。它们通常针对特定受众群体量身定制。 (例如,“M”为Money国际新闻服务,“FB”为法国通用新闻服务)bodySize
(int32) - 故事主体的当前版本的大小companyCount
(int8) - 新闻项目中明确列出的公司数量headlineTag
(object) -新闻的汤森路透标题标签marketCommentary
(bool) - 布尔值,新闻是否在讨论一般市场条件sentenceCount
(int16) - 新闻中的句子总数wordCount
(int32) - 新闻中的词汇总数assetCodes
(category) - 新闻中提到的资产代码assetName
(category) -资产名称firstMentionSentence
(int16) - 第一句提到被评分资产的句子。
relevance
(float32) - 一个十进制数字,表示新闻项与资产的相关性。它的范围是0到1.如果标题中提到了资产,则相关性设置为1.当新闻是alert(urgency== 1
)时,相关性应该由firstMentionSentence
来衡量。输入:导入强大的package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
stop = set(stopwords.words('english'))
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn import model_selection
from sklearn.metrics import accuracy_score
输入:获取数据的官方途径
# official way to get the data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
print('Done!')
输出:导入成功
Loading the data... This could take a minute.
Done!
Done!
输入:我们有两部分数据——市场数据和新闻数据,分别探索之。
(market_train_df, news_train_df) = env.get_training_data()
这是一个非常有趣的数据集,其中包含十多年来许多公司的股票价格。现在让我们来看看数据本身,我们可以看到长期趋势,公司的初露头角和衰落,还有许多其他事情。
输入:打印市场数据的维度
print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')
输出:样本数目和特征数目
4072956 samples and 16 features in the training market dataset.
输入:看看前五条数据长什么样子
market_train_df.head()
输出:前五条数据的dataframe
输入:随机选择10条资产记录,可视化他们收盘价格的时序变化。
data = []
for asset in np.random.choice(market_train_df['assetName'].unique(), 10):
asset_df = market_train_df[(market_train_df['assetName'] == asset)]
data.append(go.Scatter(
x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values,

y = asset_df['close'].values,
name = asset
))
layout = go.Layout(dict(title = "Closing prices of 10 random assets",
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Price (USD)'),
),legend=dict(
orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')
输出:随机抽取10条资产记录的收盘价格随时间变化的曲线
输入:收盘价格分位数的趋势变化曲线
data = []
#market_train_df['close'] = market_train_df['close'] / 20
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
price_df = market_train_df.groupby('time')['close'].quantile(i).reset_index()
data.append(go.Scatter(
x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = price_df['close'].values,
name = f'{i} quantile'
))
layout = go.Layout(dict(title = "Trends of closing prices by quantiles",
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Price (USD)'),
),legend=dict(
orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')
分析:能够看到市场如何下跌并再次上涨是很激动人心的。当市场出现严重的股价下跌时,可以注意到,较高的分位数价格随着时间的推移而增加,较低的分位数价格下降。也许贫富差距会越来越大…
输入:看看价格下降的细节,计算每天收盘价和开盘价的价格差,并计算价格差的平均标准差
market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby('time').agg({'price_diff': ['std', 'min']}).reset_index()
print(f"Average standard deviation of price change within a day in {grouped['price_diff']['std'].mean():.4f}.")
输出:每天的平均deviation为1.0335
Average standard deviation of price change within a day in 1.0335.
输入:可视化deviation最大的十个月
g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * g['price_diff']['min']).astype(str)
trace = go.Scatter(
x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = g['price_diff']['std'].values,
mode='markers',
marker=dict(
size = g['price_diff']['std'].values,
color = g['price_diff']['std'].values,
colorscale='Portland',
showscale=True
),
text = g['min_text'].values
#text = f"Maximum price drop: {g['price_diff']['min'].values}"
#g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]
layout= go.Layout(
autosize= True,
title= 'Top 10 months by standard deviation of price change within a day',
hovermode= 'closest',
yaxis=dict(
title= 'price_diff',
ticklen= 5,
gridwidth= 2,
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
输出:可以看到有一个月的deviation很大。推测一下原因:会不会是当市场崩溃时,股价波动剧烈?但这似乎不是很合理,2010年1月并没有发生市场崩溃...这可能是出现异常值导致的,接下来需要处理异常值。
输入:观察价格差最大的10条数据
market_train_df.sort_values('price_diff')[:10]
输出:
分析:可以看到,“Towers Watson&Co”股票的价格几乎是10000 …这很有可能就是我们要找的异常值。 但是Bank of New York Mellon Corp呢? 让我们看看雅虎的数据:
Bank of New York Mellon Corp的数据和比赛提供的是一致的。
Archrock Inc是成本等于999,这个数字看起来很可疑。 让我们来看看Archrock Inc。
分析:Archrock Inc是成本等于999,这个数字看起来也很不正常。观察yahoo上Archrock Inc的数据,果然又找到一个异常值。
输入:每天价格波动超过20%的数据有多少条
market_train_df['close_to_open'] = np.abs(market_train_df['close'] / market_train_df['open'])
print(f"In {(market_train_df['close_to_open'] >= 1.2).sum()} lines price increased by 20% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.8).sum()} lines price decreased by 20% or more.")
输出:
In 1211 lines price increased by 20% or more.
In 778 lines price decreased by 20% or more.
输入:继续挖掘奇怪的案例,每天价格波动超过100%的数据有多少条
print(f"In {(market_train_df['close_to_open'] >= 2).sum()} lines price increased by 100% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.5).sum()} lines price decreased by 100% or more.")
输出:
In 38 lines price increased by 100% or more.
In 16 lines price decreased by 100% or more.
输入:我们不妨假设每天价格波动超过100%的数据是异常值,需要替换异常值。一个快速的解决方案是,用这家公司的平均开盘价或收盘价替换这些线中的异常值(中位数、众数也可)。
market_train_df['assetName_mean_open'] = market_train_df.groupby('assetName')['open'].transform('mean')
market_train_df['assetName_mean_close'] = market_train_df.groupby('assetName')['close'].transform('mean')
# if open price is too far from mean open price for this company, replace it. Otherwise replace close price.
for i, row in market_train_df.loc[market_train_df['close_to_open'] >= 2].iterrows():
if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
market_train_df.iloc[i,5] = row['assetName_mean_open']
else:
market_train_df.iloc[i,4] = row['assetName_mean_close']
for i, row in market_train_df.loc[market_train_df['close_to_open'] <= 0.5].iterrows():
if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
market_train_df.iloc[i,5] = row['assetName_mean_open']
else:
market_train_df.iloc[i,4] = row['assetName_mean_close']
输入:重新可视化deviation
market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby(['time']).agg({'price_diff': ['std', 'min']}).reset_index()
g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * np.round(g['price_diff']['min'], 2)).astype(str)
trace = go.Scatter(
x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = g['price_diff']['std'].values,
mode='markers',
marker=dict(
size = g['price_diff']['std'].values * 5,
color = g['price_diff']['std'].values,
colorscale='Portland',
showscale=True
),
text = g['min_text'].values
#text = f"Maximum price drop: {g['price_diff']['min'].values}"
#g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]
layout= go.Layout(
autosize= True,
title= 'Top 10 months by standard deviation of price change within a day',
hovermode= 'closest',
yaxis=dict(
title= 'price_diff',
ticklen= 5,
gridwidth= 2,
),
showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
输出:看起来正常了
输入:观察目标变量
data = []
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].quantile(i).reset_index()
data.append(go.Scatter(
x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = price_df['returnsOpenNextMktres10'].values,
name = f'{i} quantile'
))
layout = go.Layout(dict(title = "Trends of returnsOpenNextMktres10 by quantiles",
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Price (USD)'),
),legend=dict(
orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')
输入:我们可以看到分位数具有较高的偏差,但平均值变化不大。我们只留下2010年以来的数据,现在来看看目标变量。
data = []
market_train_df = market_train_df.loc[market_train_df['time'] >= '2010-01-01 22:00:00+0000']
price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].mean().reset_index()
data.append(go.Scatter(
x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = price_df['returnsOpenNextMktres10'].values,
name = f'{i} quantile'
))
layout = go.Layout(dict(title = "Treand of returnsOpenNextMktres10 mean",
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Price (USD)'),
),legend=dict(
orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')
输出: 波动似乎很高,但实际上它们均低于8%,就像一个随机的噪音……
输入:观察return为前缀的变量
data = []
for col in ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
'returnsOpenNextMktres10']:
df = market_train_df.groupby('time')[col].mean().reset_index()
data.append(go.Scatter(
x = df['time'].dt.strftime(date_format='%Y-%m-%d').values,
y = df[col].values,
name = col
))
layout = go.Layout(dict(title = "Treand of mean values",
xaxis = dict(title = 'Month'),
yaxis = dict(title = 'Price (USD)'),
),legend=dict(
orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')
输出:看起来前10天的回报波动最大。
news_train_df.head()
print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')
输出
9328827 samples and 35 features in the training news dataset.
输入:该文件太大而无法直接处理文本,所以先看看最后100000个标题生成的词云。
text = ' '.join(news_train_df['headline'].str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white',
width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in headline')
plt.axis("off")
plt.show()
输出:
输入:关于urgency
# Let's also limit the time period
news_train_df = news_train_df.loc[news_train_df['time'] >= '2010-01-01 22:00:00+0000']
(news_train_df['urgency'].value_counts() / 1000000).plot('bar');
plt.xticks(rotation=30);
plt.title('Urgency counts (mln)');
输出:看起来urgency为2的数据几乎没有。
输入:每句词数的统计
news_train_df['sentence_word_count'] = news_train_df['wordCount'] / news_train_df['sentenceCount']
plt.boxplot(news_train_df['sentence_word_count'][news_train_df['sentence_word_count'] < 40]);
输出:没有明显的异常值,每句话大多有15-25词。
输入:
news_train_df['provider'].value_counts().head(10)
输出:可以看到,路透社是最大的提供商。
RTRS 5517624
PRN 503267
BSW 472612
GNW 145309
MKW 129621
LSE 64250
HIIS 56489
RNS 39833
CNW 30779
ONE 25233
Name: provider, dtype: int64
输入:标题标签类型
(news_train_df['headlineTag'].value_counts() / 1000)[:10].plot('barh');
plt.title('headlineTag counts (thousands)');
输出:标签缺失现象较为严重
输入:情绪分析
for i, j in zip([-1, 0, 1], ['negative', 'neutral', 'positive']):
df_sentiment = news_train_df.loc[news_train_df['sentimentClass'] == i, 'assetName']
print(f'Top mentioned companies for {j} sentiment are:')
print(df_sentiment.value_counts().head(5))
print('')
输出:苹果既是积极情绪的top1,也是消极情绪的top1。
Top mentioned companies for negative sentiment are:
Apple Inc 22518
JPMorgan Chase & Co 20647
BP PLC 19328
Goldman Sachs Group Inc 17955
Bank of America Corp 17704
Name: assetName, dtype: int64
Top mentioned companies for neutral sentiment are:
HSBC Holdings PLC 19462
Credit Suisse AG 14632
Deutsche Bank AG 12959
Barclays PLC 12414
Apple Inc 10994
Name: assetName, dtype: int64
Top mentioned companies for positive sentiment are:
Apple Inc 19020
Barclays PLC 18051
Royal Dutch Shell PLC 15484
General Electric Co 14163
Boeing Co 14080
Name: assetName, dtype: int64
输入:加入一些特征,可能帮助模型训练取得更好的结果。
比如每日价格波动(收盘价与开盘价之比)
进行归一化(减小因数据绝对值大小对结果的影响)
#%%time
# code mostly takes from this kernel: https://www.kaggle.com/ashishpatel26/bird-eye-view-of-two-sigma-xgb
def data_prep(market_df,news_df):
market_df['time'] = market_df.time.dt.date
market_df['returnsOpenPrevRaw1_to_volume'] = market_df['returnsOpenPrevRaw1'] / market_df['volume']
market_df['close_to_open'] = market_df['close'] / market_df['open']
market_df['volume_to_mean'] = market_df['volume'] / market_df['volume'].mean()
news_df['sentence_word_count'] = news_df['wordCount'] / news_df['sentenceCount']
news_df['time'] = news_df.time.dt.hour
news_df['sourceTimestamp']= news_df.sourceTimestamp.dt.hour
news_df['firstCreated'] = news_df.firstCreated.dt.date
news_df['assetCodesLen'] = news_df['assetCodes'].map(lambda x: len(eval(x)))
news_df['assetCodes'] = news_df['assetCodes'].map(lambda x: list(eval(x))[0])
news_df['headlineLen'] = news_df['headline'].apply(lambda x: len(x))
news_df['assetCodesLen'] = news_df['assetCodes'].apply(lambda x: len(x))
news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['time'].transform('count')
news_df['asset_sentence_mean'] = news_df.groupby(['assetName', 'sentenceCount'])['time'].transform('mean')
lbl = {k: v for v, k in enumerate(news_df['headlineTag'].unique())}
news_df['headlineTagT'] = news_df['headlineTag'].map(lbl)
kcol = ['firstCreated', 'assetCodes']
news_df = news_df.groupby(kcol, as_index=False).mean()
market_df = pd.merge(market_df, news_df, how='left', left_on=['time', 'assetCode'],
right_on=['firstCreated', 'assetCodes'])
lbl = {k: v for v, k in enumerate(market_df['assetCode'].unique())}
market_df['assetCodeT'] = market_df['assetCode'].map(lbl)
market_df = market_df.dropna(axis=0)
return market_df
market_train_df.drop(['price_diff', 'assetName_mean_open', 'assetName_mean_close'], axis=1, inplace=True)
market_train = data_prep(market_train_df, news_train_df)
print(market_train.shape)
up = market_train.returnsOpenNextMktres10 >= 0
fcol = [c for c in market_train.columns if c not in ['assetCode', 'assetCodes', 'assetCodesLen', 'assetName', 'assetCodeT',
'firstCreated', 'headline', 'headlineTag', 'marketCommentary', 'provider',
'returnsOpenNextMktres10', 'sourceId', 'subjects', 'time', 'time_x', 'universe','sourceTimestamp']]
X = market_train[fcol].values
up = up.values
r = market_train.returnsOpenNextMktres10.values
# Scaling of X values
mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
rng = maxs - mins
X = 1 - ((maxs - X) / rng)
输出
(611261, 54)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:
invalid value encountered in subtract
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:
invalid value encountered in true_divide
输入:LightGBM训练(下次可以详细讲讲各种BOOSTING算法的参数含义)
X_train, X_test, up_train, up_test, r_train, r_test = model_selection.train_test_split(X, up, r, test_size=0.1, random_state=99)
# xgb_up = XGBClassifier(n_jobs=4,
# n_estimators=300,
# max_depth=3,
# eta=0.15,
# random_state=42)
params = {'learning_rate': 0.01, 'max_depth': 12, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42}
model = lgb.train(params, train_set=lgb.Dataset(X_train, label=up_train), num_boost_round=2000,
valid_sets=[lgb.Dataset(X_train, label=up_train), lgb.Dataset(X_test, label=up_test)],
verbose_eval=100, early_stopping_rounds=100)
训练过程:观察AUC
Training until validation scores don't improve for 100 rounds.
[100] valid_0's auc: 0.570258 valid_1's auc: 0.566332
[200] valid_0's auc: 0.573703 valid_1's auc: 0.567868
[300] valid_0's auc: 0.577024 valid_1's auc: 0.568927
[400] valid_0's auc: 0.580109 valid_1's auc: 0.569985
[500] valid_0's auc: 0.582933 valid_1's auc: 0.570694
[600] valid_0's auc: 0.585372 valid_1's auc: 0.571191
[700] valid_0's auc: 0.58784 valid_1's auc: 0.571578
[800] valid_0's auc: 0.590147 valid_1's auc: 0.571726
[900] valid_0's auc: 0.592448 valid_1's auc: 0.571908
[1000] valid_0's auc: 0.594658 valid_1's auc: 0.57203
[1100] valid_0's auc: 0.596887 valid_1's auc: 0.572259
[1200] valid_0's auc: 0.598918 valid_1's auc: 0.572422
[1300] valid_0's auc: 0.601052 valid_1's auc: 0.572563
[1400] valid_0's auc: 0.603196 valid_1's auc: 0.57269
[1500] valid_0's auc: 0.605227 valid_1's auc: 0.572756
[1600] valid_0's auc: 0.60723 valid_1's auc: 0.572837
[1700] valid_0's auc: 0.609211 valid_1's auc: 0.572897
[1800] valid_0's auc: 0.611181 valid_1's auc: 0.573038
[1900] valid_0's auc: 0.613095 valid_1's auc: 0.573162
[2000] valid_0's auc: 0.615015 valid_1's auc: 0.573307
Did not meet early stopping. Best iteration is:
[2000] valid_0's auc: 0.615015 valid_1's auc: 0.573307
输入:观察特征重要程度
def generate_color():
color = '#{:02x}{:02x}{:02x}'.format(*map(lambda x: np.random.randint(0, 255), range(3)))
return color
df = pd.DataFrame({'imp': model.feature_importance(), 'col':fcol})
df = df.sort_values(['imp','col'], ascending=[True, False])
data = [df]
for dd in data:
colors = []
for i in range(len(dd)):
colors.append(generate_color())
data = [
go.Bar(
orientation = 'h',
x=dd.imp,
y=dd.col,
name='Features',
textfont=dict(size=20),
marker=dict(
color= colors,
line=dict(
color='#000000',
width=0.5
),
opacity = 0.87
)
)
]
layout= go.Layout(
title= 'Feature Importance of LGB',
xaxis= dict(title='Columns', ticklen=5, zeroline=False, gridwidth=2),
yaxis=dict(title='Value Count', ticklen=5, gridwidth=2),
showlegend=True
)
py.iplot(dict(data=data,layout=layout), filename='horizontal-bar')
输入:提交
days = env.get_prediction_days()
import time
n_days = 0
prep_time = 0
prediction_time = 0
packaging_time = 0
for (market_obs_df, news_obs_df, predictions_template_df) in days:
n_days +=1
if n_days % 50 == 0:
print(n_days,end=' ')
t = time.time()
market_obs_df = data_prep(market_obs_df, news_obs_df)
market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)]
X_live = market_obs_df[fcol].values
X_live = 1 - ((maxs - X_live) / rng)
prep_time += time.time() - t
t = time.time()
lp = model.predict(X_live)
prediction_time += time.time() -t
t = time.time()
confidence = 2 * lp -1
preds = pd.DataFrame({'assetCode':market_obs_df['assetCode'],'confidence':confidence})
predictions_template_df = predictions_template_df.merge(preds,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
env.predict(predictions_template_df)
packaging_time += time.time() - t
env.write_submission_file()
输出:
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:
invalid value encountered in true_divide
50 100 150
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:
invalid value encountered in subtract
200 250 300 350 400 450 500 550 600 Your submission file has been saved. Once you `Commit` your Kernel and it finishes running, you can submit the file to the competition from the Kernel Viewer `Output` tab.
原文:https://www.cnblogs.com/AIKaggle/p/11556108.html