新闻预测股价——对冲基金Two Sigma寻求智能解决方案

时间：2019-09-20 13:52:16 阅读：116 评论：0 收藏：0 [点我收藏+]

引言
评价指标
提交说明
- 提交内容
- 提交文件
数据描述
- Market data
- News data
解决方案
总结

微信公众号：AIKaggle
欢迎建议和拍砖，若需要资源，请公众号留言;
如果你觉得AIKaggle对你有帮助，点一下在看

引言

作为Boosting算法前世今生的缓冲，今天来讲一个非常有意思的Kaggle比赛，也就是用新闻数据来预测金融市场。这是由对冲基金Two Sigma在Kaggle社区发起的一场竞赛，奖池是100，000万美金，吸引了众多Kagglers大神参与比赛，贡献思路。本文介绍本次比赛的背景、数据、提交方式、评估方式等。并提出一个解决方案，介绍eda(early data analysis)方法，包括时序分析、数据可视化、异常值处理并给出提交结果。该kernel由Andrew Lukyanenko创作。一言以概之，就是EDA两小时，Modeling5分钟。

我们可以通过分析新闻内容来预测股价表现吗？如今，多维度的数据使投资者们能够做出更好的投资决策，而其中的挑战主要在于如何在这个信息海洋中提取有用的信息并加以使用。对冲基金Two Sigma在 Kaggle 社区举办了一个通过分析新闻数据来预测股票价格的比赛，Kagglers有机会来推进此项研究，探索如何用新闻预测股票价格，而且研究的结果可能在全世界产生重大的经济影响。
本次比赛的数据来自以下来源：
Intrinio提供市场数据。
技术分享图片

Thomson Reuters提供新闻数据。

评价指标

在本次竞赛中，您必须预测一个有符号置信度值 \(\hat y_{ti} \in [-1,1]\)，它乘以在给定assetCode下的10天市场调整回报。如果您认为股票在未来十天内与大盘相比具有较大的正回报，您可以为其分配一个大的、正的confidenceValue（接近\(1.0\)）。如果您认为股票具有负回报，您可以为其指定一个较大的、负的confidenceValue（接近\(-1.0\)）。如果不确定，您可以指定confidenceValue为接近零的值。
对于评估时间段中的每一天，我们计算：
\[x_t =\sum_{i}\hat y_{ti}r_{ti}u_{ti}\]
其中\(r_{ti}\)是第\(t\)天第\(i\)个金融资产对应的10天市场调整回报，而\(u_{ti}\)是取值为0或1的universe逻辑变量（详细信息见数据描述），\(u_{ti}\)表示特定资产是否包含在特定日期的评分中。
您的提交分数将计算为每日\(x_t\)的平均值除以每日\(x_t\)值的标准差：
\[score = \frac{\bar x_t}{\sigma(x_t)}\]
如果预测的标准偏差为0，则分数定义为0。

提交说明

提交内容

这是一个仅限于Kernels的两阶段竞赛，其中第二阶段是真正的预测未来。在第一阶段，参与者将建立模型，排行榜将反映历史时间段内的分数。在第一阶段结束时，代码将被冻结，排行榜将转换为显示未来数据的分数。Kaggle将重新运行参与者选择的在未来数据上运行的Kernel，并重新提交该Kernel生成的提交文件。
本次竞赛的所有提交都将通过Kernels环境进行。Kernels环境有一个自定义python模块，参与者必须使用它来访问比赛数据，进行预测并编写适当的提交文件。此模块用于确保模型在进行预测时不包含未来信息。为简单起见，本次比赛的提交文件将涵盖历史，第1阶段时段和未来第2阶段时段。这意味着在给定时间只有一个“有效”提交文件（参与者同时预测每个阶段的时间跨度，如上图）。在评分期间，Kaggle将忽略当前阶段之外的预测值。

提交文件

您必须直接从Kaggle Kernels提交。在调用env.write_submission_file()时，内核环境会自动格式化并创建提交文件，无需手动创建提交。
提交的格式如下：

time，assetCode，confidenceValue 
2017-01-03,RPXC.O,0.1 
2017-01-04,RPXC.O,0.02 
2017-01-05,RPXC.O,-0.3
etc.

数据描述

在本次竞赛中，您将基于两个数据来源预测未来股票价格：

由Intrinio提供的市场数据（2007年至今）：包含金融市场信息，如开盘价，收盘价，交易量，计算回报等。
2.新闻数据（2007年至今）资料来源：Thomson Reuters：包含有关资产的新闻文章/警告信息，如文章详情，投资情绪和其他评论。

每个资产都由assetCode标识（请注意，单个公司可能有多个assetCode）。根据您的目的，您可以使用assetCode，assetName或time将市场数据和新闻数据进行JOIN。

Market data

Within the marketdata, you will find the following columns:

数据包括美国上市金融资产的一个子集。这个集合包含的金融资产每天都会发生变化，并根据交易金额和信息的可用性来确定。这意味着可能有金融资产输入和离开这部分数据。因此，所提供的数据中可能存在Gaps，这并不一定意味着数据不存在（由于选择条件，这些行可能不包括在内）。市场数据包含在不同时间段上计算的各种回报。这组市场数据中的所有收益都具有以下属性：
收益的计算要么是开盘到开盘（从一个交易日的开盘时间到另一个交易日的开盘时间）要么是收盘到收盘（从一个交易日的收盘时间到另一个交易日的开盘时间）。
回报率要么是原始的（数据没有根据任何基准进行调整），要么是市场剩余化(Mktres)（这意味着市场作为一个整体的运动已经被考虑在内，只留下金融资产固有的运动。
可以在任意时间间隔内计算返回值。这里提供1天和10天的视野。
如果在时间上是往过去看，则标记返回值为“prev”；如果在时间上是向前看，则标记返回值为“next”。

在市场数据中，您将找到以下列：

time(datetime64[ns, UTC]) - 当前时间 (市场数据中，所有行都显示 22:00 UTC)
assetCode（object） - 资产的唯一ID
assetName（category） - 一组assetCodes对应的名称。如果相应assetCode的新闻数据中没有任何行，则这些可能是"Unknown" 。
universe（float64） - 一个布尔值，表示该金融资产是否包含在当天的评分中。在训练数据时间段之外不提供该值。特定日期的交易范围是可用于交易的金融资产集合（评分函数给不在交易领域中的金融资产打分）。交易范围每天都在变化。
volume（float64） - 当天股票交易量
close（float64） - 当天收盘价（未调整分割或股息）
open（float64） - 当天的开盘价（未调整分拆或股息）
returnsClosePrevRaw1（float64） - 请参阅上面的返回说明
-returnsOpenPrevRaw1（float64） - 请参阅上面的返回说明
returnsClosePrevMktres1（float64） - 请参阅上面的返回说明
returnsOpenPrevMktres1（float64） - 请参阅上面的返回说明
returnsClosePrevRaw10（float64） - 请参阅上面的返回说明
returnsOpenPrevRaw10（float64） - 请参阅上面的返回说明
returnsClosePrevMktres10（float64） - 请参阅上面的返回说明
returnsOpenPrevMktres10（float64） - 请参阅上面的返回说明
returnsOpenNextMktres10（float64） - 未来10天的市场残差回报。这是竞争评分中使用的目标变量。市场数据已经过滤，因此 returnsOpenNextMktres10不为空。

News data

新闻数据包含新闻文章和资产信息。

time(datetime64[ns, UTC]) - 显示数据在订阅源上可用的UTC时间戳
sourceTimestamp(datetime64[ns, UTC]) - 此新闻项创建时的UTC时间戳
firstCreated(datetime64[ns, UTC]) - UTC timestamp for the first version of the item
sourceId(object) - 新闻Id
headline(object) - 标题
urgency(int8) - 类型 (1: alert, 3: article)
takeSequence(int16) - 新闻项的获取序列号，从1开始。对于给定的故事，alert和article具有单独的序列。
provider(category) - 提供新闻项目的组织的标识符（例如，RTRS代表路透社新闻的，BSW代表美国商业资讯）
subjects(category) - 与该新闻项目相关的主题代码和公司标识符。主题代码描述了新闻项目的主题。这些可以涵盖资产类别，地理位置，事件，行业/部门和其他类型。
audiences(category) - 标识新闻项目所属的桌面新闻产品。它们通常针对特定受众群体量身定制。（例如，“M”为Money国际新闻服务，“FB”为法国通用新闻服务）
bodySize(int32) - 故事主体的当前版本的大小
companyCount(int8) - 新闻项目中明确列出的公司数量
headlineTag(object) -新闻的汤森路透标题标签
marketCommentary(bool) - 布尔值，新闻是否在讨论一般市场条件
sentenceCount(int16) - 新闻中的句子总数
wordCount(int32) - 新闻中的词汇总数
assetCodes(category) - 新闻中提到的资产代码
assetName(category) -资产名称
firstMentionSentence(int16) - 第一句提到被评分资产的句子。
- 1: 标题
- 2: 故事主体的第一句话
- 3: 故事主体的第二句话，以此类推
- 0: 在新闻项目的标题或正文中未找到被评分的资产。
relevance(float32) - 一个十进制数字，表示新闻项与资产的相关性。它的范围是0到1.如果标题中提到了资产，则相关性设置为1.当新闻是alert（urgency== 1）时，相关性应该由firstMentionSentence来衡量。
还有较多较为相似的列，考虑到篇幅大小，故不列出，接下来介绍一个Kernel提供的EDA。

解决方案

准备数据

输入：导入强大的package

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
stop = set(stopwords.words('english'))


import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn import model_selection
from sklearn.metrics import accuracy_score

输入：获取数据的官方途径

# official way to get the data
from kaggle.competitions import twosigmanews
env = twosigmanews.make_env()
print('Done!')

输出：导入成功

Loading the data... This could take a minute.
Done!
Done!

输入：我们有两部分数据——市场数据和新闻数据，分别探索之。

(market_train_df, news_train_df) = env.get_training_data()

市场数据

这是一个非常有趣的数据集，其中包含十多年来许多公司的股票价格。现在让我们来看看数据本身，我们可以看到长期趋势，公司的初露头角和衰落，还有许多其他事情。
输入：打印市场数据的维度

print(f'{market_train_df.shape[0]} samples and {market_train_df.shape[1]} features in the training market dataset.')

输出：样本数目和特征数目

4072956 samples and 16 features in the training market dataset.

输入：看看前五条数据长什么样子

market_train_df.head()

输出：前五条数据的dataframe
技术分享图片

输入：随机选择10条资产记录，可视化他们收盘价格的时序变化。

data = []
for asset in np.random.choice(market_train_df['assetName'].unique(), 10):
    asset_df = market_train_df[(market_train_df['assetName'] == asset)]

    data.append(go.Scatter(
        x = asset_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
![plot2.PNG](https://upload-images.jianshu.io/upload_images/19514105-ca6033e21c4c752b.PNG?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
        y = asset_df['close'].values,
        name = asset
    ))
layout = go.Layout(dict(title = "Closing prices of 10 random assets",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')

输出：随机抽取10条资产记录的收盘价格随时间变化的曲线
技术分享图片

输入：收盘价格分位数的趋势变化曲线

data = []
#market_train_df['close'] = market_train_df['close'] / 20
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
    price_df = market_train_df.groupby('time')['close'].quantile(i).reset_index()

    data.append(go.Scatter(
        x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = price_df['close'].values,
        name = f'{i} quantile'
    ))
layout = go.Layout(dict(title = "Trends of closing prices by quantiles",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"))
py.iplot(dict(data=data, layout=layout), filename='basic-line')

技术分享图片

分析：能够看到市场如何下跌并再次上涨是很激动人心的。当市场出现严重的股价下跌时，可以注意到，较高的分位数价格随着时间的推移而增加，较低的分位数价格下降。也许贫富差距会越来越大…

输入：看看价格下降的细节，计算每天收盘价和开盘价的价格差，并计算价格差的平均标准差

market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby('time').agg({'price_diff': ['std', 'min']}).reset_index()

print(f"Average standard deviation of price change within a day in {grouped['price_diff']['std'].mean():.4f}.")

输出：每天的平均deviation为1.0335

Average standard deviation of price change within a day in 1.0335.

输入：可视化deviation最大的十个月

g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * g['price_diff']['min']).astype(str)
trace = go.Scatter(
    x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = g['price_diff']['std'].values,
    mode='markers',
    marker=dict(
        size = g['price_diff']['std'].values,
        color = g['price_diff']['std'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = g['min_text'].values
    #text = f"Maximum price drop: {g['price_diff']['min'].values}"
    #g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Top 10 months by standard deviation of price change within a day',
    hovermode= 'closest',
    yaxis=dict(
        title= 'price_diff',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

技术分享图片

输出：可以看到有一个月的deviation很大。推测一下原因：会不会是当市场崩溃时，股价波动剧烈？但这似乎不是很合理，2010年1月并没有发生市场崩溃...这可能是出现异常值导致的，接下来需要处理异常值。

输入：观察价格差最大的10条数据

market_train_df.sort_values('price_diff')[:10]

输出：
技术分享图片

分析：可以看到，“Towers Watson＆Co”股票的价格几乎是10000 …这很有可能就是我们要找的异常值。但是Bank of New York Mellon Corp呢？让我们看看雅虎的数据：
技术分享图片
Bank of New York Mellon Corp的数据和比赛提供的是一致的。

Archrock Inc是成本等于999，这个数字看起来很可疑。让我们来看看Archrock Inc。
技术分享图片

分析：Archrock Inc是成本等于999，这个数字看起来也很不正常。观察yahoo上Archrock Inc的数据，果然又找到一个异常值。

输入：每天价格波动超过20%的数据有多少条

market_train_df['close_to_open'] =  np.abs(market_train_df['close'] / market_train_df['open'])
print(f"In {(market_train_df['close_to_open'] >= 1.2).sum()} lines price increased by 20% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.8).sum()} lines price decreased by 20% or more.")

输出：

In 1211 lines price increased by 20% or more.
In 778 lines price decreased by 20% or more.

输入：继续挖掘奇怪的案例，每天价格波动超过100%的数据有多少条

print(f"In {(market_train_df['close_to_open'] >= 2).sum()} lines price increased by 100% or more.")
print(f"In {(market_train_df['close_to_open'] <= 0.5).sum()} lines price decreased by 100% or more.")

输出：

In 38 lines price increased by 100% or more.
In 16 lines price decreased by 100% or more.

输入：我们不妨假设每天价格波动超过100%的数据是异常值，需要替换异常值。一个快速的解决方案是，用这家公司的平均开盘价或收盘价替换这些线中的异常值（中位数、众数也可）。

market_train_df['assetName_mean_open'] = market_train_df.groupby('assetName')['open'].transform('mean')
market_train_df['assetName_mean_close'] = market_train_df.groupby('assetName')['close'].transform('mean')

# if open price is too far from mean open price for this company, replace it. Otherwise replace close price.
for i, row in market_train_df.loc[market_train_df['close_to_open'] >= 2].iterrows():
    if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
        market_train_df.iloc[i,5] = row['assetName_mean_open']
    else:
        market_train_df.iloc[i,4] = row['assetName_mean_close']
        
for i, row in market_train_df.loc[market_train_df['close_to_open'] <= 0.5].iterrows():
    if np.abs(row['assetName_mean_open'] - row['open']) > np.abs(row['assetName_mean_close'] - row['close']):
        market_train_df.iloc[i,5] = row['assetName_mean_open']
    else:
        market_train_df.iloc[i,4] = row['assetName_mean_close']

输入：重新可视化deviation

market_train_df['price_diff'] = market_train_df['close'] - market_train_df['open']
grouped = market_train_df.groupby(['time']).agg({'price_diff': ['std', 'min']}).reset_index()
g = grouped.sort_values(('price_diff', 'std'), ascending=False)[:10]
g['min_text'] = 'Maximum price drop: ' + (-1 * np.round(g['price_diff']['min'], 2)).astype(str)
trace = go.Scatter(
    x = g['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = g['price_diff']['std'].values,
    mode='markers',
    marker=dict(
        size = g['price_diff']['std'].values * 5,
        color = g['price_diff']['std'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = g['min_text'].values
    #text = f"Maximum price drop: {g['price_diff']['min'].values}"
    #g['time'].dt.strftime(date_format='%Y-%m-%d').values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Top 10 months by standard deviation of price change within a day',
    hovermode= 'closest',
    yaxis=dict(
        title= 'price_diff',
        ticklen= 5,
        gridwidth= 2,
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

输出：看起来正常了

技术分享图片

输入：观察目标变量

data = []
for i in [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]:
    price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].quantile(i).reset_index()

    data.append(go.Scatter(
        x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = price_df['returnsOpenNextMktres10'].values,
        name = f'{i} quantile'
    ))
layout = go.Layout(dict(title = "Trends of returnsOpenNextMktres10 by quantiles",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

技术分享图片

输入：我们可以看到分位数具有较高的偏差，但平均值变化不大。我们只留下2010年以来的数据，现在来看看目标变量。

data = []
market_train_df = market_train_df.loc[market_train_df['time'] >= '2010-01-01 22:00:00+0000']

price_df = market_train_df.groupby('time')['returnsOpenNextMktres10'].mean().reset_index()

data.append(go.Scatter(
    x = price_df['time'].dt.strftime(date_format='%Y-%m-%d').values,
    y = price_df['returnsOpenNextMktres10'].values,
    name = f'{i} quantile'
))
layout = go.Layout(dict(title = "Treand of returnsOpenNextMktres10 mean",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

输出：波动似乎很高，但实际上它们均低于8％，就像一个随机的噪音……
技术分享图片

输入：观察return为前缀的变量

data = []
for col in ['returnsClosePrevRaw1', 'returnsOpenPrevRaw1',
       'returnsClosePrevMktres1', 'returnsOpenPrevMktres1',
       'returnsClosePrevRaw10', 'returnsOpenPrevRaw10',
       'returnsClosePrevMktres10', 'returnsOpenPrevMktres10',
       'returnsOpenNextMktres10']:
    df = market_train_df.groupby('time')[col].mean().reset_index()
    data.append(go.Scatter(
        x = df['time'].dt.strftime(date_format='%Y-%m-%d').values,
        y = df[col].values,
        name = col
    ))
    
layout = go.Layout(dict(title = "Treand of mean values",
                  xaxis = dict(title = 'Month'),
                  yaxis = dict(title = 'Price (USD)'),
                  ),legend=dict(
                orientation="h"),)
py.iplot(dict(data=data, layout=layout), filename='basic-line')

输出：看起来前10天的回报波动最大。

技术分享图片

新闻数据

news_train_df.head()
print(f'{news_train_df.shape[0]} samples and {news_train_df.shape[1]} features in the training news dataset.')

输出

9328827 samples and 35 features in the training news dataset.

输入：该文件太大而无法直接处理文本，所以先看看最后100000个标题生成的词云。

text = ' '.join(news_train_df['headline'].str.lower().values[-1000000:])
wordcloud = WordCloud(max_font_size=None, stopwords=stop, background_color='white',
                      width=1200, height=1000).generate(text)
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud)
plt.title('Top words in headline')
plt.axis("off")
plt.show()

输出：
技术分享图片

输入：关于urgency

# Let's also limit the time period
news_train_df = news_train_df.loc[news_train_df['time'] >= '2010-01-01 22:00:00+0000']
(news_train_df['urgency'].value_counts() / 1000000).plot('bar');
plt.xticks(rotation=30);
plt.title('Urgency counts (mln)');

输出：看起来urgency为2的数据几乎没有。
技术分享图片

输入：每句词数的统计

news_train_df['sentence_word_count'] =  news_train_df['wordCount'] / news_train_df['sentenceCount']
plt.boxplot(news_train_df['sentence_word_count'][news_train_df['sentence_word_count'] < 40]);

输出：没有明显的异常值，每句话大多有15-25词。
技术分享图片

输入：

news_train_df['provider'].value_counts().head(10)

输出：可以看到，路透社是最大的提供商。

RTRS    5517624
PRN      503267
BSW      472612
GNW      145309
MKW      129621
LSE       64250
HIIS      56489
RNS       39833
CNW       30779
ONE       25233
Name: provider, dtype: int64

输入：标题标签类型

(news_train_df['headlineTag'].value_counts() / 1000)[:10].plot('barh');
plt.title('headlineTag counts (thousands)');

输出：标签缺失现象较为严重
技术分享图片

输入：情绪分析

for i, j in zip([-1, 0, 1], ['negative', 'neutral', 'positive']):
    df_sentiment = news_train_df.loc[news_train_df['sentimentClass'] == i, 'assetName']
    print(f'Top mentioned companies for {j} sentiment are:')
    print(df_sentiment.value_counts().head(5))
    print('')

输出：苹果既是积极情绪的top1，也是消极情绪的top1。

Top mentioned companies for negative sentiment are:
Apple Inc                  22518
JPMorgan Chase & Co        20647
BP PLC                     19328
Goldman Sachs Group Inc    17955
Bank of America Corp       17704
Name: assetName, dtype: int64

Top mentioned companies for neutral sentiment are:
HSBC Holdings PLC    19462
Credit Suisse AG     14632
Deutsche Bank AG     12959
Barclays PLC         12414
Apple Inc            10994
Name: assetName, dtype: int64

Top mentioned companies for positive sentiment are:
Apple Inc                19020
Barclays PLC             18051
Royal Dutch Shell PLC    15484
General Electric Co      14163
Boeing Co                14080
Name: assetName, dtype: int64

建模

输入：加入一些特征，可能帮助模型训练取得更好的结果。
比如每日价格波动（收盘价与开盘价之比）
进行归一化（减小因数据绝对值大小对结果的影响）

#%%time
# code mostly takes from this kernel: https://www.kaggle.com/ashishpatel26/bird-eye-view-of-two-sigma-xgb

def data_prep(market_df,news_df):
    market_df['time'] = market_df.time.dt.date
    market_df['returnsOpenPrevRaw1_to_volume'] = market_df['returnsOpenPrevRaw1'] / market_df['volume']
    market_df['close_to_open'] = market_df['close'] / market_df['open']
    market_df['volume_to_mean'] = market_df['volume'] / market_df['volume'].mean()
    news_df['sentence_word_count'] =  news_df['wordCount'] / news_df['sentenceCount']
    news_df['time'] = news_df.time.dt.hour
    news_df['sourceTimestamp']= news_df.sourceTimestamp.dt.hour
    news_df['firstCreated'] = news_df.firstCreated.dt.date
    news_df['assetCodesLen'] = news_df['assetCodes'].map(lambda x: len(eval(x)))
    news_df['assetCodes'] = news_df['assetCodes'].map(lambda x: list(eval(x))[0])
    news_df['headlineLen'] = news_df['headline'].apply(lambda x: len(x))
    news_df['assetCodesLen'] = news_df['assetCodes'].apply(lambda x: len(x))
    news_df['asset_sentiment_count'] = news_df.groupby(['assetName', 'sentimentClass'])['time'].transform('count')
    news_df['asset_sentence_mean'] = news_df.groupby(['assetName', 'sentenceCount'])['time'].transform('mean')
    lbl = {k: v for v, k in enumerate(news_df['headlineTag'].unique())}
    news_df['headlineTagT'] = news_df['headlineTag'].map(lbl)
    kcol = ['firstCreated', 'assetCodes']
    news_df = news_df.groupby(kcol, as_index=False).mean()

    market_df = pd.merge(market_df, news_df, how='left', left_on=['time', 'assetCode'], 
                            right_on=['firstCreated', 'assetCodes'])

    lbl = {k: v for v, k in enumerate(market_df['assetCode'].unique())}
    market_df['assetCodeT'] = market_df['assetCode'].map(lbl)
    
    market_df = market_df.dropna(axis=0)
    
    return market_df

market_train_df.drop(['price_diff', 'assetName_mean_open', 'assetName_mean_close'], axis=1, inplace=True)
market_train = data_prep(market_train_df, news_train_df)
print(market_train.shape)
up = market_train.returnsOpenNextMktres10 >= 0

fcol = [c for c in market_train.columns if c not in ['assetCode', 'assetCodes', 'assetCodesLen', 'assetName', 'assetCodeT',
                                             'firstCreated', 'headline', 'headlineTag', 'marketCommentary', 'provider',
                                             'returnsOpenNextMktres10', 'sourceId', 'subjects', 'time', 'time_x', 'universe','sourceTimestamp']]

X = market_train[fcol].values
up = up.values
r = market_train.returnsOpenNextMktres10.values

# Scaling of X values
mins = np.min(X, axis=0)
maxs = np.max(X, axis=0)
rng = maxs - mins
X = 1 - ((maxs - X) / rng)

输出

(611261, 54)
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:

invalid value encountered in subtract

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:51: RuntimeWarning:

invalid value encountered in true_divide

输入：LightGBM训练（下次可以详细讲讲各种BOOSTING算法的参数含义）

X_train, X_test, up_train, up_test, r_train, r_test = model_selection.train_test_split(X, up, r, test_size=0.1, random_state=99)

# xgb_up = XGBClassifier(n_jobs=4,
#                        n_estimators=300,
#                        max_depth=3,
#                        eta=0.15,
#                        random_state=42)
params = {'learning_rate': 0.01, 'max_depth': 12, 'boosting': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'is_training_metric': True, 'seed': 42}
model = lgb.train(params, train_set=lgb.Dataset(X_train, label=up_train), num_boost_round=2000,
                  valid_sets=[lgb.Dataset(X_train, label=up_train), lgb.Dataset(X_test, label=up_test)],
                  verbose_eval=100, early_stopping_rounds=100)

训练过程：观察AUC

Training until validation scores don't improve for 100 rounds.
[100]   valid_0's auc: 0.570258 valid_1's auc: 0.566332
[200]   valid_0's auc: 0.573703 valid_1's auc: 0.567868
[300]   valid_0's auc: 0.577024 valid_1's auc: 0.568927
[400]   valid_0's auc: 0.580109 valid_1's auc: 0.569985
[500]   valid_0's auc: 0.582933 valid_1's auc: 0.570694
[600]   valid_0's auc: 0.585372 valid_1's auc: 0.571191
[700]   valid_0's auc: 0.58784  valid_1's auc: 0.571578
[800]   valid_0's auc: 0.590147 valid_1's auc: 0.571726
[900]   valid_0's auc: 0.592448 valid_1's auc: 0.571908
[1000]  valid_0's auc: 0.594658 valid_1's auc: 0.57203
[1100]  valid_0's auc: 0.596887 valid_1's auc: 0.572259
[1200]  valid_0's auc: 0.598918 valid_1's auc: 0.572422
[1300]  valid_0's auc: 0.601052 valid_1's auc: 0.572563
[1400]  valid_0's auc: 0.603196 valid_1's auc: 0.57269
[1500]  valid_0's auc: 0.605227 valid_1's auc: 0.572756
[1600]  valid_0's auc: 0.60723  valid_1's auc: 0.572837
[1700]  valid_0's auc: 0.609211 valid_1's auc: 0.572897
[1800]  valid_0's auc: 0.611181 valid_1's auc: 0.573038
[1900]  valid_0's auc: 0.613095 valid_1's auc: 0.573162
[2000]  valid_0's auc: 0.615015 valid_1's auc: 0.573307
Did not meet early stopping. Best iteration is:
[2000]  valid_0's auc: 0.615015 valid_1's auc: 0.573307

输入：观察特征重要程度

def generate_color():
    color = '#{:02x}{:02x}{:02x}'.format(*map(lambda x: np.random.randint(0, 255), range(3)))
    return color

df = pd.DataFrame({'imp': model.feature_importance(), 'col':fcol})
df = df.sort_values(['imp','col'], ascending=[True, False])
data = [df]
for dd in data:  
    colors = []
    for i in range(len(dd)):
         colors.append(generate_color())

    data = [
        go.Bar(
        orientation = 'h',
        x=dd.imp,
        y=dd.col,
        name='Features',
        textfont=dict(size=20),
            marker=dict(
            color= colors,
            line=dict(
                color='#000000',
                width=0.5
            ),
            opacity = 0.87
        )
    )
    ]
    layout= go.Layout(
        title= 'Feature Importance of LGB',
        xaxis= dict(title='Columns', ticklen=5, zeroline=False, gridwidth=2),
        yaxis=dict(title='Value Count', ticklen=5, gridwidth=2),
        showlegend=True
    )

    py.iplot(dict(data=data,layout=layout), filename='horizontal-bar')

技术分享图片

输入：提交

days = env.get_prediction_days()
import time

n_days = 0
prep_time = 0
prediction_time = 0
packaging_time = 0
for (market_obs_df, news_obs_df, predictions_template_df) in days:
    n_days +=1
    if n_days % 50 == 0:
        print(n_days,end=' ')
    
    t = time.time()
    market_obs_df = data_prep(market_obs_df, news_obs_df)
    market_obs_df = market_obs_df[market_obs_df.assetCode.isin(predictions_template_df.assetCode)]
    X_live = market_obs_df[fcol].values
    X_live = 1 - ((maxs - X_live) / rng)
    prep_time += time.time() - t
    
    t = time.time()
    lp = model.predict(X_live)
    prediction_time += time.time() -t
    
    t = time.time()
    confidence = 2 * lp -1
    preds = pd.DataFrame({'assetCode':market_obs_df['assetCode'],'confidence':confidence})
    predictions_template_df = predictions_template_df.merge(preds,how='left').drop('confidenceValue',axis=1).fillna(0).rename(columns={'confidence':'confidenceValue'})
    env.predict(predictions_template_df)
    packaging_time += time.time() - t
    
env.write_submission_file()

输出：

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:

invalid value encountered in true_divide

50 100 150 
/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning:

invalid value encountered in subtract

200 250 300 350 400 450 500 550 600 Your submission file has been saved. Once you `Commit` your Kernel and it finishes running, you can submit the file to the competition from the Kernel Viewer `Output` tab.

总结

这篇kernel好花了大量的时间做了EDA，妥善处理了异常值，做了充分的数据分析，而高质量的数据绝对对最终结果有较大程度的提升，关于模型训练的部分只是简单调用了LightGBM模型（可能是出于效率和效果的性价比考虑）。有机会详细聊聊各种Boosting算法实现时的各种参数是怎样从理论得来的，以及最终如何影响模型性能。

新闻预测股价——对冲基金Two Sigma寻求智能解决方案

原文：https://www.cnblogs.com/AIKaggle/p/11556108.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)