微信公众号:AIKaggle
欢迎建议和拍砖,若需要资源,请公众号留言;
如果你觉得AIKaggle对你有帮助,欢迎赞赏
放松一下,我们来介绍如何通过菜谱来判断这道菜是属于哪个国家,本篇属于先导篇,介绍如何分析高维稀疏向量。
假设你正在美食街闲逛,你能根据你所看到的菜预测是哪个国家的美食吗?如果你在加利福尼亚,走过菜贩子的小铺,就能看到深紫色的羽衣甘蓝和粉色的甜菜。在韩国,火辣辣的泡菜就非常受当地人民的喜爱,又因其沿海,所以经常在韩剧里看到男女主角吃腌蟹、活章鱼的场面。印度的市场是最丰富多彩的,充满了香料的香气:有姜黄,八角茴香,罂粟种子、咖喱等等。
当地的美食和该地区的地理位置和历史文化息息相关,这次的Kaggle比赛更具娱乐性质,Yummy邀请你做美食家,根据菜谱判断美食的国别。

ID,美食国别cuisine(标签)以及每个食谱的食材清单ingredients。 数据以JSON格式存储。train.json中食谱的示例:{
"id": 24717,
"cuisine": "indian",
"ingredients": [
"tumeric",
"vegetable stock",
"tomatoes",
"garam masala",
"naan",
"red lentils",
"red chili peppers",
"onions",
"spinach",
"sweet potatoes"
]
},
test.json中,食谱的格式与train.json相同,只是不包含美食国别cuisine,因为它是要预测的目标变量。train.json:包含食谱ID,美食国别cuisine和配料列表ingredients的训练集test.json:包含食谱ID和配料列表ingredients的测试集sample_submission.csv:格式正确的示例提交文件看完这个数据,发现数据集包含的特征只有ingredients,而ingredients是一个食材的列表,所以通常会对ingredients所包含的元素进行one-hot向量化,这就形成了高维稀疏特征矩阵,为了减小维度,提高特征矩阵的稠密度,通常会进行embedding操作,这里会怎么做呢,我们继续往下看。
id,cuisine
35203,italian
17600,italian
35200,italian
17602,italian
...
etc.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import os
print(os.listdir("../input"))
# Any results you write to the current directory are saved as output.
['sample_submission.csv', 'test.json', 'train.json']
print('Maximum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().max())
print('Minimum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().min())
Maximum Number of Ingredients in a Dish: 65
Minimum Number of Ingredients in a Dish: 1
ingredients的数目的分布直方图plt.hist(train_df['ingredients'].str.len(),bins=max(train_df['ingredients'].str.len()),edgecolor='b')
plt.gcf().set_size_inches(16,8)
plt.title('Ingredients in a Dish Distribution')

sns.countplot(y='cuisine', data=train_df,palette=sns.color_palette('inferno',15))
plt.gcf().set_size_inches(15,10)
plt.title('Cuisine Distribution',size=20)

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)
counts = vec.fit_transform(train_df['seperated_ingredients'])
count=dict(zip(vec.get_feature_names(), counts.sum(axis=0).tolist()[0]))
count=pd.DataFrame(list(count.items()),columns=['Ingredient','Count'])
count.set_index('Ingredient').sort_values('Count',ascending=False)[:15].plot.barh(width=0.9)
plt.gcf().set_size_inches(10,10)
plt.gca().invert_yaxis()
plt.title('Top 15 Ingredients')

ingredients列表进行one-hot向量化ingreList = []
for index, row in train_df.iterrows():
ingre = row['ingredients']
for i in ingre:
if i not in ingreList:
ingreList.append(i)
def binary(ingre_list):
binaryList = []
for item in ingreList:
if item in ingre_list:
binaryList.append(1)
else:
binaryList.append(0)
return binaryList
train_df['bin ingredients']=train_df['ingredients'].apply(lambda x: binary(x))
cosine来度量,值在0~1之间,直观上理解就是两个向量的夹角的余弦值,显然,两个向量夹角越小,我们就认为这两个向量的相似度越大。还有一些其他方式可以表达相似度:如欧几里得距离(L2范数),曼哈顿距离(L1范数),明可夫斯基距离(Lp范数,当p=1就是曼哈顿距离,p=2就是欧几里得距离,p=∞就是切比雪夫距离,Jaccard相似度是利用交并比,皮尔森相关系数)from scipy import spatial
def Similarity(Id1, Id2):
a = train_df.iloc[Id1]
b = train_df.iloc[Id2]
A = a['bin ingredients']
B = b['bin ingredients']
distance=spatial.distance.cosine(A,B)
return distance, Id2
food=[]
for i in train_df.index:
food.append(Similarity(1,i))
common_ingredients=sorted(food,key=lambda x: x[0])[1:10]
indexes=[]
for i in range(len(common_ingredients)):
indexes.append(common_ingredients[i][1])
train_df.iloc[indexes]
nltk是一个英语的自然语言工具包import nltk
from collections import Counter
train_df['for ngrams']=train_df['seperated_ingredients'].str.replace(',',' ')
f,ax=plt.subplots(2,2,figsize=(20,20))
def ingre_cusine(cuisine):
frame=train_df[train_df['cuisine']==cuisine]
common=list(nltk.bigrams(nltk.word_tokenize(" ".join(frame['for ngrams']))))
return pd.DataFrame(Counter(common),index=['count']).T.sort_values('count',ascending=False)[:15]
ingre_cusine('mexican').plot.barh(ax=ax[0,0],width=0.9,color='#45ff45')
ax[0,0].set_title('Mexican Cuisine')
ingre_cusine('indian').plot.barh(ax=ax[0,1],width=0.9,color='#df6dfd')
ax[0,1].set_title('Indian Cuisine')
ingre_cusine('italian').plot.barh(ax=ax[1,0],width=0.9,color='#fbca5f')
ax[1,0].set_title('Italian Cuisine')
ingre_cusine('chinese').plot.barh(ax=ax[1,1],width=0.9,color='#ffff00')
ax[1,1].set_title('Chinese Cuisine')
plt.subplots_adjust(wspace=0.5)

net_diagram('french','mexican')

net_diagram('thai','chinese')

分析到这里就差不多要结束啦,下一篇讲一讲怎么将处理这种高维稀疏向量,并使用keras建立一个简单的NN模型,记得关注哦~
随时分享交流哦

原文:https://www.cnblogs.com/AIKaggle/p/11600942.html