美食家的必修课：从菜谱中读出信息

时间：2019-09-28 10:00:17 阅读：108 评论：0 收藏：0 [点我收藏+]

引言
数据格式
- 示例
- 文件描述
评测标准
数据分析
关注AIKaggle

微信公众号：AIKaggle
欢迎建议和拍砖，若需要资源，请公众号留言;
如果你觉得AIKaggle对你有帮助，欢迎赞赏

放松一下，我们来介绍如何通过菜谱来判断这道菜是属于哪个国家，本篇属于先导篇，介绍如何分析高维稀疏向量。

引言

假设你正在美食街闲逛，你能根据你所看到的菜预测是哪个国家的美食吗？如果你在加利福尼亚，走过菜贩子的小铺，就能看到深紫色的羽衣甘蓝和粉色的甜菜。在韩国，火辣辣的泡菜就非常受当地人民的喜爱，又因其沿海，所以经常在韩剧里看到男女主角吃腌蟹、活章鱼的场面。印度的市场是最丰富多彩的，充满了香料的香气：有姜黄，八角茴香，罂粟种子、咖喱等等。
当地的美食和该地区的地理位置和历史文化息息相关，这次的Kaggle比赛更具娱乐性质，Yummy邀请你做美食家，根据菜谱判断美食的国别。

数据格式

示例

在数据集中，包括食谱 ID，美食国别cuisine（标签）以及每个食谱的食材清单ingredients。数据以JSON格式存储。train.json中食谱的示例：

{
 "id": 24717,
 "cuisine": "indian",
 "ingredients": [
     "tumeric",
     "vegetable stock",
     "tomatoes",
     "garam masala",
     "naan",
     "red lentils",
     "red chili peppers",
     "onions",
     "spinach",
     "sweet potatoes"
 ]
 },

看这个示例，材料有姜黄，蔬菜汤，西红柿，烤饼，红扁豆，红辣椒，洋葱，菠菜，红薯，看起来像个大杂烩，确实有印度菜的风格。
在测试文件test.json中，食谱的格式与train.json相同，只是不包含美食国别cuisine，因为它是要预测的目标变量。

文件描述

train.json：包含食谱ID，美食国别cuisine和配料列表ingredients的训练集
test.json：包含食谱ID和配料列表ingredients的测试集
sample_submission.csv：格式正确的示例提交文件

看完这个数据，发现数据集包含的特征只有ingredients，而ingredients是一个食材的列表，所以通常会对ingredients所包含的元素进行one-hot向量化，这就形成了高维稀疏特征矩阵，为了减小维度，提高特征矩阵的稠密度，通常会进行embedding操作，这里会怎么做呢，我们继续往下看。

评测标准

根据分类准确性（正确分类菜肴的百分比）评估提交的内容。提交文件应预测测试集中每种食谱的国别。该文件应包含标头，并具有以下格式：

id,cuisine
35203,italian
17600,italian
35200,italian
17602,italian
...
etc.

数据分析

输入：导入数据处理的包

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

输出：

['sample_submission.csv', 'test.json', 'train.json']

输入：看看食材数目的极值

print('Maximum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().max())
print('Minimum Number of Ingredients in a Dish: ',train_df['ingredients'].str.len().min())

输出：猜测仅包含一个食材的菜谱可能是简单的水煮菜。。。

Maximum Number of Ingredients in a Dish:  65
Minimum Number of Ingredients in a Dish:  1

输入：看看食材ingredients的数目的分布直方图

plt.hist(train_df['ingredients'].str.len(),bins=max(train_df['ingredients'].str.len()),edgecolor='b')
plt.gcf().set_size_inches(16,8)
plt.title('Ingredients in a Dish Distribution')

输出：类正态分布，大部分食谱的食材数目在10种左右。

美食国别揭秘

输入：美食国别的分布情况

sns.countplot(y='cuisine', data=train_df,palette=sns.color_palette('inferno',15))
plt.gcf().set_size_inches(15,10)
plt.title('Cuisine Distribution',size=20)

输出：食谱数量前三名是意大利、墨西哥和南美。
输入：看看最常用的食材是什么，最受各国大厨的喜爱

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=lambda x: [i.strip() for i in x.split(',')], lowercase=False)
counts = vec.fit_transform(train_df['seperated_ingredients']) 
count=dict(zip(vec.get_feature_names(), counts.sum(axis=0).tolist()[0]))
count=pd.DataFrame(list(count.items()),columns=['Ingredient','Count'])
count.set_index('Ingredient').sort_values('Count',ascending=False)[:15].plot.barh(width=0.9)
plt.gcf().set_size_inches(10,10)
plt.gca().invert_yaxis()
plt.title('Top 15 Ingredients')

输出：前15名最受大厨们喜爱的食材：盐（怪不得古代走私盐是很严重的问题）、橄榄油、洋葱、水、大蒜、糖、蒜米（这和大蒜不是同一种东西吗）、黄油、黑胡椒（我喜欢黑胡椒）、中筋面粉、甜椒、植物油、鸡蛋、酱油、海盐。

找出相似的菜

输入：这一部分是将每个ingredients列表进行one-hot向量化

ingreList = []
for index, row in train_df.iterrows():
    ingre = row['ingredients']
    
    for i in ingre:
        if i not in ingreList:
            ingreList.append(i)
def binary(ingre_list):
    binaryList = []
    
    for item in ingreList:
        if item in ingre_list:
            binaryList.append(1)
        else:
            binaryList.append(0)
    
    return binaryList
train_df['bin ingredients']=train_df['ingredients'].apply(lambda x: binary(x))

输入：找出两个食材列表的相似度，相似度可以用cosine来度量，值在0~1之间，直观上理解就是两个向量的夹角的余弦值，显然，两个向量夹角越小，我们就认为这两个向量的相似度越大。还有一些其他方式可以表达相似度：如欧几里得距离（L2范数），曼哈顿距离（L1范数），明可夫斯基距离（Lp范数，当p=1就是曼哈顿距离，p=2就是欧几里得距离，p=∞就是切比雪夫距离，Jaccard相似度是利用交并比，皮尔森相关系数）

from scipy import spatial

def Similarity(Id1, Id2):
    a = train_df.iloc[Id1]
    b = train_df.iloc[Id2]
    
    A = a['bin ingredients']
    B = b['bin ingredients']
    distance=spatial.distance.cosine(A,B)
    
    return distance, Id2

输入：找和第一道美食最相似的几道菜

food=[]
for i in train_df.index:
    food.append(Similarity(1,i))
common_ingredients=sorted(food,key=lambda x: x[0])[1:10]
indexes=[]
for i in range(len(common_ingredients)):
    indexes.append(common_ingredients[i][1])
train_df.iloc[indexes]

输出：都是英国和南美的黑暗料理

找出出现最多的两元组

输入：nltk是一个英语的自然语言工具包

import nltk
from collections import Counter

输入：统计食材的二元组和的数目

train_df['for ngrams']=train_df['seperated_ingredients'].str.replace(',',' ')
f,ax=plt.subplots(2,2,figsize=(20,20))
def ingre_cusine(cuisine):
    frame=train_df[train_df['cuisine']==cuisine]
    common=list(nltk.bigrams(nltk.word_tokenize(" ".join(frame['for ngrams']))))
    return pd.DataFrame(Counter(common),index=['count']).T.sort_values('count',ascending=False)[:15]
ingre_cusine('mexican').plot.barh(ax=ax[0,0],width=0.9,color='#45ff45')
ax[0,0].set_title('Mexican Cuisine')
ingre_cusine('indian').plot.barh(ax=ax[0,1],width=0.9,color='#df6dfd')
ax[0,1].set_title('Indian Cuisine')
ingre_cusine('italian').plot.barh(ax=ax[1,0],width=0.9,color='#fbca5f')
ax[1,0].set_title('Italian Cuisine')
ingre_cusine('chinese').plot.barh(ax=ax[1,1],width=0.9,color='#ffff00')
ax[1,1].set_title('Chinese Cuisine')
plt.subplots_adjust(wspace=0.5)