在sklearn
中使用k-近邻进行分类处理的是sklearn.neighbors.KNeighborsClassifier
类
#生成已标记数据
from sklearn.datasets import make_blobs
#生成数据
centers = [[-2,2],[2,2],[0,4]]
X,y = make_blobs(n_samples = 60,centers = centers,
random_state = 0,cluster_std = 0.6)
说明:n_sample
为训练样本的个数,centers
指定中心点位置,cluster_std
指明生成点分布的松散程度(标准差)。训练数据集放在X
中,数据集类别标记放在y
中。
import matplotlib.pyplot as plt
import numpy as np
#绘制数据
plt.figure(figsize = (16,10),dpi = 144)
c = np.array(centers)
plt.scatter(X[:,0],X[:,1],c = y,s = 100,cmap = ‘cool‘) #画样本
plt.scatter(c[:,0],c[:,1],s = 100,marker = ‘^‘,c = ‘orange‘) #画中心点
<matplotlib.collections.PathCollection at 0x1f28e49d220>
?
?
使用KNeighborsClassifier
来对算法进行训练,选择的参数是k=5
from sklearn.neighbors import KNeighborsClassifier
#模型训练
k = 5
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X,y)
KNeighborsClassifier()
对一个新的样本进行预测,需要进行预测的样本为[0,2]
,使用kneighbors()
方法把样本周围距离最近的5个点取出来,取出来的点是训练样本X
中的索引。
#进行预测
X_sample = np.array([0,2]).reshape(1,-1)
y_sample = clf.predict(X_sample)
neighbors = clf.kneighbors(X_sample,return_distance = False)
X_sample
array([[0, 2]])
y_sample
array([0])
neighbors
array([[16, 20, 48, 6, 23]], dtype=int64)
标记最近的5个点和待预测样本
#画示意图
plt.figure(figsize = (16,10),dpi = 144)
plt.scatter(X[:,0],X[:,1],c=y,s = 100,cmap = ‘cool‘) #样本
plt.scatter(c[:,0],c[:,1],s=100,marker = ‘^‘,c = ‘k‘) #中心点
plt.scatter(X_sample[0][0],X_sample[0][1],marker=‘x‘,
c = y_sample,s = 100,cmap = ‘cool‘) #待预测点
for i in neighbors[0]:
plt.plot([X[i][0],X_sample[0][0]],[X[i][1],X_sample[0][1]],‘k--‘,linewidth = 0.6) #预测点与距离最近5点的连线
?
?
用k-近邻算法在连续区间内对数值进行预测,进行回归拟合。
在scikit-learn
中,使用k-近邻算法进行回归拟合的算法是sklearn.neighbors.KNeighborsRegressor
类
#生成数据集,在余弦曲线的基础上加入了噪声
import numpy as np
n_dots = 40
X = 5 * np.random.rand(n_dots,1)
y = np.cos(X).ravel()
#添加一些噪声
y += 0.2 * np.random.rand(n_dots) - 0.1
#使用KNeighborsRegressor来训练模型
from sklearn.neighbors import KNeighborsRegressor
k = 5
knn = KNeighborsRegressor(k)
knn.fit(X,y)
KNeighborsRegressor()
回归拟合的过程:在X轴上指定区间内生成足够多的点,针对这些足够密集的点,使用训练出来的模型进行预测,把所有的预测点连接起来得到拟合曲线。
#生成足够密集的点进行预测(np.newais用于插入新维度)
T = np.linspace(0,5,500)[:,np.newaxis]
y_pred = knn.predict(T)
knn.score(X,y)
0.9756908320045331
#绘制拟合曲线
plt.figure(figsize = (16,10),dpi = 144)
plt.scatter(X,y,c = ‘g‘,label = ‘data‘,s = 100) #画出训练样本
plt.plot(T,y_pred,c = ‘k‘,label = ‘prediction‘,lw = 4) #画出拟合曲线
plt.axis(‘tight‘)
plt.title(‘KNeighborsRegressor (k = %i)‘ % k)
plt.show()
?
使用 k-近邻算法及其变种,对 Pima 印第安人的糖尿病进行预测。
#加载数据
import pandas as pd
data = pd.read_csv(‘diabetes.csv‘)
print(‘dataset shape {}‘.format(data.shape))
data.head()
dataset shape (768, 9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
8个特征分别如下:
data.groupby("Outcome").size()
Outcome
0 500
1 268
dtype: int64
#将8个特征值分离出来作为训练数据集,把Outcome列分离出来作为目标值。
#然后把训练集划分为训练数据集和测试数据集
X = data.iloc[:,0:8]
Y = data.iloc[:,8]
print(‘shape of X {}; shape of Y {}‘.format(X.shape,Y.shape))
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size = 0.2)
shape of X (768, 8); shape of Y (768,)
使用普通的k-均值算法、带权重的k-均值算法以及指定半径的k-均值算法分别对数据集进行拟合并计算评分:
from sklearn.neighbors import KNeighborsClassifier,RadiusNeighborsClassifier
#构造3个模型
models = []
models.append(("KNN",KNeighborsClassifier(n_neighbors = 2)))
models.append(("KNN with weights",KNeighborsClassifier(
n_neighbors=2,weights = "distance")))
models.append(("Radius Neighbors",RadiusNeighborsClassifier(
n_neighbors=2,radius = 500.0)))
print(models)
#分别训练3个模型,并计算评分
results = []
for name,model in models:
model.fit(X_train,Y_train)
results.append((name,model.score(X_test,Y_test)))
for i in range(len(results)):
print("name: {};score: {}".format(results[i][0],results[i][1]))
[(‘KNN‘, KNeighborsClassifier(n_neighbors=2)), (‘KNN with weights‘, KNeighborsClassifier(n_neighbors=2, weights=‘distance‘)), (‘Radius Neighbors‘, RadiusNeighborsClassifier(radius=500.0))]
name: KNN;score: 0.7792207792207793
name: KNN with weights;score: 0.7077922077922078
name: Radius Neighbors;score: 0.6883116883116883
说明:
RadiusNeighborsClassifier
模型的半径,选择了500如何更准确地对比算法准确性?多次随机分配训练数据集和交叉验证数据集,然后求模型准确性评分的平均值。scikit-learn
提供了KFold
和cross_val_score()
函数来处理这种问题:
交叉验证具体流程:https://blog.csdn.net/qq_36523839/article/details/80707678
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
results = []
for name,model in models:
kfold = KFold(n_splits = 10)
cv_result = cross_val_score(model,X,Y,cv = kfold)
results.append((name,cv_result))
for i in range(len(results)):
print("name: {};cross val score: {}".format(
results[i][0],results[i][1].mean()))
name: KNN;cross val score: 0.7147641831852358
name: KNN with weights;cross val score: 0.6770505809979495
name: Radius Neighbors;cross val score: 0.6497265892002735
上述代码通过KFold
把数据集分成10份,其中1份会作为交叉验证数据集来计算模型准确性,剩下9份作为训练数据集。
cross_val_score()
函数总共计算出10次不同训练集和交叉验证数据集组合得到的模型准确性评分。
综上所述,普通的k-均值算法性能更优一些。接下来,我们就使用普通的k-均值算法模型对数据集进行训练,并查看对训练样本的拟合情况以及对测试样本的预测准确性情况。
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train,Y_train)
train_score = knn.score(X_train,Y_train)
test_score = knn.score(X_test,Y_test)
print("train score: {};test score: {}".format(train_score,test_score))
train score: 0.8159609120521173;test score: 0.7792207792207793
以上结果表明:
from sklearn.model_selection import ShuffleSplit
from common.utils import plot_learning_curve
knn = KNeighborsClassifier(n_neighbors = 2)
cv = ShuffleSplit(n_splits = 10,test_size = 0.2,random_state = 0)
plt.figure(figsize = (10,6),dpi = 200)
plot_learning_curve(plt,knn,"Learn Curve for KNN Diabetes",
X,Y,ylim = (0.0,1.01),cv = cv)
<module ‘matplotlib.pyplot‘ from ‘D:\\Anaconda3\\lib\\site-packages\\matplotlib\\pyplot.py‘>
?
?
如上图所示:训练样本评分较低,且测试样本与训练样本距离较大,这是典型的欠拟合现象。
如果要用直观方法来揭示为什么k-均值算法不是针对这一问题的好模型?
scikit-learn
在sklearn.feature_selection
包中提供了丰富的特征选择方法,在此使用SelectKBest
选择相关性最大的两个特征:
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k = 2)
X_new = selector.fit_transform(X,Y)
X_new[0:5]
array([[148. , 33.6],
[ 85. , 26.6],
[183. , 23.3],
[ 89. , 28.1],
[137. , 43.1]])
准确性效果
results = []
for name,model in models:
kfold = KFold(n_splits = 10)
cv_result = cross_val_score(model,X_new,Y,cv = kfold)
results.append((name,cv_result))
for i in range(len(results)):
print("name: {};cross val score: {}".format(
results[i][0],results[i][1].mean()))
name: KNN;cross val score: 0.725205058099795
name: KNN with weights;cross val score: 0.6900375939849623
name: Radius Neighbors;cross val score: 0.6510252904989747
由此看出两个特征与所有特征比较准确性差不多,侧面体现了SelectKBest
特征选择的准确性。
#两个特征画出所有训练样本,观察分布情况
plt.figure(figsize = (10,6),dpi = 200)
plt.ylabel("BMI")
plt.xlabel("Glucose")
#画出Y==0的阴性样本
plt.scatter(X_new[Y==0][:,0],X_new[Y==0][:,1],c = ‘r‘,s = 20,marker = ‘o‘)
#画出Y==1的阳性样本
plt.scatter(X_new[Y==1][:,0],X_new[Y==1][:,1],c = ‘g‘,s = 20,marker = ‘^‘)
<matplotlib.collections.PathCollection at 0x1f29282d520>
?
?
因为两特征对应的阴性和阳性样本的分类并不明显,所以很难预测糖尿病问题,无法达到很高的预测准确性。
原文:https://www.cnblogs.com/MurasameLory-chenyulong/p/15091629.html