在机器学习中评判一个模型好坏的标准有很多,常用的有准确率、召回率、AUC等。本文介绍下AUC及其计算方式。
AUC常用来评估一个二元分类模型,二元分类模型通常有4中预测结局,以是否患高血压为例:
我们可以得到一个TPR = TP / (TP + FN) FPR = FP / (FP + TN)
如果我们去很多不同的阈值就可以得到一系列的(FPR, TPR)点,这些点可以拟合成一条曲线,我们称之为ROC(Receiver Operating Characteristic);该曲线下方与横轴之间的面积大小即为AUC。因此,我们计算AUC的方式如下:
#!/usr/bin/python
import sys
def get_auc(arr_score, arr_label, pos_label):
score_label_list = []
for index in xrange(len(arr_score)):
score_label_list.append((float(arr_score[index]), int(arr_label[index])))
score_label_list_sorted = sorted(score_label_list, key = lambda line:line[0], reverse = True)
fp, tp = 0, 0
lastfp, lasttp = 0, 0
A = 0
lastscore = None
for score_label in score_label_list_sorted:
score, label = score_label[:2]
if score != lastscore:
A += trapezoid_area(fp, lastfp, tp, lasttp)
lastscore = score
lastfp, lasttp = fp, tp
if label == pos_label:
tp += 1
else:
fp += 1
A += trapezoid_area(fp, lastfp, tp, lasttp)
A /= (fp * tp)
return A
def trapezoid_area(x1, x2, y1, y2):
delta = abs(x2 - x1)
return delta * 0.5 * (y1 + y2)
if __name__ == "__main__":
if len(sys.argv) != 2:
print "Error!\n%s pred_model_file"
sys.exit(-1)
arr_score, arr_label = [], []
for line in file(sys.argv[1]):
line = line.strip().split('\t')
if len(line) < 2 : continue
arr_score.append(line[0])
arr_label.append(line[1])
print arr_score;print arr_label;
print "AUC = %s" % get_auc(arr_score, arr_label, 2)
F:\python_workspace\offline_evaluation>python model_evaluation.py pred_model_file.txt ['0.1', '0.4', '0.35', '0.8'] ['1', '1', '2', '2'] AUC = 0.75
采用sklearn里的代码也可以得到AUC值,http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1, 1, 2, 2]) >>> pred = np.array([0.1, 0.4, 0.35, 0.8]) >>> fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr, tpr) 0.75
换个例子:
F:\python_workspace\offline_evaluation>python model_evaluation.py tmp.txt ['0.1', '0.2', '0.4', '0.5', '0.35', '0.8', '0.9', '0.95'] ['1', '2', '1', '1', '2', '2', '2', '1'] AUC = 0.5
>>> import numpy as np >>> from sklearn import metrics >>> y = np.array([1,2,1,1,2,2,2,1]) >>> pred = np.array([0.1,0.2,0.4,0.5,0.35,0.8,0.9,0.95]) >>> fpr, tpr, ths = metrics.roc_curve(y, pred, pos_label=2) >>> metrics.auc(fpr,tpr) 0.5
从以上2个例子中可以看到与之前自己写的代码得到的AUC值一样!
原文:http://blog.csdn.net/lming_08/article/details/44284155