参考:http://scikit-learn.org/stable/modules/computational_performance.html
对于有些应用,estimators的计算效能(主要指预测新样本时的延迟和吞吐量)非常关键,我们也考虑训练的效能,但由于训练可以offline,所以我们更关注预测时的效能问题。
预测延迟(Prediction latency):预测一个新样本花费的时间(the elapsed time necessary to make a prediction)。
预测吞吐量(Prediction throughput):单位时间内能够预测的新样本数量(the number of predictions the software can deliver in a given amount of time)。
计算效能的提高往往意味着预测精度的损害(简单模型确实跑得快,但却没有复杂模型能够考虑的properties of the data多)。我们review一些estimators的计算效能的数量级,同时提供一些改进计算瓶颈的方法。
由于大家都不喜欢看介绍,而喜欢方法,先介绍tips和tricks。
3、Tips and Tricks
1)线性代数库:
关注Numpy/Scipy and linear algebra libraries的版本,保证他们built using an optimized BLAS / LAPACK library.
但并不是所有的运算都受益:(randomized) decision trees的inner loops、kernel SVMs (SVC, SVR, NuSVC, NuSVR)不受影响;但linear model(via numpy.dot)会大大改善。
展示NumPy
 / SciPy / scikit-learn install的BLAS / LAPACK库的命令:
2)Model compression(模型压缩)
这里只考虑线性模型,特指将模型的非零系数转换成sparsity,即做到模型和数据都是sparsity。
Here is sample code that illustrates the use of the sparsify() method:
In this example we prefer the elasticnet penalty as it is often a good compromise between model compactness and prediction power. One can also further tune the l1_ratio parameter (in combination with the regularization strength alpha) to control this tradeoff.
3)model reshaping
Model reshaping consists in
selecting only a portion of the available features to fit a model。 In other words, if a model discards features
 during the learning phase we can then strip those from the input.
好处不说了,因为当前这个trick也只能performed manually in scikit-learn。。。。
4)给大牛们留的菜
1、预测延迟
整体预测会比单一预测快上1-2个数量级(原因:branching predictability, CPU cache, linear algebra libraries optimizations etc.),参考:http://scikit-learn.org/stable/modules/computational_performance.html中的两个对比图片。
可喜的是:To benchmark different estimators for your case you can simply change the n_features parameter in this example: Prediction Latency. This should give you an estimate of the order of magnitude of the prediction latency.
2)Number of features
效果图参考:http://scikit-learn.org/stable/modules/computational_performance.html。
Overall you can expect the prediction time to increase at least linearly with the number of features (non-linear cases can happen depending on the global memory footprint and estimator).
3)Input data representation and sparsity
主要讲述sparse和dense的区别。
sparse的好处:If you have 100 non zeros in 1e6 dimensional space, you only need 100 multiply and add operation instead of 1e6.
什么情况下使用sparse format:最多10%的非零元素,if the sparsity ratio is greater than 90% you can probably benefit from sparse formats.
Check Scipy’s sparse matrix formats documentation for more information on how to build (or convert your data to) sparse matrix formats. Most of the time the CSR and CSC formats work best.
好吧,再解释一下CSR/CSC:
| csc_matrix(arg1[, shape, dtype, copy]) | Compressed Sparse Column matrix | 
| csr_matrix(arg1[, shape, dtype, copy]) | Compressed Sparse Row matrix | 
检测数据sparsity的函数:
4)Model complexity
Generally
 speaking, when model complexity increases, predictive power and latency are supposed to increase.(模型越复杂,预测能力越强、但延迟越严重)。
 对于sklearn.linear_model (e.g.
 Lasso, ElasticNet, SGDClassifier/Regressor, Ridge & RidgeClassifier, PassiveAgressiveClassifier/Regressor, LinearSVC, LogisticRegression...), 预测时的决策函数是一样的 (系数和相应值的点积) ,所以延迟都一样,跟模型复杂度无关。
对于其他模型,具体实验结果参考:http://scikit-learn.org/stable/modules/computational_performance.html中的对比图片。
5)Feature extraction
(有些应用中提取特征会花很多时间;好吧。。。。刚说了是offline的。。。。)in
 many real world applications the feature extraction process (i.e. turning raw data like database rows or network packets into numpy arrays) governs the overall prediction time. For example on the Reuters text classification task the whole preparation
 (reading and parsing SGML files, tokenizing the text and hashing it into a common vector space) is taking 100 to 500 times more time than the actual prediction code, depending on the chosen model.具体图片参考:http://scikit-learn.org/stable/modules/computational_performance.html吧。
2、预测吞吐量
具体图片参考:http://scikit-learn.org/stable/modules/computational_performance.html吧。
版权声明:本文为博主原创文章,未经博主允许不得转载。
scikit-learn:7. Computational Performance(计算效能<延迟和吞吐量>)
原文:http://blog.csdn.net/mmc2015/article/details/47089813