【原创】xgboost 特征评分的计算原理

首页 > 代码库 > 【原创】xgboost 特征评分的计算原理

【原创】xgboost 特征评分的计算原理

2024-08-15 14:17:05 218人阅读

xgboost是基于GBDT原理进行改进的算法，效率高，并且可以进行并行化运算；

而且可以在训练的过程中给出各个特征的评分，从而表明每个特征对模型训练的重要性，

调用的源码就不准备详述，本文主要侧重的是计算的原理，函数get_fscore源码如下，

源码来自安装包：xgboost/python-package/xgboost/core.py

通过下面的源码可以看出，特征评分可以看成是被用来分离决策树的次数，而这个与

《统计学习基础-数据挖掘、推理与推测》中10.13.1 计算公式有写差异，此处需要注意。

注：考虑的角度不同，计算方法略有差异。

 def get_fscore(self, fmap=‘‘):        """Get feature importance of each feature.        Parameters        ----------        fmap: str (optional)           The name of feature map file        """        return self.get_score(fmap, importance_type=‘weight‘)    def get_score(self, fmap=‘‘, importance_type=‘weight‘):        """Get feature importance of each feature.        Importance type can be defined as:            ‘weight‘ - the number of times a feature is used to split the data across all trees.            ‘gain‘ - the average gain of the feature when it is used in trees            ‘cover‘ - the average coverage of the feature when it is used in trees        Parameters        ----------        fmap: str (optional)           The name of feature map file        """        if importance_type not in [‘weight‘, ‘gain‘, ‘cover‘]:            msg = "importance_type mismatch, got ‘{}‘, expected ‘weight‘, ‘gain‘, or ‘cover‘"            raise ValueError(msg.format(importance_type))        # if it‘s weight, then omap stores the number of missing values        if importance_type == ‘weight‘:            # do a simpler tree dump to save time            trees = self.get_dump(fmap, with_stats=False)            fmap = {}            for tree in trees:                for line in tree.split(‘\n‘):                    # look for the opening square bracket                    arr = line.split(‘[‘)                    # if no opening bracket (leaf node), ignore this line                    if len(arr) == 1:                        continue                    # extract feature name from string between []                    fid = arr[1].split(‘]‘)[0].split(‘<‘)[0]                    if fid not in fmap:                        # if the feature hasn‘t been seen yet                        fmap[fid] = 1                    else:                        fmap[fid] += 1            return fmap        else:            trees = self.get_dump(fmap, with_stats=True)            importance_type += ‘=‘            fmap = {}            gmap = {}            for tree in trees:                for line in tree.split(‘\n‘):                    # look for the opening square bracket                    arr = line.split(‘[‘)                    # if no opening bracket (leaf node), ignore this line                    if len(arr) == 1:                        continue                    # look for the closing bracket, extract only info within that bracket                    fid = arr[1].split(‘]‘)                    # extract gain or cover from string after closing bracket                    g = float(fid[1].split(importance_type)[1].split(‘,‘)[0])                    # extract feature name from string before closing bracket                    fid = fid[0].split(‘<‘)[0]                    if fid not in fmap:                        # if the feature hasn‘t been seen yet                        fmap[fid] = 1                        gmap[fid] = g                    else:                        fmap[fid] += 1                        gmap[fid] += g            # calculate average value (gain/cover) for each feature            for fid in gmap:                gmap[fid] = gmap[fid] / fmap[fid]            return gmap

GBDT特征评分的计算说明原理：

链接：1、http://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/

详细的代码说明过程：可以从上面的链接进入下面的链接：

http://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting

【原创】xgboost 特征评分的计算原理

声明：以上内容来自用户投稿及互联网公开渠道收集整理发布，本网站不拥有所有权，未作人工编辑处理，也不承担相关法律责任，若内容有误或涉及侵权可进行投诉：投诉/举报工作人员会在5个工作日内联系你，一经查实，本站将立刻删除涉嫌侵权内容。

联系
我们

首页 > 代码库 > 【原创】xgboost 特征评分的计算原理

【原创】xgboost 特征评分的计算原理

看完仍有疑问？有类似问题直接问程序猿