sklearn.GBDT源码解读(宏观把握)--688IT编程网

sklearn.GBDT源码解读（宏观把握）

sklearn.GBDT源码解读

2017/01/09 22:05 V.0.1 第⼀版不注重源码的细节把握，注重的是代码的整体把控。后续版本会更新具体源码细节部分。

2017/01/11 01:25 V.0.2 第⼀版不注重源码的细节把握，注重的是代码的整体把控。后续版本会更新具体源码细节部分。

最近⼀直玩数据挖掘，GBDT使⽤了⼀点，就想看看源码是怎么实现的。

当训练⼀个GBDT模型的时候

semble.GradientBoostingClassifier(param)

s所以我们到对应⽂件夹的代码

class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):

_SUPPORTED_LOSS = ('deviance', 'exponential')

def__init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,

subsample=1.0, min_samples_split=2,

min_samples_leaf=1, min_weight_fraction_leaf=0.,

max_depth=3, init=None, random_state=None,

max_features=None, verbose=0,

max_leaf_nodes=None, warm_start=False,

presort='auto'):

super(GradientBoostingClassifier, self).__init__(

loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,

min_samples_split=min_samples_split,

min_samples_leaf=min_samples_leaf,

min_weight_fraction_leaf=min_weight_fraction_leaf,

max_depth=max_depth, init=init, subsample=subsample,

max_features=max_features,

random_state=random_state, verbose=verbose,

max_leaf_nodes=max_leaf_nodes, warm_start=warm_start,

presort=presort)

w我们看到GradientBoostingClassifier继承了⼀个⽗类BaseGradientBoosting，同时我们也会发现GradientBoostingRegressor也继承了这个⽗类。我们会在这个⽗类中到如下代码⽚：

class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,

_LearntSelectorMixin)):

def fit(self, X, y, sample_weight=None, monitor=None):

"""Fit the gradient boosting model.

x显然，当我们训练模型时调⽤的就是这个fit函数。

clf = clf.fit(train[predictors],train[target])

x下⾯我们深⼊到这个fit代码⾥⾯⼀探究竟，看看GBDT到底是怎么来训练模型的。其中⽐如

warm_start,check_X_y,check_random_state等⼀看便是基本的检查数据合法性，直接略过不看。直接到最重要的代码⽚：

# fit the boosting stages

n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state,

电影源代码人物介绍

begin_at_stage, monitor, X_idx_sorted)

# change shape of arrays after fit (early-stopping or additional ests)

if n_stages != self.estimators_.shape[0]:

self.estimators_ = self.estimators_[:n_stages]

if hasattr(self, 'oob_improvement_'):

return self

s⾸先解释⼀下上⾯的代码⽚，其中包括训练模型的_fit_stages，以及迭代的次数n_stages进⼊该代码⽚def _fit_stages拖到最后可以看到return i+1，显然n_stages对应的就是迭代次数。然⽽注意到reture i+1 这句代码⼀定是串⾏执⾏的代码，也和我们认知到的GBDT的第t颗树的构建依赖t-1棵树的结果是⼀致的。

z在fit代码⽚中到

def_fit_stages(self, X, y, y_pred, sample_weight, random_state,

begin_at_stage=0, monitor=None, X_idx_sorted=None):

# perform boosting iterations

i = begin_at_stage

for i in range(begin_at_stage, self.n_estimators):

# subsampling

if do_oob:

# fit next stage of trees

y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,

sample_mask, random_state, X_idx_sorted,

X_csc, X_csr)

if do_oob:

else:

# no need to fancy index w/ no subsampling

h具体的迭代过程就在这个for循环⾥⾯，在这⾥我们会看到do_oob，不同的童鞋直接百度

out-of-bag，很多blog有详细说明，不在赘述。我们重点来看 y_pred = self._fit_stage这句话，需要注意的是与_fit_stages不同的是这句话表⽰训练下⼀颗树。串⾏构建树，⽽不是并⾏构建。同样的在xgboost中也是串⾏构建树，但是其并⾏特点是在构建⼀棵树的时候在feature的细粒度下的并⾏，同样也不是tree的粗粒度的并⾏。下⾯我们来详细看看GBDT中的⼀棵树到底是如何构建的。

def_fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,

random_state, X_idx_sorted, X_csc=None, X_csr=None):

"""Fit another stage of ``n_classes_`` trees to the boosting model. """

assert sample_mask.dtype == np.bool

loss = self.loss_

original_y = y

for k in range(loss.K):

if loss.is_multi_class:

y = np.array(original_y == k, dtype=np.float64)

residual = ative_gradient(y, y_pred, k=k,

sample_weight=sample_weight)

# induce regression tree on residuals

tree = DecisionTreeRegressor(

criterion='friedman_mse',

splitter='best',

max_depth=self.max_depth,

min_samples_split=self.min_samples_split,

min_samples_leaf=self.min_samples_leaf,

min_weight_fraction_leaf=self.min_weight_fraction_leaf,

max_features=self.max_features,

max_leaf_nodes=self.max_leaf_nodes,

random_state=random_state,

presort=self.presort)

if self.subsample < 1.0:

# no inplace multiplication!

sample_weight = sample_weight * sample_mask.astype(np.float64)

if X_csc is not None:

tree.fit(X_csc, residual, sample_weight=sample_weight,

check_input=False, X_idx_sorted=X_idx_sorted)

else:

tree.fit(X, residual, sample_weight=sample_weight,

check_input=False, X_idx_sorted=X_idx_sorted)

# update tree leaves

if X_csr is not None:

loss.update_terminal__, X_csr, y, residual, y_pred,

sample_weight, sample_mask,

self.learning_rate, k=k)

else:

loss.update_terminal__, X, y, residual, y_pred,

sample_weight, sample_mask,

self.learning_rate, k=k)

# add tree to ensemble

self.estimators_[i, k] = tree

return y_pred

w为什么说GBDT是回归树呢？为什么说GBDT是朝着reduce residual的⽬标来构建树呢？我想上⾯代码⽚中的：

if loss.is_multi_class:

y = np.array(original_y ==k, dtype=np.float64)

residual = ative_gradient(y, y_pred, k=k,

sample_weight=sample_weight)

# induce regression tree on residuals

tree = DecisionTreeRegressor(

criterion='friedman_mse',

splitter='best',

max_depth=self.max_depth,

min_samples_split=self.min_samples_split,

min_samples_leaf=self.min_samples_leaf,

min_weight_fraction_leaf=self.min_weight_fraction_leaf,

max_features=self.max_features,

max_leaf_nodes=self.max_leaf_nodes,

random_state=random_state,

presort=self.presort)`

g我们注意到ative_gradient()这个函数，见字如⾯。给了我们很好的回答。随后进⼊构建好的回归树的训练中，当然训练的⽬标就是极⼩化这个residual，这也是residual regression的概念：

if X_csc is not None:

tree.fit(X_csc, residual, sample_weight=sample_weight,

check_input=False, X_idx_sorted=X_idx_sorted)

else:

tree.fit(X, residual, sample_weight=sample_weight,

check_input=False, X_idx_sorted=X_idx_sorted)

OK，让我们⾸先到tree.fit这个函数在哪，如果读者是在windows开发环境下，建议先去装个linux。

#from .. import DecisionTreeRegressor

cd ..

cd anaconda2

find -name sklearn

cd anaconda2/pkgs/scikit-learn-0.17.1-np111py27_2/lib/python2.7/site-packages/sklearn

cd ensemble

vi gradient_boosting.py

y以上是到gradient_boosting.py的⽅式.

cd ..

ls #到tree⽂件夹

cd tree

vi tree.py #打开tree.py，到DecisionTreeRegressor

r如上⽅式到DecisionTreeRegressor，

class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):

"""A decision tree regressor.

def__init__(self,

criterion="mse",

splitter="best",

max_depth=None,

min_samples_split=2,

min_samples_leaf=1,

min_weight_fraction_leaf=0.,

max_features=None,

random_state=None,

max_leaf_nodes=None,

presort=False):

super(DecisionTreeRegressor, self).__init__(

criterion=criterion,

splitter=splitter,

max_depth=max_depth,

min_samples_split=min_samples_split,

min_samples_leaf=min_samples_leaf,

min_weight_fraction_leaf=min_weight_fraction_leaf,

max_features=max_features,

max_leaf_nodes=max_leaf_nodes,

random_state=random_state,

presort=presort)

t同样的，到继承的⽗类BaseDecisionTree，

class BaseDecisionTree(six.with_metaclass(ABCMeta, BaseEstimator,

_LearntSelectorMixin)):

def fit(self, X, y, sample_weight=None, check_input=True,

X_idx_sorted=None):

"""Build a decision tree from the training set (X, y).

z⾄此我们就到了刚才tree.fit的函数。我们知道GBDT也是基于决策树（C3/C4.5）的，RF和GBDT是两种不同的策略，RF采⽤的是Bagging，GBDT是采⽤的回归⽅法拟合来降低label的残差。那么这两种forest的形式都是基于最基础的决策树。整体来说呢，我们在这⾥⽬前不关注⼀些细节实现，⽐⽅说下⾯的代码：

if is_classification:

else:

self.classes_ = [None] * self.n_outputs_

self.n_classes_ = [1] * self.n_outputs_

显然这只是⼀个参数赋值的过程，等我们浏览完整体的GBDT代码以后再来仔细研究这些代码。⾸先我们只看树的构建：

688IT编程网

sklearn.GBDT源码解读(宏观把握)

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

sklearn.GBDT源码解读(宏观把握)

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行