sklearn.GBDT源码解读(宏观把握)
sklearn.GBDT源码解读
2017/01/09 22:05 V.0.1 第⼀版不注重源码的细节把握,注重的是代码的整体把控。后续版本会更新具体源码细节部分。
2017/01/11 01:25 V.0.2 第⼀版不注重源码的细节把握,注重的是代码的整体把控。后续版本会更新具体源码细节部分。
最近⼀直玩数据挖掘,GBDT使⽤了⼀点,就想看看源码是怎么实现的。
当训练⼀个GBDT模型的时候
semble.GradientBoostingClassifier(param)
s所以我们到对应⽂件夹的代码
class GradientBoostingClassifier(BaseGradientBoosting, ClassifierMixin):
_SUPPORTED_LOSS = ('deviance', 'exponential')
def__init__(self, loss='deviance', learning_rate=0.1, n_estimators=100,
subsample=1.0, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.,
max_depth=3, init=None, random_state=None,
max_features=None, verbose=0,
max_leaf_nodes=None, warm_start=False,
presort='auto'):
super(GradientBoostingClassifier, self).__init__(
loss=loss, learning_rate=learning_rate, n_estimators=n_estimators,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_depth=max_depth, init=init, subsample=subsample,
max_features=max_features,
random_state=random_state, verbose=verbose,
max_leaf_nodes=max_leaf_nodes, warm_start=warm_start,
presort=presort)
w我们看到GradientBoostingClassifier继承了⼀个⽗类BaseGradientBoosting,同时我们也会发现GradientBoostingRegressor也继承了这个⽗类。我们会在这个⽗类中到如下代码⽚:
class BaseGradientBoosting(six.with_metaclass(ABCMeta, BaseEnsemble,
_LearntSelectorMixin)):
.
.
.
.
def fit(self, X, y, sample_weight=None, monitor=None):
"""Fit the gradient boosting model.
a
x显然,当我们训练模型时调⽤的就是这个fit函数。
clf = clf.fit(train[predictors],train[target])
x下⾯我们深⼊到这个fit代码⾥⾯⼀探究竟,看看GBDT到底是怎么来训练模型的。其中⽐如
warm_start,check_X_y,check_random_state等⼀看便是基本的检查数据合法性,直接略过不看。直接到最重要的代码⽚:
# fit the boosting stages
n_stages = self._fit_stages(X, y, y_pred, sample_weight, random_state,
电影源代码人物介绍
begin_at_stage, monitor, X_idx_sorted)
# change shape of arrays after fit (early-stopping or additional ests)
if n_stages != self.estimators_.shape[0]:
self.estimators_ = self.estimators_[:n_stages]
if hasattr(self, 'oob_improvement_'):
return self
s⾸先解释⼀下上⾯的代码⽚,其中包括训练模型的_fit_stages,以及迭代的次数n_stages进⼊该代码⽚def _fit_stages拖到最后可以看到return i+1,显然n_stages对应的就是迭代次数。然⽽注意到reture i+1 这句代码⼀定是串⾏执⾏的代码,也和我们认知到的GBDT的第t颗树的构建依赖t-1棵树的结果是⼀致的。
z在fit代码⽚中到
def_fit_stages(self, X, y, y_pred, sample_weight, random_state,
begin_at_stage=0, monitor=None, X_idx_sorted=None):
.
.
.
# perform boosting iterations
i = begin_at_stage
for i in range(begin_at_stage, self.n_estimators):
# subsampling
if do_oob:
.
.
.
# fit next stage of trees
y_pred = self._fit_stage(i, X, y, y_pred, sample_weight,
sample_mask, random_state, X_idx_sorted,
X_csc, X_csr)
if do_oob:
.
.
.
else:
# no need to fancy index w/ no subsampling
h具体的迭代过程就在这个for循环⾥⾯,在这⾥我们会看到do_oob,不同的童鞋直接百度
out-of-bag,很多blog有详细说明,不在赘述。我们重点来看 y_pred = self._fit_stage这句话,需要注意的是与_fit_stages不同的是这句话表⽰训练下⼀颗树。串⾏构建树,⽽不是并⾏构建。同样的在xgboost中也是串⾏构建树,但是其并⾏特点是在构建⼀棵树的时候在feature的细粒度下的并⾏,同样也不是tree的粗粒度的并⾏。下⾯我们来详细看看GBDT中的⼀棵树到底是如何构建的。
def_fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
random_state, X_idx_sorted, X_csc=None, X_csr=None):
"""Fit another stage of ``n_classes_`` trees to the boosting model. """
assert sample_mask.dtype == np.bool
loss = self.loss_
original_y = y
for k in range(loss.K):
if loss.is_multi_class:
y = np.array(original_y == k, dtype=np.float64)
residual = ative_gradient(y, y_pred, k=k,
sample_weight=sample_weight)
# induce regression tree on residuals
tree = DecisionTreeRegressor(
criterion='friedman_mse',
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)
if self.subsample < 1.0:
# no inplace multiplication!
sample_weight = sample_weight * sample_mask.astype(np.float64)
if X_csc is not None:
tree.fit(X_csc, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
else:
tree.fit(X, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
# update tree leaves
if X_csr is not None:
loss.update_terminal__, X_csr, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
else:
loss.update_terminal__, X, y, residual, y_pred,
sample_weight, sample_mask,
self.learning_rate, k=k)
# add tree to ensemble
self.estimators_[i, k] = tree
return y_pred
w为什么说GBDT是回归树呢?为什么说GBDT是朝着reduce residual的⽬标来构建树呢?我想上⾯代码⽚中的:
if loss.is_multi_class:
y = np.array(original_y ==k, dtype=np.float64)
residual = ative_gradient(y, y_pred, k=k,
sample_weight=sample_weight)
# induce regression tree on residuals
tree = DecisionTreeRegressor(
criterion='friedman_mse',
splitter='best',
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
min_weight_fraction_leaf=self.min_weight_fraction_leaf,
max_features=self.max_features,
max_leaf_nodes=self.max_leaf_nodes,
random_state=random_state,
presort=self.presort)`
g我们注意到ative_gradient()这个函数,见字如⾯。给了我们很好的回答。随后进⼊构建好的回归树的训练中,当然训练的⽬标就是极⼩化这个residual,这也是residual regression的概念:
if X_csc is not None:
tree.fit(X_csc, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
else:
tree.fit(X, residual, sample_weight=sample_weight,
check_input=False, X_idx_sorted=X_idx_sorted)
OK,让我们⾸先到tree.fit这个函数在哪,如果读者是在windows开发环境下,建议先去装个linux。
#from .. import DecisionTreeRegressor
cd ..
ls
cd anaconda2
find -name sklearn
cd anaconda2/pkgs/scikit-learn-0.17.1-np111py27_2/lib/python2.7/site-packages/sklearn
cd ensemble
ls
vi gradient_boosting.py
y以上是到gradient_boosting.py的⽅式.
cd ..
ls #到tree⽂件夹
cd tree
vi tree.py #打开tree.py,到DecisionTreeRegressor
r如上⽅式到DecisionTreeRegressor,
class DecisionTreeRegressor(BaseDecisionTree, RegressorMixin):
"""A decision tree regressor.
.
.
.
def__init__(self,
criterion="mse",
splitter="best",
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.,
max_features=None,
random_state=None,
max_leaf_nodes=None,
presort=False):
super(DecisionTreeRegressor, self).__init__(
criterion=criterion,
splitter=splitter,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
min_weight_fraction_leaf=min_weight_fraction_leaf,
max_features=max_features,
max_leaf_nodes=max_leaf_nodes,
random_state=random_state,
presort=presort)
t同样的,到继承的⽗类BaseDecisionTree,
class BaseDecisionTree(six.with_metaclass(ABCMeta, BaseEstimator,
_LearntSelectorMixin)):
.
.
.
def fit(self, X, y, sample_weight=None, check_input=True,
X_idx_sorted=None):
"""Build a decision tree from the training set (X, y).
z⾄此我们就到了刚才tree.fit的函数。我们知道GBDT也是基于决策树(C3/C4.5)的,RF和GBDT是两种不同的策略,RF采⽤的是Bagging,GBDT是 采⽤的回归⽅法拟合来降低label的残差。那么这两种forest的形式都是基于最基础的决策树。整体来说呢,我们在这⾥⽬前不关注⼀些细节实现,⽐⽅说下⾯的代码:
if is_classification:
.
.
.
else:
self.classes_ = [None] * self.n_outputs_
self.n_classes_ = [1] * self.n_outputs_
显然这只是⼀个参数赋值的过程,等我们浏览完整体的GBDT代码以后再来仔细研究这些代码。⾸先我们只看树的构建:
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论