ランダムフォレスト回帰で過学習を抑制

はじめに

　ランダムフォレストは分類にも回帰にも使えます。今回は回帰を取り扱います。

　ランダムフォレストの厄介なところは、決定木なので油断すると過学習しまくるところです。これは分類でも問題になりますが、回帰の場合は更に問題です。ということで、過学習対策をやってみようと思います。

　分類の場合の記事
ランダムフォレストで分類するときの過学習対策の検討 - 静かなる名辞

やること

　分類の際の記事では、

max_depth
min_samples_leaf
min_samples_split

　などを試して、どれでもだいたい同じくらいの結果が得られるが、わかりやすいのはmin_samples_leafあたりだろうというところで終わっていました。

　今回は細かく検討しないで、min_samples_leafの調整だけでどの程度の変化が得られるのかを見ます。というのは、前回の結果のこともありますが、

min_samples_leaf : int, float, optional (default=1)
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
3.2.4.3.2. sklearn.ensemble.RandomForestRegressor — scikit-learn 0.21.3 documentation

　（太字は筆者強調）

　と公式でも触れられているからです。

まずはサイン波っぽいデータでやる

　周期と振幅の違うサイン波を混ぜて、更に正規分布するノイズをまぶしたデータで実験してみます。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

def main():
    X = np.arange(0, 15, 0.025).reshape(-1, 1)
    Y = np.sin(X) + 0.3*np.sin(5*X) + np.random.normal(scale=0.4, size=X.shape)
    X_test = np.arange(0, 15, 0.0005).reshape(-1, 1)
    Y_true = np.sin(X_test) + 0.3*np.sin(5*X_test)

    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 8))
    axes[0,0].plot(X_test.ravel(), Y_true.ravel(), c="b")
    axes[0,0].set_title("original(true)")
    for ax, min_samples_leaf in zip(axes.ravel()[1:], [1, 2, 4, 8, 16]):
        rfc = RandomForestRegressor(
            n_estimators=100, min_samples_leaf=min_samples_leaf)
        rfc.fit(X, Y.ravel())
        predict = rfc.predict(X_test)
        ax.scatter(X.ravel(), Y.ravel(), c="g")
        ax.plot(X_test.ravel(), predict, c="b")
        ax.set_title("min_samples_leaf\n={}".format(min_samples_leaf))
    plt.tight_layout()
    plt.savefig("result.png")

if __name__ == "__main__":
    main()

　
　min_samples_leafを大きくするに従ってモデルの表現力が減り、ノイズに影響されなくなることがわかります。このデータではmin_samples_leaf=8くらいが良いようです（当然データ依存なので、この数字はデータに応じて決める必要があります）。

sklearnのデータセットでやる

　回帰問題のデータセットで試します。

from sklearn.datasets import load_breast_cancer, load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

def main():
    breast_cancer = load_breast_cancer()
    boston = load_boston()

    for name, dataset in zip(["breast cancer", "boston"],
                             [breast_cancer, boston]):
        print(name)
        X, y = dataset.data, dataset.target
        X_train, X_test, y_train, y_test \
            = train_test_split(X, y, random_state=0)

        for min_samples_leaf in [1, 2, 4, 8]:
            rfc = RandomForestRegressor(
                n_estimators=500, min_samples_leaf=min_samples_leaf,
                random_state=0)
            rfc.fit(X_train, y_train)
            score = rfc.score(X_test, y_test)
            print(("min_samples_leaf={} "
                   "score={:.4f}").format(
                       min_samples_leaf, score))

if __name__ == "__main__":
    main()

　結果

breast cancer
min_samples_leaf=1 score=0.8723
min_samples_leaf=2 score=0.8901
min_samples_leaf=4 score=0.8896
min_samples_leaf=8 score=0.8797
boston
min_samples_leaf=1 score=0.7988
min_samples_leaf=2 score=0.7748
min_samples_leaf=4 score=0.7389
min_samples_leaf=8 score=0.7278

　breast cancerのデータセットではmin_samples_leaf=2で最良なので、過学習を抑制して汎化性能を稼いだことが効いています。bostonでは効いていないので、とにかく過学習気味でもモデルの表現力が高い方が有利というパターンです。

　このように一般性のある結果は出ないので、データに応じてパラメータチューニングすることになります。機械学習なんてそんなものといえばそれまでですね。

まとめ

　回帰でも分類の場合と同様に、パラメータチューニングで過学習を抑制することができます。ただし、性能がそれで上がるかはなんともいえない面があります。