在資訊工程中,樹是一種由節點及邊構成的資料結構,並且會從一個根節點開始,長出數個以邊連接的其他節點作為分支;這些其他節點也可以再分支出節點,而不再長出分支時的節點則稱為葉節點。示意圖如下:

而在機器學習中,如果我們在非葉節點上執行「判斷某個特徵是否符合某個條件」的動作,並在符合條件時往某個分支走去,不符合時往另一個分支走去,直到抵達葉節點為止,則這樣的一棵樹,就可以是一個模型。基於樹的模型有相當多種,複雜的模型甚至需要組合多棵樹的結果,例如 random forestXGBoostLightGBMCatBoost、...等等,本篇在使用上,將介紹 XGBoost。

XGBoost 在分類或回歸問題都可以使用。以下範例先以分類問題作為示範:

import numpy as np
from sklearn.datasets import make_moons
from xgboost import XGBClassifier

x_train, y_train = make_moons(n_samples=800, shuffle=True, noise=0.1)
x_test, y_test = make_moons(n_samples=200, shuffle=True, noise=0.1)

param = {
	'n_estimators': 5,
	'max_depth': 3,
	'learning_rate': 0.01,
}

model = XGBClassifier(**param)
model.fit(x_train, y_train)
pred = model.predict(x_test)
print(f'Accuracy: {np.mean(pred == y_test)*100:.2f}%')

在上述範例中:

如果你想要知道在訓練過程中,loss 變化的狀況,可以在 fit 的時候,將相關資料集帶入,如下:

import numpy as np
from sklearn.datasets import make_moons
from xgboost import XGBClassifier

x_train, y_train = make_moons(n_samples=800, shuffle=True, noise=0.1)
x_val, y_val = make_moons(n_samples=200, shuffle=True, noise=0.1)
x_test, y_test = make_moons(n_samples=200, shuffle=True, noise=0.1)

param = {
	'n_estimators': 5,
	'max_depth': 3,
	'learning_rate': 0.01,
}

model = XGBClassifier(**param)
model.fit(
	x_train,
	y_train,
	eval_set=[
		(x_val, y_val),
		(x_train, y_train),
	],
)
pred = model.predict(x_test)
print(f'Accuracy: {np.mean(pred == y_test)*100:.2f}%')

Scikit-learn 的分類器通常不允許你自己設計 loss function,但是 XGBoost 可以。以下用回歸問題示範如何撰寫:

import numpy as np
from sklearn.datasets import make_regression
from xgboost import XGBRegressor

x_train, y_train = make_regression(n_samples=1600, shuffle=True, noise=0.01)
x_val, y_val = make_regression(n_samples=400, shuffle=True, noise=0.01)
x_test, y_test = make_regression(n_samples=400, shuffle=True, noise=0.01)

param = {
	'n_estimators': 50,
	'max_depth': 5,
	'learning_rate': 0.01,
}

model = XGBRegressor(**param)
model.fit(
	x_train,
	y_train,
	eval_set=[
		(x_val, y_val),
		(x_train, y_train),
	],
)
pred = model.predict(x_test)
print(f'RMSE: {np.mean((pred - y_test) ** 2) ** 0.5:.4f}')


def my_loss(label, pred):
	"""
	loss = (label - pred) ** 2
	"""
	grad = -2 * (label - pred)
	hess = 2 * np.ones_like(pred)
	return grad, hess


param['objective'] = my_loss
model = XGBRegressor(**param)
model.fit(
	x_train,
	y_train,
	eval_set=[
		(x_val, y_val),
		(x_train, y_train),
	],
)
pred = model.predict(x_test)
print(f'RMSE: {np.mean((pred - y_test) ** 2) ** 0.5:.4f}')

在上述範例中:

需要注意的是,樹模型用於回歸問題時,並不適用於需要外插(extrapolation)的情況,亦即測試資料當中,若出現了在訓練資料範圍外的值,則樹模型的效果會很差。這是由於樹模型是根據輸入特徵值,將資料點分配到某個葉節點;而在樹模型用於回歸時,該葉節點的預測值,會是該節點內所有訓練樣本的平均值,因此只能輸出訓練資料範圍內的值,無法產生比訓練資料更大或更小的預測值。相關範例如下:

import numpy as np
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression

np.random.seed(42)
X_train = np.linspace(0, 10, 50).reshape(-1, 1)
y_train = 2 * X_train.ravel() + 1 + np.random.randn(50) * 0.1

X_test = np.linspace(-5, 15, 200).reshape(-1, 1)
y_test = 2 * X_test.ravel() + 1 + np.random.randn(200) * 0.1

model_xgb = XGBRegressor(n_estimators=50, max_depth=5)
model_lr = LinearRegression()

model_xgb.fit(X_train, y_train)
model_lr.fit(X_train, y_train)

pred_in_xgb = model_xgb.predict(X_train)
pred_in_lr = model_lr.predict(X_train)

pred_out_xgb = model_xgb.predict(X_test)
pred_out_lr = model_lr.predict(X_test)

plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, label='Training data')
plt.scatter(X_test, y_test, label='Test data', alpha=0.3)
plt.plot(X_train, pred_in_xgb, label='XGB inside test')
plt.plot(X_test, pred_out_xgb, label='XGB outside test')
plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.axvline(x=10, color='black', linestyle='--', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('XGBoost Regression')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_train, y_train, label='Training data')
plt.scatter(X_test, y_test, label='Test data', alpha=0.3)
plt.plot(X_train, pred_in_lr, label='LR inside test')
plt.plot(X_test, pred_out_lr, label='LR outside test')
plt.axvline(x=0, color='black', linestyle='--', alpha=0.5)
plt.axvline(x=10, color='black', linestyle='--', alpha=0.5)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend()

plt.show()

在 XGBoost 中,特徵重要性(Feature Importance)是一項強大的分析工具。它可以讓我們更清楚的知道各項特徵對於預測結果的貢獻度排名。透過此功能,我們不僅能理解哪些關鍵變數主導了模型的判斷,更能作為後續特徵工程與維度刪減的依據:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, plot_importance

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
print('Shapes:', X.shape, y.shape)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param = {
	'n_estimators': 50,  
	'max_depth': 3,
	'learning_rate': 0.1,
}

model = XGBClassifier(**param)
model.fit(x_train, y_train)
pred = model.predict(x_test)
print(f'Accuracy: {np.mean(pred == y_test)*100:.2f}%')

fea_dim = model.feature_importances_.shape[0]
"""
Weight(預設):該特徵在所有樹中被用來分割節點的「總次數」。
Gain:該特徵在分割時帶來的「平均資訊增益」,通常最能反映特徵對模型的貢獻。
Cover:受該特徵分割影響的「樣本數量」。
"""
importance = model.get_booster().get_score(importance_type='gain')
importance = np.array([importance.get('f'+str(d), 0) for d in range(fea_dim)]).astype('float32')
for imp, name in zip(importance, feature_names):
	print(name, imp)

Feature Importance 只能告訴我們哪個特徵重要,但 SHAP (SHapley Additive exPlanations)則可以告訴我們某個特徵是如何影響單一個案的預測。雖然 SHAP 不限制只用在樹模型,但我們仍以樹模型來做示範:

import numpy as np
import matplotlib.pyplot as plt
import shap
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, plot_importance

data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names
print('Shapes:', X.shape, y.shape)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param = {
    'n_estimators': 50,  
    'max_depth': 3,
    'learning_rate': 0.1,
}

model = XGBClassifier(**param)
model.fit(x_train, y_train)
pred = model.predict(x_test)
print(f'Accuracy: {np.mean(pred == y_test)*100:.2f}%')

explainer = shap.TreeExplainer(model)
shap_values = explainer(x_test)
shap_values.feature_names = feature_names
shap.summary_plot(shap_values, x_test)