一般來說,我們會將整個資料集切成訓練、驗證與測試集,而交叉驗證則是把整個資料集切分成訓練與測試集,再將訓練集分成 k 個小份,每次拿一份作為驗證使用,另外 k - 1 份做為訓練使用,如此輪流 k 次的整個過程。交叉驗證的好處是可以更有效的利用資料,讓訓練模型的過程,有機會以不同的資料來選擇超參數等。在實作上,scikit-learn 的 model_selection.cross_validate 可以幫你進行交叉驗證與超參數搜尋,但本篇為了說明交叉驗證本身的細節,因此不會透過該函式庫來進行交叉驗證。

下列是一個交叉驗證的範例。此範例使用了 wine dataset 的 80% 作為訓練集,並將這個訓練集分成三份進行交叉驗證。你可以試著改變模型超參數,看看平均的 accuracy 是否會變化:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier

dataset = load_wine()
total_data_num = dataset.data.shape[0]
print('Data shapes:', dataset.data.shape, dataset.target.shape)

idx_tr = np.where(np.arange(total_data_num) % 5 != 0)[0]
x_train = dataset.data[idx_tr]
y_train = dataset.target[idx_tr]
tr_data_num = x_train.shape[0]
print('Training data shapes:', x_train.shape, y_train.shape)

N_FOLD = 3
accuracies = []
for i in range(N_FOLD):
	idx_fold_tr = np.where(np.arange(tr_data_num) % N_FOLD != i)[0]
	idx_fold_va = np.where(np.arange(tr_data_num) % N_FOLD == i)[0]
	model = KNeighborsClassifier(n_neighbors=1)
	model.fit(x_train[idx_fold_tr], y_train[idx_fold_tr])
	pred = model.predict(x_train[idx_fold_va])
	acc = 100 * np.mean(pred == y_train[idx_fold_va])
	accuracies.append(acc)
print('Average acc:', np.mean(accuracies))

在上述範例中,尋找超參數的方式是手動修改並執行程式,你也可以撰寫迴圈來暴力搜尋,或者直接使用 scikit-learn 的函式庫來進行。總之,透過交叉驗證,找出能取得最好的準確度(或者其他你需要的評估結果)的超參數後,就可以用這組超參數和所有的訓練資料重新訓練一個模型,並在測試集上看看效果。以下範例假設你先前找出的是 n_neighbors = 5 會最好:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier

dataset = load_wine()
total_data_num = dataset.data.shape[0]
print('Data shapes:', dataset.data.shape, dataset.target.shape)

idx_tr = np.where(np.arange(total_data_num) % 5 != 0)[0]
x_train = dataset.data[idx_tr]
y_train = dataset.target[idx_tr]
print('Training data shapes:', x_train.shape, y_train.shape)

idx_te = np.where(np.arange(total_data_num) % 5 == 0)[0]
x_test = dataset.data[idx_te]
y_test = dataset.target[idx_te]
print('Test data shapes:', x_test.shape, y_test.shape)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(x_train, y_train)
pred = model.predict(x_test)
acc = 100 * np.mean(pred == y_test)
print('Test acc:', acc)

事實上,在交叉驗證當中,能改變的不只有超參數,甚至連選用的模型本身也可以更動。此外,除了使用找出的最佳超參數和所有的訓練資料重新訓練一個模型以外,你也可以將過程中訓練的模型全部直接保留,並將其全部用來對測試集做預測,再將結果取平均或中位數等統計;這樣在每個 fold 的超參數有明顯差異時,有機會取得比較穩定的預測結果。一個簡單的範例如下:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier

dataset = load_wine()
total_data_num = dataset.data.shape[0]
print('Data shapes:', dataset.data.shape, dataset.target.shape)

idx_tr = np.where(np.arange(total_data_num) % 5 != 0)[0]
x_train = dataset.data[idx_tr]
y_train = dataset.target[idx_tr]
tr_data_num = x_train.shape[0]
print('Training data shapes:', x_train.shape, y_train.shape)

idx_te = np.where(np.arange(total_data_num) % 5 == 0)[0]
x_test = dataset.data[idx_te]
y_test = dataset.target[idx_te]
print('Test data shapes:', x_test.shape, y_test.shape)

N_FOLD = 3
models = []
accuracies = []
for i in range(N_FOLD):
	idx_fold_tr = np.where(np.arange(tr_data_num) % N_FOLD != i)[0]
	idx_fold_va = np.where(np.arange(tr_data_num) % N_FOLD == i)[0]
	model = KNeighborsClassifier(n_neighbors=1)
	model.fit(x_train[idx_fold_tr], y_train[idx_fold_tr])
	pred = model.predict(x_train[idx_fold_va])
	acc = 100 * np.mean(pred == y_train[idx_fold_va])
	models.append(model)
	accuracies.append(acc)
print('Average acc:', np.mean(accuracies))

pred_all = []
for m in models:
	pred = m.predict(x_test)
	pred_all.append(pred)

pred_all = np.array(pred_all) # Shape: (N_FOLD, data_Num)
pred_all = np.median(pred_all, axis=0) # Shape: (data_Num, )
print('Test acc:', 100 * np.mean(pred_all == y_test))

若將上述範例套用到競賽或研究的實務上,你可能會想要將交叉驗證完畢的模型先儲存起來,再用另外一個檔案載入驗證效果比較好的模型並對測試集做預測,以避免將效果不佳的模型拿去做預測而浪費時間。若有需要儲存與載入模型,可以參考 Python 內建的 pickle 函式庫。