特徵選取(feature selection)的⽬標,是要從原有的特徵中挑選出部分可能較有鑑別力的特徵,以期改進辨識效果。但由於求取最佳組合的複雜度,會隨著特徵數量呈指數增長,因此其中一個較常用的近似解法,是 sequential forward selection,它的演算步驟如下:

  1. 先挑選出一個能取得最好辨識效果的特徵。
  2. 從剩下的特徵當中,挑選出跟已挑選特徵一起使用後,能取得最好辨識效果的特徵。
  3. 重複步驟 2,直到達到預定的挑選數量目標,或者辨識效果不升反降。

我們使用 scikit-learn 的 feature_selection.SequentialFeatureSelector 來幫我們執行 sequential forward selection 演算法,使用的資料集是 Wine Data Set

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

dataset = load_wine()
print('Data shapes:', dataset.data.shape, dataset.target.shape)

# 分割資料集
X_train, X_test, y_train, y_test = train_test_split(
	dataset.data, dataset.target, test_size=0.2, random_state=42
)
print('Training data shapes:', X_train.shape, y_train.shape)
print('Test data shapes:', X_test.shape, y_test.shape)

model = KNeighborsClassifier(n_neighbors=3)
sfs = SequentialFeatureSelector(model, n_features_to_select=5)

sfs.fit(X_train, y_train)

print('Is selected:', sfs.get_support())

在上述範例中:

也有一些特徵選取的方法,不需要依賴特定的模型,例如方差篩選、相關係數、卡方檢定等等。以下介紹方差篩選,其目標是移除較低變異度的特徵:

import numpy as np
from sklearn.datasets import load_wine
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split

dataset = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
	dataset.data, dataset.target, test_size=0.2, random_state=42
)
print('Training data shapes:', X_train.shape, y_train.shape)
print('Test data shapes:', X_test.shape, y_test.shape)

selector = VarianceThreshold(threshold=0.5)
X_train_selected = selector.fit_transform(X_train)
X_test_selected = selector.transform(X_test)

print("Selected feature dim:", X_train_selected.shape[1])
print("Is selected:", selector.get_support())
print("Selected feature name:", [dataset.feature_names[i] for i in np.where(selector.get_support())[0]])

variances = np.var(X_train, axis=0)
for i, (name, var) in enumerate(zip(dataset.feature_names, variances)):
	status = "Kept" if selector.get_support()[i] else "Removed"
	print(f"{status:8s} {name:30s} var={var:.4f}")