特徵選取(feature selection)的⽬標,是要從原有的特徵中挑選出部分可能較有鑑別力的特徵,以期改進辨識效果。但由於求取最佳組合的複雜度,會隨著特徵數量呈指數增長,因此其中一個較常用的近似解法,是 sequential forward selection,它的演算步驟如下:
- 先挑選出一個能取得最好辨識效果的特徵。
- 從剩下的特徵當中,挑選出跟已挑選特徵一起使用後,能取得最好辨識效果的特徵。
- 重複步驟 2,直到達到預定的挑選數量目標,或者辨識效果不升反降。
我們使用 scikit-learn 的 feature_selection.SequentialFeatureSelector 來幫我們執行 sequential forward selection 演算法,使用的資料集是 Wine Data Set:
import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_wine from sklearn.feature_selection import SequentialFeatureSelector from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split dataset = load_wine() print('Data shapes:', dataset.data.shape, dataset.target.shape) # 分割資料集 X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.2, random_state=42 ) print('Training data shapes:', X_train.shape, y_train.shape) print('Test data shapes:', X_test.shape, y_test.shape) model = KNeighborsClassifier(n_neighbors=3) sfs = SequentialFeatureSelector(model, n_features_to_select=5) sfs.fit(X_train, y_train) print('Is selected:', sfs.get_support())在上述範例中:
- SequentialFeatureSelector 在辨識時所採用的分類器,需要由你自己決定。一般來說,我們會挑選比較簡單的分類器,以節省運算效能。
- 在實務上,請務必如同範例一般,先切分出訓練集和測試集,再把訓練集丟給 SequentialFeatureSelector 做選取,否則選取時就會看到測試資料,造成後續的評估結果過於樂觀。
- 若需要做更細節的控制,請參考官方文件。
也有一些特徵選取的方法,不需要依賴特定的模型,例如方差篩選、相關係數、卡方檢定等等。以下介紹方差篩選,其目標是移除較低變異度的特徵:
import numpy as np from sklearn.datasets import load_wine from sklearn.feature_selection import VarianceThreshold from sklearn.model_selection import train_test_split dataset = load_wine() X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.2, random_state=42 ) print('Training data shapes:', X_train.shape, y_train.shape) print('Test data shapes:', X_test.shape, y_test.shape) selector = VarianceThreshold(threshold=0.5) X_train_selected = selector.fit_transform(X_train) X_test_selected = selector.transform(X_test) print("Selected feature dim:", X_train_selected.shape[1]) print("Is selected:", selector.get_support()) print("Selected feature name:", [dataset.feature_names[i] for i in np.where(selector.get_support())[0]]) variances = np.var(X_train, axis=0) for i, (name, var) in enumerate(zip(dataset.feature_names, variances)): status = "Kept" if selector.get_support()[i] else "Removed" print(f"{status:8s} {name:30s} var={var:.4f}")