PCA 的全名是 principal components analysis,中文一般稱為「主成分分析」,是降低資料維度的方法之一。它的基本概念是先把資料平移到平均為 0 的位置,再找出變異數最大的方向為新的第一個軸;再找出與第一個軸垂直的方向當中,變異數最大的為第二個軸;依此類推。在數學上,也相當於令資料為 X(形狀為 n_samples * n_features)時,先求出 XTX 的 eigenvectors,再拿去跟原始的資料相乘。

以下範例將分別使用 sklearn 和 np.linalg.eig 計算 PCA 的結果,以同時說明完全現成函式的使用,以及你可以如何自己計算:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

dataset = load_wine()
print('Data shapes:', dataset.data.shape, dataset.target.shape)

X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target)
num_class = np.max(y_train) + 1
print('Training data shapes:', X_train.shape, y_train.shape)
print('Test data shapes:', X_test.shape, y_test.shape)
print('Num classes:', num_class)

model = PCA(n_components=2)
X_train_skl = model.fit_transform(X_train)
X_test_skl = model.transform(X_test)
print('Transformed shapes (sklearn):', X_train_skl.shape, X_test_skl.shape)

X_train_mean = np.mean(X_train, axis=0)
X_train = X_train - X_train_mean[np.newaxis, :]
X_test = X_test - X_train_mean[np.newaxis, :]
eig_vals, eig_vecs = np.linalg.eig(X_train.T @ X_train)
X_train_np = X_train @ -eig_vecs
X_test_np = X_test @ -eig_vecs
print('eigen values:', eig_vals)
print('Transformed shapes (np.linalg.eig):', X_train_np.shape, X_test_np.shape)

plt.subplot(2, 2, 1)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_skl[idx, 0], X_train_skl[idx, 1], '.', label='Class {}'.format(i))
plt.legend()
plt.title('TR (sklearn)')

plt.subplot(2, 2, 2)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_skl[idx, 0], X_test_skl[idx, 1], '.')
plt.title('TE (sklearn)')

plt.subplot(2, 2, 3)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_np[idx, 0], X_train_np[idx, 1], '.')
plt.title('TR (np.linalg.eig)')

plt.subplot(2, 2, 4)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_np[idx, 0], X_test_np[idx, 1], '.')
plt.title('TE (np.linalg.eig)')

plt.show()

在上述範例中:

對於影像類型的資料,我們可以用繪製的方式,來觀察看看 PCA 投影到的方向。以下範例是以 50 x 37 = 1850 維的人臉影像資料,找出前 150 個投影的方向,並繪製前 12 張原始圖片,以及前 12 個投影方向(此範例需要經由網路連線下載資料,若你的網路不順暢,或者伺服器有問題的話,可能會無法執行此範例):

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA

lfw_people = fetch_lfw_people(min_faces_per_person=50, resize=0.4)
print('Data shapes:', lfw_people.images.shape, lfw_people.data.shape)

model = PCA(n_components=150)
model.fit(lfw_people.data)
print(model.components_.shape)
eigenfaces = model.components_.reshape(
	(150, lfw_people.images.shape[1], lfw_people.images.shape[2])
)

plt.figure()
for i in range(12):
	plt.subplot(3, 4, i+1)
	plt.imshow(lfw_people.images[i])

plt.figure()
for i in range(12):
	plt.subplot(3, 4, i+1)
	plt.imshow(eigenfaces[i])

plt.show()

在上述範例中,model.components_ 是 150 個位於 1850 維空間中的向量(並且皆互相垂直),因此我們可以把任意一個向量 reshape 成 50 x 37,以當作影像來顯示。