線上教材：音樂資訊分析

PCA 的全名是 principal components analysis，中文一般稱為「主成分分析」，是降低資料維度的方法之一。它的基本概念是先把資料平移到平均為 0 的位置，再找出變異數最大的方向為新的第一個軸；再找出與第一個軸垂直的方向當中，變異數最大的為第二個軸；依此類推。在數學上，也相當於令資料為 X（形狀為 n_samples * n_features）時，先求出 X^TX 的 eigenvectors，再拿去跟原始的資料相乘。

以下範例將分別使用 sklearn 和 np.linalg.eig 計算 PCA 的結果，以同時說明完全現成函式的使用，以及你可以如何自己計算：
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

dataset = load_wine()
print('Data shapes:', dataset.data.shape, dataset.target.shape)

X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target)
num_class = np.max(y_train) + 1
print('Training data shapes:', X_train.shape, y_train.shape)
print('Test data shapes:', X_test.shape, y_test.shape)
print('Num classes:', num_class)

model = PCA(n_components=2)
X_train_skl = model.fit_transform(X_train)
X_test_skl = model.transform(X_test)
print('Transformed shapes (sklearn):', X_train_skl.shape, X_test_skl.shape)

X_train_mean = np.mean(X_train, axis=0)
X_train = X_train - X_train_mean[np.newaxis, :]
X_test = X_test - X_train_mean[np.newaxis, :]
eig_vals, eig_vecs = np.linalg.eig(X_train.T @ X_train)
X_train_np = X_train @ -eig_vecs
X_test_np = X_test @ -eig_vecs
print('eigen values:', eig_vals)
print('Transformed shapes (np.linalg.eig):', X_train_np.shape, X_test_np.shape)

plt.subplot(2, 2, 1)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_skl[idx, 0], X_train_skl[idx, 1], '.', label='Class {}'.format(i))
plt.legend()
plt.title('TR (sklearn)')

plt.subplot(2, 2, 2)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_skl[idx, 0], X_test_skl[idx, 1], '.')
plt.title('TE (sklearn)')

plt.subplot(2, 2, 3)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_np[idx, 0], X_train_np[idx, 1], '.')
plt.title('TR (np.linalg.eig)')

plt.subplot(2, 2, 4)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_np[idx, 0], X_test_np[idx, 1], '.')
plt.title('TE (np.linalg.eig)')

plt.show()
在上述範例中：

PCA 的運算相當依賴資料分布，而 train_test_split 會亂數切分資料，因此你每次看到的結果，通常會不太一樣。

降到二維是為了方便畫圖顯示，實際運用上可以視其他需求，來決定降到幾維。

使用 np.linalg.eig 計算的 eigenvectors 會乘上負號，是因為其內部實作細節與 sklearn 不同，導致 eigenvectors 的方向可能在大多數情況下恰好相反，而範例中為了圖片呈現上的一致，所以乘上負號。

對於影像類型的資料，我們可以用繪製的方式，來觀察看看 PCA 投影到的方向。以下範例是以 50 x 37 = 1850 維的人臉影像資料，找出前 150 個投影的方向，並繪製前 12 張原始圖片，以及前 12 個投影方向（此範例需要經由網路連線下載資料，若你的網路不順暢，或者伺服器有問題的話，可能會無法執行此範例）：
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA

lfw_people = fetch_lfw_people(min_faces_per_person=50, resize=0.4)
print('Data shapes:', lfw_people.images.shape, lfw_people.data.shape)

model = PCA(n_components=150)
model.fit(lfw_people.data)
print(model.components_.shape)
eigenfaces = model.components_.reshape(
	(150, lfw_people.images.shape[1], lfw_people.images.shape[2])
)

plt.figure()
for i in range(12):
	plt.subplot(3, 4, i+1)
	plt.imshow(lfw_people.images[i])

plt.figure()
for i in range(12):
	plt.subplot(3, 4, i+1)
	plt.imshow(eigenfaces[i])

plt.show()
在上述範例中，model.components_ 是 150 個位於 1850 維空間中的向量（並且皆互相垂直），因此我們可以把任意一個向量 reshape 成 50 x 37，以當作影像來顯示。