LDA 的全名是 linear discriminant analysis,中文一般稱為「線性判別分析」,是降低資料維度的方法之一;並且由於 LDA 會考慮類別的資訊,因此也可以用於分類(但由於本篇屬於降維的範圍,故只會提到如何將 LDA 使用於降維)。它的基本概念是找出能讓類別間差異最大,而各個類別內部差異最小的方向,來做為投影方向,依此類推。在數學上,相當於先定義類別之間的共變異矩陣 Σbetween 以及各個類別內部的共變異矩陣(的總和) Σwithin,再求出 Σwithin-1Σbetween 的 eigenvectors,再拿去跟原始的資料相乘。

以下範例將分別使用 sklearn 和 numpy 計算 LDA 的結果,以同時說明完全現成函式的使用,以及你可以如何自己計算:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_wine
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split

dataset = load_wine()
print('Data shapes:', dataset.data.shape, dataset.target.shape)

X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target)
num_class = np.max(y_train) + 1
print('Training data shapes:', X_train.shape, y_train.shape)
print('Test data shapes:', X_test.shape, y_test.shape)
print('Num classes:', num_class)

model = LDA(n_components=2)
X_train_skl = model.fit_transform(X_train, y_train)
X_test_skl = model.transform(X_test)
print('Transformed shapes (sklearn):', X_train_skl.shape, X_test_skl.shape)

# Shape: (1, n_feature)
mu = np.mean(X_train, axis=0, keepdims=True)

mk_muk_mukt = 0
for k in range(num_class):
	m_k = np.sum(y_train == k)
	mu_k = np.mean(X_train[np.where(y_train == k)[0]], axis=0, keepdims=True)
	mk_muk_mukt += m_k * mu_k.T @ mu_k

sigma_within_inv = np.linalg.inv(X_train.T @ X_train - mk_muk_mukt)
sigma_between = mk_muk_mukt - X_train.shape[0] * mu.T @ mu

eig_vals, eig_vecs = np.linalg.eig(sigma_within_inv @ sigma_between)

X_train_np = X_train @ eig_vecs
X_test_np = X_test @ eig_vecs
print('eigen values:', eig_vals)
print('Transformed shapes (np.linalg.eig):', X_train_np.shape, X_test_np.shape)

plt.subplot(2, 2, 1)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_skl[idx, 0], X_train_skl[idx, 1], '.', label='Class {}'.format(i))
plt.legend()
plt.title('TR (sklearn)')

plt.subplot(2, 2, 2)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_skl[idx, 0], X_test_skl[idx, 1], '.')
plt.title('TE (sklearn)')

plt.subplot(2, 2, 3)
for i in range(num_class):
	idx = np.where(y_train == i)[0]
	plt.plot(X_train_np[idx, 0], X_train_np[idx, 1], '.')
plt.title('TR (np.linalg.eig)')

plt.subplot(2, 2, 4)
for i in range(num_class):
	idx = np.where(y_test == i)[0]
	plt.plot(X_test_np[idx, 0], X_test_np[idx, 1], '.')
plt.title('TE (np.linalg.eig)')

plt.show()

在上述範例中: