貝氏定理的公式如下,其中 x 代表模型輸入(也就是 features),y 代表模型輸出:
P(y|x) = P(x|y)P(y) / p(x)
而在這之中,只要特徵固定了,p(x) 就是固定的常數,因此可以將公式變更為以下:
P(y|x) ∝ P(x|y)P(y)
而如果你的特徵有多個維度,並假設它們之間是獨立的,則又可以改寫為如下的樣子:
P(y|x) ∝ P(y) Πi P(xi|y)
因此,模型的預測結果,也就是能讓 P(y|x) 有最大值的 y,可以寫成:
ŷ = argmaxy P(y) Πi P(xi|y)
而高斯貝氏分類器,就是將公式中的 P(xi|y) 以高斯分布來表示的分類器。我們使用 scikit-learn 的 naive_bayes.GaussianNB 來幫我們執行高斯貝氏分類演算法,使用的資料集是 Wine Data Set,該資料集在 scikit-learn 有內建一版,你可以直接透過 import 相關函式來使用,不必自己下載:
import numpy as np from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB dataset = load_wine() print('Data shapes:', dataset.data.shape, dataset.target.shape) X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target) classes, counts = np.unique(y_train, return_counts=True) print('Training data info:', X_train.shape, y_train.shape, classes, counts) print('Test data shapes:', X_test.shape, y_test.shape) model = GaussianNB() model.fit(X_train, y_train) pred = model.predict(X_test) print(f'Accuracy (sklearn): {100*np.mean(pred==y_test):.2f}%') n_class = classes.size mean_train = np.zeros((n_class, X_train.shape[1])) sigma_train = np.zeros((n_class, X_train.shape[1])) log_class_probs = np.zeros(n_class) for c_idx in range(n_class): idx = np.where(y_train == c_idx)[0] mean_train[c_idx] = np.mean(X_train[idx], axis=0) sigma_train[c_idx] = np.var(X_train[idx], axis=0) + 1e-9 log_class_probs[c_idx] = np.log(np.mean(y_train == c_idx) + 1e-9) predictions = np.zeros((y_test.shape[0], n_class)) for i, p in enumerate(X_test): for c_idx in range(n_class): log_p_x_given_y = np.sum( -0.5 * np.log(2 * np.pi * sigma_train[c_idx]) - \ (p - mean_train[c_idx]) ** 2 / (2 * sigma_train[c_idx]) ) predictions[i][c_idx] = log_class_probs[c_idx] + log_p_x_given_y predictions = np.argmax(predictions, axis=1) print(f'Accuracy (self-implemented): {100*np.mean(predictions == y_test):.2f}%') print(f'Are two sets of predictions equal: {100*np.mean(predictions == pred)}%')在上述範例中:
- 訓練完畢後,model.theta_ 的內容會是每個類別的每個特徵的平均,model.var_ 或 model.sigma_ 則是對應的 variance。
- 由於範例中的資料是亂數切分,所以每次執行所得到的結果會不太一樣。
- 除了使用 sklearn 以外,也同時示範了如何依照公式來實作,以便讓各位更容易了解內部的演算方式。
我們可以把上述範例,改成只使用前兩維(或一維)的資料,來進行高斯貝氏分類器的效果視覺化。你可以看到等高線形成的橢圓,其長短軸都平行於座標軸,就是因為高斯貝氏分類器假設特徵之間獨立:
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from scipy.stats import multivariate_normal dataset = load_wine() print('Data shapes:', dataset.data.shape, dataset.target.shape) X = dataset.data[:, :2] X_train, X_test, y_train, y_test = train_test_split(X, dataset.target) classes, counts = np.unique(y_train, return_counts=True) print('Training data info:', X_train.shape, y_train.shape, classes, counts) print('Test data shapes:', X_test.shape, y_test.shape) model = GaussianNB() model.fit(X_train, y_train) pred = model.predict(X_test) print(f'Accuracy (sklearn): {100*np.mean(pred==y_test):.2f}%') x_min, x_max = X_train[:,0].min()-1, X_train[:,0].max()+1 y_min, y_max = X_train[:,1].min()-1, X_train[:,1].max()+1 xx, yy = np.mgrid[x_min:x_max:0.01, y_min:y_max:0.01] pos = np.dstack((xx, yy)) for i, c in enumerate(classes): mu = model.theta_[i] var = model.var_[i] Sigma = np.diag(var) rv = multivariate_normal(mu, Sigma) z = rv.pdf(pos) plt.contour(xx, yy, z) plt.scatter(X_train[:,0], X_train[:,1], c=y_train, s=10) plt.title("GaussianNB class contours (first 2 features)") plt.show()在前面的範例中,雖然我們沒有檢驗過特徵之間是否獨立,但仍然直接使用了貝氏分類器,而且效果看似還不錯,這是因為對於分類任務來說,我們並不在意機率值是否精準,而只要相對大小的關係正確即可;以音樂分類的狀況來說,節奏快慢跟音量大小之間可能具有高度相關性,但我們想要找出搖滾樂時,則只要 只要 P(搖滾|{快, 大聲}) 大於 P(古典|{快,大聲}),模型的分類結果就是正確的。
當然,我們還是有機會碰到高斯貝氏分類器遇上非獨立特徵就不靈光的狀況;這時候,你可能就必須考慮更換分類器等做法,例如以下範例:
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from scipy.stats import multivariate_normal np.random.seed(89) mu1 = [0, 0] mu2 = [3, 0] Sigma = [ [3, 2], [2, 3] ] X1 = np.random.multivariate_normal(mu1, Sigma, 200) X2 = np.random.multivariate_normal(mu2, Sigma, 200) X = np.vstack((X1, X2)) y = np.array([0]*200 + [1]*200) print('Data shapes:', X.shape, y.shape) X_train, X_test, y_train, y_test = train_test_split(X, y) classes, counts = np.unique(y_train, return_counts=True) print('Training data info:', X_train.shape, y_train.shape, classes, counts) print('Test data shapes:', X_test.shape, y_test.shape) # ===== GNB ===== gnb = GaussianNB() gnb.fit(X_train, y_train) pred = gnb.predict(X_test) print(f'Accuracy (GNB): {100*np.mean(pred==y_test):.2f}%') # ===== QDA ===== qda = QuadraticDiscriminantAnalysis(store_covariance=True) qda.fit(X_train, y_train) pred = qda.predict(X_test) print(f'Accuracy (QDA): {100*np.mean(pred==y_test):.2f}%') # ===== contour ===== x_min, x_max = X_train[:,0].min()-1, X_train[:,0].max()+1 y_min, y_max = X_train[:,1].min()-1, X_train[:,1].max()+1 xx, yy = np.mgrid[x_min:x_max:0.01, y_min:y_max:0.01] pos = np.dstack((xx, yy)) # ===== GNB ===== plt.figure() for i, c in enumerate(classes): mu = gnb.theta_[i] var = gnb.var_[i] Sigma_gnb = np.diag(var) rv = multivariate_normal(mu, Sigma_gnb) z = rv.pdf(pos) plt.contour(xx, yy, z, colors='blue', linestyles='dashed') plt.scatter(X_train[:,0], X_train[:,1], c=y_train, s=10) plt.title("GNB contours") # ===== QDA ===== plt.figure() for i, c in enumerate(classes): mu = qda.means_[i] Sigma_qda = qda.covariance_[i] rv = multivariate_normal(mu, Sigma_qda) z = rv.pdf(pos) plt.contour(xx, yy, z, colors='red') plt.scatter(X_train[:,0], X_train[:,1], c=y_train, s=10) plt.title("QDA contours") plt.show()然而,仍需注意的是,因為我們只把每類資料(的每個維度)用一個高斯分布來表示,因此如果你的資料中,每個類別會各自分成很多小群的話,則高斯貝氏分類就不適用。以下的範例中,在二維平面上,第一和第三象限的點是一個類別,第二和第四象限的點是另外一個類別,每個類別各自分成了兩群,因此套用高斯貝氏分類的效果也會很差:
import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB X = np.random.randn(1000, 2) y = np.logical_xor(X[:, 0] > 0, X[:, 1] > 0).astype(int) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) model = GaussianNB() model.fit(X_train, y_train) pred = model.predict(X_test) print(f'Accuracy (sklearn): {100*np.mean(pred==y_test):.2f}%')