在真實的資料中,許多現象都具有時間上的先後順序,而非彼此獨立的觀測值;例如股票的每日收盤價、氣象站每小時的溫度紀錄、工廠感測器每秒回傳的振動數值,以及電商平台每月的銷售量等。這類依照時間順序排列的資料,我們稱之為時間序列(time series)。時間序列資料的核心挑戰在於,相鄰的觀測值之間往往不是獨立的,而是存在某種程度的相關性,例如昨天的股價會影響今天的股價,上個月的銷售量也可能反映出這個月的需求趨勢。因此,無論是預測未來的數值、偵測異常的時間點,還是理解資料背後的週期規律,都需要專門針對這種「時間相依性」設計的分析方法。本篇將從時間序列的基本組成與特性出發,逐步介紹常用的分析與預測模型,以及實務上的注意事項。

在分析或預測時間序列之前,通常可以先嘗試將其分解為趨勢、季節性與殘差三個成分的和,以便分別理解各部分的行為。以下範例,會以太陽黑子的年活動量資料為例,示範如何進行此分解,並比較 statsmodels 函式庫與自行以移動平均計算的結果:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.datasets import sunspots
from statsmodels.tsa.seasonal import seasonal_decompose

data = sunspots.load_pandas().data
y = data["SUNACTIVITY"].to_numpy()
years = data["YEAR"].to_numpy()
period = 11

result = seasonal_decompose(y, model="additive", period=period)

pad = period // 2
trend_manual = np.array([
	np.mean(y[i - pad: i + pad + 1]) for i in range(pad, len(y) - pad)
])
years_valid = years[pad: len(y) - pad]
detrended = y[pad: len(y) - pad] - trend_manual
seasonal_manual = np.zeros(len(detrended))
for phase in range(period):
	idx = np.arange(phase, len(detrended), period)
	seasonal_manual[idx] = np.mean(detrended[idx])
residual_manual = detrended - seasonal_manual

fig, axes = plt.subplots(4, 2)
titles_left = ["Original (statsmodels)", "Trend", "Seasonal", "Residual"]
titles_right = ["Original (manual)", "Trend", "Seasonal", "Residual"]

data_left = [y, result.trend, result.seasonal, result.resid]
data_right = [y, trend_manual, seasonal_manual, residual_manual]
x_left = [years, years, years, years]
x_right = [years, years_valid, years_valid, years_valid]

for i in range(4):
	axes[i, 0].plot(x_left[i], data_left[i])
	axes[i, 1].plot(x_right[i], data_right[i])
	axes[i, 0].set_title(titles_left[i])
	axes[i, 1].set_title(titles_right[i])

plt.show()

在上述範例中:

前述的加法模型,假設季節性振幅不會隨趨勢改變。然而在某些資料中,振幅會隨著整體數值的增大而放大,例如航空旅客人數在 1949 年至 1960 年間的資料,涵蓋了航空業快速成長的年代,讓旅客量不僅有明顯的上升趨勢,且季節性的波動幅度也隨之擴大。此時,改用乘法模型會更為合適,即將資料分解為趨勢、季節性與殘差三個成分的積。以下會以航空旅客人數資料,示範如何進行乘法分解,並同樣比較 statsmodels 函式庫與自行計算的結果:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.datasets import get_rdataset
from statsmodels.tsa.seasonal import seasonal_decompose

data = get_rdataset("AirPassengers").data
y = data["value"].to_numpy().astype(float)
period = 12

result = seasonal_decompose(y, model="multiplicative", period=period)

pad = period // 2
trend_manual = np.array([
	np.mean(y[i - pad: i + pad + 1]) for i in range(pad, len(y) - pad)
])
x_valid = np.arange(pad, len(y) - pad)
detrended = y[pad: len(y) - pad] / trend_manual
seasonal_manual = np.zeros(len(detrended))
for phase in range(period):
	idx = np.arange(phase, len(detrended), period)
	seasonal_manual[idx] = np.mean(detrended[idx])
residual_manual = detrended / seasonal_manual

fig, axes = plt.subplots(4, 2)
titles_left = ["Original (statsmodels)", "Trend", "Seasonal", "Residual"]
titles_right = ["Original (manual)", "Trend", "Seasonal", "Residual"]

x = np.arange(len(y))
data_left = [y, result.trend, result.seasonal, result.resid]
data_right = [y, trend_manual, seasonal_manual, residual_manual]
x_left = [x, x, x, x]
x_right = [x, x_valid, x_valid, x_valid]

for i in range(4):
	axes[i, 0].plot(x_left[i],  data_left[i])
	axes[i, 1].plot(x_right[i], data_right[i])
	axes[i, 0].set_title(titles_left[i])
	axes[i, 1].set_title(titles_right[i])

plt.show()

在上述範例中:

如果你要進行時間序列資料的預測,則有很多種模型可以選擇。以下先介紹傳統統計模型中的 ARIMA (AutoregRessive Integrated Moving Average)。ARIMA 是一個經典且廣泛使用的預測模型,適合單變數、具有趨勢但不含明顯季節性的時間序列;若資料同時含有季節性,則可進一步擴展為 SARIMA。以下範例會以航空旅客人數資料為例,示範資料切割、滾動預測,以及常見評估指標的計算:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.datasets import get_rdataset
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error

data = get_rdataset("AirPassengers").data
y = np.log(data["value"].to_numpy().astype(float))
n = len(y)

test_size = 24
train = y[:n-test_size]
test = y[n-test_size :]

predictions = []
history = list(train)
for i in range(test_size):
	model = ARIMA(history, order=(2, 1, 2))
	result = model.fit()
	pred = result.forecast(steps=1)[0]
	predictions.append(pred)
	history.append(test[i])

predictions = np.array(predictions)

y_true = np.exp(test)
y_pred = np.exp(predictions)

mae = mean_absolute_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred) ** 0.5
mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print(f"MAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")

x_all = np.arange(n)
x_train = np.arange(n - test_size)
x_test = np.arange(n - test_size, n)

plt.plot(x_train, np.exp(train), label="Train")
plt.plot(x_test, y_true, label="Test (ground truth)")
plt.plot(x_test, y_pred, label="Prediction")
plt.legend()
plt.show()

在上述範例中:

除了傳統統計模型以外,深度學習模型和樹模型等模型,也可以用來對時間序列資料進行預測。以下以太陽黑子資料集,示範如何用 XGBoost 處理時間序列資料:

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.datasets import sunspots
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

data = sunspots.load().data
y = data["SUNACTIVITY"].to_numpy().astype(float)
n = len(y)

test_size = 44
train = y[:n-test_size]
test  = y[n-test_size:]


def make_features(series, n_lags):
	X, y_out = [], []
	for i in range(n_lags, len(series)):
		X.append(series[i - n_lags: i])
		y_out.append(series[i])
	return np.array(X), np.array(y_out)


n_lags = 11
X_train, y_train = make_features(train, n_lags)

model = XGBRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

predictions = []
history = list(train)
for i in range(test_size):
	x = np.array(history[-n_lags:]).reshape(1, -1)
	pred = model.predict(x)[0]
	predictions.append(pred)
	history.append(test[i])
predictions = np.array(predictions)

mae = mean_absolute_error(test, predictions)
rmse = mean_squared_error(test, predictions) ** 0.5
mape = np.mean(np.abs((test - predictions) / (test + 1e-8))) * 100
print(f"MAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAPE: {mape:.2f}%")

years = data["YEAR"].astype(int)
plt.plot(years[:n-test_size], train, label="Train")
plt.plot(years[n-test_size:], test, label="Test (ground truth)")
plt.plot(years[n-test_size:], predictions, label="Prediction")
plt.xlabel("Year")
plt.legend()
plt.show()

在上述範例中:

此外,由於時間序列資料的特殊性,若遇到資料中有缺失值時,由於相鄰時間點之間通常具有連續性,因此以前後相鄰的有效值來填補,往往會比用整體統計值填補更為合適;而當資料來自不同來源,或取樣頻率不一致時,則可能需要進行時間對齊與重採樣,例如將日資料累加為月資料,才容易進行後續分析。近年來,也出現了所謂的「時間序列基礎模型」(Time Series Foundation Model,TSFM),其概念類似於自然語言處理領域的大語言模型,即先以大量來自不同領域的時間序列資料進行預訓練,之後便能對未曾見過的新資料進行零樣本(zero-shot)預測,而不需要針對特定資料集重新訓練模型。目前較具代表性的選擇包括 Google 的 TimesFM、Amazon 的 Chronos-2,以及 Salesforce 的 Moirai-2,三者在各大基準測試上的排名互有高低,且仍在持續更新競爭中。如果你的情境適合使用此類模型,可以考慮將其作為快速建立預測基準的起點,再視需要進行微調。