Gausian Process Regression with Covariance Product and Sum#

The atmospheric CO2 concentration readings in parts per million (ppm) by volume from air samples are collected continuously by the Mauno Loa observatory in Hawaii. The data are freely available at ftp: // ftp.cmdl.noaa.gov/ccg/co2/trends/. http://scrippsco2.ucsd.edu/data/atmospheric_co2/mlo

This example was first presented in the book of Rasmussen & Williams [1] and serves as a tutorial for kernel composition, see for instance - https://docs.pymc.io/notebooks/GP-MaunaLoa.html - https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html - http://dfm.io/george/current/user/hyper/

The modelling and the SDE representation of Arno Solin and Simo Särkkä [2], which differs from the book of Rasmussen & Williams [3], is used here.

[1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from pysip.regressors import Regressor
from pysip.statespace import Matern32, Matern52, Periodic

Load and prepare the Mauna Loa CO2 data set#

[2]:
df = pd.read_csv(
    "../../../data/mauna_loa/monthly_in_situ_co2_mlo.csv",
    comment='"',
    header=0,
    skipinitialspace=True,
)

df = df[["Date.1", "CO2"]].set_index("Date.1")
df.index.name = "Date"
df = df[df.CO2 != -99.99]

# Use data until 2010 for the fit and the rest for the prediction
C02_mean = df["CO2"].mean()
df["CO2_fit"] = df["CO2"] - C02_mean
df["CO2_pred"] = df["CO2_fit"]
df["CO2_pred"][df.index >= 2010] = np.nan

Initialize parameters and models#

[3]:
# Matérn 5/2 for slow rising trend (long-term effects)
p1 = [
    dict(name="mscale", value=2.295e03, transform="log"),
    dict(name="lscale", value=4.864e02, transform="log"),
    dict(name="sigv", value=2.210e-01, transform="log"),
]

# Periodic * Matérn 3/2 for quasi-periodic variations
p2 = [
    dict(name="period", value=1.0, transform="fixed"),
    dict(name="mscale", value=2.833e00, transform="log"),
    dict(name="lscale", value=1.341e00, transform="log"),
    dict(name="sigv", value=0.0, transform="fixed"),
]

p3 = [
    dict(name="mscale", value=1.0, transform="fixed"),
    dict(name="lscale", value=2.607e02, transform="log"),
    dict(name="sigv", value=0.0, transform="fixed"),
]

# Matérn 3/2 for short-term effects
p4 = [
    dict(name="mscale", value=4.595e-01, transform="log"),
    dict(name="lscale", value=6.359e-01, transform="log"),
    dict(name="sigv", value=0.0, transform="fixed"),
]

k1 = Matern52(p1, name="k1")
k2 = Periodic(p2, name="k2")
k3 = Matern32(p3, name="k3")
k4 = Matern32(p4, name="k4")

# Compose covariance function
K = k1 + k2 * k3 + k4
K
[3]:
GPSum(hold_order=0, method='mfd', name='k1__+__k2__x__k3__+__k4')

Initialize the regressor#

[4]:
reg = Regressor(K, outputs="CO2_fit")

Do the frequentist fit#

[5]:
fit_summary, corr_matrix, opt_summary = reg.fit(df=df)
fit_summary
[ ]:
tnew = np.arange(1958, 2030, 0.01)
ds = reg.predict(df=df, smooth=True, tnew=tnew)
[ ]:
# Plot output mean and 95% credible intervals
ym = ds["y_mean"].sel(outputs="CO2_fit")
ysd = ds["y_std"].sel(outputs="CO2_fit")


plt.plot(tnew, C02_mean + ym, color="navy", label="mean")
plt.fill_between(
    tnew,
    C02_mean + ym - 2 * ysd,
    C02_mean + ym + 2 * ysd,
    color="darkblue",
    alpha=0.2,
    label=r"95% CI",
)
plt.plot(df.index, df["CO2"], color="darkred", label="data")
plt.tight_layout()
plt.legend(loc="best", fancybox=True, framealpha=0.5)
ax = plt.gca()
ax.set_xlabel("year")
ax.set_ylabel("CO2 [ppm]")
Text(38.347222222222214, 0.5, 'CO2 [ppm]')
../../_images/examples_gaussian_process_Mauna_Loa_CO2_11_1.png