Hands-on 07: Autoencoders for anomaly detection#

This week, we will look at autoencoders for anomaly detection.

The goal is to train an autoencoder to reconstruct QCD (background) jets, which are plentiful at the LHC. Then we will apply it to top quark (signal) jets to see if the reconstruction error is larger. The reconstruction error can then be used as an anomaly score in real data.

This autoencoder architecture used is inspired by this paper: https://arxiv.org/abs/1808.08992. Autoencoder

You may need to install the JetNet library if you don’t already have it.

pip install --user jetnet

Download the dataset#

We will use a validation dataset of 400k jets, which is plenty for our purposes. The full dataset is available at https://doi.org/10.5281/zenodo.2603255.

import jetnet

im_size = 16
jet_r = 0.8
max_jets = 50000

# download the validation data (400k jets, which is plenty for our purposes)
# full dataset is available here: https://doi.org/10.5281/zenodo.2603255
data = jetnet.datasets.TopTagging(
    jet_type="all",
    particle_features=["E", "px", "py", "pz"],
    jet_features=["type"],
    split="valid",
    data_dir="data/",
    particle_transform=jetnet.utils.cartesian_to_relEtaPhiPt,
)
/home/runner/miniconda3/envs/phys139/lib/python3.10/site-packages/jetnet/utils/utils.py:7: FutureWarning: In version 2024.7.0 (target date: 2024-06-30 11:59:59-05:00), this will be an error.
To raise these warnings as errors (and get stack traces to find out where they're called), run
    import warnings
    warnings.filterwarnings("error", module="coffea.*")
after the first `import coffea` or use `@pytest.mark.filterwarnings("error:::coffea.*")` in pytest.
Issue: coffea.nanoevents.methods.vector will be removed and replaced with scikit-hep vector. Nanoevents schemas internal to coffea will be migrated. Otherwise please consider using that package!.
  from coffea.nanoevents.methods import vector
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 9
      5 max_jets = 50000
      7 # download the validation data (400k jets, which is plenty for our purposes)
      8 # full dataset is available here: https://doi.org/10.5281/zenodo.2603255
----> 9 data = jetnet.datasets.TopTagging(
     10     jet_type="all",
     11     particle_features=["E", "px", "py", "pz"],
     12     jet_features=["type"],
     13     split="valid",
     14     data_dir="data/",
     15     particle_transform=jetnet.utils.cartesian_to_relEtaPhiPt,
     16 )

File ~/miniconda3/envs/phys139/lib/python3.10/site-packages/jetnet/datasets/toptagging.py:80, in TopTagging.__init__(self, jet_type, data_dir, particle_features, jet_features, particle_normalisation, jet_normalisation, particle_transform, jet_transform, num_particles, split, download)
     77 if jet_features == "all":
     78     jet_features = copy(self.ALL_JET_FEATURES)
---> 80 self.particle_data, self.jet_data = self.getData(
     81     jet_type, data_dir, particle_features, jet_features, num_particles, split, download
     82 )
     84 super().__init__(
     85     data_dir=data_dir,
     86     particle_features=particle_features,
   (...)
     92     num_particles=num_particles,
     93 )
     95 self.jet_type = jet_type

File ~/miniconda3/envs/phys139/lib/python3.10/site-packages/jetnet/datasets/toptagging.py:155, in TopTagging.getData(cls, jet_type, data_dir, particle_features, jet_features, num_particles, split, download)
    152 jet_data = []
    154 for s in split:
--> 155     hdf5_file = checkDownloadZenodoDataset(
    156         data_dir,
    157         dataset_name=cls._SPLIT_KEY_MAPPING[s],
    158         record_id=cls._ZENODO_RECORD_ID,
    159         key=f"{cls._SPLIT_KEY_MAPPING[s]}.h5",
    160         download=download,
    161     )
    163     data = np.array(pd.read_hdf(hdf5_file, key="table"))
    165     # select only specified types of jets (qcd or top or both)

File ~/miniconda3/envs/phys139/lib/python3.10/site-packages/jetnet/datasets/utils.py:128, in checkDownloadZenodoDataset(data_dir, dataset_name, record_id, key, download)
    125         download_progress_bar(file_url, file_path)
    127 if not file_path.is_file():
--> 128     raise RuntimeError(
    129         f"Dataset {dataset_name} not found at {file_path}, "
    130         "you can use download=True to download it."
    131     )
    133 match_md5, fmd5 = _check_md5(file_path, md5)
    134 if not match_md5:

RuntimeError: Dataset val not found at data/val.h5, you can use download=True to download it.

Transform and split the data#

The data is originally up to 200 particles per jet (zero-padded), and the features are the standard 4-vectors \((E, p_x, p_y, p_z)\). We can assume the particles are massless so \(E=\sqrt{p_x^2+p_y^2+p_z^2}\) and there are only 3 degrees of freedom.

We will transform to relative coordinates centered on the jet using the function jetnet.utils.cartesian_to_relEtaPhiPt:

\begin{align} \eta^\mathrm{rel} &=\eta^\mathrm{particle} - \eta^\mathrm{jet}\ \phi^\mathrm{rel} &=\phi^\mathrm{particle} - \phi^\mathrm{jet} \pmod{2\pi}\ p_\mathrm{T}^\mathrm{rel} &= p_\mathrm{T}^\mathrm{particle}/p_\mathrm{T}^\mathrm{jet} \end{align}

import numpy as np

indices = np.random.permutation(np.arange(len(data)))[:max_jets]
# transform the data
transformed_particle_data = data.particle_transform(data.particle_data[indices])
# split qcd background and top quark signal
qcd_data = transformed_particle_data[data.jet_data[indices][:, 0] == 0]
top_data = transformed_particle_data[data.jet_data[indices][:, 0] == 1]
from sklearn.model_selection import train_test_split

qcd_train, qcd_test = train_test_split(qcd_data, test_size=0.2, random_state=42)
top_train, top_test = train_test_split(top_data, test_size=0.2, random_state=42)

Convert data to jet images#

import numpy as np

#  convert full dataset
qcd_train_images = np.expand_dims(jetnet.utils.to_image(qcd_train, im_size=im_size, maxR=jet_r), axis=-1)
qcd_test_images = np.expand_dims(jetnet.utils.to_image(qcd_test, im_size=im_size, maxR=jet_r), axis=-1)
top_test_images = np.expand_dims(jetnet.utils.to_image(top_test, im_size=im_size, maxR=jet_r), axis=-1)

# rescale so sum is 1 (it should be close already)
qcd_train_images = qcd_train_images / np.sum(qcd_train_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)
qcd_test_images = qcd_test_images / np.sum(qcd_test_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)
top_test_images = top_test_images / np.sum(top_test_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)

Visualize the jet images#

import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm


def plot_jet_images(images, titles, filename="jet_image.pdf"):

    n_images = len(images)
    plt.figure(figsize=(5 * n_images, 5))

    for i, (image, title) in enumerate(zip(images, titles)):
        plt.subplot(1, n_images, i + 1)
        plt.title(title)
        plt.imshow(image, origin="lower", norm=LogNorm(vmin=1e-3, vmax=1))
        cbar = plt.colorbar()
        plt.xlabel(r"$\Delta\eta$ cell", fontsize=15)
        plt.ylabel(r"$\Delta\phi$ cell", fontsize=15)
        cbar.set_label(r"$p_T/p_T^{jet}$", fontsize=15)

    plt.tight_layout()
    plt.savefig(filename)
plot_jet_images([qcd_test_images[0], top_test_images[0]], ["QCD jet image", "Top quark jet image"])

Define the autoencoder model#

from tensorflow.keras.models import Model
from tensorflow.keras.layers import (
    Dense,
    Input,
    Conv2D,
    Conv2DTranspose,
    Reshape,
    Flatten,
    Softmax,
)

x_in = Input(shape=(im_size, im_size, 1))
x = Conv2D(128, kernel_size=(3, 3), strides=(2, 2), activation="relu", padding="same")(x_in)
x = Conv2D(128, kernel_size=(3, 3), strides=(2, 2), activation="relu", padding="same")(x)
x = Flatten()(x)

x_enc = Dense(2, name="bottleneck")(x)

x = Dense(int(im_size * im_size / 16) * 128, activation="relu")(x_enc)
x = Reshape((int(im_size / 4), int(im_size / 4), 128))(x)
x = Conv2DTranspose(128, kernel_size=(3, 3), strides=(2, 2), activation="relu", padding="same")(x)
x = Conv2DTranspose(1, kernel_size=(3, 3), strides=(2, 2), activation="linear", padding="same")(x)
x_out = Softmax(name="softmax", axis=[-2, -3])(x)
model = Model(inputs=x_in, outputs=x_out, name="autoencoder")

model.compile(loss="mse", optimizer="adam")
model.summary()

# save the encoder-only model for easy access to latent space
encoder = Model(inputs=x_in, outputs=x_enc, name="encoder")

Train the autoencoder model#

history = model.fit(
    qcd_train_images,
    qcd_train_images,
    batch_size=32,
    epochs=10,
    verbose=0,
    validation_data=(qcd_test_images, qcd_test_images),
)

Reconstruction performance#

qcd_reco_image = model.predict(qcd_test_images[0:1], verbose=0).reshape(im_size, im_size)
plot_jet_images([qcd_test_images[0], qcd_reco_image], ["Input QCD jet image", "Reconstructed QCD jet image"])
top_reco_image = model.predict(top_test_images[0:1], verbose=0).reshape(im_size, im_size)
plot_jet_images([top_test_images[0], top_reco_image], ["Input top quark jet image", "Reconstructed top quark jet image"])

Anomaly detection performance#

qcd_reco_images = model.predict(qcd_test_images, verbose=0)
top_reco_images = model.predict(top_test_images, verbose=0)
diff_qcd = np.power((qcd_reco_images - qcd_test_images), 2)
loss_qcd = np.mean(diff_qcd.reshape(-1, im_size * im_size), axis=-1)

diff_top = np.power((top_reco_images - top_test_images), 2)
loss_top = np.mean(diff_top.reshape(-1, im_size * im_size), axis=-1)

loss_all = np.concatenate([loss_qcd, loss_top])
plt.figure()
bins = np.arange(0, np.max(loss_all), np.max(loss_all) / 100)
plt.hist(loss_qcd, label="QCD jets", bins=bins, alpha=0.8)
plt.hist(loss_top, label="Top quark jets", bins=bins, alpha=0.8)
plt.legend(title="Autoencoder")
plt.xlabel("MSE loss")
plt.ylabel("Jets")
plt.xlim(0, np.max(loss_all))
plt.show()

Exercises#

  1. Plot the ROC curve for the MSE loss of the autoencoder on the merged testing sample of QCD and top quark jets, assuming a label of 1 for top quark jets and a label of 0 for QCD jets. Report the AUC.

  2. Perform a PCA on only the QCD training images using sklearn.decomposition.PCA with 2 components. Note you will have to reshape the image tensors so that they are 2D instead of 4D (as required by the autoencoder), e.g. qcd_test_images.reshape(-1, im_size * im_size)). Plot the distribution of the reconstruction losses for top quark jets and QCD jets separately. Hint: review https://rittikghosh.com/autoencoder.html.

  3. Plot the PCA ROC curve similar to part 1. Report the AUC.

  4. Plot the 2D latent space for the QCD and top quark test images for both the autoencoder and the PCA.