{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hands-on 07: Autoencoders for anomaly detection\n",
    "\n",
    "This week, we will look at autoencoders for anomaly detection.\n",
    "\n",
    "The goal is to train an autoencoder to reconstruct QCD (background) jets, which are plentiful at the LHC.\n",
    "Then we will apply it to top quark (signal) jets to see if the reconstruction error is larger.\n",
    "The reconstruction error can then be used as an *anomaly score* in real data.\n",
    "\n",
    "This autoencoder architecture used is inspired by this paper: https://arxiv.org/abs/1808.08992.\n",
    "![Autoencoder](https://inspirehep.net/files/70ddba8443dc13268cf2f0521c302338)\n",
    "\n",
    "You may need to install the [JetNet](https://jetnet.readthedocs.io/en/latest/pages/contents.html) library if you don't already have it.\n",
    "```bash\n",
    "pip install --user jetnet\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download the dataset\n",
    "\n",
    "We will use a validation dataset of 400k jets, which is plenty for our purposes.\n",
    "The full dataset is available at https://doi.org/10.5281/zenodo.2603255."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jetnet\n",
    "\n",
    "im_size = 16\n",
    "jet_r = 0.8\n",
    "max_jets = 50000\n",
    "\n",
    "# download the validation data (400k jets, which is plenty for our purposes)\n",
    "# full dataset is available here: https://doi.org/10.5281/zenodo.2603255\n",
    "data = jetnet.datasets.TopTagging(\n",
    "    jet_type=\"all\",\n",
    "    particle_features=[\"E\", \"px\", \"py\", \"pz\"],\n",
    "    jet_features=[\"type\"],\n",
    "    split=\"valid\",\n",
    "    data_dir=\"data/\",\n",
    "    particle_transform=jetnet.utils.cartesian_to_relEtaPhiPt,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform and split the data\n",
    "\n",
    "The data is originally up to 200 particles per jet (zero-padded), and the features are the standard 4-vectors $(E, p_x, p_y, p_z)$. \n",
    "We can assume the particles are massless so $E=\\sqrt{p_x^2+p_y^2+p_z^2}$ and there are only 3 degrees of freedom.\n",
    "\n",
    "We will transform to relative coordinates centered on the jet using the function [`jetnet.utils.cartesian_to_relEtaPhiPt`](https://jetnet.readthedocs.io/en/latest/pages/utils.html#jetnet.utils.cartesian_to_relEtaPhiPt):\n",
    "\n",
    "\\begin{align}\n",
    "\\eta^\\mathrm{rel} &=\\eta^\\mathrm{particle} - \\eta^\\mathrm{jet}\\\\\n",
    "\\phi^\\mathrm{rel} &=\\phi^\\mathrm{particle} - \\phi^\\mathrm{jet} \\pmod{2\\pi}\\\\\n",
    "p_\\mathrm{T}^\\mathrm{rel} &= p_\\mathrm{T}^\\mathrm{particle}/p_\\mathrm{T}^\\mathrm{jet}\n",
    "\\end{align}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "indices = np.random.permutation(np.arange(len(data)))[:max_jets]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the data\n",
    "transformed_particle_data = data.particle_transform(data.particle_data[indices])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split qcd background and top quark signal\n",
    "qcd_data = transformed_particle_data[data.jet_data[indices][:, 0] == 0]\n",
    "top_data = transformed_particle_data[data.jet_data[indices][:, 0] == 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "qcd_train, qcd_test = train_test_split(qcd_data, test_size=0.2, random_state=42)\n",
    "top_train, top_test = train_test_split(top_data, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Convert data to jet images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "#  convert full dataset\n",
    "qcd_train_images = np.expand_dims(jetnet.utils.to_image(qcd_train, im_size=im_size, maxR=jet_r), axis=-1)\n",
    "qcd_test_images = np.expand_dims(jetnet.utils.to_image(qcd_test, im_size=im_size, maxR=jet_r), axis=-1)\n",
    "top_test_images = np.expand_dims(jetnet.utils.to_image(top_test, im_size=im_size, maxR=jet_r), axis=-1)\n",
    "\n",
    "# rescale so sum is 1 (it should be close already)\n",
    "qcd_train_images = qcd_train_images / np.sum(qcd_train_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)\n",
    "qcd_test_images = qcd_test_images / np.sum(qcd_test_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)\n",
    "top_test_images = top_test_images / np.sum(top_test_images.reshape(-1, 1, 1, 1, im_size * im_size), axis=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize the jet images"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import LogNorm\n",
    "\n",
    "\n",
    "def plot_jet_images(images, titles, filename=\"jet_image.pdf\"):\n",
    "\n",
    "    n_images = len(images)\n",
    "    plt.figure(figsize=(5 * n_images, 5))\n",
    "\n",
    "    for i, (image, title) in enumerate(zip(images, titles)):\n",
    "        plt.subplot(1, n_images, i + 1)\n",
    "        plt.title(title)\n",
    "        plt.imshow(image, origin=\"lower\", norm=LogNorm(vmin=1e-3, vmax=1))\n",
    "        cbar = plt.colorbar()\n",
    "        plt.xlabel(r\"$\\Delta\\eta$ cell\", fontsize=15)\n",
    "        plt.ylabel(r\"$\\Delta\\phi$ cell\", fontsize=15)\n",
    "        cbar.set_label(r\"$p_T/p_T^{jet}$\", fontsize=15)\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.savefig(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_jet_images([qcd_test_images[0], top_test_images[0]], [\"QCD jet image\", \"Top quark jet image\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define the autoencoder model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import Model\n",
    "from tensorflow.keras.layers import (\n",
    "    Dense,\n",
    "    Input,\n",
    "    Conv2D,\n",
    "    Conv2DTranspose,\n",
    "    Reshape,\n",
    "    Flatten,\n",
    "    Softmax,\n",
    ")\n",
    "\n",
    "x_in = Input(shape=(im_size, im_size, 1))\n",
    "x = Conv2D(128, kernel_size=(3, 3), strides=(2, 2), activation=\"relu\", padding=\"same\")(x_in)\n",
    "x = Conv2D(128, kernel_size=(3, 3), strides=(2, 2), activation=\"relu\", padding=\"same\")(x)\n",
    "x = Flatten()(x)\n",
    "\n",
    "x_enc = Dense(2, name=\"bottleneck\")(x)\n",
    "\n",
    "x = Dense(int(im_size * im_size / 16) * 128, activation=\"relu\")(x_enc)\n",
    "x = Reshape((int(im_size / 4), int(im_size / 4), 128))(x)\n",
    "x = Conv2DTranspose(128, kernel_size=(3, 3), strides=(2, 2), activation=\"relu\", padding=\"same\")(x)\n",
    "x = Conv2DTranspose(1, kernel_size=(3, 3), strides=(2, 2), activation=\"linear\", padding=\"same\")(x)\n",
    "x_out = Softmax(name=\"softmax\", axis=[-2, -3])(x)\n",
    "model = Model(inputs=x_in, outputs=x_out, name=\"autoencoder\")\n",
    "\n",
    "model.compile(loss=\"mse\", optimizer=\"adam\")\n",
    "model.summary()\n",
    "\n",
    "# save the encoder-only model for easy access to latent space\n",
    "encoder = Model(inputs=x_in, outputs=x_enc, name=\"encoder\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the autoencoder model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = model.fit(\n",
    "    qcd_train_images,\n",
    "    qcd_train_images,\n",
    "    batch_size=32,\n",
    "    epochs=10,\n",
    "    verbose=0,\n",
    "    validation_data=(qcd_test_images, qcd_test_images),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reconstruction performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qcd_reco_image = model.predict(qcd_test_images[0:1], verbose=0).reshape(im_size, im_size)\n",
    "plot_jet_images([qcd_test_images[0], qcd_reco_image], [\"Input QCD jet image\", \"Reconstructed QCD jet image\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_reco_image = model.predict(top_test_images[0:1], verbose=0).reshape(im_size, im_size)\n",
    "plot_jet_images([top_test_images[0], top_reco_image], [\"Input top quark jet image\", \"Reconstructed top quark jet image\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Anomaly detection performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "qcd_reco_images = model.predict(qcd_test_images, verbose=0)\n",
    "top_reco_images = model.predict(top_test_images, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff_qcd = np.power((qcd_reco_images - qcd_test_images), 2)\n",
    "loss_qcd = np.mean(diff_qcd.reshape(-1, im_size * im_size), axis=-1)\n",
    "\n",
    "diff_top = np.power((top_reco_images - top_test_images), 2)\n",
    "loss_top = np.mean(diff_top.reshape(-1, im_size * im_size), axis=-1)\n",
    "\n",
    "loss_all = np.concatenate([loss_qcd, loss_top])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "bins = np.arange(0, np.max(loss_all), np.max(loss_all) / 100)\n",
    "plt.hist(loss_qcd, label=\"QCD jets\", bins=bins, alpha=0.8)\n",
    "plt.hist(loss_top, label=\"Top quark jets\", bins=bins, alpha=0.8)\n",
    "plt.legend(title=\"Autoencoder\")\n",
    "plt.xlabel(\"MSE loss\")\n",
    "plt.ylabel(\"Jets\")\n",
    "plt.xlim(0, np.max(loss_all))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "1. Plot the ROC curve for the MSE loss of the autoencoder on the merged testing sample of QCD and top quark jets, assuming a label of 1 for top quark jets and a label of 0 for QCD jets. Report the AUC.\n",
    "2. Perform a PCA on only the QCD training images using `sklearn.decomposition.PCA` with 2 components. Note you will have to reshape the image tensors so that they are 2D instead of 4D (as required by the autoencoder), e.g. `qcd_test_images.reshape(-1, im_size * im_size))`. Plot the distribution of the reconstruction losses for top quark jets and QCD jets separately. Hint: review https://rittikghosh.com/autoencoder.html.\n",
    "3. Plot the PCA ROC curve similar to part 1. Report the AUC.\n",
    "6. Plot the 2D latent space for the QCD and top quark test images for both the autoencoder and the PCA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "d0ea348b636367bcdf67fd2d6d24251712b38670f61fdee14f28eb58fe74f081"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}