{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "uMaZzc8tHpPZ"
   },
   "source": [
    "# Hands-on 03: Tabular data and NNs: Classifying particle jets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "o3Diqc0XHpPe"
   },
   "outputs": [],
   "source": [
    "from tensorflow.keras.utils import to_categorical\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import LabelEncoder, StandardScaler\n",
    "\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "\n",
    "%matplotlib inline\n",
    "seed = 42\n",
    "np.random.seed(seed)\n",
    "tf.random.set_seed(seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KPFfRhZbHpPf"
   },
   "source": [
    "## Fetch the jet tagging dataset from Open ML"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "g1a6LkWuHpPg"
   },
   "outputs": [],
   "source": [
    "data = fetch_openml(\"hls4ml_lhc_jets_hlf\", parser=\"auto\")\n",
    "X, y = data[\"data\"], data[\"target\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "l_qe5eOAHpPg"
   },
   "source": [
    "### Let's print some information about the dataset\n",
    "Print the feature names and the dataset shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "3S94__7-HpPh",
    "outputId": "100537e6-9512-4e93-9593-3f818a021823",
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "print(f\"Feature names: {data['feature_names']}\")\n",
    "print(f\"Target names: {y.dtype.categories.to_list()}\")\n",
    "print(f\"Shapes: {X.shape}, {y.shape}\")\n",
    "print(f\"Inputs: {X}\")\n",
    "print(f\"Targets: {y}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "LlPhEpHpHpPi"
   },
   "source": [
    "As you see above, the `y` target is an array of strings, e.g. `[\"g\", \"w\", ...]` etc.\n",
    "These correspond to different source particles for the jets.\n",
    "You will notice that except for quark- and gluon-initiated jets (`\"g\"`), all other jets in the dataset have at least one \"prong.\"\n",
    "\n",
    "<img src=\"images/jet_classes.png\" alt=\"jet_classes\" width=\"600\"/>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "4fPxF-40HpPj"
   },
   "source": [
    "### Lets see what the jet variables look like\n",
    "\n",
    "Many of these variables are energy correlation functions $N$, $M$, $C$, and $D$ ([1305.0007](https://arxiv.org/pdf/1305.0007.pdf), [1609.07483](https://arxiv.org/pdf/1609.07483.pdf)). \n",
    "The others are the jet mass (computed with modified mass drop) $m_\\textrm{mMDT}$, $\\sum z\\log z$ where the sum is over the particles in the jet and $z$ is the fraction of jet momentum carried by a given particle, and the overall multiplicity of particles in the jet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "0KAWH454HpPk",
    "outputId": "2553452b-6f7d-4ecd-cefd-8f5e68b34b69",
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, axs = plt.subplots(4, 4, figsize=(40, 40))\n",
    "\n",
    "for ix, ax in enumerate(axs.reshape(-1)):\n",
    "    feat = data[\"feature_names\"][ix]\n",
    "    bins = np.linspace(np.min(X[:][feat]), np.max(X[:][feat]), 20)\n",
    "    for c in y.dtype.categories:\n",
    "        ax.hist(X[y == c][feat], bins=bins, histtype=\"step\", label=c, lw=2)\n",
    "    ax.set_xlabel(feat, fontsize=20)\n",
    "    ax.set_ylabel(\"Jets\", fontsize=20)\n",
    "    ax.tick_params(axis=\"both\", which=\"major\", labelsize=20)\n",
    "    ax.legend(fontsize=20, loc=\"best\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "IHokDPh7HpPl"
   },
   "source": [
    "Because the `y` target is an array of strings, e.g. `[\"g\", \"w\", ...]`, we need to make this a \"one-hot\" encoding for the training.\n",
    "Then, split the dataset into training and validation sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "VQJoRhIFHpPl",
    "outputId": "9db84600-6c6e-469b-ab91-0c9843da5e23"
   },
   "outputs": [],
   "source": [
    "le = LabelEncoder()\n",
    "y_onehot = le.fit_transform(y)\n",
    "y_onehot = to_categorical(y_onehot, 5)\n",
    "X_train_val, X_test, y_train_val, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)\n",
    "print(y[:5])\n",
    "print(y_onehot[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PwswlgcSHpPn"
   },
   "source": [
    "## Now construct a simple neural network\n",
    "We'll use 3 hidden layers with 64, then 32, then 32 neurons. Each layer will use `relu` activation.\n",
    "Add an output layer with 5 neurons (one for each class), then finish with Softmax activation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "91U8q9ayHpPn"
   },
   "outputs": [],
   "source": [
    "from tensorflow.keras.models import Sequential\n",
    "from tensorflow.keras.layers import Dense, Activation, BatchNormalization\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "from tensorflow.keras.regularizers import l1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qf8J9KdPHpPn"
   },
   "outputs": [],
   "source": [
    "model = Sequential(name=\"sequantial1\")\n",
    "model.add(Dense(64, input_shape=(16,), name=\"fc1\"))\n",
    "model.add(Activation(activation=\"relu\", name=\"relu1\"))\n",
    "model.add(Dense(32, name=\"fc2\"))\n",
    "model.add(Activation(activation=\"relu\", name=\"relu2\"))\n",
    "model.add(Dense(32, name=\"fc3\"))\n",
    "model.add(Activation(activation=\"relu\", name=\"relu3\"))\n",
    "model.add(Dense(5, name=\"fc4\"))\n",
    "model.add(Activation(activation=\"softmax\", name=\"softmax\"))\n",
    "model.summary()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "GkixbkJzHpPn"
   },
   "source": [
    "## Train the model\n",
    "We'll use SGD optimizer with categorical crossentropy loss.\n",
    "The model isn't very complex, so this should just take a few minutes even on the CPU."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "hnUhUlJtHpPo",
    "outputId": "a79d0ede-3379-4626-99ba-b1e27f62dbb2"
   },
   "outputs": [],
   "source": [
    "model.compile(optimizer=\"sgd\", loss=[\"categorical_crossentropy\"], metrics=[\"accuracy\"])\n",
    "history = model.fit(X_train_val, y_train_val, batch_size=1024, epochs=50, validation_split=0.25, shuffle=True, verbose=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from plotting import plot_model_history\n",
    "\n",
    "plot_model_history(history)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Mq_a8hlPHpPo"
   },
   "source": [
    "## Check performance\n",
    "Check the accuracy and make a ROC curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 593
    },
    "id": "rtLqV3ZHHpPo",
    "outputId": "a13d2537-7cb3-4953-88f0-2684fbd7ae85"
   },
   "outputs": [],
   "source": [
    "from plotting import make_roc, plot_confusion_matrix\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "\n",
    "y_keras = model.predict(X_test, batch_size=1024, verbose=0)\n",
    "print(f\"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(5, 5))\n",
    "plot_confusion_matrix(y_test, y_keras, classes=le.classes_, normalize=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(5, 5))\n",
    "make_roc(y_test, y_keras, le.classes_)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "1. Apply a standard scaler to the inputs. How does the performance of the model change?\n",
    "\n",
    "```python\n",
    "scaler = StandardScaler()\n",
    "X_train_val = scaler.fit_transform(X_train_val)\n",
    "X_test = scaler.transform(X_test)\n",
    "```\n",
    "\n",
    "2. Apply L1 regularization. How does the performance of the model change? How do the distribution of the weight values change?\n",
    "\n",
    "```python\n",
    "model.add(Dense(64, input_shape=(16,), name=\"fc1\", kernel_regularizer=l1(0.01)))\n",
    "```\n",
    "\n",
    "3. How do the loss curves change if we use a smaller learning rate (say `1e-5`) or a larger one (say `0.1`)?\n",
    "\n",
    "4. How does the loss curve change and the performance of the model change if we use Adam as the optimizer instead of SGD?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "phys139",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "vscode": {
   "interpreter": {
    "hash": "d0ea348b636367bcdf67fd2d6d24251712b38670f61fdee14f28eb58fe74f081"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}