{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "uMaZzc8tHpPZ" }, "source": [ "# Hands-on 03: Tabular data and NNs: Classifying particle jets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "o3Diqc0XHpPe" }, "outputs": [], "source": [ "from tensorflow.keras.utils import to_categorical\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", "\n", "import numpy as np\n", "import tensorflow as tf\n", "\n", "%matplotlib inline\n", "seed = 42\n", "np.random.seed(seed)\n", "tf.random.set_seed(seed)" ] }, { "cell_type": "markdown", "metadata": { "id": "KPFfRhZbHpPf" }, "source": [ "## Fetch the jet tagging dataset from Open ML" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g1a6LkWuHpPg" }, "outputs": [], "source": [ "data = fetch_openml(\"hls4ml_lhc_jets_hlf\", parser=\"auto\")\n", "X, y = data[\"data\"], data[\"target\"]" ] }, { "cell_type": "markdown", "metadata": { "id": "l_qe5eOAHpPg" }, "source": [ "### Let's print some information about the dataset\n", "Print the feature names and the dataset shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "3S94__7-HpPh", "outputId": "100537e6-9512-4e93-9593-3f818a021823", "scrolled": false }, "outputs": [], "source": [ "print(f\"Feature names: {data['feature_names']}\")\n", "print(f\"Target names: {y.dtype.categories.to_list()}\")\n", "print(f\"Shapes: {X.shape}, {y.shape}\")\n", "print(f\"Inputs: {X}\")\n", "print(f\"Targets: {y}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "LlPhEpHpHpPi" }, "source": [ "As you see above, the `y` target is an array of strings, e.g. `[\"g\", \"w\", ...]` etc.\n", "These correspond to different source particles for the jets.\n", "You will notice that except for quark- and gluon-initiated jets (`\"g\"`), all other jets in the dataset have at least one \"prong.\"\n", "\n", "\"jet_classes\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "4fPxF-40HpPj" }, "source": [ "### Lets see what the jet variables look like\n", "\n", "Many of these variables are energy correlation functions $N$, $M$, $C$, and $D$ ([1305.0007](https://arxiv.org/pdf/1305.0007.pdf), [1609.07483](https://arxiv.org/pdf/1609.07483.pdf)). \n", "The others are the jet mass (computed with modified mass drop) $m_\\textrm{mMDT}$, $\\sum z\\log z$ where the sum is over the particles in the jet and $z$ is the fraction of jet momentum carried by a given particle, and the overall multiplicity of particles in the jet." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "0KAWH454HpPk", "outputId": "2553452b-6f7d-4ecd-cefd-8f5e68b34b69", "scrolled": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "fig, axs = plt.subplots(4, 4, figsize=(40, 40))\n", "\n", "for ix, ax in enumerate(axs.reshape(-1)):\n", " feat = data[\"feature_names\"][ix]\n", " bins = np.linspace(np.min(X[:][feat]), np.max(X[:][feat]), 20)\n", " for c in y.dtype.categories:\n", " ax.hist(X[y == c][feat], bins=bins, histtype=\"step\", label=c, lw=2)\n", " ax.set_xlabel(feat, fontsize=20)\n", " ax.set_ylabel(\"Jets\", fontsize=20)\n", " ax.tick_params(axis=\"both\", which=\"major\", labelsize=20)\n", " ax.legend(fontsize=20, loc=\"best\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "IHokDPh7HpPl" }, "source": [ "Because the `y` target is an array of strings, e.g. `[\"g\", \"w\", ...]`, we need to make this a \"one-hot\" encoding for the training.\n", "Then, split the dataset into training and validation sets" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "VQJoRhIFHpPl", "outputId": "9db84600-6c6e-469b-ab91-0c9843da5e23" }, "outputs": [], "source": [ "le = LabelEncoder()\n", "y_onehot = le.fit_transform(y)\n", "y_onehot = to_categorical(y_onehot, 5)\n", "X_train_val, X_test, y_train_val, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)\n", "print(y[:5])\n", "print(y_onehot[:5])" ] }, { "cell_type": "markdown", "metadata": { "id": "PwswlgcSHpPn" }, "source": [ "## Now construct a simple neural network\n", "We'll use 3 hidden layers with 64, then 32, then 32 neurons. Each layer will use `relu` activation.\n", "Add an output layer with 5 neurons (one for each class), then finish with Softmax activation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "91U8q9ayHpPn" }, "outputs": [], "source": [ "from tensorflow.keras.models import Sequential\n", "from tensorflow.keras.layers import Dense, Activation, BatchNormalization\n", "from tensorflow.keras.optimizers import Adam\n", "from tensorflow.keras.regularizers import l1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qf8J9KdPHpPn" }, "outputs": [], "source": [ "model = Sequential(name=\"sequantial1\")\n", "model.add(Dense(64, input_shape=(16,), name=\"fc1\"))\n", "model.add(Activation(activation=\"relu\", name=\"relu1\"))\n", "model.add(Dense(32, name=\"fc2\"))\n", "model.add(Activation(activation=\"relu\", name=\"relu2\"))\n", "model.add(Dense(32, name=\"fc3\"))\n", "model.add(Activation(activation=\"relu\", name=\"relu3\"))\n", "model.add(Dense(5, name=\"fc4\"))\n", "model.add(Activation(activation=\"softmax\", name=\"softmax\"))\n", "model.summary()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "GkixbkJzHpPn" }, "source": [ "## Train the model\n", "We'll use SGD optimizer with categorical crossentropy loss.\n", "The model isn't very complex, so this should just take a few minutes even on the CPU." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hnUhUlJtHpPo", "outputId": "a79d0ede-3379-4626-99ba-b1e27f62dbb2" }, "outputs": [], "source": [ "model.compile(optimizer=\"sgd\", loss=[\"categorical_crossentropy\"], metrics=[\"accuracy\"])\n", "history = model.fit(X_train_val, y_train_val, batch_size=1024, epochs=50, validation_split=0.25, shuffle=True, verbose=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from plotting import plot_model_history\n", "\n", "plot_model_history(history)" ] }, { "cell_type": "markdown", "metadata": { "id": "Mq_a8hlPHpPo" }, "source": [ "## Check performance\n", "Check the accuracy and make a ROC curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 593 }, "id": "rtLqV3ZHHpPo", "outputId": "a13d2537-7cb3-4953-88f0-2684fbd7ae85" }, "outputs": [], "source": [ "from plotting import make_roc, plot_confusion_matrix\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import accuracy_score, confusion_matrix\n", "\n", "y_keras = model.predict(X_test, batch_size=1024, verbose=0)\n", "print(f\"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(5, 5))\n", "plot_confusion_matrix(y_test, y_keras, classes=le.classes_, normalize=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(5, 5))\n", "make_roc(y_test, y_keras, le.classes_)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "1. Apply a standard scaler to the inputs. How does the performance of the model change?\n", "\n", "```python\n", "scaler = StandardScaler()\n", "X_train_val = scaler.fit_transform(X_train_val)\n", "X_test = scaler.transform(X_test)\n", "```\n", "\n", "2. Apply L1 regularization. How does the performance of the model change? How do the distribution of the weight values change?\n", "\n", "```python\n", "model.add(Dense(64, input_shape=(16,), name=\"fc1\", kernel_regularizer=l1(0.01)))\n", "```\n", "\n", "3. How do the loss curves change if we use a smaller learning rate (say `1e-5`) or a larger one (say `0.1`)?\n", "\n", "4. How does the loss curve change and the performance of the model change if we use Adam as the optimizer instead of SGD?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "phys139", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "vscode": { "interpreter": { "hash": "d0ea348b636367bcdf67fd2d6d24251712b38670f61fdee14f28eb58fe74f081" } } }, "nbformat": 4, "nbformat_minor": 4 }