{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Week 5 Notebook: Building a Deep Learning Model\n", "===============================================================\n", "\n", "Now, we'll look at a deep learning model based on low-level track features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tensorflow.keras as keras\n", "import numpy as np\n", "from sklearn.metrics import roc_curve, auc\n", "import matplotlib.pyplot as plt\n", "import uproot\n", "import tensorflow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import yaml\n", "\n", "with open('definitions.yml') as file:\n", " # The FullLoader parameter handles the conversion from YAML\n", " # scalar values to Python the dictionary format\n", " definitions = yaml.load(file, Loader=yaml.FullLoader)\n", " \n", "features = definitions['features']\n", "spectators = definitions['spectators']\n", "labels = definitions['labels']\n", "\n", "nfeatures = definitions['nfeatures']\n", "nspectators = definitions['nspectators']\n", "nlabels = definitions['nlabels']\n", "ntracks = definitions['ntracks']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Generators\n", "\n", "A quick aside on data generators. As training on large datasets is a key component of many deep learning approaches (and especially in high energy physics), and these datasets no longer fit in memory, it is imporatant to write a data generator which can automatically fetch data.\n", "\n", "Here we modify one from: https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from DataGenerator import DataGenerator\n", "help(DataGenerator)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load training and validation generators \n", "train_files = ['root://eospublic.cern.ch//eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/train/ntuple_merged_10.root']\n", "val_files = ['root://eospublic.cern.ch//eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/train/ntuple_merged_11.root']\n", "\n", "\n", "train_generator = DataGenerator(train_files, features, labels, spectators, batch_size=1024, n_dim=ntracks, \n", " remove_mass_pt_window=False, \n", " remove_unlabeled=True, max_entry=8000)\n", "\n", "val_generator = DataGenerator(val_files, features, labels, spectators, batch_size=1024, n_dim=ntracks, \n", " remove_mass_pt_window=False, \n", " remove_unlabeled=True, max_entry=2000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test Data Generator\n", "Note that the track array has a different \"shape.\" There are also less than the requested `batch_size=1024` because we remove unlabeled samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X, y = train_generator[1]\n", "print(X.shape)\n", "print(y.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note this generator can be optimized further (storing the data file locally, etc.). It's important to note that I/O is often a bottleneck for training big networks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fully Connected Neural Network Classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.models import Model\n", "from tensorflow.keras.layers import Input, Dense, BatchNormalization, Flatten\n", "import tensorflow.keras.backend as K\n", "\n", "# define dense keras model\n", "inputs = Input(shape=(ntracks, nfeatures,), name='input') \n", "x = BatchNormalization(name='bn_1')(inputs)\n", "x = Flatten(name='flatten_1')(x)\n", "x = Dense(64, name='dense_1', activation='relu')(x)\n", "x = Dense(32, name='dense_2', activation='relu')(x)\n", "x = Dense(32, name='dense_3', activation='relu')(x)\n", "outputs = Dense(nlabels, name='output', activation='softmax')(x)\n", "keras_model_dense = Model(inputs=inputs, outputs=outputs)\n", "keras_model_dense.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n", "print(keras_model_dense.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define callbacks\n", "from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau\n", "\n", "early_stopping = EarlyStopping(monitor='val_loss', patience=5)\n", "reduce_lr = ReduceLROnPlateau(patience=5, factor=0.5)\n", "model_checkpoint = ModelCheckpoint('keras_model_dense_best.h5', monitor='val_loss', save_best_only=True)\n", "callbacks = [early_stopping, model_checkpoint, reduce_lr]\n", "\n", "# fit keras model\n", "history_dense = keras_model_dense.fit(train_generator,\n", " validation_data=val_generator,\n", " steps_per_epoch=len(train_generator),\n", " validation_steps=len(val_generator),\n", " max_queue_size=5,\n", " epochs=20,\n", " shuffle=False,\n", " callbacks=callbacks,\n", " verbose=0)\n", "# reload best weights\n", "keras_model_dense.load_weights('keras_model_dense_best.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "plt.plot(history_dense.history['loss'], label='Loss')\n", "plt.plot(history_dense.history['val_loss'], label='Val. loss')\n", "plt.xlabel('Epoch')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deep Sets Classifier\n", "\n", "This model uses the `Dense` layer of Keras, but really it's more like the Deep Sets architecture applied to jets, the so-caled Particle-flow network approach{cite:p}`Komiske:2018cqr,NIPS2017_6931`.\n", "We are applying the same fully connected neural network to each track. \n", "Then the `GlobalAveragePooling1D` layer sums over the tracks (actually it takes the mean). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.models import Model\n", "from tensorflow.keras.layers import Input, Dense, BatchNormalization, GlobalAveragePooling1D\n", "import tensorflow.keras.backend as K\n", "\n", "# define Deep Sets model with Dense Keras layer\n", "inputs = Input(shape=(ntracks, nfeatures,), name='input') \n", "x = BatchNormalization(name='bn_1')(inputs)\n", "x = Dense(64, name='dense_1', activation='relu')(x)\n", "x = Dense(32, name='dense_2', activation='relu')(x)\n", "x = Dense(32, name='dense_3', activation='relu')(x)\n", "# sum over tracks\n", "x = GlobalAveragePooling1D(name='pool_1')(x)\n", "x = Dense(100, name='dense_4', activation='relu')(x)\n", "outputs = Dense(nlabels, name='output', activation='softmax')(x)\n", "keras_model_deepset = Model(inputs=inputs, outputs=outputs)\n", "keras_model_deepset.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])\n", "print(keras_model_deepset.summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define callbacks\n", "from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau\n", "\n", "early_stopping = EarlyStopping(monitor='val_loss', patience=5)\n", "reduce_lr = ReduceLROnPlateau(patience=5, factor=0.5)\n", "model_checkpoint = ModelCheckpoint('keras_model_deepset_best.h5', monitor='val_loss', save_best_only=True)\n", "callbacks = [early_stopping, model_checkpoint, reduce_lr]\n", "\n", "# fit keras model\n", "history_deepset = keras_model_deepset.fit(train_generator, \n", " validation_data=val_generator, \n", " steps_per_epoch=len(train_generator), \n", " validation_steps=len(val_generator),\n", " max_queue_size=5,\n", " epochs=20, \n", " shuffle=False,\n", " callbacks=callbacks, \n", " verbose=0)\n", "# reload best weights\n", "keras_model_deepset.load_weights('keras_model_deepset_best.h5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "plt.plot(history_deepset.history['loss'], label='Loss')\n", "plt.plot(history_deepset.history['val_loss'], label='Val. loss')\n", "plt.xlabel('Epoch')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load testing file\n", "test_files = ['root://eospublic.cern.ch//eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/test/ntuple_merged_0.root']\n", "test_generator = DataGenerator(test_files, features, labels, spectators, batch_size=1024, n_dim=ntracks, \n", " remove_mass_pt_window=True, \n", " remove_unlabeled=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# run model inference on test data set\n", "predict_array_dense = []\n", "predict_array_deepset = []\n", "label_array_test = []\n", "\n", "for t in test_generator:\n", " label_array_test.append(t[1])\n", " predict_array_dense.append(keras_model_dense.predict(t[0]))\n", " predict_array_deepset.append(keras_model_deepset.predict(t[0]))\n", " \n", " \n", "predict_array_dense = np.concatenate(predict_array_dense, axis=0)\n", "predict_array_deepset = np.concatenate(predict_array_deepset, axis=0)\n", "label_array_test = np.concatenate(label_array_test, axis=0)\n", "\n", "\n", "# create ROC curves\n", "fpr_dense, tpr_dense, threshold_dense = roc_curve(label_array_test[:,1], predict_array_dense[:,1])\n", "fpr_deepset, tpr_deepset, threshold_deepset = roc_curve(label_array_test[:,1], predict_array_deepset[:,1])\n", " \n", "# plot ROC curves\n", "plt.figure()\n", "plt.plot(tpr_dense, fpr_dense, lw=2.5, label=\"Dense, AUC = {:.1f}%\".format(auc(fpr_dense, tpr_dense)*100))\n", "plt.plot(tpr_deepset, fpr_deepset, lw=2.5, label=\"Deep Sets, AUC = {:.1f}%\".format(auc(fpr_deepset, tpr_deepset)*100))\n", "plt.xlabel(r'True positive rate')\n", "plt.ylabel(r'False positive rate')\n", "plt.semilogy()\n", "plt.ylim(0.001, 1)\n", "plt.xlim(0, 1)\n", "plt.grid(True)\n", "plt.legend(loc='upper left')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see the more structurally-aware Deep Sets model does better than a simple fully conneted neural network appraoch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 2 }