Hands-on 03: Tabular data and NNs: Classifying particle jets#

from tensorflow.keras.utils import to_categorical
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler

import numpy as np
import tensorflow as tf

%matplotlib inline
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)
2024-06-05 04:33:49.625123: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 from tensorflow.keras.utils import to_categorical
      2 from sklearn.datasets import fetch_openml
      3 from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'tensorflow.keras'

Fetch the jet tagging dataset from Open ML#

data = fetch_openml("hls4ml_lhc_jets_hlf", parser="auto")
X, y = data["data"], data["target"]

Let’s print some information about the dataset#

Print the feature names and the dataset shape

print(f"Feature names: {data['feature_names']}")
print(f"Target names: {y.dtype.categories.to_list()}")
print(f"Shapes: {X.shape}, {y.shape}")
print(f"Inputs: {X}")
print(f"Targets: {y}")

As you see above, the y target is an array of strings, e.g. ["g", "w", ...] etc. These correspond to different source particles for the jets. You will notice that except for quark- and gluon-initiated jets ("g"), all other jets in the dataset have at least one “prong.”

jet_classes

Lets see what the jet variables look like#

Many of these variables are energy correlation functions \(N\), \(M\), \(C\), and \(D\) (1305.0007, 1609.07483). The others are the jet mass (computed with modified mass drop) \(m_\textrm{mMDT}\), \(\sum z\log z\) where the sum is over the particles in the jet and \(z\) is the fraction of jet momentum carried by a given particle, and the overall multiplicity of particles in the jet.

import matplotlib.pyplot as plt

fig, axs = plt.subplots(4, 4, figsize=(40, 40))

for ix, ax in enumerate(axs.reshape(-1)):
    feat = data["feature_names"][ix]
    bins = np.linspace(np.min(X[:][feat]), np.max(X[:][feat]), 20)
    for c in y.dtype.categories:
        ax.hist(X[y == c][feat], bins=bins, histtype="step", label=c, lw=2)
    ax.set_xlabel(feat, fontsize=20)
    ax.set_ylabel("Jets", fontsize=20)
    ax.tick_params(axis="both", which="major", labelsize=20)
    ax.legend(fontsize=20, loc="best")
plt.tight_layout()
plt.show()

Because the y target is an array of strings, e.g. ["g", "w", ...], we need to make this a “one-hot” encoding for the training. Then, split the dataset into training and validation sets

le = LabelEncoder()
y_onehot = le.fit_transform(y)
y_onehot = to_categorical(y_onehot, 5)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)
print(y[:5])
print(y_onehot[:5])

Now construct a simple neural network#

We’ll use 3 hidden layers with 64, then 32, then 32 neurons. Each layer will use relu activation. Add an output layer with 5 neurons (one for each class), then finish with Softmax activation.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1
model = Sequential(name="sequantial1")
model.add(Dense(64, input_shape=(16,), name="fc1"))
model.add(Activation(activation="relu", name="relu1"))
model.add(Dense(32, name="fc2"))
model.add(Activation(activation="relu", name="relu2"))
model.add(Dense(32, name="fc3"))
model.add(Activation(activation="relu", name="relu3"))
model.add(Dense(5, name="fc4"))
model.add(Activation(activation="softmax", name="softmax"))
model.summary()

Train the model#

We’ll use SGD optimizer with categorical crossentropy loss. The model isn’t very complex, so this should just take a few minutes even on the CPU.

model.compile(optimizer="sgd", loss=["categorical_crossentropy"], metrics=["accuracy"])
history = model.fit(X_train_val, y_train_val, batch_size=1024, epochs=50, validation_split=0.25, shuffle=True, verbose=0)
from plotting import plot_model_history

plot_model_history(history)

Check performance#

Check the accuracy and make a ROC curve

from plotting import make_roc, plot_confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix

y_keras = model.predict(X_test, batch_size=1024, verbose=0)
print(f"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))}")
plt.figure(figsize=(5, 5))
plot_confusion_matrix(y_test, y_keras, classes=le.classes_, normalize=True)
plt.figure(figsize=(5, 5))
make_roc(y_test, y_keras, le.classes_)

Exercises#

  1. Apply a standard scaler to the inputs. How does the performance of the model change?

scaler = StandardScaler()
X_train_val = scaler.fit_transform(X_train_val)
X_test = scaler.transform(X_test)
  1. Apply L1 regularization. How does the performance of the model change? How do the distribution of the weight values change?

model.add(Dense(64, input_shape=(16,), name="fc1", kernel_regularizer=l1(0.01)))
  1. How do the loss curves change if we use a smaller learning rate (say 1e-5) or a larger one (say 0.1)?

  2. How does the loss curve change and the performance of the model change if we use Adam as the optimizer instead of SGD?