Hands-on 03: Tabular data and NNs: Classifying particle jets#
from tensorflow.keras.utils import to_categorical
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
import tensorflow as tf
%matplotlib inline
seed = 42
np.random.seed(seed)
tf.random.set_seed(seed)
2024-06-05 04:33:49.625123: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from tensorflow.keras.utils import to_categorical
2 from sklearn.datasets import fetch_openml
3 from sklearn.model_selection import train_test_split
ModuleNotFoundError: No module named 'tensorflow.keras'
Fetch the jet tagging dataset from Open ML#
data = fetch_openml("hls4ml_lhc_jets_hlf", parser="auto")
X, y = data["data"], data["target"]
Let’s print some information about the dataset#
Print the feature names and the dataset shape
print(f"Feature names: {data['feature_names']}")
print(f"Target names: {y.dtype.categories.to_list()}")
print(f"Shapes: {X.shape}, {y.shape}")
print(f"Inputs: {X}")
print(f"Targets: {y}")
As you see above, the y
target is an array of strings, e.g. ["g", "w", ...]
etc.
These correspond to different source particles for the jets.
You will notice that except for quark- and gluon-initiated jets ("g"
), all other jets in the dataset have at least one “prong.”
Lets see what the jet variables look like#
Many of these variables are energy correlation functions \(N\), \(M\), \(C\), and \(D\) (1305.0007, 1609.07483). The others are the jet mass (computed with modified mass drop) \(m_\textrm{mMDT}\), \(\sum z\log z\) where the sum is over the particles in the jet and \(z\) is the fraction of jet momentum carried by a given particle, and the overall multiplicity of particles in the jet.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(4, 4, figsize=(40, 40))
for ix, ax in enumerate(axs.reshape(-1)):
feat = data["feature_names"][ix]
bins = np.linspace(np.min(X[:][feat]), np.max(X[:][feat]), 20)
for c in y.dtype.categories:
ax.hist(X[y == c][feat], bins=bins, histtype="step", label=c, lw=2)
ax.set_xlabel(feat, fontsize=20)
ax.set_ylabel("Jets", fontsize=20)
ax.tick_params(axis="both", which="major", labelsize=20)
ax.legend(fontsize=20, loc="best")
plt.tight_layout()
plt.show()
Because the y
target is an array of strings, e.g. ["g", "w", ...]
, we need to make this a “one-hot” encoding for the training.
Then, split the dataset into training and validation sets
le = LabelEncoder()
y_onehot = le.fit_transform(y)
y_onehot = to_categorical(y_onehot, 5)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)
print(y[:5])
print(y_onehot[:5])
Now construct a simple neural network#
We’ll use 3 hidden layers with 64, then 32, then 32 neurons. Each layer will use relu
activation.
Add an output layer with 5 neurons (one for each class), then finish with Softmax activation.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1
model = Sequential(name="sequantial1")
model.add(Dense(64, input_shape=(16,), name="fc1"))
model.add(Activation(activation="relu", name="relu1"))
model.add(Dense(32, name="fc2"))
model.add(Activation(activation="relu", name="relu2"))
model.add(Dense(32, name="fc3"))
model.add(Activation(activation="relu", name="relu3"))
model.add(Dense(5, name="fc4"))
model.add(Activation(activation="softmax", name="softmax"))
model.summary()
Train the model#
We’ll use SGD optimizer with categorical crossentropy loss. The model isn’t very complex, so this should just take a few minutes even on the CPU.
model.compile(optimizer="sgd", loss=["categorical_crossentropy"], metrics=["accuracy"])
history = model.fit(X_train_val, y_train_val, batch_size=1024, epochs=50, validation_split=0.25, shuffle=True, verbose=0)
from plotting import plot_model_history
plot_model_history(history)
Check performance#
Check the accuracy and make a ROC curve
from plotting import make_roc, plot_confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix
y_keras = model.predict(X_test, batch_size=1024, verbose=0)
print(f"Accuracy: {accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_keras, axis=1))}")
plt.figure(figsize=(5, 5))
plot_confusion_matrix(y_test, y_keras, classes=le.classes_, normalize=True)
plt.figure(figsize=(5, 5))
make_roc(y_test, y_keras, le.classes_)
Exercises#
Apply a standard scaler to the inputs. How does the performance of the model change?
scaler = StandardScaler()
X_train_val = scaler.fit_transform(X_train_val)
X_test = scaler.transform(X_test)
Apply L1 regularization. How does the performance of the model change? How do the distribution of the weight values change?
model.add(Dense(64, input_shape=(16,), name="fc1", kernel_regularizer=l1(0.01)))
How do the loss curves change if we use a smaller learning rate (say
1e-5
) or a larger one (say0.1
)?How does the loss curve change and the performance of the model change if we use Adam as the optimizer instead of SGD?