Week 3 Notebook: Feature Engineering¶
Many different kinds of features have been specially engineered to perform b-tagging. We will consider several of these today.
b hadrons¶
B hadrons contain bottom quarks. These hadrons are unstable, meaning they naturally (and spontaneously) decay to hadrons containing lighter quarks, after a period of time (with a known median lifetime).
Because of the large mass difference between bottom quarks and the lighter quarks, the lifetime of b hadrons tends to be relatively long: around \(10^{-12}\) s. Given the fact that they are traveling near the speed of light, this means they can travel \(O\)(mm) in the detector before dacaying. The particle sproduced from this decay also tend to be higher energy
Therefore, we can consider the properties of displaced tracks and secondary vertices (points of origin of collections of displaced tracks). In particular we quantify how displaced a track is in terms of its impact parameter.
Single b-tagging¶
For single b-tagging, consider the following image:
For more information, read [5].
Double b-tagging¶
For double b-tagging, consider the following image:
For this, we can also take advantage of the so-called jet substructure. For more information, read [2, 6].
import uproot
f = uproot.open('root://eospublic.cern.ch//eos/opendata/cms/datascience/HiggsToBBNtupleProducerTool/HiggsToBBNTuple_HiggsToBB_QCD_RunII_13TeV_MC/train/ntuple_merged_10.root')
tree = f['deepntuplizer/tree']
labels = tree.arrays(['label_QCD_b',
'label_QCD_bb',
'label_QCD_c',
'label_QCD_cc',
'label_QCD_others',
'label_H_bb',
'sample_isQCD'],
entry_stop=20000,
library='np')
# label QCD: require the sample to be QCD and any of the QCD flavors
label_QCD = labels['sample_isQCD'] * (labels['label_QCD_b'] + \
labels['label_QCD_bb'] + \
labels['label_QCD_c'] + \
labels['label_QCD_cc'] + \
labels['label_QCD_others'])
# label Hbb
label_Hbb = labels['label_H_bb']
Let’s load a sampling of track, secondary vertex, and jet features.
track_features = tree.arrays(['track_pt',
'track_dxy',
'track_dxysig',
'track_dz',
'track_dzsig',
'trackBTag_Sip2dSig',
'trackBTag_Sip2dVal',
'trackBTag_Sip3dSig',
'trackBTag_Sip3dVal',
'trackBTag_PtRatio',
'trackBTag_PParRatio'],
entry_stop=20000,
library='ak')
sv_features = tree.arrays(['sv_pt',
'sv_mass'],
entry_stop=20000,
library='ak')
jet_features = tree.arrays(['fj_pt',
'fj_sdmass',
'fj_mass',
'fj_tau21',
'fj_jetNTracks',
'fj_trackSipdSig_0',
'fj_trackSipdSig_1'],
entry_stop=20000,
library='np')
Visualize Separation of Track Features¶
Let’s visualize the separation by plotting the signal and background for several track features that may be important.
Number of tracks
Maximum relative track \(p_T\)
Maximum signed 3D impact parameter value
Maximum signed 3D impact parameter significance: this is the value above divded by the estimated uncertainty of the measurement. This enables us to tell how “significant” (far from impact parameter of 0), this value is.
import matplotlib.pyplot as plt
import numpy as np
import awkward as ak
# number of tracks
plt.figure()
plt.hist(ak.num(track_features['track_pt']), weights=label_QCD, bins=np.linspace(0,80,81), density=True, alpha=0.7, label='QCD')
plt.hist(ak.num(track_features['track_pt']), weights=label_Hbb, bins=np.linspace(0,80,81), density=True, alpha=0.7, label='H(bb)')
plt.xlabel('Number of tracks')
plt.ylabel('Fraction of jets')
plt.legend()
# max. relative track pt
plt.figure()
plt.hist(ak.max(track_features['track_pt'], axis=-1)/jet_features['fj_pt'], weights=label_QCD, bins=np.linspace(0,0.5,51), density=True,alpha=0.7,label='QCD')
plt.hist(ak.max(track_features['track_pt'], axis=-1)/jet_features['fj_pt'], weights=label_Hbb, bins=np.linspace(0,0.5,51), density=True,alpha=0.7,label='H(bb)')
plt.xlabel(r'Maximum relative track $p_{T}$')
plt.ylabel('Fraction of jets')
plt.legend()
# maximum signed 3D impact paramter value
plt.figure()
plt.hist(ak.max(track_features['trackBTag_Sip3dVal'], axis=-1), weights=label_QCD, bins=np.linspace(-2,40,51), density=True, alpha=0.7, label='QCD')
plt.hist(ak.max(track_features['trackBTag_Sip3dVal'], axis=-1), weights=label_Hbb, bins=np.linspace(-2,40,51), density=True, alpha=0.7, label='H(bb)')
plt.xlabel('Maximum signed 3D impact parameter value')
plt.ylabel('Fraction of jets')
plt.legend()
# maximum signed 3D impact paramter significance
plt.figure()
plt.hist(ak.max(track_features['trackBTag_Sip3dSig'], axis=-1), weights=label_QCD, bins=np.linspace(-2,200,51), density=True, alpha=0.7, label='QCD')
plt.hist(ak.max(track_features['trackBTag_Sip3dSig'], axis=-1), weights=label_Hbb, bins=np.linspace(-2,200,51), density=True, alpha=0.7, label='H(bb)')
plt.xlabel('Maximum signed 3D impact parameter significance')
plt.ylabel('Fraction of jets')
plt.legend()
plt.show()




Visualize Separation of SV Features¶
Let’s visualize the separation by plotting the signal and background for several track features that may be important.
Number of secondary vertices
Maximum relative secondary vertex \(p_T\)
Maximum relative secondary vertex mass
ROC Curves¶
ROC curves can tell us how well each of these features discriminates betweeen signal and background.
from sklearn.metrics import roc_curve, auc
disc = np.nan_to_num(ak.num(track_features['track_pt'], axis=-1).to_numpy(allow_missing=True))
fpr, tpr, threshold = roc_curve(label_Hbb, disc)
# plot ROC curve
plt.figure()
plt.plot(fpr, tpr, lw=2.5, label="AUC = {:.1f}%".format(auc(fpr,tpr)*100))
plt.xlabel(r'False positive rate')
plt.ylabel(r'True positive rate')
#plt.semilogy()
plt.ylim(0,1)
plt.xlim(0,1)
plt.plot([0, 1], [0, 1], lw=2.5, label='Random, AUC = 50.0%')
plt.grid(True)
plt.legend(loc='upper left')
plt.show()

Other engineered features¶
To read more about the features read this: http://opendata.cern.ch/record/12102
Here is a selection:
Data variable |
Type |
Description |
---|---|---|
|
UInt_t |
Event number |
|
Float_t |
Number of reconstructed primary vertices (PVs) |
|
Float_t |
True mean number of the poisson distribution for this event from which the number of interactions in each bunch crossing has been sampled |
|
Float_t |
Median density (in GeV/A) of pile-up contamination per event; computed from all PF candidates of the event |
|
Int_t |
Boolean that is 1 if the simulated sample corresponds to QCD multijet production |
|
Int_t |
Boolean that is 1 if a Higgs boson is matched and at least two b quarks are found within the AK8 jet |
|
Int_t |
Boolean that is 1 if no resonances are matched and at least two b quarks are found within the AK8 jet |
|
Int_t |
Boolean that is 1 if no resonances are matched and only one b quark is found within the AK8 jet |
|
Int_t |
Boolean that is 1 if no resonances are matched and at least two c quarks are found within the AK8 jet |
|
Int_t |
Boolean that is 1 if no resonances are matched and only one c quark is found within the AK8 jet |
|
Int_t |
Boolean that is 1 if no resonances are matched and no b or c quarks are found within the AK8 jet |
|
Float_t |
Double-b tagging discriminant based on a boosted decision tree calculated for the AK8 jet (see CMS-BTV-16-002) |
|
Float_t |
Pseudorapidity η of the AK8 jet |
|
Float_t |
Pseudorapidity η of the generator-level, matched heavy particle: H, W, Z, top, etc. (default = -999) |
|
Float_t |
Transverse momentum of the generator-level, geometrically matched heavy particle: H, W, Z, t, etc. (default = -999) |
|
Int_t |
Boolean that is 1 if two or more b hadrons are clustered within the AK8 jet (see SWGuideBTagMCTools) |
|
Int_t |
Boolean that is 1 if fewer than two b hadrons are clustered within the AK8 jet (see SWGuideBTagMCTools) |
|
Int_t |
Number of b hadrons that are clustered within the AK8 jet (see SWGuideBTagMCTools) |
|
Int_t |
Number of c hadrons that are clustered within the AK8 jet (see SWGuideBTagMCTools) |
|
Int_t |
Boolean that is 1 if a generator-level Higgs boson and its daughters are geometrically matched to the AK8 jet |
|
Int_t |
Boolean that is 1 if a generator-level top quark and its daughters are geometrically matched to the AK8 jet |
|
Int_t |
Boolean that is 1 if a generator-level W boson and its daughters are geometrically matched to the AK8 jet |
|
Int_t |
Boolean that is 1 if a generator-level Z boson and its daughters are geometrically matched to the AK8 jet |
|
Int_t |
Boolean that is 1 if none of the above matching criteria are satisfied (H, top, W, Z) |
|
Int_t |
Integer label: |
|
Int_t |
Alternative integer label from the CMS Jet/MET and Resolution (JMAR) group: |
|
Int_t |
Alternative (legacy) integer label: |
|
Float_t |
Number of tracks associated with the AK8 jet |
|
Float_t |
Number of SVs associated with the AK8 jet (∆R < 0.7) |
|
Float_t |
Number of soft drop subjets in the AK8 jet (up to 2) |
|
Float_t |
Ungroomed mass of the AK8 jet |
|
Float_t |
Azimuthal angle ϕ of the AK8 jet |
|
Float_t |
Transverse momentum of the AK8 jet |
|
Float_t |
N-subjettiness variable for a 1-prong jet hypothesis |
|
Float_t |
N-subjettiness variable for a 2-prong jet hypothesis |
|
Float_t |
N-subjettiness variable for a 3-prong jet hypothesis |
|
Float_t |
N-subjettiness variable for 2-prong vs 1-prong jet discrimination ( |
|
Float_t |
N-subjettiness variable for 3-prong vs 2-prong jet discrimination ( |
|
Float_t |
Soft drop mass of the AK8 jet |
|
Float_t |
Transverse momentum times the ΔR between the two soft drop subjets |
|
Float_t |
Absolute relative difference between the transverse momenta of the two softdrop subjets |
|
Float_t |
Fraction of second subjet transverse momentum times ∆R squared |
|
Float_t |
First axis of the first subjet |
|
Float_t |
Second axis of the first subjet |
|
Float_t |
Combined secondary vertex (CSV) b-tagging discriminant for the first subjet |
|
Float_t |
Pseudorapidity η of the first subjet |
|
Float_t |
Mass of the first subjet |
|
Float_t |
Particle multiplicity of the first subjet |
|
Float_t |
Azimuthal angle ϕ of the first subjet |
|
Float_t |
Transverse momentum of the first subjet |
|
Float_t |
ptD variable, defined as the square root of the sum in quadrature of the transverse momentum of the subjet constituents divided by the scalar sum of the transverse momentum of the subjet constituents, for the first subjet (see CMS-PAS-JME-13-002) |
|
Float_t |
First axis of the first subjet |
|
Float_t |
Second axis of the first subjet |
|
Float_t |
Combined secondary vertex (CSV) b-tagging discriminant for the first subject |
|
Float_t |
Pseudorapidity η of the second subjet |
|
Float_t |
Mass of the second subjet |
|
Float_t |
Particle multiplicity of the second subjet |
|
Float_t |
Azimuthal angle ϕ of the second subjet |
|
Float_t |
Transverse momentum of the second subjet |
|
Float_t |
ptD variable, defined as the square root of the sum in quadrature of the transverse momentum of the subjet constituents divided by the scalar sum of the transverse momentum of the subjet constituents, for the second subjet (see CMS-PAS-JME-13-002) |
|
Float_t |
z ratio variable as defined in CMS-BTV-16-002 |
|
Float_t |
First largest track 3D signed impact parameter significance (see CMS-BTV-16-002 ) |
|
Float_t |
Second largest track 3D signed impact parameter significance (see CMS-BTV-16-002 ) |
|
Float_t |
Third largest track 3D signed impact parameter significance (see CMS-BTV-16-002 ) |
|
Float_t |
Fourth largest track 3D signed impact parameter significance (see CMS-BTV-16-002 ) |
|
Float_t |
First largest track 3D signed impact parameter significance associated to the first N-subjettiness axis |
|
Float_t |
Second largest track 3D signed impact parameter significance associated to the first N-subjettiness axis |
|
Float_t |
First largest track 3D signed impact parameter significance associated to the second N-subjettiness axis |
|
Float_t |
Second largest track 3D signed impact parameter significance associated to the second N-subjettiness axis |
|
Float_t |
Track 2D signed impact parameter significance of the first track lifting the combined invariant mass of the tracks above the c hadron threshold mass (1.5 GeV) |
|
Float_t |
Track 2D signed impact parameter significance of the first track lifting the combined invariant mass of the tracks above b hadron threshold mass (5.2 GeV) |
|
Float_t |
Track 2D signed impact parameter significance of the second track lifting the combined invariant mass of the tracks above b hadron threshold mass (5.2 GeV) |
|
Float_t |
Smallest track pseudorapidity ∆η, relative to the jet axis, associated to the first N-subjettiness axis |
|
Float_t |
Second smallest track pseudorapidity ∆η, relative to the jet axis, associated to the first N-subjettiness axis |
|
Float_t |
Third smallest track pseudorapidity ∆η, relative to the jet axis, associated to the first N-subjettiness axis |
|
Float_t |
Smallest track pseudorapidity ∆η, relative to the jet axis, associated to the second N-subjettiness axis |
|
Float_t |
Second smallest track pseudorapidity ∆η, relative to the jet axis, associated to the second N-subjettiness axis |
|
Float_t |
Third smallest track pseudorapidity ∆η, relative to the jet axis, associated to the second N-subjettiness axis |
|
Float_t |
Total SV mass for the first N-subjettiness axis, defined as the invariant mass of all tracks from SVs associated with the first N-subjettiness axis |
|
Float_t |
Total SV mass for the second N-subjettiness axis, defined as the invariant mass of all tracks from SVs associated with the second N-subjettiness axis |
|
Float_t |
SV vertex energy ratio for the first N-subjettiness axis, defined as the total energy of all SVs associated with the first N-subjettiness axis divided by the total energy of all the tracks associated with the AK8 jet that are consistent with the PV |
|
Float_t |
SV energy ratio for the second N-subjettiness axis, defined as the total energy of all SVs associated with the first N-subjettiness axis divided by the total energy of all the tracks associated with the AK8 jet that are consistent with the PV |
|
Float_t |
Transverse (2D) flight distance significance between the PV and the SV with the smallest uncertainty on the 3D flight distance associated to the first N-subjettiness axis |
|
Float_t |
Transverse (2D) flight distance significance between the PV and the SV with the smallest uncertainty on the 3D flight distance associated to the second N-subjettiness axis |
|
Float_t |
Pseudoangular distance ∆R between the first N-subjettiness axis and SV direction |
|
Int_t |
Number of particle flow (PF) candidates associated to the AK8 jet with transverse momentum greater than 0.95 GeV |
|
Float_t |
Number of particle flow (PF) candidates associated to the AK8 jet with transverse momentum greater than 0.95 GeV |
|
Int_t |
PV association quality for the PF candiate: |
|
Float_t |
Electric charge of the PF candidate |
|
Float_t |
Pseudoangular distance ∆R between the PF candidate and the AK8 jet axis |
|
Float_t |
Minimum pseudoangular distance ∆R between the associated SVs and the PF candidate |
|
Float_t |
Pseudoangular distance ∆R between the PF candidate and the first soft drop subjet |
|
Float_t |
Pseudoangular distance ∆R between the PF candidate and the second soft drop subjet |
|
Float_t |
Transverse (2D) impact paramater of the PF candidate, defined as the distance of closest approach of the PF candidate trajectory to the beam line in the transverse plane to the beam |
|
Float_t |
Transverse (2D) impact paramater significance of the PF candidate |
|
Float_t |
Longitudinal impact parameter, defined as the distance of closest approach of the PF candidate trajectory to the PV projected on to the z direction |
|
Float_t |
Longitudinal impact parameter significance of the PF candidate |
|
Float_t |
Energy of the PF candidate divided by the energy of the AK8 jet |
|
Float_t |
Pseudorapidity of the PF candidate relative to the AK8 jet axis |
|
Float_t |
Azimuthal angular distance ∆ϕ between the PF candidate and the AK8 jet axis |
|
Float_t |
Transverse momentum of the PF candidate divided by the transverse momentum of the AK8 jet |
|
Float_t |
Integer indicating whether the PF candidate is consistent with the PV: |
|
Float_t |
Fraction of energy of the PF candidate deposited in the hadron calorimeter |
|
Float_t |
Boolean that is 1 if the PF candidate is classified as a charged hadron |
|
Float_t |
Boolean that is 1 if the PF candidate is classified as an electron |
|
Float_t |
Boolean that is 1 if the PF candidate is classified as an photon |
|
Float_t |
Boolean that is 1 if the PF candidate is classified as an muon |
|
Float_t |
Boolean that is 1 if the PF candidate is classified as a neutral hadron |
|
Float_t |
Integer with information related to inner silicon tracker hits for the PF candidate: |
|
Float_t |
Mass of the PF candidate |
|
Float_t |
Pileup per-particle identification (PUPPI) weight indicating whether the PF candidate is pileup-like (0) or not (1) |
|
Int_t |
Number of tracks associated with the AK8 jet |
|
Float_t |
Number of tracks associated with the AK8 jet |
|
Float_t |
Pseudoangular distance ∆R between the track and the AK8 jet axis |
|
Float_t |
Pseudorapidity η of the track |
|
Float_t |
Pseudorapidity ∆η of the track relative the AK8 jet axis |
|
Float_t |
Minimum track approach distance to the AK8 jet axis |
|
Float_t |
Momentum of the track |
|
Float_t |
Component of track momentum parallel to the AK8 jet axis |
|
Float_t |
Component of track momentum parallel to the AK8 jet axis, normalized to the track momentum |
|
Float_t |
Component of track momentum perpendicular to the AK8 jet axis, normalized to the track momentum |
|
Float_t |
Component of track momentum perpendicular to the AK8 jet axis |
|
Float_t |
Transverse (2D) signed impact paramater of the track |
|
Float_t |
Transverse (2D) signed impact paramater significance of the track |
|
Float_t |
3D signed impact parameter significance of the track |
|
Float_t |
3D signed impact parameter of the track |
|
Float_t |
PV association quality for the track: |
|
Float_t |
Electric charge of the charged PF candidate |
|
Float_t |
Pseudoangular distance (∆R) between the charged PF candidate and the AK8 jet axis |
|
Float_t |
Track covariance matrix entry (eta, eta) |
|
Float_t |
Track covariance matrix entry (lambda, dz) |
|
Float_t |
Track covariance matrix entry (phi, phi) |
|
Float_t |
Track covariance matrix entry (phi, xy) |
|
Float_t |
Track covariance matrix entry (pT, pT) |
|
Float_t |
Track covariance matrix entry (dxy, dxy) |
|
Float_t |
Track covariance matrix entry (dxy, dz) |
|
Float_t |
Track covariance matrix entry (dz, dz) |
|
Float_t |
Minimum pseudoangular distance ∆R between the associated SVs and the charged PF candidate |
|
Float_t |
Pseudoangular distance ∆R between the charged PF candidate and the first soft drop subjet |
|
Float_t |
Pseudoangular distance ∆R between the charged PF candidate and the second soft drop subjet |
|
Float_t |
Transverse (2D) impact parameter of the track, defined as the distance of closest approach of the track trajectory to the beam line in the transverse plane to the beam |
|
Float_t |
Transverse (2D) impact parameter significance of the track |
|
Float_t |
Longitudinal impact parameter, defined as the distance of closest approach of the track trajectory to the PV projected on to the z direction |
|
Float_t |
Longitudinal impact parameter significance of the track |
|
Float_t |
Energy of the charged PF candidate divided by the energy of the AK8 jet |
|
Float_t |
Pseudorapidity ∆η of the track relative to the jet axis |
|
Float_t |
Integer indicating whether the charged PF candidate is consistent with the PV: |
|
Float_t |
Boolean that is 1 if the charged PF candidate is classified as a charged hadron |
|
Float_t |
Boolean that is 1 if the charged PF candidate is classified as an electron |
|
Float_t |
Boolean that is 1 if the charged PF candidate is classified as a muon |
|
Float_t |
Integer with information related to inner silicon tracker hits for the track: |
|
Float_t |
Mass of the charged PF candidate |
|
Float_t |
Normalized χ2 of the track fit |
|
Float_t |
Azimuthal angular distance ∆ϕ between the charged PF candidate and the AK8 jet axis |
|
Float_t |
Transverse momentum of the charged PF candidate |
|
Float_t |
Transverse momentum of the charged PF candidate divided by the transverse momentum of the AK8 jet |
|
Float_t |
Pileup per-particle identification (PUPPI) weight indicating whether the PF candidate is pileup-like (0) or not (1) |
|
Float_t |
Track quality: |
|
Int_t |
Number of secondary vertices (SV) associated with the AK8 jet (∆R < 0.8) |
|
Float_t |
Number of secondary vertices (SV) associated with the AK8 jet (∆R < 0.8) |
|
Float_t |
χ2 of the vertex fit |
|
Float_t |
number of degrees of freedom of the vertex fit |
|
Float_t |
χ2 divided by the number of degrees of freedom for the vertex fit |
|
Float_t |
Cosine of the angle cos(θ) between the SV and the PV |
|
Float_t |
3D flight distance of the SV |
|
Float_t |
3D flight distance uncertainty of the SV |
|
Float_t |
3D flight distance significance of the SV |
|
Float_t |
Transverse (2D) flight distance of the SV |
|
Float_t |
Transverse (2D) flight distance uncertainty of the SV |
|
Float_t |
Transverse (2D) flight distance significance of the SV |
|
Float_t |
Pseudoangular distance ∆R between the SV and the AK8 jet |
|
Float_t |
Energy of the SV divided by the energy of the AK8 jet |
|
Float_t |
Pseudorapidity ∆η of the SV relative to the AK8 jet axis |
|
Float_t |
Mass of the SV |
|
Float_t |
Number of tracks associated with the SV |
|
Float_t |
Azimuthal angular distance ∆ϕ of the SV relative to the jet axis |
|
Float_t |
Transverse momentum of the SV |
|
Float_t |
Transverse momentum of the SV divided by the transverse momentum of the AK8 jet |