{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hands-on 02: Tabular data and BDTs: Classifying LHC collisions\n",
    "\n",
    "### Goal\n",
    "\n",
    "Discriminate $H\\to\\tau\\tau$ (signal) from background such as $t\\bar{t}$.\n",
    "\n",
    "Signal:\n",
    "\n",
    "![](images/Htautau.png)\n",
    "\n",
    "Example background:\n",
    "\n",
    "![](images/ttbar.png)\n",
    "\n",
    "In a real detector, the signal looks like:\n",
    "\n",
    "<img src=\"https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HIGG-2012-07/figaux_17.png\" width=500/>\n",
    "\n",
    "### Boosted decision tree library: XGBoost\n",
    "We'll use the python API for the [XGBoost (eXtreme Gradient Boosting) library](https://github.com/dmlc/xgboost).\n",
    "\n",
    "### Data\n",
    "[ATLAS](https://home.cern/about/experiments/atlas) hosted a [Kaggle](https://www.kaggle.com/) competition for identifying $H\\to\\tau\\tau$ events, [the Higgs Boson Machine Learning Challenge](https://www.kaggle.com/c/higgs-boson/data). \n",
    "The training data for this event contains 250,000 labeled, simulated ATLAS events in `csv` format described [here](https://www.kaggle.com/c/higgs-boson/data) and [here](https://higgsml.lal.in2p3.fr/files/2014/04/documentation_v1.8.pdf).\n",
    "You can download it yourself, but we will only play with a small subset (10k events).\n",
    "\n",
    "### Data handling\n",
    "We'll use [Pandas](http://pandas.pydata.org/).\n",
    "\n",
    "### Installing XGBoost\n",
    "Assuming you have Python, NumPy, Matplotlib, and Pandas installed, you may need to install XGBoost if it's not already installed.\n",
    "```bash\n",
    "pip install xgboost --user\n",
    "```\n",
    "\n",
    "### Links\n",
    "A lot of this was borrowed from other sources. \n",
    "These sources and other good places for information about XGBoost and BDTs in general are here:\n",
    " * XGBoost demo: [Example of how to use XGBoost Python module to run Kaggle Higgs competition](https://github.com/dmlc/xgboost/tree/master/demo/kaggle-higgs)\n",
    " * Blog post by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do?](https://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)\n",
    " * XGBoost Kaggle Higgs solution: https://github.com/hetong007/higgsml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## XGBoost Tutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import xgboost as xgb\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data\n",
    "#### Load data\n",
    "First, load in the data and look at it. \n",
    "We will download a 10k event subsample of the Kaggle training data. \n",
    "Then we'll put it in the right format for XGBoost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! wget https://raw.githubusercontent.com/k-woodruff/bdt-tutorial/master/data/training_10k.csv -O data/training_10k.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"data/training_10k.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what the data looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Size of data: {}\".format(data.shape))\n",
    "print(\"Number of events: {}\".format(data.shape[0]))\n",
    "print(\"Number of columns: {}\".format(data.shape[1]))\n",
    "\n",
    "print(\"\\nList of features in dataset:\")\n",
    "for col in data.columns:\n",
    "    print(col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detailed description of features \n",
    "\n",
    "Prefix-less variables `EventId`, `Weight`, and `Label` have a special role and should not be used as input to the classifier. \n",
    "The variables prefixed with PRI (for PRImitives) are \"raw\" quantities about the bunch collision as measured by the detector, essentially the momenta of particles. \n",
    "Variables prefixed with DER (for DERived) are quantities computed from the primitive features. \n",
    "These quantities were selected by the physicists of ATLAS in the reference document either to select regions of interest or as features for the Boosted Decision Trees used in this analysis. \n",
    "In addition:\n",
    " * Variables are floating point unless specified otherwise.\n",
    " * All azimuthal $\\phi$ angles are in radian in the $[-\\pi, +\\pi]$ range.\n",
    " * Energy, mass, momentum are all in GeV\n",
    " * All other variables are unitless.\n",
    " * Variables are indicated as \"may be undefined\" when it can happen that they are meaningless or cannot be computed; in this case, their value is −999.0, which is outside the normal range of all variables.\n",
    " * The mass of particles has not been provided, as it can safely be neglected for the Challenge.\n",
    "\n",
    "Features:\n",
    "- `EventId`: An unique integer identifier of the event. \n",
    "  Not to be used as a feature.\n",
    "- `DER_mass_MMC`: The estimated mass $m_H$ of the Higgs boson candidate, obtained through a probabilistic phase space integration (may be undefined if the topology of the event is too far from the expected topology)\n",
    "- `DER_mass_transverse_met_lep`: The transverse mass (21) between the missing transverse energy and the lepton.\n",
    "- `DER_mass_vis`: The invariant mass (20) of the hadronic tau and the lepton.\n",
    "- `DER_pt_h`: The modulus (19) of the vector sum of the transverse momentum of the hadronic tau, the lepton, and the missing transverse energy vector.\n",
    "- `DER_deltaeta_jet_jet`: The absolute value of the pseudorapidity separation (22) between the\n",
    "two jets (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `DER_mass_jet_jet`: The invariant mass (20) of the two jets (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `DER_prodeta_jet_jet`: The product of the pseudorapidities of the two jets (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `DER_deltar_tau_lep`: The R separation (23) between the hadronic tau and the lepton.\n",
    "- `DER_pt_tot`: The modulus (19) of the vector sum of the missing transverse momenta and the transverse momenta of the hadronic tau, the lepton, the leading jet (if `PRI_jet_num` $\\geq 1$) and the subleading jet (if `PRI_jet_num` $= 2$) (but not of any additional jets).\n",
    "- `DER_sum_pt`: The sum of the moduli (19) of the transverse momenta of the hadronic tau, the lepton, the leading jet (if `PRI_jet_num` $\\geq 1$) and the subleading jet (if `PRI_jet_num` $= 2$) and the other jets (if `PRI_jet_num` $= 3$).\n",
    "- `DER_pt_ratio_lep_tau`: The ratio of the transverse momenta of the lepton and the hadronic tau.\n",
    "- `DER_met_phi_centrality`: The centrality of the azimuthal angle of the missing transverse energy vector w.r.t. the hadronic tau and the lepton $C=\\frac{A+B}{A^2 + B^2}$ where $A = \\sin(\\phi_\\mathrm{met} − \\phi_\\mathrm{lep})$, $B = \\sin(\\phi_\\mathrm{had} − \\phi_\\mathrm{met})$, and $\\phi_\\mathrm{met}$, $\\phi_\\mathrm{lep}$, and $\\phi_\\mathrm{had}$ are the azimuthal angles of the missing transverse energy vector, the lepton, and the hadronic tau, respectively. \n",
    "  The centrality is $\\sqrt{2}$ if the missing transverse energy vector $\\vec{E}_\\mathrm{T}^\\mathrm{miss}$ is on the bisector of the transverse momenta of the lepton and the hadronic tau. \n",
    "  It decreases to 1 if $\\vec{E}_\\mathrm{T}^\\mathrm{miss}$ is collinear with one of these vectors and it decreases further to $-\\sqrt{2}$ when $\\vec{E}_\\mathrm{T}^\\mathrm{miss}$ is exactly opposite to the bisector.\n",
    "- `DER_lep_eta_centrality`: The centrality of the pseudorapidity of the lepton w.r.t. the two jets\n",
    "(undefined if `PRI_jet_num` $\\leq 1$) $\\exp\\left [ \\frac{-4}{(\\eta_1 - \\eta_2)^2} \\left ( \\eta_\\mathrm{lep} - \\frac{\\eta_1+\\eta_2}{2} \\right)^2 \\right]$ where $\\eta_\\mathrm{lep}$ is the pseudorapidity of the lepton and $\\eta_1$ and $\\eta_2$ are the pseudorapidities of the two jets. \n",
    "  The centrality is 1 when the lepton is on the bisector of the two jets, decreases to $1/e$ when it is collinear to one of the jets, and decreases further to zero at infinity.\n",
    "- `PRI_tau_pt`: The transverse momentum $\\sqrt{p_x^2 + p_y^2}$ of the hadronic tau. \n",
    "- `PRI_tau_eta`: The pseudorapidity $\\eta$ of the hadronic tau.\n",
    "- `PRI_tau_phi`: The azimuth angle $\\phi$ of the hadronic tau.\n",
    "- `PRI_lep_pt`: The transverse momentum $\\sqrt{p_x^2 + p_y^2}$ of the lepton (electron or muon).\n",
    "- `PRI_lep_eta`: The pseudorapidity $\\eta$ of the lepton.\n",
    "- `PRI_ep_phi`: The azimuth angle $\\phi$ of the lepton.\n",
    "- `PRI_met`: The missing transverse energy $E_\\mathrm{T}^\\mathrm{miss}$\n",
    "- `PRI_met_phi` The azimuth angle $\\phi$ of the missing transverse energy.\n",
    "- `PRI_met_sumet`: The total transverse energy in the detector.\n",
    "- `PRI_jet_num`: The number of jets (integer with value of 0, 1, 2 or 3; possible larger values have been capped at 3).\n",
    "- `PRI_jet_leading_pt`: The transverse momentum $\\sqrt{p_x^2 + p_y^2}$ of the leading jet, that is the jet with largest transverse momentum (undefined if `PRI_jet_num` $= 0$).\n",
    "- `PRI_jet_leading_eta`: The pseudorapidity $\\eta$ of the leading jet (undefined if `PRI_jet_num` $= 0$).\n",
    "- `PRI_jet_leading_phi`: The azimuth angle $\\phi$ of the leading jet (undefined if `PRI_jet_num` $= 0$). \n",
    "- `PRI_jet_subleading_pt`: The transverse momentum $\\sqrt{p_x^2 + p_y^2}$ of the leading jet, that is, the jet with second largest transverse momentum (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `PRI_jet_subleading_eta`: The pseudorapidity $\\eta$ of the subleading jet (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `PRI_jet_subleading_phi`: The azimuth angle $\\phi$ of the subleading jet (undefined if `PRI_jet_num` $\\leq 1$).\n",
    "- `PRI_jet_all_pt`: The scalar sum of the transverse momentum of all the jets of the events. \n",
    "- `Weight`: The event weight $w_i$, explained in Section 3.3. \n",
    "  Not to be used as a feature. \n",
    "  Not available in the test sample.\n",
    "- `Label`: The event label (string) $y_i \\in \\{s, b\\}$ ($s$ for signal, $b$ for background).\n",
    "  Not to be used as a feature. \n",
    "  Not available in the test sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data set has 10,000 events with 33 columns each. \n",
    "The first column is an identifier, and should not be used as a feature. \n",
    "The last two columns `Weight` and `Label`, are the weights and labels from the simulation, and also should not be used as features (this information is all contained in the documentation).\n",
    "\n",
    "Now we can look at how many events are signal and background:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look at column labels --- notice last one is \"Label\" and first is \"EventId\" also \"Weight\"\n",
    "print(f\"Number of signal events: {len(data[data.Label == 's'])}\")\n",
    "print(f\"Number of background events: {len(data[data.Label == 'b'])}\")\n",
    "print(f\"Fraction signal: {len(data[data.Label == 's'])/(len(data[data.Label == 's']) + len(data[data.Label == 'b']))}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualize the features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "\n",
    "\n",
    "fig, axs = plt.subplots(8, 4, figsize=(40, 80))\n",
    "\n",
    "for ix, ax in enumerate(axs.reshape(-1)):\n",
    "    col = data.columns[ix + 1]\n",
    "    if col == \"Weight\" or col == \"Label\":\n",
    "        continue\n",
    "    signal = data[col][data.Label == \"s\"].to_numpy()\n",
    "    mask_signal = signal > -999\n",
    "    background = data[col][data.Label == \"b\"].to_numpy()\n",
    "    mask_background = background > -999\n",
    "    xmin = min(np.min(background[mask_background]), np.min(signal[mask_signal]))\n",
    "    xmax = max(np.max(background[mask_background]), np.max(signal[mask_signal]))\n",
    "\n",
    "    ax.hist(signal[mask_signal], bins=np.linspace(xmin, xmax, 51), alpha=0.5, label=\"signal\", density=True)\n",
    "    ax.hist(background[mask_background], bins=np.linspace(xmin, xmax, 51), alpha=0.5, label=\"background\", density=True)\n",
    "\n",
    "    ax.set_xlabel(col, fontsize=40)\n",
    "    ax.set_xlabel(col, fontsize=40)\n",
    "    ax.tick_params(axis=\"both\", which=\"major\", labelsize=40)\n",
    "    ax.legend(fontsize=40)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Format data:\n",
    "Now we should get the data into an XGBoost-friendly format. \n",
    "We can create DMatrix objects that will be used to train the BDT model. \n",
    "For now, we'll use all 30 of the features for training.\n",
    "\n",
    "First, we'll slice up the data into training and testing sets. \n",
    "Here, we take 20% for the test set, which is arbitrary.\n",
    "\n",
    "In this file, all samples are independent and ordered randomly, so we can just grab a chunk.\n",
    "Check out [Scikit-learn Cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) for dividing up samples responsibly.\n",
    "\n",
    "We can also change the data type of the `Label` column to the Pandas type `category` for easier use later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"Label\"] = data.Label.astype(\"category\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = data[:8000]\n",
    "data_test = data[8000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check to make sure we did it right:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Number of training samples: {len(data_train)}\")\n",
    "print(f\"Number of testing samples: {len(data_test)}\")\n",
    "print()\n",
    "print(f\"Number of signal events in training set: {len(data_train[data_train.Label == 's'])}\")\n",
    "print(f\"Number of background events in training set: {len(data_train[data_train.Label == 'b'])}\")\n",
    "print(\n",
    "    f\"Fraction signal: {len(data_train[data_train.Label == 's'])/(len(data_train[data_train.Label == 's']) + len(data_train[data_train.Label == 'b']))}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `DMatrix` object takes as arguments:\n",
    "- `data`: the features\n",
    "- `label`: `1`/`0` or `True`/`False` for binary data (we have to convert our label to boolean from string `\"s\"`/`\"b\"`)\n",
    "- `missing`: how missing values are represented (here as `-999.0`)\n",
    "- `feature_names`: the names of all the features (optional)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = list(data.columns[1:-2])  # we skip the first and last two columns because they are the ID, weight, and label\n",
    "\n",
    "print(len(feature_names))\n",
    "\n",
    "train = xgb.DMatrix(\n",
    "    data=data_train[feature_names], label=data_train.Label.cat.codes, missing=-999.0, feature_names=feature_names\n",
    ")\n",
    "test = xgb.DMatrix(\n",
    "    data=data_test[feature_names], label=data_test.Label.cat.codes, missing=-999.0, feature_names=feature_names\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check if we did it right:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Number of training samples: {train.num_row()}\")\n",
    "print(f\"Number of testing samples: {test.num_row()}\")\n",
    "print()\n",
    "print(f\"Number of signal events in training set: {len(np.where(train.get_label())[0])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make the model\n",
    "#### Set hyperparameters:\n",
    "The XGBoost hyperparameters are defined [here](https://github.com/dmlc/xgboost/blob/master/doc/parameter.md). \n",
    "For a nice description of what they all mean, and tips on tuning them, see [this guide](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/).\n",
    "\n",
    "In general, the tunable parameters in XGBoost are the ones you would see in other gradient boosting libraries. Here, they fall into three categories:\n",
    "1. General parameters: e.g., which booster to use, number of threads. We won't mess with these here.\n",
    "2. Booster parameters: Tune the actual boosting, e.g., learning rate. These are the ones to optimize.\n",
    "3. Learning task parameters: Define the objective function and the evaluation metrics.\n",
    "\n",
    "Here, we will use the defaults for most parameters and just set a few to see how it's done. \n",
    "The parameters are passed in as a dictionary or list of pairs.\n",
    "\n",
    "Make the parameter dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = {}\n",
    "\n",
    "param[\"seed\"] = 42  # set seed for reproducibility\n",
    "\n",
    "# Booster parameters\n",
    "param[\"eta\"] = 0.1  # learning rate\n",
    "param[\"max_depth\"] = 10  # maximum depth of a tree\n",
    "param[\"subsample\"] = 0.8  # fraction of events to train tree on\n",
    "param[\"colsample_bytree\"] = 0.8  # fraction of features to train tree on\n",
    "\n",
    "# Learning task parameters\n",
    "param[\"objective\"] = \"binary:logistic\"  # objective function\n",
    "param[\"eval_metric\"] = \"error\"  # evaluation metric for cross validation, note: last one is used for early stopping\n",
    "param = list(param.items())\n",
    "\n",
    "num_trees = 100  # number of trees to make"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we set the booster parameters. \n",
    "Again, we just chose a few here to experiment with. \n",
    "These are the parameters to tune to optimize your model. \n",
    "Generally, there is a trade off between speed and accuracy.\n",
    "1. `eta` is the learning rate. It determines how much to change the data weights after each boosting iteration. The default is 0.3.\n",
    "2. `max_depth` is the maximum depth of any tree. The default is 6.\n",
    "3. `subsample` is the fraction of events used to train each new tree. These events are randomly sampled each iteration from the whole sample set. The default is 1 (use every event for each tree).\n",
    "4. `colsample_bytree` is the fraction of features available to train each new tree. These features are randomly sampled each iteration from the whole feature set. The default is 1.\n",
    "\n",
    "Next, we set the learning objective to `binary:logistic`. So, we have two classes that we want to score from 0 to 1. \n",
    "The `eval_metric` parameters set what we want to monitor when doing cross validation. \n",
    "(We aren't doing cross validation in this example, but we should be!) \n",
    "If you want to watch more than one metric, ```param``` must be a list of pairs, instead of a dict. \n",
    "Otherwise, we would just keep resetting the same parameter.\n",
    "\n",
    "Last, we set the number of trees to 100. \n",
    "Usually, you would set this number high, and choose a cut off point based on the cross validation. \n",
    "The number of trees is the same as the number of iterations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now train!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "booster = xgb.train(param, train, num_boost_round=num_trees)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a trained model. The next step is to look at it's performance and try to improve the model if we need to. We can try to improve it by improving/adding features, adding more training data, using more boosting iterations, or tuning the hyperparameters (ideally in that order).\n",
    "\n",
    "#### Evaluate:\n",
    "First, let's look at how it does on the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(booster.eval(test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the evaluation metrics that we stored in the parameter set.\n",
    "\n",
    "It's pretty hard to interpret the performance of a classifier from a few number. So, let's look at the predictions for the entire test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = booster.predict(test)\n",
    "labels = test.get_label().astype(bool)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot signal and background predictions, separately\n",
    "plt.figure()\n",
    "plt.hist(predictions[labels], bins=np.linspace(0, 1, 50), histtype=\"step\", color=\"firebrick\", label=\"Signal\")\n",
    "plt.hist(predictions[~labels], bins=np.linspace(0, 1, 50), histtype=\"step\", color=\"midnightblue\", label=\"Background\")\n",
    "# make the plot readable\n",
    "plt.xlabel(\"BDT prediction\", fontsize=12)\n",
    "plt.ylabel(\"Events\", fontsize=12)\n",
    "plt.legend(frameon=False)\n",
    "plt.xlim(0, 1)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import roc_curve, auc, accuracy_score\n",
    "\n",
    "fpr, tpr, _ = roc_curve(labels, predictions)\n",
    "auc_score = auc(fpr, tpr)\n",
    "acc_score = accuracy_score(labels, predictions > 0.5)\n",
    "\n",
    "# plot TPR vs. FPR (ROC curve)\n",
    "plt.figure()\n",
    "plt.plot(fpr, tpr, color=\"blueviolet\", label=f\"AUC = {auc_score*100:.2f}%, acc. = {acc_score*100:.2f}%\")\n",
    "# make the plot readable\n",
    "plt.xlabel(\"False positive rate\", fontsize=12)\n",
    "plt.ylabel(\"True positive rate\", fontsize=12)\n",
    "plt.legend(frameon=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's also very informative to look at the importance of each feature. \n",
    "The \"F score\" is the number of times each feature is used to split the data over all the trees (times the weight of that tree).\n",
    "\n",
    "There is a built-in function in the XGBoost Python API to easily plot this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.plot_importance(booster, grid=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature that was used the most was `DER_mass_MMC`.\n",
    "\n",
    "We can plot how this feature is distributed for the signal and background."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.hist(\n",
    "    data_test.DER_mass_MMC[data_test.Label == \"s\"],\n",
    "    bins=np.linspace(0, 400, 50),\n",
    "    histtype=\"step\",\n",
    "    color=\"firebrick\",\n",
    "    label=\"Signal\",\n",
    ")\n",
    "plt.hist(\n",
    "    data_test.DER_mass_MMC[data_test.Label == \"b\"],\n",
    "    bins=np.linspace(0, 400, 50),\n",
    "    histtype=\"step\",\n",
    "    color=\"midnightblue\",\n",
    "    label=\"Background\",\n",
    ")\n",
    "\n",
    "plt.xlim(0, 400)\n",
    "plt.xlabel(\"DER_mass_MMC\", fontsize=12)\n",
    "plt.ylabel(\"Events\", fontsize=12)\n",
    "plt.legend(frameon=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This variable is physically significant because it represents an estimate of the Higgs boson mass. \n",
    "For signal, it is expected to peak at 125 GeV.\n",
    "We can also plot it with one of the next most important features `DER_mass_transverse_met_lep`.\n",
    "Note: the exact ranking of features can depend on the random seed and other hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "\n",
    "mask_b = np.array(data_test.Label == \"b\")\n",
    "mask_s = np.array(data_test.Label == \"s\")\n",
    "\n",
    "DER_mass_MMC = np.array(data_test.DER_mass_MMC)\n",
    "DER_mass_transverse_met_lep = np.array(data_test.DER_mass_transverse_met_lep)\n",
    "\n",
    "plt.plot(\n",
    "    DER_mass_MMC[mask_b],\n",
    "    DER_mass_transverse_met_lep[mask_b],\n",
    "    \"o\",\n",
    "    markersize=2,\n",
    "    color=\"midnightblue\",\n",
    "    markeredgewidth=0,\n",
    "    alpha=0.8,\n",
    "    label=\"Background\",\n",
    ")\n",
    "plt.plot(\n",
    "    DER_mass_MMC[mask_s],\n",
    "    DER_mass_transverse_met_lep[mask_s],\n",
    "    \"o\",\n",
    "    markersize=2,\n",
    "    color=\"firebrick\",\n",
    "    markeredgewidth=0,\n",
    "    alpha=0.8,\n",
    "    label=\"Signal\",\n",
    ")\n",
    "plt.xlim(0, 400)\n",
    "plt.ylim(0, 150)\n",
    "plt.xlabel(\"DER_mass_MMC\", fontsize=12)\n",
    "plt.ylabel(\"DER_mass_transverse_met_lep\", fontsize=12)\n",
    "plt.legend(frameon=False, numpoints=1, markerscale=2)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise: Retrain BDT with 2 features and plot decision boundary\n",
    "Repeat the steps above but manually set the features to just two of the most \"important\" ones: `feature_names = [\"DER_mass_MMC\", \"DER_mass_transverse_met_lep\"]`\n",
    "\n",
    "Then use the code below to plot the decision boundary in 2D. \n",
    "1) What do you notice about the shape of the decision boundary?\n",
    "2) Do you see any evidence of overfitting? How can you prove it? (Hint: consider the training data)\n",
    "\n",
    "```python\n",
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "# first get a mesh grid\n",
    "x_grid, y_grid = np.meshgrid(np.linspace(0, 400, 1000), np.linspace(0, 150, 1000))\n",
    "# convert grid into DMatrix\n",
    "matrix_grid = xgb.DMatrix(\n",
    "    data=np.c_[x_grid.ravel(), y_grid.ravel()], missing=-999.0, feature_names=feature_names\n",
    ")\n",
    "# run prediction for every value in grid\n",
    "z_grid = booster.predict(matrix_grid)\n",
    "# reshape\n",
    "z_grid = z_grid.reshape(x_grid.shape)\n",
    "\n",
    "\n",
    "plt.figure()\n",
    "# plot decision boundary\n",
    "ax = plt.subplot(111)\n",
    "cm = ListedColormap([\"midnightblue\", \"firebrick\"])\n",
    "plt.contourf(x_grid, y_grid, z_grid, levels=[0, 0.5, 1], cmap=cm, alpha=0.25)\n",
    "# overlaid with test data points\n",
    "plt.plot(\n",
    "    DER_mass_MMC[mask_b],\n",
    "    DER_mass_transverse_met_lep[mask_b],\n",
    "    \"o\",\n",
    "    markersize=2,\n",
    "    color=\"midnightblue\",\n",
    "    markeredgewidth=0,\n",
    "    alpha=0.8,\n",
    "    label=\"Background\",\n",
    ")\n",
    "plt.plot(\n",
    "    DER_mass_MMC[mask_s],\n",
    "    DER_mass_transverse_met_lep[mask_s],\n",
    "    \"o\",\n",
    "    markersize=2,\n",
    "    color=\"firebrick\",\n",
    "    markeredgewidth=0,\n",
    "    alpha=0.8,\n",
    "    label=\"Signal\",\n",
    ")\n",
    "ax.set_ylim(0,150)\n",
    "ax.set_xlim(0,400)\n",
    "plt.xlabel(\"DER_mass_MMC\", fontsize=12)\n",
    "plt.ylabel(\"DER_mass_transverse_met_lep\", fontsize=12)\n",
    "plt.legend(frameon=False, numpoints=1, markerscale=2)\n",
    "plt.show()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "d0ea348b636367bcdf67fd2d6d24251712b38670f61fdee14f28eb58fe74f081"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}