Files
Esercizi-MLN/Labs/Lab 8/lab_8.ipynb
2024-11-28 23:01:35 +01:00

1304 lines
173 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
{
"cells": [
{
"cell_type": "markdown",
"id": "0f2ab399",
"metadata": {},
"source": [
"<center><b><font size=6>Lab-8 <b><center>\n",
"<center><b><font size=6> Supervised learning with model selection, validation and hyper-parameter tuning<b><center>"
]
},
{
"cell_type": "markdown",
"id": "cce64de2",
"metadata": {},
"source": [
"### Objective: Applying the following techniques for supervised learning\n",
"1. **Validation curve** plots performance score over a single varying hyper-parameter. Therefore, we will also need to perform model validation, with a validation set or with cross-validation. \n",
"2. **Grid search** analyzes the combination of different hyper-parameters. It traverses all the possible combinations of selected values for selected hyper-parameters, training a hypothesis and examining the corresponding performance, eventually outputting the best parameter setting. We need to perform model validation, with a validation set or with cross-validation. \n",
"3. **Learning curve** studies the minimum amount of training samples needed to produce acceptable performance. This is helpful to understand if more data would likely improve the results, or if less data is already enough. We need to perform model validation also on the learning curve."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "936e1268",
"metadata": {},
"outputs": [],
"source": [
"# import needed python libraries\n",
"\n",
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import pandas as pd\n",
"import numpy as np\n",
"import random\n",
"import math\n",
"import copy\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics import classification_report, accuracy_score, confusion_matrix\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.naive_bayes import GaussianNB\n",
"\n",
"from sklearn.model_selection import cross_validate, StratifiedShuffleSplit"
]
},
{
"cell_type": "markdown",
"id": "cdb08d90",
"metadata": {},
"source": [
"### 1. Tutorial - advanced classification"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f654c7f0",
"metadata": {},
"outputs": [],
"source": [
"# load dataset of IRIS flowers\n",
"from sklearn import datasets\n",
"iris_data = datasets.load_iris()\n",
"features_iris = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n",
"df_iris = pd.DataFrame(iris_data.data, columns = features_iris)\n",
"df_iris['type'] = 'setosa'\n",
"df_iris.loc[50:99, 'type'] = 'versicolor'\n",
"df_iris.loc[100:149, 'type'] = 'virginica'\n",
"\n",
"# note that we need to convert the type of flowers to numerical labels\n",
"df_iris['label'] = pd.Categorical(df_iris['type']).codes\n",
"\n",
"# data segmentation\n",
"# we still split the dataset into training and test sets with portions of 70% and 30% for now\n",
"# but in the later stage, you need to further split the training set into training and validation sets\n",
"# therefore, we do not name training set as X_train and y_train, but X and y\n",
"X, X_test, y, y_test = train_test_split(\n",
" df_iris[features_iris], \n",
" df_iris['label'], \n",
" stratify = df_iris['label'], \n",
" train_size = 0.7, \n",
" random_state = 15\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2149c055",
"metadata": {},
"source": [
"#### Two approaches for model validation\n",
"The Test set is always excluded during model development, and in order to consolidate the model selection and tuning, we need a validation set to evaluate the performance as an intermediate outcome. However, ML models are sensitive to data so that the outcome might be too dependant on a certain segmentation of data if you always stick to the same training and validation sets. Therefore, to better evaluate the model, we can repeat the process multiple times using different data to train and validate the model, and then averaging the metrics to derive an overall unbiased performance of the model (but notice that the obtained hypothesis might be different).\n",
"\n",
"There are multiple ways, and here we introduce two of them.\n",
"\n",
"1. k-fold cross-validation: \n",
" - After setting apart the test set, you split the data into k-folds (k blocks). \n",
" - We select one fold for validation and we train on all the remaining folds. \n",
" - We repeat the process k times, each time with a different validation fold. \n",
" - During each trial, you record the metrics and finally, you can calculate the min, max, mean value of the metrics to evaluate the model performance.\n",
"2. Randomly stratified sampling:\n",
" - After setting apart the test set, you randomly split the remaining data in a stratified way (usually according to the label) into training and validation with a certain proportion (e.g., 70% and 30%)\n",
" - You repeat the random split multiple times. In each trial, you end up with somewhat different data segmentation. \n",
" - During each trial, you record the metrics and finally, you can calculate the min, max, mean value of the metrics to evaluate the model performance.\n",
"\n",
"<center><img src=\"validation.png\" width=\"vp\"/></center>"
]
},
{
"cell_type": "markdown",
"id": "2c9ec483",
"metadata": {},
"source": [
"### 1.1 Validation curve\n",
"One way to find the best value of an hyper-parameter of a model (keeping the other ones constant) is the validation curve. It evaluates the model performance on validation set as we change the value of a certain hyper-parameters. Note that you can also simplify the work by using the ``validation_curve()`` (<a href=\"https://scikit-learn.org/stable/modules/learning_curve.html\">documentation</a>) provided by sklearn to do everything in one shot without manually setting up the for loop.\n",
"\n",
"Here we are finding the best value for max_depth in a RandomForestClassifier between 2 and 9, by using cross validation with 5 folds. Use as a metric the accuracy.\n",
"\n",
"Note: the function ``cross_validate`` automatically handles the data segmentation into training and validation sets, but it is possible to do it manually."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "5cae7d7e",
"metadata": {},
"outputs": [],
"source": [
"acc_train_means = []\n",
"acc_train_maxs = []\n",
"acc_train_mins = []\n",
"acc_val_means = []\n",
"acc_val_maxs = []\n",
"acc_val_mins = []\n",
"\n",
"#we choose the model RandomForestClassifier and change its hyper-parameter max_depth, using integer values from 2 to 9\n",
"for n in range(2, 10):\n",
" rf_clf = RandomForestClassifier(max_depth=n)\n",
" scores = cross_validate(rf_clf, X, y, cv=5, scoring='accuracy', return_train_score=True)\n",
" acc_train_means.append(scores['train_score'].mean())\n",
" acc_train_maxs.append(scores['train_score'].max())\n",
" acc_train_mins.append(scores['train_score'].min())\n",
" acc_val_means.append(scores['test_score'].mean())\n",
" acc_val_maxs.append(scores['test_score'].max())\n",
" acc_val_mins.append(scores['test_score'].min())"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "06528dcd",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure()\n",
"x = list(range(2, 10))\n",
"plt.plot(x, acc_train_means, color='tab:blue', label='train')\n",
"plt.fill_between(x, acc_train_mins, acc_train_maxs, alpha=0.5, color='tab:blue')\n",
"plt.plot(x, acc_val_means, color='tab:red', label='val')\n",
"plt.fill_between(x, acc_val_mins, acc_val_maxs, alpha=0.5, color='tab:red')\n",
"plt.xlabel('max depth')\n",
"plt.ylabel('accuracy')\n",
"plt.legend()\n",
"plt.title(\"Validation curve for max depth in RF\")\n",
"plt.ylim(0.7,1)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "542195cd",
"metadata": {},
"source": [
"### 1.2 Grid search\n",
"The grid search exhaustively use all combinations of candidates from a grid of hyper-parameter values. It is used to identify the best combinations of ML hyper-parameters for a task, and can be used either for supervised or unsupervised learning problems. First, you select a bunch of hyper-parameters that you want to optimize, and for each of them, determine its value range. Then, the grid search will iterate over all the possible combinations of values of hyper-parameters, building and evaluating a model on each of them, and eventually choosing the one with best performance. Note that we are not doing a cross validation for simplicity, but you can also use ``GridSearchCV`` (<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html\">documentation</a>) provided by sklearn to do it in one line of code.\n",
"\n",
"Here we are finding the best combination of values for max_depth and n_estimators in a RandomForestClassifier, by using a single validation set. We use the accuracy metric to compare the performance.\n",
"\n",
"For using a single validation set, we further split the previous ``X`` and ``y`` into training and validation sets, so that their occupation in the original dataset will be 50% and 20%."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "89b0827e",
"metadata": {},
"outputs": [],
"source": [
"# note that here we do not perform standardization for simplicity\n",
"X_train, X_val, y_train, y_val = train_test_split(\n",
" X, \n",
" y, \n",
" stratify = y, \n",
" train_size = 0.5/0.7, \n",
" random_state = 15\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b674f0b9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"#trees: 10, max depth: 2, accuracy: train - 0.9866666666666667 | val - 0.9\n",
"#trees: 10, max depth: 3, accuracy: train - 0.9866666666666667 | val - 0.9\n",
"#trees: 10, max depth: 4, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 10, max depth: 5, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 10, max depth: None, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 20, max depth: 2, accuracy: train - 0.9866666666666667 | val - 0.9\n",
"#trees: 20, max depth: 3, accuracy: train - 0.9866666666666667 | val - 0.9\n",
"#trees: 20, max depth: 4, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 20, max depth: 5, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 20, max depth: None, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 30, max depth: 2, accuracy: train - 0.9866666666666667 | val - 0.9333333333333333\n",
"#trees: 30, max depth: 3, accuracy: train - 0.9866666666666667 | val - 0.9333333333333333\n",
"#trees: 30, max depth: 4, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 30, max depth: 5, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 30, max depth: None, accuracy: train - 1.0 | val - 0.9\n",
"#trees: 40, max depth: 2, accuracy: train - 0.9866666666666667 | val - 0.9333333333333333\n",
"#trees: 40, max depth: 3, accuracy: train - 0.9866666666666667 | val - 0.9333333333333333\n",
"#trees: 40, max depth: 4, accuracy: train - 1.0 | val - 0.9333333333333333\n",
"#trees: 40, max depth: 5, accuracy: train - 1.0 | val - 0.9333333333333333\n",
"#trees: 40, max depth: None, accuracy: train - 1.0 | val - 0.9333333333333333\n"
]
}
],
"source": [
"# e.g., RF - parameters to be optimized:\n",
" # n_estimators: number of trees\n",
" # max_depth: the maximum depth of trees\n",
"\n",
"# iterate over all possible combinations\n",
"for n_estimators in range(10, 50, 10):\n",
" for max_depth in [2, 3, 4, 5, None]:\n",
" # initialize and fit a model per pair of parameters\n",
" rf_tmp = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=12)\n",
" rf_tmp.fit(X_train, y_train)\n",
" # here we only refer to accuracy for simplicity\n",
" acc_train = accuracy_score(y_train, rf_tmp.predict(X_train))\n",
" acc_val = accuracy_score(y_val, rf_tmp.predict(X_val))\n",
" print(f'#trees: {n_estimators}, max depth: {max_depth}, accuracy: train - {acc_train} | val - {acc_val}')"
]
},
{
"cell_type": "markdown",
"id": "534c3c92",
"metadata": {},
"source": [
"### 1.3 Learning curves\n",
"The learning curve shows how performance changes by varying (i.e., reducing) the available data points (e.g., sample size, number of features).\n",
"\n",
"Normally, we need a sampling strategy to create subsets of the data, and such data should be representative. Available sampling strategies are arithmetic sampling, geometric sampling, random sampling, etc. \n",
"\n",
"You can start with a small portion of the original data to train the model and check the performance, and then keep increasing the portion, observing the performance with respect to the amount of data. Note that, given a value of the portion, the sampling randomly selects data, possibly impacting performance. Therefore, to take it into consideration we can use multiple random states to sample data and computing the statistics of performance (mean, max, min). \n",
"\n",
"Here we are using a Decision Tree classifier with default hyper-parameters. Start from a subset of training set containing 10% of the data up to 90%. Use randomly stratified sampling and repeat the computation 10 times for each subset with a different initialization state. Use as a metric the accuracy."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "307ba80b",
"metadata": {},
"outputs": [],
"source": [
"# here we only refer to accuracy for simplicity\n",
"info_accuracy_train = []\n",
"info_accuracy_val = []\n",
"\n",
"for train_size in np.arange(0.1, 1, 0.1):\n",
" # select 10 random states\n",
" random_states = set()\n",
" while len(random_states) < 10:\n",
" n = random.randint(0,1000000)\n",
" random_states.add(n)\n",
"\n",
" accuraries_train_tmp = []\n",
" accuraries_val_tmp = []\n",
" \n",
" # iterate over all random states with same training size to derive the 10 different performance\n",
" for random_state in random_states:\n",
" # do a stratified random sampling from the training set \n",
" sss = StratifiedShuffleSplit(n_splits=1, train_size=train_size, random_state=random_state)\n",
" index_selected = list(sss.split(X_train, y_train))[0][0]\n",
" X_train_selected = X_train.iloc[index_selected]\n",
" y_train_selected = y_train.iloc[index_selected]\n",
" \n",
" # initialize and fit a DT model per time\n",
" dt_tmp = DecisionTreeClassifier(random_state=0)\n",
" dt_tmp.fit(X_train_selected, y_train_selected)\n",
" accuraries_train_tmp.append(accuracy_score(y_train_selected, dt_tmp.predict(X_train_selected)))\n",
" accuraries_val_tmp.append(accuracy_score(y_val, dt_tmp.predict(X_val)))\n",
"\n",
" # calculate the statistics to generate overall performance\n",
" accuraries_tmp = np.array(accuraries_train_tmp)\n",
" mean_acc = accuraries_tmp.mean()\n",
" min_acc = accuraries_tmp.min()\n",
" max_acc = accuraries_tmp.max()\n",
" info_accuracy_train.append((mean_acc, min_acc, max_acc))\n",
" \n",
" accuraries_tmp = np.array(accuraries_val_tmp)\n",
" mean_acc = accuraries_tmp.mean()\n",
" min_acc = accuraries_tmp.min()\n",
" max_acc = accuraries_tmp.max()\n",
" info_accuracy_val.append((mean_acc, min_acc, max_acc))"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "1ff4a1d0",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "'yerr' must not contain negative values",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[16], line 12\u001b[0m\n\u001b[1;32m 1\u001b[0m plt\u001b[38;5;241m.\u001b[39mfigure()\n\u001b[1;32m 2\u001b[0m plt\u001b[38;5;241m.\u001b[39merrorbar(\n\u001b[1;32m 3\u001b[0m [train_size \u001b[38;5;28;01mfor\u001b[39;00m train_size \u001b[38;5;129;01min\u001b[39;00m np\u001b[38;5;241m.\u001b[39marange(\u001b[38;5;241m0.1\u001b[39m, \u001b[38;5;241m1.0\u001b[39m, \u001b[38;5;241m0.1\u001b[39m)], \u001b[38;5;66;03m# x-location of each error bar\u001b[39;00m\n\u001b[1;32m 4\u001b[0m [info[\u001b[38;5;241m0\u001b[39m] \u001b[38;5;28;01mfor\u001b[39;00m info \u001b[38;5;129;01min\u001b[39;00m info_accuracy_train], \u001b[38;5;66;03m# y-location of each error bar\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 10\u001b[0m color\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtab:blue\u001b[39m\u001b[38;5;124m'\u001b[39m, label\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mtrain\u001b[39m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m 11\u001b[0m )\n\u001b[0;32m---> 12\u001b[0m \u001b[43mplt\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43merrorbar\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 13\u001b[0m \u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43mtrain_size\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mtrain_size\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mnp\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43marange\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m0.1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m1.0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m0.1\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# x-location of each error bar\u001b[39;49;00m\n\u001b[1;32m 14\u001b[0m \u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43minfo\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo_accuracy_val\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;66;43;03m# y-location of each error bar\u001b[39;49;00m\n\u001b[1;32m 15\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;66;43;03m# the size of each error bar\u001b[39;49;00m\n\u001b[1;32m 16\u001b[0m \u001b[43m \u001b[49m\u001b[43myerr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43m[\u001b[49m\n\u001b[1;32m 17\u001b[0m \u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43minfo\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m-\u001b[39;49m\u001b[43minfo\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo_accuracy_val\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\n\u001b[1;32m 18\u001b[0m \u001b[43m \u001b[49m\u001b[43m[\u001b[49m\u001b[43minfo\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m-\u001b[39;49m\u001b[43minfo\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m0\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43minfo_accuracy_val\u001b[49m\u001b[43m]\u001b[49m\n\u001b[1;32m 19\u001b[0m \u001b[43m \u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\n\u001b[1;32m 20\u001b[0m \u001b[43m \u001b[49m\u001b[43mcolor\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mtab:red\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mlabel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mval\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\n\u001b[1;32m 21\u001b[0m \u001b[43m)\u001b[49m\n\u001b[1;32m 22\u001b[0m plt\u001b[38;5;241m.\u001b[39mgrid()\n\u001b[1;32m 23\u001b[0m plt\u001b[38;5;241m.\u001b[39mxlabel(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mPortion of training set\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
"File \u001b[0;32m/usr/lib64/python3.13/site-packages/matplotlib/pyplot.py:3246\u001b[0m, in \u001b[0;36merrorbar\u001b[0;34m(x, y, yerr, xerr, fmt, ecolor, elinewidth, capsize, barsabove, lolims, uplims, xlolims, xuplims, errorevery, capthick, data, **kwargs)\u001b[0m\n\u001b[1;32m 3225\u001b[0m \u001b[38;5;129m@_copy_docstring_and_deprecators\u001b[39m(Axes\u001b[38;5;241m.\u001b[39merrorbar)\n\u001b[1;32m 3226\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21merrorbar\u001b[39m(\n\u001b[1;32m 3227\u001b[0m x: \u001b[38;5;28mfloat\u001b[39m \u001b[38;5;241m|\u001b[39m ArrayLike,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 3244\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs,\n\u001b[1;32m 3245\u001b[0m ) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m ErrorbarContainer:\n\u001b[0;32m-> 3246\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mgca\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43merrorbar\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 3247\u001b[0m \u001b[43m \u001b[49m\u001b[43mx\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3248\u001b[0m \u001b[43m \u001b[49m\u001b[43my\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3249\u001b[0m \u001b[43m \u001b[49m\u001b[43myerr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43myerr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3250\u001b[0m \u001b[43m \u001b[49m\u001b[43mxerr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mxerr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3251\u001b[0m \u001b[43m \u001b[49m\u001b[43mfmt\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mfmt\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3252\u001b[0m \u001b[43m \u001b[49m\u001b[43mecolor\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mecolor\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3253\u001b[0m \u001b[43m \u001b[49m\u001b[43melinewidth\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43melinewidth\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3254\u001b[0m \u001b[43m \u001b[49m\u001b[43mcapsize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcapsize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3255\u001b[0m \u001b[43m \u001b[49m\u001b[43mbarsabove\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbarsabove\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3256\u001b[0m \u001b[43m \u001b[49m\u001b[43mlolims\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlolims\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3257\u001b[0m \u001b[43m \u001b[49m\u001b[43muplims\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muplims\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3258\u001b[0m \u001b[43m \u001b[49m\u001b[43mxlolims\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mxlolims\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3259\u001b[0m \u001b[43m \u001b[49m\u001b[43mxuplims\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mxuplims\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3260\u001b[0m \u001b[43m \u001b[49m\u001b[43merrorevery\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43merrorevery\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3261\u001b[0m \u001b[43m \u001b[49m\u001b[43mcapthick\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcapthick\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3262\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m{\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mdata\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m}\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mif\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mdata\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mis\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;129;43;01mnot\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[38;5;28;43;01melse\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43m{\u001b[49m\u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3263\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3264\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n",
"File \u001b[0;32m/usr/lib64/python3.13/site-packages/matplotlib/__init__.py:1476\u001b[0m, in \u001b[0;36m_preprocess_data.<locals>.inner\u001b[0;34m(ax, data, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1473\u001b[0m \u001b[38;5;129m@functools\u001b[39m\u001b[38;5;241m.\u001b[39mwraps(func)\n\u001b[1;32m 1474\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21minner\u001b[39m(ax, \u001b[38;5;241m*\u001b[39margs, data\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs):\n\u001b[1;32m 1475\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m data \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m-> 1476\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 1477\u001b[0m \u001b[43m \u001b[49m\u001b[43max\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1478\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;28;43mmap\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43msanitize_sequence\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 1479\u001b[0m \u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43m{\u001b[49m\u001b[43mk\u001b[49m\u001b[43m:\u001b[49m\u001b[43m \u001b[49m\u001b[43msanitize_sequence\u001b[49m\u001b[43m(\u001b[49m\u001b[43mv\u001b[49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mfor\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mv\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;129;43;01min\u001b[39;49;00m\u001b[43m \u001b[49m\u001b[43mkwargs\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mitems\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m}\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1481\u001b[0m bound \u001b[38;5;241m=\u001b[39m new_sig\u001b[38;5;241m.\u001b[39mbind(ax, \u001b[38;5;241m*\u001b[39margs, \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mkwargs)\n\u001b[1;32m 1482\u001b[0m auto_label \u001b[38;5;241m=\u001b[39m (bound\u001b[38;5;241m.\u001b[39marguments\u001b[38;5;241m.\u001b[39mget(label_namer)\n\u001b[1;32m 1483\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m bound\u001b[38;5;241m.\u001b[39mkwargs\u001b[38;5;241m.\u001b[39mget(label_namer))\n",
"File \u001b[0;32m/usr/lib64/python3.13/site-packages/matplotlib/axes/_axes.py:3743\u001b[0m, in \u001b[0;36mAxes.errorbar\u001b[0;34m(self, x, y, yerr, xerr, fmt, ecolor, elinewidth, capsize, barsabove, lolims, uplims, xlolims, xuplims, errorevery, capthick, **kwargs)\u001b[0m\n\u001b[1;32m 3740\u001b[0m res \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mzeros(err\u001b[38;5;241m.\u001b[39mshape, dtype\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mbool\u001b[39m) \u001b[38;5;66;03m# Default in case of nan\u001b[39;00m\n\u001b[1;32m 3741\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m np\u001b[38;5;241m.\u001b[39many(np\u001b[38;5;241m.\u001b[39mless(err, \u001b[38;5;241m-\u001b[39merr, out\u001b[38;5;241m=\u001b[39mres, where\u001b[38;5;241m=\u001b[39m(err \u001b[38;5;241m==\u001b[39m err))):\n\u001b[1;32m 3742\u001b[0m \u001b[38;5;66;03m# like err<0, but also works for timedelta and nan.\u001b[39;00m\n\u001b[0;32m-> 3743\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 3744\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mdep_axis\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124merr\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m must not contain negative values\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 3745\u001b[0m \u001b[38;5;66;03m# This is like\u001b[39;00m\n\u001b[1;32m 3746\u001b[0m \u001b[38;5;66;03m# elow, ehigh = np.broadcast_to(...)\u001b[39;00m\n\u001b[1;32m 3747\u001b[0m \u001b[38;5;66;03m# return dep - elow * ~lolims, dep + ehigh * ~uplims\u001b[39;00m\n\u001b[1;32m 3748\u001b[0m \u001b[38;5;66;03m# except that broadcast_to would strip units.\u001b[39;00m\n\u001b[1;32m 3749\u001b[0m low, high \u001b[38;5;241m=\u001b[39m dep \u001b[38;5;241m+\u001b[39m np\u001b[38;5;241m.\u001b[39mvstack([\u001b[38;5;241m-\u001b[39m(\u001b[38;5;241m1\u001b[39m \u001b[38;5;241m-\u001b[39m lolims), \u001b[38;5;241m1\u001b[39m \u001b[38;5;241m-\u001b[39m uplims]) \u001b[38;5;241m*\u001b[39m err\n",
"\u001b[0;31mValueError\u001b[0m: 'yerr' must not contain negative values"
]
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.figure()\n",
"plt.errorbar(\n",
" [train_size for train_size in np.arange(0.1, 1.0, 0.1)], # x-location of each error bar\n",
" [info[0] for info in info_accuracy_train], # y-location of each error bar\n",
" # the size of each error bar\n",
" yerr=[\n",
" [info[0]-info[1] for info in info_accuracy_train], \n",
" [info[2]-info[0] for info in info_accuracy_train]\n",
" ], \n",
" color='tab:blue', label='train'\n",
")\n",
"plt.errorbar(\n",
" [train_size for train_size in np.arange(0.1, 1.0, 0.1)], # x-location of each error bar\n",
" [info[0] for info in info_accuracy_val], # y-location of each error bar\n",
" # the size of each error bar\n",
" yerr=[\n",
" [info[0]-info[1] for info in info_accuracy_val], \n",
" [info[2]-info[0] for info in info_accuracy_val]\n",
" ], \n",
" color='tab:red', label='val'\n",
")\n",
"plt.grid()\n",
"plt.xlabel('Portion of training set')\n",
"plt.ylabel('Accuracy')\n",
"plt.title('Learning curve for DT')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "ab5122d0",
"metadata": {},
"source": [
"### 2. Exercise - RTP dataset\n",
"In this exercise, we employ the same RTP dataset as in the previous laboratory, but we focus on multi-class classification instead of the binary one. The classes are **Audio, FEC-Audio, High Quality Video, Medium Quality Video, Low Quality Video, FEC-Video, and Screen Sharing**. \n",
"\n",
"You will:\n",
"- Load the dataset\n",
"- Perform necessary data processing\n",
"- Perform validation curve for k-NN classifier\n",
"- Perform grid search for decision tree classifier\n",
"- Choose one of the models with best hyper-parameters\n",
"- With this model and hyper-parameters, draw a learning curve"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "44ec27b7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>interarrival_std</th>\n",
" <th>interarrival_mean</th>\n",
" <th>interarrival_min</th>\n",
" <th>interarrival_max</th>\n",
" <th>interarrival_max_min_diff</th>\n",
" <th>interarrival_p10</th>\n",
" <th>interarrival_p20</th>\n",
" <th>interarrival_p25</th>\n",
" <th>interarrival_p30</th>\n",
" <th>interarrival_p40</th>\n",
" <th>...</th>\n",
" <th>rtp_interarrival_max_min_R</th>\n",
" <th>rtp_interarrival_kurtosis</th>\n",
" <th>rtp_interarrival_skew</th>\n",
" <th>rtp_interarrival_moment3</th>\n",
" <th>rtp_interarrival_moment4</th>\n",
" <th>rtp_interarrival_len_unique_percent</th>\n",
" <th>rtp_interarrival_max_value_count_percent</th>\n",
" <th>rtp_interarrival_min_max_R</th>\n",
" <th>rtp_marker_sum_check</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.001927</td>\n",
" <td>0.010000</td>\n",
" <td>0.004951</td>\n",
" <td>0.014423</td>\n",
" <td>0.009472</td>\n",
" <td>7.619953e-05</td>\n",
" <td>8.045912e-05</td>\n",
" <td>8.572698e-05</td>\n",
" <td>9.030223e-05</td>\n",
" <td>9.799051e-05</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.010000</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>Audio</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.000515</td>\n",
" <td>0.020009</td>\n",
" <td>0.019227</td>\n",
" <td>0.021251</td>\n",
" <td>0.002024</td>\n",
" <td>1.931565e-04</td>\n",
" <td>1.953020e-04</td>\n",
" <td>1.958430e-04</td>\n",
" <td>1.965890e-04</td>\n",
" <td>1.985469e-04</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.020000</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>Audio</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.041315</td>\n",
" <td>0.019994</td>\n",
" <td>0.000000</td>\n",
" <td>0.143393</td>\n",
" <td>0.143393</td>\n",
" <td>9.536743e-09</td>\n",
" <td>9.536743e-09</td>\n",
" <td>9.536743e-09</td>\n",
" <td>1.907349e-08</td>\n",
" <td>4.053116e-08</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.019231</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>Audio</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.008119</td>\n",
" <td>0.019954</td>\n",
" <td>0.000873</td>\n",
" <td>0.044432</td>\n",
" <td>0.043559</td>\n",
" <td>9.701633e-05</td>\n",
" <td>1.477895e-04</td>\n",
" <td>1.699674e-04</td>\n",
" <td>1.779909e-04</td>\n",
" <td>1.895509e-04</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.020000</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>Audio</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.018683</td>\n",
" <td>0.020117</td>\n",
" <td>0.000001</td>\n",
" <td>0.121093</td>\n",
" <td>0.121092</td>\n",
" <td>1.023531e-05</td>\n",
" <td>7.453918e-05</td>\n",
" <td>1.209468e-04</td>\n",
" <td>1.324451e-04</td>\n",
" <td>1.531601e-04</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.021739</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0</td>\n",
" <td>Audio</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139995</th>\n",
" <td>0.000799</td>\n",
" <td>0.337698</td>\n",
" <td>0.336812</td>\n",
" <td>0.338365</td>\n",
" <td>0.001553</td>\n",
" <td>3.370330e-03</td>\n",
" <td>3.372540e-03</td>\n",
" <td>3.373646e-03</td>\n",
" <td>3.374751e-03</td>\n",
" <td>3.376961e-03</td>\n",
" <td>...</td>\n",
" <td>0.511905</td>\n",
" <td>-1.500000</td>\n",
" <td>-0.707107</td>\n",
" <td>-2.211840e+08</td>\n",
" <td>3.185050e+11</td>\n",
" <td>0.666667</td>\n",
" <td>0.666667</td>\n",
" <td>0.488095</td>\n",
" <td>3</td>\n",
" <td>ScreenSharing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139996</th>\n",
" <td>0.159892</td>\n",
" <td>0.239946</td>\n",
" <td>0.000108</td>\n",
" <td>0.320163</td>\n",
" <td>0.320055</td>\n",
" <td>9.596729e-04</td>\n",
" <td>1.918266e-03</td>\n",
" <td>2.397562e-03</td>\n",
" <td>2.876859e-03</td>\n",
" <td>3.196862e-03</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>-0.671026</td>\n",
" <td>-1.148811</td>\n",
" <td>-2.524719e+12</td>\n",
" <td>6.654528e+16</td>\n",
" <td>1.000000</td>\n",
" <td>0.250000</td>\n",
" <td>0.000000</td>\n",
" <td>3</td>\n",
" <td>ScreenSharing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139997</th>\n",
" <td>0.045574</td>\n",
" <td>0.040176</td>\n",
" <td>0.000012</td>\n",
" <td>0.151814</td>\n",
" <td>0.151802</td>\n",
" <td>1.705837e-05</td>\n",
" <td>3.843689e-05</td>\n",
" <td>6.171942e-05</td>\n",
" <td>1.125135e-04</td>\n",
" <td>2.727780e-04</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.043478</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>23</td>\n",
" <td>ScreenSharing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139998</th>\n",
" <td>0.028728</td>\n",
" <td>0.325410</td>\n",
" <td>0.299745</td>\n",
" <td>0.356444</td>\n",
" <td>0.056699</td>\n",
" <td>3.038041e-03</td>\n",
" <td>3.078630e-03</td>\n",
" <td>3.098925e-03</td>\n",
" <td>3.119220e-03</td>\n",
" <td>3.159810e-03</td>\n",
" <td>...</td>\n",
" <td>0.511144</td>\n",
" <td>-1.500000</td>\n",
" <td>-0.695813</td>\n",
" <td>-1.628640e+08</td>\n",
" <td>2.163721e+11</td>\n",
" <td>1.000000</td>\n",
" <td>0.333333</td>\n",
" <td>0.488856</td>\n",
" <td>3</td>\n",
" <td>ScreenSharing</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139999</th>\n",
" <td>0.004189</td>\n",
" <td>0.040222</td>\n",
" <td>0.032511</td>\n",
" <td>0.049401</td>\n",
" <td>0.016890</td>\n",
" <td>3.474479e-04</td>\n",
" <td>3.678946e-04</td>\n",
" <td>3.826904e-04</td>\n",
" <td>3.873811e-04</td>\n",
" <td>3.936524e-04</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>-3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.000000e+00</td>\n",
" <td>0.040000</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>25</td>\n",
" <td>ScreenSharing</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>140000 rows × 96 columns</p>\n",
"</div>"
],
"text/plain": [
" interarrival_std interarrival_mean interarrival_min \\\n",
"0 0.001927 0.010000 0.004951 \n",
"1 0.000515 0.020009 0.019227 \n",
"2 0.041315 0.019994 0.000000 \n",
"3 0.008119 0.019954 0.000873 \n",
"4 0.018683 0.020117 0.000001 \n",
"... ... ... ... \n",
"139995 0.000799 0.337698 0.336812 \n",
"139996 0.159892 0.239946 0.000108 \n",
"139997 0.045574 0.040176 0.000012 \n",
"139998 0.028728 0.325410 0.299745 \n",
"139999 0.004189 0.040222 0.032511 \n",
"\n",
" interarrival_max interarrival_max_min_diff interarrival_p10 \\\n",
"0 0.014423 0.009472 7.619953e-05 \n",
"1 0.021251 0.002024 1.931565e-04 \n",
"2 0.143393 0.143393 9.536743e-09 \n",
"3 0.044432 0.043559 9.701633e-05 \n",
"4 0.121093 0.121092 1.023531e-05 \n",
"... ... ... ... \n",
"139995 0.338365 0.001553 3.370330e-03 \n",
"139996 0.320163 0.320055 9.596729e-04 \n",
"139997 0.151814 0.151802 1.705837e-05 \n",
"139998 0.356444 0.056699 3.038041e-03 \n",
"139999 0.049401 0.016890 3.474479e-04 \n",
"\n",
" interarrival_p20 interarrival_p25 interarrival_p30 \\\n",
"0 8.045912e-05 8.572698e-05 9.030223e-05 \n",
"1 1.953020e-04 1.958430e-04 1.965890e-04 \n",
"2 9.536743e-09 9.536743e-09 1.907349e-08 \n",
"3 1.477895e-04 1.699674e-04 1.779909e-04 \n",
"4 7.453918e-05 1.209468e-04 1.324451e-04 \n",
"... ... ... ... \n",
"139995 3.372540e-03 3.373646e-03 3.374751e-03 \n",
"139996 1.918266e-03 2.397562e-03 2.876859e-03 \n",
"139997 3.843689e-05 6.171942e-05 1.125135e-04 \n",
"139998 3.078630e-03 3.098925e-03 3.119220e-03 \n",
"139999 3.678946e-04 3.826904e-04 3.873811e-04 \n",
"\n",
" interarrival_p40 ... rtp_interarrival_max_min_R \\\n",
"0 9.799051e-05 ... 0.500000 \n",
"1 1.985469e-04 ... 0.500000 \n",
"2 4.053116e-08 ... 0.500000 \n",
"3 1.895509e-04 ... 0.500000 \n",
"4 1.531601e-04 ... 0.500000 \n",
"... ... ... ... \n",
"139995 3.376961e-03 ... 0.511905 \n",
"139996 3.196862e-03 ... 1.000000 \n",
"139997 2.727780e-04 ... 0.500000 \n",
"139998 3.159810e-03 ... 0.511144 \n",
"139999 3.936524e-04 ... 0.500000 \n",
"\n",
" rtp_interarrival_kurtosis rtp_interarrival_skew \\\n",
"0 -3.000000 0.000000 \n",
"1 -3.000000 0.000000 \n",
"2 -3.000000 0.000000 \n",
"3 -3.000000 0.000000 \n",
"4 -3.000000 0.000000 \n",
"... ... ... \n",
"139995 -1.500000 -0.707107 \n",
"139996 -0.671026 -1.148811 \n",
"139997 -3.000000 0.000000 \n",
"139998 -1.500000 -0.695813 \n",
"139999 -3.000000 0.000000 \n",
"\n",
" rtp_interarrival_moment3 rtp_interarrival_moment4 \\\n",
"0 0.000000e+00 0.000000e+00 \n",
"1 0.000000e+00 0.000000e+00 \n",
"2 0.000000e+00 0.000000e+00 \n",
"3 0.000000e+00 0.000000e+00 \n",
"4 0.000000e+00 0.000000e+00 \n",
"... ... ... \n",
"139995 -2.211840e+08 3.185050e+11 \n",
"139996 -2.524719e+12 6.654528e+16 \n",
"139997 0.000000e+00 0.000000e+00 \n",
"139998 -1.628640e+08 2.163721e+11 \n",
"139999 0.000000e+00 0.000000e+00 \n",
"\n",
" rtp_interarrival_len_unique_percent \\\n",
"0 0.010000 \n",
"1 0.020000 \n",
"2 0.019231 \n",
"3 0.020000 \n",
"4 0.021739 \n",
"... ... \n",
"139995 0.666667 \n",
"139996 1.000000 \n",
"139997 0.043478 \n",
"139998 1.000000 \n",
"139999 0.040000 \n",
"\n",
" rtp_interarrival_max_value_count_percent rtp_interarrival_min_max_R \\\n",
"0 1.000000 0.500000 \n",
"1 1.000000 0.500000 \n",
"2 1.000000 0.500000 \n",
"3 1.000000 0.500000 \n",
"4 1.000000 0.500000 \n",
"... ... ... \n",
"139995 0.666667 0.488095 \n",
"139996 0.250000 0.000000 \n",
"139997 1.000000 0.500000 \n",
"139998 0.333333 0.488856 \n",
"139999 1.000000 0.500000 \n",
"\n",
" rtp_marker_sum_check label \n",
"0 0 Audio \n",
"1 0 Audio \n",
"2 0 Audio \n",
"3 0 Audio \n",
"4 0 Audio \n",
"... ... ... \n",
"139995 3 ScreenSharing \n",
"139996 3 ScreenSharing \n",
"139997 23 ScreenSharing \n",
"139998 3 ScreenSharing \n",
"139999 25 ScreenSharing \n",
"\n",
"[140000 rows x 96 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# load the dataset\n",
"df = pd.read_csv(\"RTP_dataset.csv\")\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "09ce9f59",
"metadata": {},
"source": [
"### 2.1 Dataset processing\n",
"1. Do a single stratified split based on the labels to segment the dataset into training and test sets with portions of 70% and 30% (the validation set will be needed later on, and in that case, you will have to further split the training set).\n",
"2. Standardize the dataset by fitting the scaler on training set and then transforming the entire dataset.\n",
"3. Repeat the process of removing correlated features, as in the previous lab, examining the correlation in training set and then removing the correlated ones from all datasets."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "8caca196",
"metadata": {},
"outputs": [],
"source": [
"# This part is already provided\n",
"# Here, it is basically the same that you have done during the previous lab except that you are dealing with multiple classes\n",
"\n",
"# Prepare the dataset extracting Features (X) and Labels (Y) \n",
"# Stratify the dataset by having 70% of the data in the training set and 30% in the test set\n",
"df_copy = df.copy()\n",
"df_copy['label'] = pd.Categorical(df_copy['label']).codes # transform to numerical labels\n",
"X = df_copy.drop(columns=['label']).to_numpy()\n",
"y = df_copy[['label']].to_numpy()\n",
"\n",
"# Run stratified training-test splitting\n",
"# Herem X and y are needed for further split when validation set is needed, while test set is withheld to be evaluate at the very last\n",
"X, X_test, y, y_test = train_test_split(X, y, stratify=y, train_size=0.7, random_state=15)\n",
"y, y_test = np.ravel(y), np.ravel(y_test)\n",
"\n",
"# Standardize data\n",
"scaler = StandardScaler()\n",
"scaler.fit(X)\n",
"X_s, X_test_s = scaler.transform(X), scaler.transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "533a1c5a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"59 features are removed\n"
]
}
],
"source": [
"# compute the correlation matrix\n",
"columns= [i for i in range(X_s.shape[1])]\n",
"df_tmp = pd.DataFrame(X_s, columns=columns)\n",
"correlation_matrix = df_tmp.corr().abs()\n",
"\n",
"# display the heatmap\n",
"plt.figure()\n",
"sns.heatmap(correlation_matrix, cmap='Blues', vmin=0.8, vmax=1, cbar_kws={'label':'Correlation'})\n",
"plt.xlabel('Feature')\n",
"plt.ylabel('Feature')\n",
"plt.show()\n",
"\n",
"# extract features having a correlation > 0.8\n",
"c = correlation_matrix[correlation_matrix>0.8]\n",
"s = c.unstack()\n",
"so = s.sort_values(ascending=False).reset_index()\n",
"\n",
"# get strongly correlated features removing pairs having correlation = 1 because of the diagonal, i.e., correlation between one feature and itself\n",
"so = so[(so[0].isnull()==False) & (so[\"level_0\"] != so[\"level_1\"])]\n",
"\n",
"to_be_deleted = []\n",
"candidates = list(so[\"level_0\"])\n",
"\n",
"# get the unique set of features to be deleted. Notice that we discard one feature per time considering the case where a feature is strongly correlated with multiple features\n",
"subset_so = so\n",
"for candidate in candidates:\n",
" if (candidate in list(subset_so[\"level_0\"])): \n",
" to_be_deleted.append(candidate)\n",
" subset_so = subset_so[(subset_so[\"level_0\"] != candidate) & (subset_so[\"level_1\"] != candidate)]\n",
"\n",
"# to_be_deleted contains the index of columns that you need to remove from both training and test sets\n",
"print(len(to_be_deleted), 'features are removed')\n",
"\n",
"# remove the correlated features from bot sets\n",
"\n",
"# Create a mask for the columns to keep\n",
"columns_to_keep = np.ones(X_s.shape[1], dtype=bool)\n",
"columns_to_keep[to_be_deleted] = False\n",
"\n",
"# Use the mask to select only the columns to keep\n",
"X_s = X_s[:, columns_to_keep]\n",
"X_test_s = X_test_s[:, columns_to_keep]"
]
},
{
"cell_type": "markdown",
"id": "5b786a31",
"metadata": {},
"source": [
"### 2.2 Validation curve\n",
"1. Use GaussianNB classifier. For the hyper-parameter `var_smoothing` (<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html\">documentation</a>) , define a range of possible values, for example ``np.logspace(-4,0,5)``.\n",
"2. For each value of `var_smoothing`, for validation use k-fold cross-validation or stratified random sampling for k times. **Note that, no matter which one do you choose, you need to stick to it for the following questions. Choose a low value for k, for example 5**\n",
"3. Record the max, min, and mean for each value of the hyper-parameter, and finally plot the performance.\n",
" - What is the best value of `var_smoothing`?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8a65e9e",
"metadata": {},
"outputs": [],
"source": [
"acc_train_means = []\n",
"acc_train_maxs = []\n",
"acc_train_mins = []\n",
"acc_val_means = []\n",
"acc_val_maxs = []\n",
"acc_val_mins = []\n",
"\n",
"hyper_vals = np.logspace(-4,0,5)\n",
"\n",
"#we choose the model RandomForestClassifier and change its hyper-parameter max_depth, using integer values from 2 to 9\n",
"for n in hyper_vals:\n",
" gnb_clf = GaussianNB(var_smoothing=n)\n",
" scores = cross_validate(gnb_clf, X_s, y, cv=5, scoring='accuracy', return_train_score=True)\n",
" acc_train_means.append(scores['train_score'].mean())\n",
" acc_train_maxs.append(scores['train_score'].max())\n",
" acc_train_mins.append(scores['train_score'].min())\n",
" acc_val_means.append(scores['test_score'].mean())\n",
" acc_val_maxs.append(scores['test_score'].max())\n",
" acc_val_mins.append(scores['test_score'].min())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a76a68c3",
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.plot(hyper_vals, acc_train_means, color='tab:blue', label='train')\n",
"plt.xscale('log')\n",
"plt.fill_between(hyper_vals, acc_train_mins, acc_train_maxs, alpha=0.5, color='tab:blue')\n",
"plt.plot(hyper_vals, acc_val_means, color='tab:red', label='val')\n",
"plt.fill_between(hyper_vals, acc_val_mins, acc_val_maxs, alpha=0.5, color='tab:red')\n",
"plt.xlabel('max depth')\n",
"plt.ylabel('accuracy')\n",
"plt.legend()\n",
"plt.title(\"Validation curve for max depth in GaussianNB\")\n",
"plt.ylim(0.65,0.85)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "c5f309f2",
"metadata": {},
"source": [
"### 2.3 Grid search for decision tree\n",
"\n",
"1. Define the parameters' range (<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier\">documentation</a>) as following:\n",
" - ``criterion`` - ``['gini', 'entropy']``,\n",
" - ``max_depth`` - ``[4, 8, None]``,\n",
" - ``min_samples_split`` - ``[2, 4]``,\n",
"2. Perform the grid search among all the possible combinations of the defined parameters' range, and for each combination of configurations:\n",
" - Refer to the approach (k-fold cross-validation or randomly data segmentation for k times) you select in the previous question.\n",
" - Initialize and train the model with the choosing parameters, and evaluate the performance on training and validation set. Record the accuracy of for both sets.\n",
" - After the process, plot the accuracy for training and validation sets with respect to different configuration of parameters (on y-axis, output accuracy with either error bar or filled region between min and max, and on x-axis, specify the configuration index). In the figure, specify the best configuration by mark the highest mean accuracy on validation set.\n",
"3. At last, answer the following questions:\n",
" - What is the best combination of parameters?\n",
" - Try to explain why such parameters generate the best outcome. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "942ba83b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import itertools\n",
"\n",
"criterion = ['gini', 'entropy']\n",
"max_depth = [4, 8, None]\n",
"min_samples_split = [2, 4]\n",
"\n",
"acc_train_means = []\n",
"acc_train_maxs = []\n",
"acc_train_mins = []\n",
"acc_val_means = []\n",
"acc_val_maxs = []\n",
"acc_val_mins = []\n",
"\n",
"# iterate over all possible combinations\n",
"for (c, md, mss) in itertools.product(criterion, max_depth, min_samples_split):\n",
" # initialize and fit a model per pair of parameters\n",
" dtc_tmp = DecisionTreeClassifier(criterion=c, max_depth=md, min_samples_split=mss, random_state=69)\n",
" dtc_tmp.fit(X_s, y)\n",
" # here we only refer to accuracy for simplicity\n",
" acc_train = accuracy_score(y, dtc_tmp.predict(X_s))\n",
" acc_val = accuracy_score(y_test, dtc_tmp.predict(X_test_s))\n",
" print(f'#criterion: {c},\\t max depth: {md},\\t min samples split {mss},\\t accuracy: train - {acc_train} | val - {acc_val}')\n",
"\n",
" scores = cross_validate(dtc_tmp, X_s, y, cv=5, scoring='accuracy', return_train_score=True)\n",
" acc_train_means.append(scores['train_score'].mean())\n",
" acc_train_maxs.append(scores['train_score'].max())\n",
" acc_train_mins.append(scores['train_score'].min())\n",
" acc_val_means.append(scores['test_score'].mean())\n",
" acc_val_maxs.append(scores['test_score'].max())\n",
" acc_val_mins.append(scores['test_score'].min())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10ccdedc",
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"x = list(range(12))\n",
"plt.plot(x, acc_train_means, color='tab:blue', label='train')\n",
"plt.fill_between(x, acc_train_mins, acc_train_maxs, alpha=0.5, color='tab:blue')\n",
"plt.plot(x, acc_val_means, color='tab:red', label='val')\n",
"plt.fill_between(x, acc_val_mins, acc_val_maxs, alpha=0.5, color='tab:red')\n",
"plt.xlabel('max depth')\n",
"plt.ylabel('combination number')\n",
"plt.legend()\n",
"plt.title(\"Validation curve for max depth in DT\")\n",
"plt.ylim(0.87,1)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "6d2a9d7e",
"metadata": {},
"source": [
"### 2.4 Learning Curve\n",
"Once we find the best features and configurations, we can check the number of training samples required to create a reliable model by checking learning curve.\n",
"1. Select the best one of the previous models with the best configuration. Define the range of portion of training sample. Note that this is the portion of training samples out of the dataset used for model training (test set is excluded). The goal is to find the proper portion that can generate an accuracy on validation greater than a certain threshold, e.g., 96%, and to see whether it is needed to gather more data.\n",
"2. For each of the percentage in the range, refer to your previous selected approach (k-fold cross-validation or random stratified sampling for k times) and train your model with different portions of training set.\n",
"3. At each trial, record the accuracy, calculating the max, min and mean among the k repetitions.\n",
"4. Visualize the performance evolution with error bar as you increase the portion of training data.\n",
" - Which portion do you think is proper to derive a reliable result?\n",
" - Do you think whether we need more data?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8378f8c2",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# here we only refer to accuracy for simplicity\n",
"info_accuracy_train = []\n",
"info_accuracy_val = []\n",
"\n",
"for train_size in np.arange(0.1, 1, 0.1):\n",
" # select 10 random states\n",
" random_states = set()\n",
" while len(random_states) < 10:\n",
" n = random.randint(0,1000000)\n",
" random_states.add(n)\n",
"\n",
" accuraries_train_tmp = []\n",
" accuraries_val_tmp = []\n",
" \n",
" # iterate over all random states with same training size to derive the 10 different performance\n",
" for random_state in random_states:\n",
" # do a stratified random sampling from the training set \n",
" sss = StratifiedShuffleSplit(n_splits=1, train_size=train_size, random_state=random_state)\n",
" index_selected = list(sss.split(X_s, y))[0][0]\n",
" X_train_selected = X_s[index_selected]\n",
" y_train_selected = y[index_selected]\n",
" \n",
" dtc_tmp = DecisionTreeClassifier(criterion=\"entropy\", max_depth=None, min_samples_split=4, random_state=69)\n",
" \n",
" # initialize and fit a DT model per time\n",
" dtc_tmp.fit(X_train_selected, y_train_selected)\n",
" accuraries_train_tmp.append(accuracy_score(y_train_selected, dtc_tmp.predict(X_train_selected)))\n",
" accuraries_val_tmp.append(accuracy_score(y_test, dtc_tmp.predict(X_test_s)))\n",
"\n",
" # calculate the statistics to generate overall performance\n",
" accuraries_tmp = np.array(accuraries_train_tmp)\n",
" mean_acc = accuraries_tmp.mean()\n",
" min_acc = accuraries_tmp.min()\n",
" max_acc = accuraries_tmp.max()\n",
" info_accuracy_train.append((mean_acc, min_acc, max_acc))\n",
" \n",
" accuraries_tmp = np.array(accuraries_val_tmp)\n",
" mean_acc = accuraries_tmp.mean()\n",
" min_acc = accuraries_tmp.min()\n",
" max_acc = accuraries_tmp.max()\n",
" info_accuracy_val.append((mean_acc, min_acc, max_acc))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "394e5480",
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.errorbar(\n",
" [train_size for train_size in np.arange(0.1, 1.0, 0.1)], # x-location of each error bar\n",
" [info[0] for info in info_accuracy_train], # y-location of each error bar\n",
" # the size of each error bar\n",
" yerr=[\n",
" [info[0]-info[1] for info in info_accuracy_train], \n",
" [info[2]-info[0] for info in info_accuracy_train]\n",
" ], \n",
" color='tab:blue', label='train'\n",
")\n",
"plt.errorbar(\n",
" [train_size for train_size in np.arange(0.1, 1.0, 0.1)], # x-location of each error bar\n",
" [info[0] for info in info_accuracy_val], # y-location of each error bar\n",
" # the size of each error bar\n",
" yerr=[\n",
" [info[0]-info[1] for info in info_accuracy_val], \n",
" [info[2]-info[0] for info in info_accuracy_val]\n",
" ], \n",
" color='tab:red', label='val'\n",
")\n",
"plt.grid()\n",
"plt.xlabel('Portion of training set')\n",
"plt.ylabel('Accuracy')\n",
"plt.title('Learning curve for DT')\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "97c53dc0",
"metadata": {},
"source": [
"### 2.5 Test your best model on unseen data\n",
"Based on the previous considerations, retrain your model with all of the aforementioned best selections on all data except test set. Output the final performance in terms of classification report and confusion matrix on the test set.\n",
"- Which classes have good performance?\n",
"- Which classes have poor performance?\n",
"- Can you find some peculiar behavior for the performance of certain classes? Try to explain why."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bde5bee",
"metadata": {},
"outputs": [],
"source": [
"# you answer here\n",
"model = dt_tmp = DecisionTreeClassifier(\n",
" criterion=\"entropy\", max_depth=None, min_samples_split=4, random_state=60\n",
")\n",
"\n",
"model.fit(X_s, y)\n",
"y_pred = model.predict(X_test_s)\n",
"print(classification_report(y_test, y_pred))\n",
"confusion_test = confusion_matrix(y_test, y_pred)\n",
"\n",
"# visualize the confusion matrix\n",
"plt.figure(figsize=(10, 10))\n",
"sns.heatmap(confusion_test, cmap=\"Blues\", annot=True, cbar_kws={\"label\": \"Occurrences\"})\n",
"plt.xlabel(\"Prediction\")\n",
"plt.ylabel(\"True\")\n",
"plt.title(\"Confusion matrix\")\n",
"plt.show()\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}