{ "cells": [ { "cell_type": "markdown", "id": "c5aaedf3", "metadata": {}, "source": [ "
Lab-7 Supervised Learning
" ] }, { "cell_type": "markdown", "id": "1af90b7a", "metadata": {}, "source": [ "### Objective: build a basic ML pipeline for classification problem:\n", "1. **Data segmentation** is used to split the whole dataset into different portions for different purpose - training, validation, and test. Useful link: Wiki, sklearn.\n", "2. **Usage of classification algorithms using Scikit-learn (sklearn)**: we will use classification algorithms already implemented in sklearn libraries. Useful link: Wiki, examples in sklearn.\n", "3. **Performance evaluation**: we will also refer to sklearn libraries to use different metrics to evaluate trained model performance. Useful link: Wiki, sklearn." ] }, { "cell_type": "code", "execution_count": 3, "id": "6d1fd7be", "metadata": {}, "outputs": [], "source": [ "# import needed python libraries\n", "\n", "%matplotlib inline\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.decomposition import PCA\n", "from sklearn.metrics import classification_report, confusion_matrix\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error\n", "from sklearn.tree import DecisionTreeRegressor" ] }, { "cell_type": "markdown", "id": "6a8c41f0", "metadata": {}, "source": [ "### 1. Tutorial - Classification\n", "Classification is supervised learning, for which you have labeled data to build/tune/evaluate ML models to classify/predict future samples, helping to make decisions, forecast conditions, identify patterns, etc. \n", "\n", "\n", "\n", "Here we use the IRIS dataset as an example to show how you can perform classification. " ] }, { "cell_type": "code", "execution_count": 4, "id": "1c0556ae", "metadata": {}, "outputs": [], "source": [ "# load dataset of IRIS flowers\n", "\n", "from sklearn import datasets\n", "iris_data = datasets.load_iris()\n", "features_iris = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']\n", "df_iris = pd.DataFrame(iris_data.data, columns = features_iris)\n", "df_iris['type'] = 'setosa'\n", "df_iris.loc[50:99, 'type'] = 'versicolor'\n", "df_iris.loc[100:149, 'type'] = 'virginica'" ] }, { "cell_type": "markdown", "id": "fb59ce2a", "metadata": {}, "source": [ "### 1.1 Data segmentation\n", "\n", "For labeled data in possession, you need to split them into training, validation if needed, and test datasets. Training and validation datasets are involved during model training, where training set is used to build the model (learn the parameters), while validation set is used to evaluate the model performance during/after training, avoiding overfitting/underfitting and chosing hyper-parameters. Test set is out of training phase and used to derive the final model performance. \n", "\n", "For now, we only need training and test datasets. We split data into training and test sets, respectively accounting for 70% and 30% of the sample. Here, we will use **stratified sampling**, which splits the data into two (or more) parts, each having the same proportion of a class labels." ] }, { "cell_type": "code", "execution_count": 6, "id": "ae0ac3d9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthtypelabel
05.13.51.40.2setosa0
14.93.01.40.2setosa0
24.73.21.30.2setosa0
34.63.11.50.2setosa0
45.03.61.40.2setosa0
.....................
1456.73.05.22.3virginica2
1466.32.55.01.9virginica2
1476.53.05.22.0virginica2
1486.23.45.42.3virginica2
1495.93.05.11.8virginica2
\n", "

150 rows × 6 columns

\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width type label\n", "0 5.1 3.5 1.4 0.2 setosa 0\n", "1 4.9 3.0 1.4 0.2 setosa 0\n", "2 4.7 3.2 1.3 0.2 setosa 0\n", "3 4.6 3.1 1.5 0.2 setosa 0\n", "4 5.0 3.6 1.4 0.2 setosa 0\n", ".. ... ... ... ... ... ...\n", "145 6.7 3.0 5.2 2.3 virginica 2\n", "146 6.3 2.5 5.0 1.9 virginica 2\n", "147 6.5 3.0 5.2 2.0 virginica 2\n", "148 6.2 3.4 5.4 2.3 virginica 2\n", "149 5.9 3.0 5.1 1.8 virginica 2\n", "\n", "[150 rows x 6 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# first, we need to convert the type of flowers to numerical labels\n", "df_iris['label'] = pd.Categorical(df_iris['type']).codes\n", "df_iris" ] }, { "cell_type": "code", "execution_count": 21, "id": "7fdc9034", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " df_iris[features_iris], # X\n", " df_iris['label'], # y\n", " stratify = df_iris['label'], # stratify the dataset based on class labels\n", " train_size = 0.7, # percentage of training set\n", " random_state = 15 \n", ")" ] }, { "cell_type": "markdown", "id": "07590f25", "metadata": {}, "source": [ "### 1.2 ML model usage\n", "The way you develop ML models using sklearn is similar for different algorithms, as follows:\n", "```python\n", "# load the model from sklearn\n", "from sklearn.xxx import MODEL\n", "\n", "# initialize the ML model (includes model and loss)\n", "model = MODEL() \n", "\n", "# train the model by fitting the algorithm based on training set\n", "model.fit(X_train, y_train) \n", "\n", "# use the trained model to make predictions for train and test set\n", "preds_train = model.predict(X_train) \n", "preds_test = model.predict(X_test) \n", "```" ] }, { "cell_type": "code", "execution_count": 22, "id": "e60ee760", "metadata": {}, "outputs": [], "source": [ "# here we use Gaussian Naive Bayes classifier as an example\n", "\n", "gnb = GaussianNB()\n", "gnb.fit(X_train, y_train)\n", "y_train_pred = gnb.predict(X_train)\n", "y_test_pred = gnb.predict(X_test)" ] }, { "cell_type": "markdown", "id": "ffc920de", "metadata": {}, "source": [ "### 1.3 Performance evaluation metrics\n", "Instead of building our own functions to evaluate the model performance, we will use sklearn library to do the job. Besides, we can use heatmap to visualize the confusion matrix. These metrics can be evaluated on train/test." ] }, { "cell_type": "code", "execution_count": 23, "id": "143d37ea", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 15\n", " 1 0.83 1.00 0.91 15\n", " 2 1.00 0.80 0.89 15\n", "\n", " accuracy 0.93 45\n", " macro avg 0.94 0.93 0.93 45\n", "weighted avg 0.94 0.93 0.93 45\n", "\n" ] } ], "source": [ "# Classification report includes\n", "# - accuracy\n", "# - precision, recall and F1-score for each class (in this case 3 classes)\n", "# - averages of precision, recall, and f1-score\n", "# - number of samples for each class and in total (support)\n", "\n", "# Here we show results for test data\n", "print(classification_report(y_test, y_test_pred))" ] }, { "cell_type": "code", "execution_count": 24, "id": "b622be52", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# get the confusion matrix of test set\n", "confusion_test = confusion_matrix(y_test, y_test_pred)\n", "\n", "# visualize the confusion matrix\n", "plt.figure(figsize=(5,4))\n", "sns.heatmap(confusion_test, cmap='Blues', annot=True, cbar_kws={'label':'Occurrences'})\n", "plt.xlabel('Prediction')\n", "plt.ylabel('True')\n", "plt.title('Confusion matrix')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "bbe08206", "metadata": {}, "source": [ "### 2. Exercise - RTP dataset\n", "In this exercise, we employ a different dataset, which contains RTP traffic information. The Real-time Transport Protocol (RTP) is a network protocol for delivering real-time audio and video over IP networks. RTP is used in communication and entertainment systems that involve streaming media, such as telephony, video teleconference applications including WebRTC, television services and web-based push-to-talk features. In this laboratory, you will work on traces referred to Webex conference call to perform a classification task. Specifically, the traffic was collected on client side during video-conferencing. The traffic is basically the traces (records) of RTP packets following chronological order. Afterwards, we define successive time windows and aggregate packets in each time window, calculating certain statistics, which can be considered a statistical representation of the traffic in such time window.\n", "\n", "![](video_conference.png)" ] }, { "cell_type": "markdown", "id": "f9e3948e", "metadata": {}, "source": [ "### 2.1 Loading the dataset\n", "Unzip the RTP_dataset.csv.zip to get the csv dataset describing RTP traffic. Each record describes 1 second of a traffic (packet aggregation) carrying different class of data. Each record reports 95 features including statistics on:\n", "- Packet size\n", "- Interarrival time*\n", "- RTP interarrival time*\n", "- Interlength*\n", "- Label describing the class of data carried by the flow.\n", "
*inter statistics are computed based on the difference between the current and previous packet. For example, if packet 1 is received at 30s and packet 2 is received at 31s, the interarrival time between those two will be 1s.\n", "\n", "Each record (row) belongs to a single class. 3 main classes exist: **Audio, Video, Screen Sharing**. In particular, audio class consists of two sub-classes, **Audio and FEC-Audio**, and video class can be further split in 4 more sub-classes: **High Quality (>=720p), Medium Quality (360p\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
interarrival_stdinterarrival_meaninterarrival_mininterarrival_maxinterarrival_max_min_diffinterarrival_p10interarrival_p20interarrival_p25interarrival_p30interarrival_p40...rtp_interarrival_max_min_Rrtp_interarrival_kurtosisrtp_interarrival_skewrtp_interarrival_moment3rtp_interarrival_moment4rtp_interarrival_len_unique_percentrtp_interarrival_max_value_count_percentrtp_interarrival_min_max_Rrtp_marker_sum_checklabel
00.0019270.0100000.0049510.0144230.0094727.619953e-058.045912e-058.572698e-059.030223e-059.799051e-05...0.500000-3.0000000.0000000.000000e+000.000000e+000.0100001.0000000.5000000Audio
10.0005150.0200090.0192270.0212510.0020241.931565e-041.953020e-041.958430e-041.965890e-041.985469e-04...0.500000-3.0000000.0000000.000000e+000.000000e+000.0200001.0000000.5000000Audio
20.0413150.0199940.0000000.1433930.1433939.536743e-099.536743e-099.536743e-091.907349e-084.053116e-08...0.500000-3.0000000.0000000.000000e+000.000000e+000.0192311.0000000.5000000Audio
30.0081190.0199540.0008730.0444320.0435599.701633e-051.477895e-041.699674e-041.779909e-041.895509e-04...0.500000-3.0000000.0000000.000000e+000.000000e+000.0200001.0000000.5000000Audio
40.0186830.0201170.0000010.1210930.1210921.023531e-057.453918e-051.209468e-041.324451e-041.531601e-04...0.500000-3.0000000.0000000.000000e+000.000000e+000.0217391.0000000.5000000Audio
..................................................................
1399950.0007990.3376980.3368120.3383650.0015533.370330e-033.372540e-033.373646e-033.374751e-033.376961e-03...0.511905-1.500000-0.707107-2.211840e+083.185050e+110.6666670.6666670.4880953ScreenSharing
1399960.1598920.2399460.0001080.3201630.3200559.596729e-041.918266e-032.397562e-032.876859e-033.196862e-03...1.000000-0.671026-1.148811-2.524719e+126.654528e+161.0000000.2500000.0000003ScreenSharing
1399970.0455740.0401760.0000120.1518140.1518021.705837e-053.843689e-056.171942e-051.125135e-042.727780e-04...0.500000-3.0000000.0000000.000000e+000.000000e+000.0434781.0000000.50000023ScreenSharing
1399980.0287280.3254100.2997450.3564440.0566993.038041e-033.078630e-033.098925e-033.119220e-033.159810e-03...0.511144-1.500000-0.695813-1.628640e+082.163721e+111.0000000.3333330.4888563ScreenSharing
1399990.0041890.0402220.0325110.0494010.0168903.474479e-043.678946e-043.826904e-043.873811e-043.936524e-04...0.500000-3.0000000.0000000.000000e+000.000000e+000.0400001.0000000.50000025ScreenSharing
\n", "

140000 rows × 96 columns

\n", "" ], "text/plain": [ " interarrival_std interarrival_mean interarrival_min \\\n", "0 0.001927 0.010000 0.004951 \n", "1 0.000515 0.020009 0.019227 \n", "2 0.041315 0.019994 0.000000 \n", "3 0.008119 0.019954 0.000873 \n", "4 0.018683 0.020117 0.000001 \n", "... ... ... ... \n", "139995 0.000799 0.337698 0.336812 \n", "139996 0.159892 0.239946 0.000108 \n", "139997 0.045574 0.040176 0.000012 \n", "139998 0.028728 0.325410 0.299745 \n", "139999 0.004189 0.040222 0.032511 \n", "\n", " interarrival_max interarrival_max_min_diff interarrival_p10 \\\n", "0 0.014423 0.009472 7.619953e-05 \n", "1 0.021251 0.002024 1.931565e-04 \n", "2 0.143393 0.143393 9.536743e-09 \n", "3 0.044432 0.043559 9.701633e-05 \n", "4 0.121093 0.121092 1.023531e-05 \n", "... ... ... ... \n", "139995 0.338365 0.001553 3.370330e-03 \n", "139996 0.320163 0.320055 9.596729e-04 \n", "139997 0.151814 0.151802 1.705837e-05 \n", "139998 0.356444 0.056699 3.038041e-03 \n", "139999 0.049401 0.016890 3.474479e-04 \n", "\n", " interarrival_p20 interarrival_p25 interarrival_p30 \\\n", "0 8.045912e-05 8.572698e-05 9.030223e-05 \n", "1 1.953020e-04 1.958430e-04 1.965890e-04 \n", "2 9.536743e-09 9.536743e-09 1.907349e-08 \n", "3 1.477895e-04 1.699674e-04 1.779909e-04 \n", "4 7.453918e-05 1.209468e-04 1.324451e-04 \n", "... ... ... ... \n", "139995 3.372540e-03 3.373646e-03 3.374751e-03 \n", "139996 1.918266e-03 2.397562e-03 2.876859e-03 \n", "139997 3.843689e-05 6.171942e-05 1.125135e-04 \n", "139998 3.078630e-03 3.098925e-03 3.119220e-03 \n", "139999 3.678946e-04 3.826904e-04 3.873811e-04 \n", "\n", " interarrival_p40 ... rtp_interarrival_max_min_R \\\n", "0 9.799051e-05 ... 0.500000 \n", "1 1.985469e-04 ... 0.500000 \n", "2 4.053116e-08 ... 0.500000 \n", "3 1.895509e-04 ... 0.500000 \n", "4 1.531601e-04 ... 0.500000 \n", "... ... ... ... \n", "139995 3.376961e-03 ... 0.511905 \n", "139996 3.196862e-03 ... 1.000000 \n", "139997 2.727780e-04 ... 0.500000 \n", "139998 3.159810e-03 ... 0.511144 \n", "139999 3.936524e-04 ... 0.500000 \n", "\n", " rtp_interarrival_kurtosis rtp_interarrival_skew \\\n", "0 -3.000000 0.000000 \n", "1 -3.000000 0.000000 \n", "2 -3.000000 0.000000 \n", "3 -3.000000 0.000000 \n", "4 -3.000000 0.000000 \n", "... ... ... \n", "139995 -1.500000 -0.707107 \n", "139996 -0.671026 -1.148811 \n", "139997 -3.000000 0.000000 \n", "139998 -1.500000 -0.695813 \n", "139999 -3.000000 0.000000 \n", "\n", " rtp_interarrival_moment3 rtp_interarrival_moment4 \\\n", "0 0.000000e+00 0.000000e+00 \n", "1 0.000000e+00 0.000000e+00 \n", "2 0.000000e+00 0.000000e+00 \n", "3 0.000000e+00 0.000000e+00 \n", "4 0.000000e+00 0.000000e+00 \n", "... ... ... \n", "139995 -2.211840e+08 3.185050e+11 \n", "139996 -2.524719e+12 6.654528e+16 \n", "139997 0.000000e+00 0.000000e+00 \n", "139998 -1.628640e+08 2.163721e+11 \n", "139999 0.000000e+00 0.000000e+00 \n", "\n", " rtp_interarrival_len_unique_percent \\\n", "0 0.010000 \n", "1 0.020000 \n", "2 0.019231 \n", "3 0.020000 \n", "4 0.021739 \n", "... ... \n", "139995 0.666667 \n", "139996 1.000000 \n", "139997 0.043478 \n", "139998 1.000000 \n", "139999 0.040000 \n", "\n", " rtp_interarrival_max_value_count_percent rtp_interarrival_min_max_R \\\n", "0 1.000000 0.500000 \n", "1 1.000000 0.500000 \n", "2 1.000000 0.500000 \n", "3 1.000000 0.500000 \n", "4 1.000000 0.500000 \n", "... ... ... \n", "139995 0.666667 0.488095 \n", "139996 0.250000 0.000000 \n", "139997 1.000000 0.500000 \n", "139998 0.333333 0.488856 \n", "139999 1.000000 0.500000 \n", "\n", " rtp_marker_sum_check label \n", "0 0 Audio \n", "1 0 Audio \n", "2 0 Audio \n", "3 0 Audio \n", "4 0 Audio \n", "... ... ... \n", "139995 3 ScreenSharing \n", "139996 3 ScreenSharing \n", "139997 23 ScreenSharing \n", "139998 3 ScreenSharing \n", "139999 25 ScreenSharing \n", "\n", "[140000 rows x 96 columns]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"RTP_dataset.csv\")\n", "df" ] }, { "cell_type": "markdown", "id": "0ac3650a", "metadata": {}, "source": [ "### 2.2 Binary classification\n", "From now on, we focus on two major classes, **Video** and **Audio**, and you need to develop ML pipeline to classify the traffic based on statistical features. Specifically, you will perform the following steps:\n", "- Data preprocessing\n", "- Model development (perform ERM with an algorithm)\n", "- Performance evaluation" ] }, { "cell_type": "markdown", "id": "bb0eab26", "metadata": {}, "source": [ "### 2.2.1 Dataset preprocessing - Data split and standardization\n", "- Extract data only associated to the aforementioned classes.\n", "- For an individual class, assign a numerical label (0 to Video and 1 to Audio).\n", "- Split the whole dataset into training and test. Stratify the split, keeping the 70/30 proportion (i.e., the training dataset contains the 70% of the sample per label, the test contains the remaining 30% per label).\n", "- After the splitting, standardize the data (features). Fit the StandardScaler only on the training set and then transform both the training and test sets. From now on, you will use the same standardize datasets for all the experiments." ] }, { "cell_type": "code", "execution_count": null, "id": "9ff3ad91", "metadata": {}, "outputs": [], "source": [ "# This part is provided\n", "# You can simply run this cell\n", "\n", "# extract data from Video and Audio\n", "# we have to perform a copy of the dataset otherwise we will modify the original dataset\n", "\n", "video = ['FEC-Video', 'HighQ', 'LowQ', 'MediumQ']\n", "audio = ['Audio', 'FEC-Audio']\n", "screen = ['ScreenSharing']\n", "\n", "video_data = df[df[\"label\"].isin(video)].copy()\n", "audio_data = df[df[\"label\"].isin(audio)].copy()\n", "\n", "video_data[\"binary_label\"]=0\n", "audio_data[\"binary_label\"]=1\n", "\n", "video_data = video_data.drop(\"label\",axis=1)\n", "audio_data = audio_data.drop(\"label\",axis=1)\n", "\n", "binary_dataset = pd.concat([video_data, audio_data])\n", "\n", "# prepare the new dataset\n", "# get the X and y from the dataset\n", "X = binary_dataset.drop(columns=['binary_label']).to_numpy()\n", "y = binary_dataset[['binary_label']].to_numpy()" ] }, { "cell_type": "code", "execution_count": null, "id": "7e1cf5e9", "metadata": {}, "outputs": [], "source": [ "# your answer here\n", "\n", "# run stratified training-test splitting using train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X,\n", " y,\n", " stratify = y, # stratify the dataset based on class labels\n", " train_size = 0.7, # percentage of training set\n", " random_state = 42\n", ")\n", "\n", "# standardize data using StandardScaler\n", "scaler = StandardScaler()\n", "X_train_s = scaler.fit_transform(X_train, y_train)\n", "X_test_s = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "id": "2ff6a273", "metadata": {}, "source": [ "### 2.2.2 Dataset preprocessing - Removal of correlated features\n", "- For the training set, compute and display the correlation matrix between the features (refer to lab 2 for details).\n", "- Remove strongly correlated features from both training and test sets, i.e., features having a correlation > 0.8. Note that a feature may be strongly correlated with many others.\n", " - How many correlated features you have to remove?" ] }, { "cell_type": "code", "execution_count": 56, "id": "280530a5", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Here everything is provided\n", "# The input here is the output required from previous steps\n", "# Just execute this and the following cells to check correlation matrix and remove correlated features\n", "\n", "# compute the correlation matrix\n", "columns= [i for i in range(X_train_s.shape[1])]\n", "df_tmp = pd.DataFrame(X_train_s, columns=columns)\n", "correlation_matrix = df_tmp.corr().abs()\n", "\n", "# display the heatmap\n", "plt.figure()\n", "sns.heatmap(correlation_matrix, cmap='Blues', vmin=0.8, vmax=1, cbar_kws={'label':'Correlation'})\n", "plt.xlabel('Feature')\n", "plt.ylabel('Feature')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 57, "id": "918ec8f4", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "58 features are removed\n" ] } ], "source": [ "# extract features having a correlation > 0.8\n", "c = correlation_matrix[correlation_matrix>0.8]\n", "s = c.unstack()\n", "so = s.sort_values(ascending=False).reset_index()\n", "\n", "# get strongly correlatead features removing pairs having correlation = 1 because of the diagonal, i.e., correlation between one feature and itself\n", "so = so[(so[0].isnull()==False) & (so[\"level_0\"] != so[\"level_1\"])]\n", "\n", "to_be_deleted = []\n", "candidates = list(so[\"level_0\"])\n", "\n", "# get the unique set of features to be deleted. Notice that we discard one feature per time considering the case where a feature is strongly correlated with multiple features\n", "subset_so = so\n", "for candidate in candidates:\n", " if (candidate in list(subset_so[\"level_0\"])): \n", " to_be_deleted.append(candidate)\n", " subset_so = subset_so[(subset_so[\"level_0\"] != candidate) & (subset_so[\"level_1\"] != candidate)]\n", "\n", "# to_be_deleted contains the index of columns that you need to remove from both training and test sets\n", "print(len(to_be_deleted), 'features are removed')\n", "\n", "# remove the correlated features from bot sets\n", "\n", "# Create a mask for the columns to keep\n", "columns_to_keep = np.ones(X_train_s.shape[1], dtype=bool)\n", "columns_to_keep[to_be_deleted] = False\n", "\n", "# Use the mask to select only the columns to keep\n", "X_train_s = X_train_s[:, columns_to_keep]\n", "X_test_s = X_test_s[:, columns_to_keep]" ] }, { "cell_type": "markdown", "id": "a0d2ea79", "metadata": {}, "source": [ "### 2.2.3 Model development\n", "- Refer to the following 3 algorithms and train the models with predefined parameters:\n", " - k-Nearest Neighbors (k-NN) (sklearn), with parameters of ``n_neighbors=3``.\n", " - Logistic Regression (LR) (sklearn), with parameters of ``max_iter=150``.\n", " - Random Forest (RF) (sklearn), with parameters of ``n_estimators=30``.\n", " \n", " Explain how each parameter will affect the algorithm\n", "- After the model training, use the obtained hypothesis to obtain predictions not only for test set but also for training set." ] }, { "cell_type": "code", "execution_count": 85, "id": "3d79fc71", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/lib64/python3.13/site-packages/sklearn/neighbors/_classification.py:238: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return self._fit(X, y)\n", "/usr/lib64/python3.13/site-packages/sklearn/utils/validation.py:1300: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n", "/usr/lib64/python3.13/site-packages/sklearn/base.py:1474: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().\n", " return fit_method(estimator, *args, **kwargs)\n" ] }, { "data": { "text/html": [ "
RandomForestClassifier(n_estimators=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomForestClassifier(n_estimators=30)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "knn = KNeighborsClassifier(n_neighbors=3)\n", "knn.fit(X_train_s, y_train)\n", "\n", "lr = LogisticRegression(max_iter=150)\n", "lr.fit(X_train_s, y_train)\n", "\n", "rf = RandomForestClassifier(n_estimators=30)\n", "rf.fit(X_train_s, y_train)" ] }, { "cell_type": "markdown", "id": "8404d16e", "metadata": {}, "source": [ "### 2.2.4 Performance evaluation\n", "- Now you should have derived 3 sets of predictions for both training and test sets for all models. Evaluate all the 6 predictions, computing and displaying numerical metrics (classification report) and confusion matrix.\n", "- Answering the following questions:\n", " - Which model produces the best performance? Why do you think so?\n", " - For each model, which class is better classified? Are they different among models?\n", " - For each model, do you observe the phenomenon of overfitting or underfitting? Why do you think so?" ] }, { "cell_type": "code", "execution_count": 86, "id": "4439fe5a", "metadata": {}, "outputs": [], "source": [ "y_knn_test = knn.predict(X_test_s)\n", "y_lr_test = lr.predict(X_test_s)\n", "y_rf_test = rf.predict(X_test_s)\n", "\n", "y_knn_train = knn.predict(X_train_s)\n", "y_lr_train = lr.predict(X_train_s)\n", "y_rf_train = rf.predict(X_train_s)" ] }, { "cell_type": "code", "execution_count": null, "id": "d1483ecb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "KNN test\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 24000\n", " 1 1.00 1.00 1.00 12000\n", "\n", " accuracy 1.00 36000\n", " macro avg 1.00 1.00 1.00 36000\n", "weighted avg 1.00 1.00 1.00 36000\n", "\n", "KNN train\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 56000\n", " 1 1.00 1.00 1.00 28000\n", "\n", " accuracy 1.00 84000\n", " macro avg 1.00 1.00 1.00 84000\n", "weighted avg 1.00 1.00 1.00 84000\n", "\n" ] } ], "source": [ "print(\"KNN test\")\n", "print(classification_report(y_test, y_knn_test))\n", "print(\"KNN train\")\n", "print(classification_report(y_train, y_knn_train))" ] }, { "cell_type": "code", "execution_count": 83, "id": "3ae4595d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LR test\n", " precision recall f1-score support\n", "\n", " 0 0.99 0.83 0.90 24000\n", " 1 0.74 0.99 0.85 12000\n", "\n", " accuracy 0.88 36000\n", " macro avg 0.87 0.91 0.88 36000\n", "weighted avg 0.91 0.88 0.89 36000\n", "\n", "LR train\n", " precision recall f1-score support\n", "\n", " 0 0.99 0.84 0.91 56000\n", " 1 0.75 0.99 0.85 28000\n", "\n", " accuracy 0.89 84000\n", " macro avg 0.87 0.91 0.88 84000\n", "weighted avg 0.91 0.89 0.89 84000\n", "\n" ] } ], "source": [ "print(\"LR test\")\n", "print(classification_report(y_test, y_lr_test))\n", "print(\"LR train\")\n", "print(classification_report(y_train, y_lr_train))" ] }, { "cell_type": "code", "execution_count": 69, "id": "f3d981fa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RF test\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 24000\n", " 1 1.00 1.00 1.00 12000\n", "\n", " accuracy 1.00 36000\n", " macro avg 1.00 1.00 1.00 36000\n", "weighted avg 1.00 1.00 1.00 36000\n", "\n", "RF train\n", " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 56000\n", " 1 1.00 1.00 1.00 28000\n", "\n", " accuracy 1.00 84000\n", " macro avg 1.00 1.00 1.00 84000\n", "weighted avg 1.00 1.00 1.00 84000\n", "\n" ] } ], "source": [ "print(\"RF test\")\n", "print(classification_report(y_test, y_rf_test))\n", "print(\"RF train\")\n", "print(classification_report(y_train, y_rf_train))" ] }, { "cell_type": "code", "execution_count": 91, "id": "000c9a7a", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "confusion_test_knn = confusion_matrix(y_knn_test, y_test)\n", "confusion_test_lr = confusion_matrix(y_lr_test, y_test)\n", "confusion_test_rf = confusion_matrix(y_rf_test, y_test)\n", "\n", "for type, conf in [(\"KNN\", confusion_test_knn), (\"LR\", confusion_test_lr), (\"RF\", confusion_test_rf)]:\n", " # visualize the confusion matrix\n", " plt.figure(figsize=(5,4))\n", " sns.heatmap(conf, cmap='Blues', annot=True, cbar_kws={'label':'Occurrences'})\n", " plt.xlabel('Prediction')\n", " plt.ylabel('True')\n", " plt.title(f'Confusion matrix {type}')\n", " plt.show()" ] }, { "cell_type": "markdown", "id": "abe29141", "metadata": {}, "source": [ "### 2.3 Regression analysis\n", "Here we intend to investigate the possibility of representing a certain feature through other features by performing a regression analysis. In other words, we may find a relationship between a feature and the others. Specifically, we focus on `interarrival_std`, and you need to do the following:\n", "- Refer to the original dataset and remove the correlated features as you have done previously.\n", "- Randomly split the dataset into training and test set (70/30 and no need of stratification), and standardize the dataset as you have done previously.\n", "- Refer to linear regression and decision tree regressor, by training scikit-learn models with default configuration. Documentations: Linear regression and Decision tree regressor.\n", "- Make predictions for both of training and test sets.\n", "- Output performance metrics for both sets and for both models by calculating Mean Squared Error (MSE) and the Mean Absolute Error (MAE). There're also scikit-learn library doing the job. Answer the following:\n", " - Do you observe overfitting or under-fitting?\n", " - Which model performs better?" ] }, { "cell_type": "code", "execution_count": 96, "id": "3b3c4c1f", "metadata": {}, "outputs": [], "source": [ "# run stratified training-test splitting using train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X,\n", " y,\n", " train_size = 0.7, # percentage of training set\n", " random_state = 69\n", ")\n", "\n", "# standardize data using StandardScaler\n", "scaler = StandardScaler()\n", "X_train_s = scaler.fit_transform(X_train, y_train)\n", "X_test_s = scaler.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 97, "id": "7954095c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
DecisionTreeRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeRegressor()" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lr = LinearRegression()\n", "lr.fit(X_train_s, y_train)\n", "\n", "dtr = DecisionTreeRegressor()\n", "dtr.fit(X_train_s, y_train)" ] }, { "cell_type": "code", "execution_count": 100, "id": "bb4eda39", "metadata": {}, "outputs": [], "source": [ "y_lr_test = lr.predict(X_test_s)\n", "y_dtr_test = dtr.predict(X_test_s)" ] }, { "cell_type": "code", "execution_count": 102, "id": "9087622d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MSE\n", "LR: 0.015691082997851952\n", "DTR: 0.0003333333333333333\n", "MAE\n", "LR: 0.08678245172383924\n", "DTR: 0.0003333333333333333\n" ] } ], "source": [ "print(\"MSE\")\n", "print(f\"LR: {mean_squared_error(y_test, y_lr_test)}\")\n", "print(f\"DTR: {mean_squared_error(y_test, y_dtr_test)}\")\n", "\n", "print(\"MAE\")\n", "print(f\"LR: {mean_absolute_error(y_test, y_lr_test)}\")\n", "print(f\"DTR: {mean_absolute_error(y_test, y_dtr_test)}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 5 }