Compare commits

..

7 Commits

Author SHA1 Message Date
d5b768962c labs: Add eight lab 2024-11-28 23:01:35 +01:00
25846ac643 labs: Add seventh base code 2024-11-21 17:52:19 +01:00
ae84532d96 labs: Add partial sixth lab 2024-11-21 16:21:30 +01:00
50287674fd labs: Add fifth lab 2024-11-07 18:10:04 +01:00
b805c7b53f labs: Add third lab (partial) 2024-10-31 16:23:09 +01:00
2c93bc6d68 labs: Add second lab (partial) 2024-10-31 16:22:50 +01:00
829e235442 labs: Add second lab (partial) 2024-10-31 16:22:11 +01:00
15 changed files with 808717 additions and 0 deletions

500001
Labs/Lab 5/darknet_traces.csv Normal file

File diff suppressed because it is too large Load Diff

2184
Labs/Lab 5/lab_5.ipynb Normal file

File diff suppressed because one or more lines are too long

919
Labs/Lab 6/lab_6.ipynb Normal file
View File

@@ -0,0 +1,919 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "426a8016",
"metadata": {},
"source": [
"<center><b><font size=6>Lab-6 A classifier from scratch<b><center>"
]
},
{
"cell_type": "markdown",
"id": "a39139f5",
"metadata": {},
"source": [
"### Objective: Implement, use and evaluate a classifier (without using specific libraries such as sklearn)\n",
"1. **Logistic regression** is a binary classification method that maps a linear combination of parameters and variables into two possible classes. Here, you will implement the logistic regression from scratch to better understand how an ML algorithm works. Useful link: <a href=\"https://en.wikipedia.org/wiki/Logistic_regression\">Wiki</a>.\n",
"2. **Performance evaluation metrics** are needed to evaluate the outcome of prediction with respect to true labels. Here, you will implement confusion matrix, accuracy, precision, recall and F-measure. Useful link: <a href=\"https://en.wikipedia.org/wiki/Confusion_matrix\">Wiki</a>."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "b6bf32f9",
"metadata": {},
"outputs": [],
"source": [
"# import needed python libraries\n",
"\n",
"%matplotlib inline\n",
"\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import random"
]
},
{
"cell_type": "markdown",
"id": "c0959af0",
"metadata": {},
"source": [
"### 1. Dataset - TCP logs\n",
"The dataset contains traffic information generated by an open-source passive network monitoring tool, namely **tstat**. It automates the collection of packet statistics of traffic aggregates, using real-time monitoring features. Being a passive tool, the typical usage scenario is live monitoring of Internet links, in which all transmitted packets are observed. In case of TCP, Tstat identifies a new flow start when it observes a TCP three-way handshake. Similarly, it identifies a TCP flow end either when it sees the TCP connection teardown, or when it doesnt observe packets for some time (idle time). A flow is defined by a unique link between the sender and receiver, e.g., a tuple of <em>(IP_Protocol_Type, IP_Source_Address, Source_Port, IP_Destination_Address, Destination_Port)</em>. For a specific flow, tstat calculates a number of statistics of all the packets transmitted over this flow, and then generate a log for such flow with multiple attributes (statistics). A log file is arranged as a simple table where each column is associated to specific information and each row reports the flow during a connection. The log information is a summary of the flow properties. For instance, in the TCP log we can find columns like the starting time of a TCP connection, its duration, the number of sent and received packets, the observed Round Trip Time.\n",
"![](tstat.png)\n",
"\n",
"In this lab, since the focus is on the development of logistic regression from scratch, we only consider a portion of the dataset for simplicity. The data can be found in `log_tcp_part.csv`, in which there are multiple columns, the last one is the class label, indicating the flow is from either **google** or **youtube**, and the rest are features. Your job is a binary classification task to classify the domain of each flow (row) **from scratch**, including:\n",
"- Build a logistic regression model,\n",
"- Evaluate the performance."
]
},
{
"cell_type": "markdown",
"id": "8fc1d837",
"metadata": {},
"source": [
"1. Load the dataset.\n",
"2. Get the list of features (columns 1 to 10).\n",
"3. Add a new column and assign numerical class labels of -1 and 1 to google and youtube.\n",
"4. Answering the following questions:\n",
" - How many features do we have?\n",
" - How many samples do we have in total?\n",
" - How many samples do we have for each class? Are they similar?"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "70294ef9",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_226018/230400442.py:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" df_tcp.replace({\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>c_msgsize_count</th>\n",
" <th>c_pktsize6</th>\n",
" <th>c_msgsize4</th>\n",
" <th>s_msgsize4</th>\n",
" <th>s_pktsize2</th>\n",
" <th>s_rtt_cnt</th>\n",
" <th>s_rtt_std</th>\n",
" <th>s_msgsize5</th>\n",
" <th>c_msgsize6</th>\n",
" <th>c_sit3</th>\n",
" <th>class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1418</td>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0.466732</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>0.413304</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1418</td>\n",
" <td>1</td>\n",
" <td>0.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1418</td>\n",
" <td>0</td>\n",
" <td>0.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>-1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19995</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>37</td>\n",
" <td>0</td>\n",
" <td>1418</td>\n",
" <td>3</td>\n",
" <td>22.224528</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.334</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19996</th>\n",
" <td>6</td>\n",
" <td>45</td>\n",
" <td>45</td>\n",
" <td>57</td>\n",
" <td>1418</td>\n",
" <td>2</td>\n",
" <td>0.000000</td>\n",
" <td>45</td>\n",
" <td>45</td>\n",
" <td>1.252</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19997</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>1205</td>\n",
" <td>0</td>\n",
" <td>531</td>\n",
" <td>4</td>\n",
" <td>15.323660</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>4975.694</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19998</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>690</td>\n",
" <td>0</td>\n",
" <td>767</td>\n",
" <td>4</td>\n",
" <td>17.997651</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1719.125</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19999</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0.000</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>20000 rows × 11 columns</p>\n",
"</div>"
],
"text/plain": [
" c_msgsize_count c_pktsize6 c_msgsize4 s_msgsize4 s_pktsize2 \\\n",
"0 1 0 0 0 1418 \n",
"1 1 0 0 0 0 \n",
"2 1 0 0 0 0 \n",
"3 1 0 0 0 1418 \n",
"4 1 0 0 0 1418 \n",
"... ... ... ... ... ... \n",
"19995 4 0 37 0 1418 \n",
"19996 6 45 45 57 1418 \n",
"19997 4 0 1205 0 531 \n",
"19998 4 0 690 0 767 \n",
"19999 1 0 0 0 0 \n",
"\n",
" s_rtt_cnt s_rtt_std s_msgsize5 c_msgsize6 c_sit3 class \n",
"0 0 0.000000 0 0 0.000 -1 \n",
"1 3 0.466732 0 0 0.000 -1 \n",
"2 3 0.413304 0 0 0.000 -1 \n",
"3 1 0.000000 0 0 0.000 -1 \n",
"4 0 0.000000 0 0 0.000 -1 \n",
"... ... ... ... ... ... ... \n",
"19995 3 22.224528 0 0 3.334 1 \n",
"19996 2 0.000000 45 45 1.252 1 \n",
"19997 4 15.323660 0 0 4975.694 1 \n",
"19998 4 17.997651 0 0 1719.125 1 \n",
"19999 1 0.000000 0 0 0.000 1 \n",
"\n",
"[20000 rows x 11 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tcp = pd.read_csv('log_tcp_part.csv')\n",
"features = df_tcp.columns[:-1] # Remove class\n",
"df_tcp.replace({\n",
" \"class\": {\n",
" \"google\": -1,\n",
" \"youtube\": 1,\n",
" }\n",
"}, inplace=True)\n",
"\n",
"df_tcp"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "48d85d94",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of features: 10\n",
"Number of samples: 20000\n",
"Number of samples of google: 10000\n",
"Number of samples of youtube: 10000\n"
]
}
],
"source": [
"num_features = features.size\n",
"num_samples = len(df_tcp)\n",
"num_google = len(df_tcp.loc[df_tcp[\"class\"] == -1])\n",
"num_youtube = len(df_tcp.loc[df_tcp[\"class\"] == 1])\n",
"\n",
"print(f\"Number of features: {num_features}\")\n",
"print(f\"Number of samples: {num_samples}\")\n",
"print(f\"Number of samples of google: {num_google}\")\n",
"print(f\"Number of samples of youtube: {num_youtube}\")"
]
},
{
"cell_type": "markdown",
"id": "c1c8cc80",
"metadata": {},
"source": [
"### 2. Implement your logistic regression learning algorithm\n",
"Here you will need to construct a class in which you need to define two functions besides the class initialization:\n",
"- `fit`. In this method you will perform ERM. Learn the parameters of the model (i.e., the hypothesis h) from training with gradient descent\n",
"- `predict`. In this method given one sample x (or more) you will perform the inference $sign(h(x))$ to obtain class labels.\n",
"\n",
"Hints:\n",
"\n",
"- The linear function used in the logistic regression is the following: $h(x)=w^T x +b $, where b is a scalar bias.\n",
"- Logistic loss: $L((x,y),h)=\\log(1+e^{-y h(x)})$\n",
"- ERM: $\\min_{w,b} f(w,b)=\\frac{1}{m}\\sum_{i=1}^{m} \\log(1+e^{-y^{(i)} h(x^{(i)})})$\n",
"- Gradient for weight: $\\nabla_w f(w,b) = \\frac{1}{m} \\sum_i \\frac{-y^{(i)}x^{(i)}}{(1+e^{y^{(i)}h(x^{(i)})})}$\n",
"- Gradient for bias: $\\nabla_b f(w,b)= \\frac{1}{m} \\sum_i \\frac{-y^{(i)}}{(1+e^{y^{(i)}h(x^{(i)})})}$\n",
"- Update the parameters: $w \\leftarrow w - \\alpha \\nabla w$, $b \\leftarrow b - \\alpha \\nabla b$\n",
"\n",
"Notice that the sigmoid function $f(z) = \\frac{1}{1 + e^{-z}}$ appears multiple times. You can write also a method for the sigmoid function to help you in the computation. By considering f(z), the gradients rewrite as:\n",
"\n",
"- Gradient for weight: $\\nabla_w f(w,b) = \\frac{1}{m} \\sum_i ({f(h(x^{(i)})) - y^{(i)}})x^{(i)}$\n",
"- Gradient for bias: $\\nabla_b f(w,b) = \\frac{1}{m} \\sum_i ({f(h(x^{(i)})) - y^{(i)}})$"
]
},
{
"cell_type": "code",
"execution_count": 176,
"id": "90a02f52",
"metadata": {},
"outputs": [],
"source": [
"def sigmoid(z):\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"\n",
"class LogisticRegression:\n",
" def __init__(self, learning_rate, num_iterations):\n",
" self.learning_rate = learning_rate\n",
" self.num_iterations = num_iterations\n",
"\n",
"\n",
" def h(self, X):\n",
" return np.dot(X, self.w) + self.b\n",
" \n",
" \n",
" def gradient_step_w(self, m, X, y):\n",
" h = self.h(X)\n",
" f = sigmoid(h)\n",
" s = np.dot(X.T, np.subtract(f, y))\n",
"\n",
" return s/m\n",
" \n",
"\n",
" def gradient_step_b(self, m, X, y):\n",
" h = self.h(X)\n",
" f = sigmoid(h)\n",
" s = np.subtract(f, y).sum()\n",
" \n",
" return s/m\n",
"\n",
"\n",
" def fit(self, X, y):\n",
" self.w = np.zeros((X.shape[1]))\n",
" self.b = 0\n",
" m = len(X)\n",
" \n",
" for i in range(self.num_iterations):\n",
" w_step = self.gradient_step_w(m, X, y)\n",
" b_step = self.gradient_step_b(m, X, y)\n",
"\n",
" self.w -= self.learning_rate*w_step\n",
" self.b -= self.learning_rate*b_step\n",
"\n",
" y_predict = np.transpose(self.predict(X))==y\n",
" correct_predictions = np.count_nonzero(y_predict == True)\n",
" accuracy = correct_predictions/len(y)\n",
" print(accuracy)\n",
" \n",
"\n",
" def predict(self, X):\n",
" if self.w is None or self.b is None:\n",
" raise ValueError\n",
" \n",
" p = self.h(X)\n",
" return np.sign(p)"
]
},
{
"cell_type": "markdown",
"id": "cc478b78",
"metadata": {},
"source": [
"### 3. Use the model\n",
"- Initialize your model with predefined learning rate of `0.1` and iterations of `100`.\n",
"- Fit your model with features and targets.\n",
"- Get the prediction with features."
]
},
{
"cell_type": "code",
"execution_count": 177,
"id": "af5a590d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5768\n",
"0.5845333333333333\n",
"0.5632\n",
"0.6148\n",
"0.5904666666666667\n",
"0.617\n",
"0.5955333333333334\n",
"0.5915333333333334\n",
"0.6082666666666666\n",
"0.5925333333333334\n",
"0.6115333333333334\n",
"0.5924666666666667\n",
"0.6012666666666666\n",
"0.5922666666666667\n",
"0.6109333333333333\n",
"0.5952\n",
"0.5922\n",
"0.5996666666666667\n",
"0.5904666666666667\n",
"0.6062\n",
"0.5915333333333334\n",
"0.5988\n",
"0.5916\n",
"0.5979333333333333\n",
"0.5917333333333333\n",
"0.5962\n",
"0.5933333333333334\n",
"0.5955333333333334\n",
"0.5945333333333334\n",
"0.5947333333333333\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5947333333333333\n",
"0.5946666666666667\n",
"0.5946666666666667\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5945333333333334\n",
"0.5946666666666667\n",
"0.5945333333333334\n",
"0.5944666666666667\n",
"0.5946\n",
"0.5945333333333334\n",
"0.5946\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5945333333333334\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5946\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5945333333333334\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5945333333333334\n",
"0.5946666666666667\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5947333333333333\n",
"0.5946666666666667\n",
"0.5947333333333333\n",
"0.5946\n",
"0.5947333333333333\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5946\n",
"0.5946\n",
"0.5946\n",
"0.5946\n",
"0.5946\n",
"0.5946666666666667\n",
"0.5947333333333333\n",
"0.5946666666666667\n",
"0.5947333333333333\n",
"0.5948\n",
"0.5948\n",
"0.5948\n",
"0.5948\n",
"0.5948666666666667\n",
"0.5947333333333333\n",
"0.5948\n",
"0.5948666666666667\n",
"0.5947333333333333\n",
"0.5948666666666667\n",
"0.5947333333333333\n",
"0.5947333333333333\n",
"0.5947333333333333\n",
"0.5947333333333333\n",
"0.5948\n",
"0.5946666666666667\n",
"0.5948\n",
"0.5947333333333333\n",
"0.5946666666666667\n",
"0.5948\n",
"0.5947333333333333\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n",
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
" return 1/(1+np.exp(np.negative(z)))\n"
]
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(df_tcp.drop(columns=[\"class\"], inplace=False), df_tcp[\"class\"])\n",
"\n",
"lr = LogisticRegression(0.1, 100)\n",
"lr.fit(x_train, y_train.values)"
]
},
{
"cell_type": "code",
"execution_count": 174,
"id": "beda67a9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.6056"
]
},
"execution_count": 174,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_predict = np.transpose(lr.predict(x_test))==y_test.values\n",
"correct_predictions = np.count_nonzero(y_predict == True)\n",
"accuracy = correct_predictions/len(y_test)\n",
"\n",
"accuracy"
]
},
{
"cell_type": "code",
"execution_count": 175,
"id": "8db63dad",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5965333333333334"
]
},
"execution_count": 175,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_predict = np.transpose(lr.predict(x_train))==y_train.values\n",
"correct_predictions = np.count_nonzero(y_predict == True)\n",
"accuracy = correct_predictions/len(y_train)\n",
"\n",
"accuracy"
]
},
{
"cell_type": "markdown",
"id": "bc5ad9e7",
"metadata": {},
"source": [
"### 4. Model evaluation\n",
"With predicted class labels and ground truths, we now evaluate the model performance through confusion matrix and numerical metrics. Specifically, you need to derive the following:\n",
"- Confusion matrix - Note that, you should indicate the corresponding quantity of each element in the table. Here positive is class 1 and negative is class -1:\n",
"\\begin{array}{|c|c|c|}\n",
"\\hline\n",
" & \\textbf{Predicted Positive} & \\textbf{Predicted Negative} \\\\\n",
"\\hline\n",
"\\textbf{Actual Positive} & \\text{True Positive (TP)} & \\text{False Negative (FN)} \\\\\n",
"\\hline\n",
"\\textbf{Actual Negative} & \\text{False Positive (FP)} & \\text{True Negative (TN)} \\\\\n",
"\\hline\n",
"\\end{array}\n",
"- Precision of each class and the average value:\n",
"$\\frac{\\text{True Positive (TP)}}{\\text{True Positive (TP) + False Positive (FP)}}$\n",
"- Recall of each class and the average value:\n",
"$\\frac{\\text{True Positive (TP)}}{\\text{True Positive (TP) + False Negative (FN)}}$\n",
"- F1-score of each class and the average value:\n",
"$F_1 = \\frac{2 \\times \\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}$\n",
"- Accuracy:\n",
"$\\frac{\\text{True Positive (TP) + True Negative (TN)}}{\\text{True Positive (TP) + True Negative (TN) + False Positive (FP) + False Negative (FN)}}$\n",
"- Answering the following questions:\n",
" - Do you have same performance between classes? If not, which one performs better?\n",
" - Change the parameters of learning rate or number of iterations. Do you have same performance? Better or Worse? Why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "15b74982",
"metadata": {},
"outputs": [],
"source": [
"# your answers here"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.2"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 5
}

20001
Labs/Lab 6/log_tcp_part.csv Normal file

File diff suppressed because it is too large Load Diff

BIN
Labs/Lab 6/tstat.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

140001
Labs/Lab 7/RTP_dataset.csv Normal file

File diff suppressed because it is too large Load Diff

2265
Labs/Lab 7/lab_7.ipynb Normal file

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

140001
Labs/Lab 8/RTP_dataset.csv Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

1303
Labs/Lab 8/lab_8.ipynb Normal file

File diff suppressed because one or more lines are too long

BIN
Labs/Lab 8/validation.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

765
Labs/Lab2 - Numpy.ipynb Normal file
View File

@@ -0,0 +1,765 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "b1ea060a47a6211d",
"metadata": {},
"source": [
"# LAB #2: Numpy\n",
"\n",
"## Introduction\n",
"In this laboratory, you will perform some operation with NumPy arrays in such a way to build your first Machine Learning model. \n",
"In particular, you will build a NumPy-based version of the K-Nearest Neighbors algorithm (a.k.a. KNN).\n",
"\n",
"## 0 Preliminary steps\n",
"### 0.1 NumPy\n",
"Make sure you have the NumPy library installed, its use is strongly recommended for this laboratory.\n",
"NumPy is the fundamental package for scientific computing with Python. You can read more about it on\n",
"the official documentation.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9246699975edf562",
"metadata": {},
"outputs": [],
"source": [
"! pip install numpy"
]
},
{
"cell_type": "markdown",
"id": "ad497ed1d0092203",
"metadata": {},
"source": [
"### 0.2 Iris dataset download \n",
"For this lab, you will need two of the datasets you have already met: Iris and MNIST. Please refer to\n",
"Laboratory 1 for a complete description of the datasets.\n",
"Iris. You can download it from:\n",
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a838a5ed77a24051",
"metadata": {},
"outputs": [],
"source": [
"# linux users\n",
"# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O iris.csv\n",
"# windows users\n",
"! pip install wget\n",
"import wget\n",
"wget.download(\"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\", \"iris.csv\")"
]
},
{
"cell_type": "markdown",
"id": "ef169d9060adb9a7",
"metadata": {},
"source": [
"## 1 Exercises \n",
"Note that exercises marked with a ($\\star$) are optional, you should focus on completing the other ones first."
]
},
{
"cell_type": "markdown",
"id": "a820274dc6b6f678",
"metadata": {},
"source": [
"## 1.1 Iris Analysis with Numpy\n",
"As you might remember from Lab. 1, the Iris dataset collects the measurements of different Iris flowers,\n",
"and each data point is characterized by 4 **features** (sepal length, sepal width, petal length, petal width) and is associated to 1 **label** (i.e. an Iris species - Setosa, Versicolor, or Virginica) which in this case is the last element of the row (last column of the csv file). "
]
},
{
"cell_type": "markdown",
"id": "46864c46cf9f9387",
"metadata": {},
"source": [
"1. Load the Iris dataset. You can use the `csv` library that we saw in the last laboratory or read it with the standard `open(filename, strategy)`. \n",
"In the second case remember to split correctly the different fields, and avoid new line characters. In any case check for empty lines. \n",
"This time remember to store the 4 features in a numpy array `x` of shape (n_sample, 4) and the labels in a different array `y` of shape (n_sample,) converting the 3 different species to a corresponding numerical value. E.g.,\n",
" - Iris-setosa: 0\n",
" - Iris-versicolor: 1\n",
" - Iris-virginica: 2\n",
"\n",
"In order to check you have correctly loaded the data, print the shape of the two arrays: you should find\n",
"(150, 4) for `x` and (150,) for `y`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "a977ccc88ef2ca39",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150, 4)\n",
"(150,)\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"def type_mapper(type):\n",
" match type:\n",
" case b\"Iris-setosa\":\n",
" return 0\n",
" case b\"Iris-versicolor\":\n",
" return 1\n",
" case b\"Iris-virginica\":\n",
" return 2\n",
" \n",
" return -1\n",
"\n",
"raw_csv = np.loadtxt(\"iris.csv\",\n",
" delimiter=\",\", dtype=float, converters={4:type_mapper})\n",
"\n",
"x = raw_csv[:,0:4]\n",
"y = raw_csv[:,4]\n",
"\n",
"print(x.shape)\n",
"print(y.shape)"
]
},
{
"cell_type": "markdown",
"id": "5050d162966956ce",
"metadata": {},
"source": [
"2. Compute again the mean and standard deviation for each class by means of the numpy functions"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "33bfaed602d4bc3e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Metrics for specie 0\n",
"Sepal length for mean: 5.006, std_dev: 0.3489469873777391\n",
"Sepal width mean: 3.418, std_dev: 0.37719490982779713\n",
"Petal length mean: 1.464, std_dev: 0.17176728442867112\n",
"Petal width mean: 0.244, std_dev: 0.10613199329137281\n",
"\n",
"Metrics for specie 1\n",
"Sepal length for mean: 5.936, std_dev: 0.5109833656783751\n",
"Sepal width mean: 2.7700000000000005, std_dev: 0.31064449134018135\n",
"Petal length mean: 4.26, std_dev: 0.4651881339845203\n",
"Petal width mean: 1.3259999999999998, std_dev: 0.19576516544063705\n",
"\n",
"Metrics for specie 2\n",
"Sepal length for mean: 6.587999999999998, std_dev: 0.6294886813914926\n",
"Sepal width mean: 2.974, std_dev: 0.3192553836664309\n",
"Petal length mean: 5.5520000000000005, std_dev: 0.546347874526844\n",
"Petal width mean: 2.0260000000000002, std_dev: 0.2718896835115301\n",
"\n"
]
}
],
"source": [
"for i in range(3):\n",
" iris = x[np.ma.masked_where(y, y==i)]\n",
"\n",
" print(f\"Metrics for specie {i}\")\n",
" print(f\"Sepal length for mean: {iris[:,0].mean()}, std_dev: {iris[:,0].std()}\")\n",
" print(f\"Sepal width mean: {iris[:,1].mean()}, std_dev: {iris[:,1].std()}\")\n",
" print(f\"Petal length mean: {iris[:,2].mean()}, std_dev: {iris[:,2].std()}\")\n",
" print(f\"Petal width mean: {iris[:,3].mean()}, std_dev: {iris[:,3].std()}\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "1f84beb708797ba9",
"metadata": {},
"source": [
"3. Compute the distances among two samples (e.g., the $36^{th}$ and the $81^{th}$, the $13^{th}$ and the $15^{th}$) \n",
"by means of the `np.linalg.norm(a-b)` function which computes the norm of `a-b`, i.e., the euclidean distance between the feature of the `a` and of the `b` samples. \n",
" - Can you guess if the two couples of samples belong to the same species?\n",
" - From the mean and standard deviations computed before can you guess which species? "
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "4a47fb722be07fb4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.7892651361962706\n",
"1.4317821063276353\n"
]
}
],
"source": [
"print(np.linalg.norm(x[35]-x[81]))\n",
"print(np.linalg.norm(x[12]-x[14]))"
]
},
{
"cell_type": "markdown",
"id": "9dc024bce0c0dd04",
"metadata": {
"collapsed": false
},
"source": [
"TODO: write your comment here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd802b47b8519bb3",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
" "
]
},
{
"cell_type": "markdown",
"id": "f3fa448bd7bc9d94",
"metadata": {
"collapsed": false
},
"source": [
"TODO: write your comment here"
]
},
{
"cell_type": "markdown",
"id": "dcceaccd4a1a7526",
"metadata": {
"collapsed": false
},
"source": [
"4. Find the k nearest neighbors of a sample in the dataset.\n",
" - Define a function `k_nearest_neighbors(x, x_set, k)` that takes as input a sample `x` and a set of sample (i.e., a matrix) `x_set` and returns the indices of the `k` nearest neighbors of `x` in `x_set`.\n",
" - Reuse the `euclidean_distance` function that you defined before to do so. \n",
" - Remember that the `x_set` is a matrix of shape ($N_{samples}, N_{features}$), so you have to compute the distance between `x` and each row of `x_set`. \n",
" - In order to find the indices of the `k` nearest neighbors, you can use the `argsort` function that returns the indices that would sort an array\n",
" - Apply the function to the $36^{th}$ sample of the dataset with $k=5$.\n",
" - Print the indices of the $5$ nearest neighbors.\n",
" - Print the labels of the $5$ nearest neighbors. Can you guess the label of the $36^{th}$ sample?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "b93f94748b3841e3",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Label of 0 nearest neighbor: 0.0\n",
"Label of 1 nearest neighbor: 0.0\n",
"Label of 2 nearest neighbor: 0.0\n",
"Label of 3 nearest neighbor: 0.0\n",
"Label of 4 nearest neighbor: 0.0\n",
"Real label: 0.0\n"
]
}
],
"source": [
"def k_nearest_neighbors(x: np.ndarray, x_set: np.ndarray, k: int):\n",
" distances = np.linalg.norm(x-x_set, axis=1)\n",
" distances_sorted = np.argsort(distances)\n",
"\n",
" return distances_sorted[0:k]\n",
"\n",
"indices = k_nearest_neighbors(x[35], x, 5)\n",
"for i, k in enumerate(indices):\n",
" print(f\"Label of {i} nearest neighbor: {y[k]}\")\n",
"\n",
"print(f\"Real label: {y[35]}\")"
]
},
{
"cell_type": "markdown",
"id": "4de2b1c8798fc98e",
"metadata": {},
"source": [
"TODO: write your comment here"
]
},
{
"cell_type": "markdown",
"id": "9dd1f94b256663e8",
"metadata": {},
"source": [
"## 1.2 KNN design and implementation\n",
"In this exercise, you will implement your own version of the K-Nearest Neighbors (KNN) algorithm, and you will use it to assign an\n",
"Iris species (i.e. a label) to flowers whose species is unknown.\n",
"\n",
"The KNN algorithm is straightforward. Suppose that some measurements (e.g., the iris features) and their\n",
"relative label (e.g., the iris species) of a set of samples are known in advance. \n",
"\n",
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img2.png\" width=\"800\">\n",
"\n",
"Then, whenever we want to label a new sample, we look at the K most similar points (a.k.a. neighbors) and assign a label accordingly. \n",
"\n",
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img1-1.png\" width=\"800\">\n",
"\n",
"\n",
"The simplest solution is using a majority voting scheme: if the majority of the neighbors votes for a label, we will go for it. \n",
"This approach is naive only at first sight: the local similarity assumed by KNN happens to be roughly true, as you have seen in the previous exercises.\n",
"Even though this reasoning does not generalize well, the KNN provides a valid baseline for your tasks.\n"
]
},
{
"cell_type": "markdown",
"id": "5d185976071690ce",
"metadata": {},
"source": [
"1. Lets identify a portion of our data for which we will try to guess the species. Randomly select 20%\n",
"of the records and store the first four columns (i.e. the features representing each flower) into a\n",
"two-dimensional numpy array of shape ($N_{test}, 4$), you can call it `X_test` and $N_{test}$ is the 20% of the total number of samples.\n",
"For the same records, store the test label column (i.e. the one with the species values) into another array, namely `y_test`. \n",
"This is the data that will be used to test the accuracy of your KNN implementation and its correct functioning (i.e. the testing data)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "a642f03b563650e8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1. 0. 0. 2. 2. 0. 2. 1. 1. 1. 1. 1. 2. 2. 2. 1. 1. 2. 2. 2. 1. 2. 2. 2.\n",
" 0. 2. 1. 0. 2. 1.]\n"
]
}
],
"source": [
"test_subset_indices = np.random.choice(len(y), size=int(len(y)*0.2), replace=False)\n",
"X_test = x[test_subset_indices]\n",
"Y_test = y[test_subset_indices]\n",
"\n",
"x[test_subset_indices]\n",
"\n",
"print(Y_test)"
]
},
{
"cell_type": "markdown",
"id": "192e5663358e8e82",
"metadata": {},
"source": [
"2. Store the remaining 80% of the records in the same way. In this case, use the names X_train andy_train for the arrays.\n",
"This is the data that your model will use as ground-truth knowledge (i.e. the training data, from which we extract the knowledge and that we will use for comparison).\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "b9f1639cc7fe3b53",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
" 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.\n",
" 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n",
" 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.\n",
" 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]\n"
]
}
],
"source": [
"train_subset_indices = [i not in test_subset_indices for i in range(len(y))]\n",
"X_train = x[train_subset_indices]\n",
"Y_train = y[train_subset_indices]\n",
"\n",
"print(Y_train)"
]
},
{
"cell_type": "markdown",
"id": "dbbc62af2fef1d5c",
"metadata": {},
"source": [
"3. Focus now on the KNN technique. \n",
"From the next month, you will use the `scikit-learn` package. Many of its functionalities\n",
"are exposed via an object-oriented interface. With this paradigm in mind, implement now the KNN\n",
"algorithm and expose it as a Python class. The bare skeleton of your class should look like this (you\n",
"are free to add other methods if you want to).\n",
"\n",
"```\n",
"class KNearestNeighbors:\n",
" def __init__(self, k):\n",
" \"\"\"\n",
" Store the value of k in a attribute of the class and initialize other attributes.\n",
" :param k : int, number of neighbors to consider.\n",
" \"\"\"\n",
" pass # TODO: implement it!\n",
" def fit(self, X, y):\n",
" \"\"\"\n",
" Store the 'prior knowledge' of you model that will be used\n",
" to predict new labels.\n",
" :param X : input data points, ndarray, shape = (R,C).\n",
" :param y : input labels, ndarray, shape = (R,).\n",
" \"\"\"\n",
" pass # TODO: implement it!\n",
" \n",
" def predict(self, X):\n",
" \"\"\"Run the KNN classification on X.\n",
" :param X: input data points, ndarray, shape = (N,C).\n",
" :return: labels : ndarray, shape = (N,).\n",
" \"\"\"\n",
" pass # TODO: implement it!\n",
"\n",
"```\n",
"\n",
"\n",
"Implement the `__init__` and `fit` methods first. \n",
"- In the `__init__` method, you should store the value of `k` in a private attribute of the class.\n",
"- In the `fit` method you should only store the training data in private attributes of the class."
]
},
{
"cell_type": "code",
"execution_count": 104,
"id": "b5de6a78df7f8585",
"metadata": {
"ExecuteTime": {
"end_time": "2024-10-10T12:53:39.426246Z",
"start_time": "2024-10-10T12:53:39.420295Z"
}
},
"outputs": [],
"source": [
"class KNearestNeighbors:\n",
" def __init__(self, k):\n",
" \"\"\"\n",
" Store the value of k in a attribute of the class and initialize other attributes.\n",
" :param k : int, number of neighbors to consider.\n",
" \"\"\"\n",
" self.k = k\n",
"\n",
" def fit(self, X, y):\n",
" \"\"\"\n",
" Store the 'prior knowledge' of you model that will be used\n",
" to predict new labels.\n",
" :param X : input data points, ndarray, shape = (R,C).\n",
" :param y : input labels, ndarray, shape = (R,).\n",
" \"\"\"\n",
" self.X = x\n",
" self.y = y\n",
"\n",
" def vote(self, labels: np.ndarray):\n",
" voting = np.unique(labels, return_counts=True)\n",
" return voting[0][voting[1].argmax()]\n",
"\n",
" \n",
" def predict(self, X):\n",
" \"\"\"Run the KNN classification on X.\n",
" :param X: input data points, ndarray, shape = (N,C).\n",
" :return: labels : ndarray, shape = (N,).\n",
" \"\"\"\n",
" distances = [np.linalg.norm(x-self.X, axis=1) for x in X]\n",
" distances_sorted = np.argsort(distances)\n",
" nearest_neighbors_labels = y[distances_sorted[:,0:self.k]]\n",
"\n",
" return np.apply_along_axis(self.vote, 1, nearest_neighbors_labels)"
]
},
{
"cell_type": "markdown",
"id": "6ad6f4fc7071bff0",
"metadata": {},
"source": [
"4. Implement the `predict` method. The function receives as input a numpy array with N rows and C\n",
"columns, corresponding to N flowers. The method assigns to each row one of the three Iris species \n",
"using the KNN algorithm, and returns the predicted species as a numpy array. \n",
"\n",
" - For finding nearest neighbours, you can either re-use the previously defined `k_nearest_neighbors` function or \n",
"implement a new one exploiting the numpy broadcasting capabilities in order to avoid iterating over the sample matrix `X`.\n",
" - Then, assign the *predicted label* to each sample using a majority voting scheme, i.e., the label that appears most frequently among the k nearest neighbors. To do so you can use the `np.unique(neighbours_labels, return_count=True)` function that returns the unique labels and their counts. \n",
" - Finally, return the predicted labels as a numpy array."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "c227627e47cc7253",
"metadata": {
"ExecuteTime": {
"end_time": "2024-10-10T13:03:44.621187Z",
"start_time": "2024-10-10T13:03:44.609767Z"
}
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "4cbd1131d3ba785d",
"metadata": {},
"source": [
"5. Now lets fit the KNN model with the X_train and y_train data. Then, try to use your KNN model\n",
"to predict the species for each record in X_test and store them in a nupy array called y_pred.\n",
"As we did in the previous lab, check how many Iris species in the array y_pred have been guessed correctly computing with respect to the ones in y_test computing the accuracy. \n",
" - A prediction is correct if `y_pred[i] == y_test[i]`. To get the accuracy then compute the ratio between the number of correct guesses and the total number of guesses is known. \n",
" - If all labels are assigned correctly ((y_pred == y_test).all() == True), the accuracy of the model is 100%. \n",
" - Instead, if none of the guessed species corresponds to the real one ((y_pred == y_test).any() == False), the accuracy is 0%\n"
]
},
{
"cell_type": "code",
"execution_count": 112,
"id": "ca4f0b4bbe44c9fe",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8666666666666667\n"
]
}
],
"source": [
"knn = KNearestNeighbors(5)\n",
"knn.fit(X_train, Y_train)\n",
"predictions = knn.predict(X_test)\n",
"correct_guesses = predictions == Y_test\n",
"accuracy = np.count_nonzero(correct_guesses == True) / len(correct_guesses)\n",
"print(accuracy)"
]
},
{
"cell_type": "markdown",
"id": "7514fc82de74b729",
"metadata": {},
"source": [
"6. ($\\star$) As a software developer, you might want to increase the functionalities of your product and\n",
"publish newer versions over time. The better your code is structured and organized, the lower is the\n",
"effort to release updates.\n",
"As such, extend your KNN implementation adding the parameter `distance`. This has to be one among:\n",
" - Euclidean distance: $ euclidean(p,q) = \\sqrt{\\sum_{i=1}^{n} (p_i _- q_i)^2} $\n",
" - Manhattan distance: $ manhattan(p,q) = \\sum_{i=1}^n |p_i - q_i|$\n",
" - Cosine distance: $ cosine(p, q) = 1 - \\frac{\\sum_{i=1}^n p_i q_i}{ \\sqrt{\\sum^n_{i=1} p^2_i} \\cdot \\sqrt{\\sum^n_{i=1} q_i^2}}$\n",
"\n",
"If any of this distance is not already implemented in `numpy` implement it yourself"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "436c6395a2f3d853",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "24c76d735fe65dbd",
"metadata": {},
"source": [
"\n",
"7. ($\\star$) Again, extend now your KNN implementation by adding the parameter `weights` to the constructor,\n",
"as shown below:\n",
"\n",
"```\n",
"class KNearestNeighbors:\n",
" def __init__(self, k, distance_metric=\"euclidean\", weights=\"uniform\"):\n",
" self.k = k\n",
" self.distance_metric = distance_metric\n",
" self.weights = weights\n",
"```\n",
"\n",
"Change your KNN implementation to accept a new weighting scheme for the labels. If weights=\n",
"\"distance\", weight neighbor votes by the inverse of their distance (for the distance, again, use\n",
"distance_metric). The weight for a neighbor of the point p is:\n",
"\n",
"$\n",
"w(p, n) = \\frac{1}{distance\\_metric(p, n)}\n",
"$\n",
"\n",
"Instead, if the default is chosen (weights=\"uniform\"), use the majority voting you already implemented\n",
"in Exercise 6.\n",
"\n",
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img5.png\">\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a84262b9fd13d9f1",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "54f1e2a662695741",
"metadata": {},
"source": [
"8. ($\\star$) Test the modularity of the implementation applying it on a different dataset. Ideally, you should\n",
"not change the code of your KNN python class.\n",
"- Download the MNIST dataset and retain only 100 samples per digit. You will end up with a dataset of 1000 samples.\n",
"- Define again four numpy arrays as you did in Exercises 2 and 3.\n",
"- Apply your KNN as you did for the Iris dataset.\n",
"- Evaluate the accuracy on MNISTs y_test."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b720ef714195eb68",
"metadata": {},
"outputs": [],
"source": [
"# download MNIST dataset\n",
"\n",
"# linux users\n",
"#! wget https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv -O mnist.csv\n",
"\n",
"# windows users\n",
"! pip install wget\n",
"import wget\n",
"wget.download(\"https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv\", \"mnist.csv\")\n"
]
},
{
"cell_type": "code",
"execution_count": 158,
"id": "77afcee410ef94ac",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 0 0 ... 0 0 0]\n",
" [0 0 0 ... 0 0 0]\n",
" [0 0 0 ... 0 0 0]\n",
" ...\n",
" [9 0 0 ... 0 0 0]\n",
" [9 0 0 ... 0 0 0]\n",
" [9 0 0 ... 0 0 0]]\n",
"(1000, 784)\n",
"(1000,)\n"
]
}
],
"source": [
"# extracting MNIST dataset\n",
"import numpy as np\n",
"\n",
"raw_csv = np.loadtxt(\"mnist.csv\",\n",
" delimiter=\",\", dtype=int, converters={4:type_mapper})\n",
"\n",
"dataset_reduced = np.ndarray((0,785),dtype=int)\n",
"\n",
"for i in range(10):\n",
" items_with_digit = raw_csv[raw_csv[:,0] == i]\n",
" dataset_reduced = np.concatenate((dataset_reduced, items_with_digit[0:100,:]))\n",
"\n",
"print(dataset_reduced)\n",
"\n",
"x = dataset_reduced[:,1:]\n",
"y = dataset_reduced[:,0]\n",
"\n",
"print(x.shape)\n",
"print(y.shape)"
]
},
{
"cell_type": "code",
"execution_count": 160,
"id": "d1a0834dd8885a2b",
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# define four numpy arrays x_train, y_train, x_test, y_test\n",
"test_subset_indices = np.random.choice(len(y), size=int(len(y)*0.2), replace=False)\n",
"X_test = x[test_subset_indices]\n",
"Y_test = y[test_subset_indices]\n",
"\n",
"x[test_subset_indices]\n",
"\n",
"train_subset_indices = [i not in test_subset_indices for i in range(len(y))]\n",
"X_train = x[train_subset_indices]\n",
"Y_train = y[train_subset_indices]\n"
]
},
{
"cell_type": "code",
"execution_count": 171,
"id": "c03d2add840c1531",
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.885\n"
]
}
],
"source": [
"# Apply KNN on MNIST\n",
"knn = KNearestNeighbors(5)\n",
"knn.fit(X_train, Y_train)\n",
"predictions = knn.predict(X_test)\n",
"correct_guesses = predictions == Y_test\n",
"accuracy = np.count_nonzero(correct_guesses == True) / len(correct_guesses)\n",
"print(accuracy)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because one or more lines are too long

BIN
Labs/New_York_City_Map.PNG Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB