Compare commits
7 Commits
b659c6a237
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
|
d5b768962c
|
|||
|
25846ac643
|
|||
|
ae84532d96
|
|||
|
50287674fd
|
|||
|
b805c7b53f
|
|||
|
2c93bc6d68
|
|||
|
829e235442
|
500001
Labs/Lab 5/darknet_traces.csv
Normal file
500001
Labs/Lab 5/darknet_traces.csv
Normal file
File diff suppressed because it is too large
Load Diff
2184
Labs/Lab 5/lab_5.ipynb
Normal file
2184
Labs/Lab 5/lab_5.ipynb
Normal file
File diff suppressed because one or more lines are too long
919
Labs/Lab 6/lab_6.ipynb
Normal file
919
Labs/Lab 6/lab_6.ipynb
Normal file
@@ -0,0 +1,919 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "426a8016",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"<center><b><font size=6>Lab-6 A classifier from scratch<b><center>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a39139f5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Objective: Implement, use and evaluate a classifier (without using specific libraries such as sklearn)\n",
|
||||
"1. **Logistic regression** is a binary classification method that maps a linear combination of parameters and variables into two possible classes. Here, you will implement the logistic regression from scratch to better understand how an ML algorithm works. Useful link: <a href=\"https://en.wikipedia.org/wiki/Logistic_regression\">Wiki</a>.\n",
|
||||
"2. **Performance evaluation metrics** are needed to evaluate the outcome of prediction with respect to true labels. Here, you will implement confusion matrix, accuracy, precision, recall and F-measure. Useful link: <a href=\"https://en.wikipedia.org/wiki/Confusion_matrix\">Wiki</a>."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "b6bf32f9",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# import needed python libraries\n",
|
||||
"\n",
|
||||
"%matplotlib inline\n",
|
||||
"\n",
|
||||
"import pandas as pd\n",
|
||||
"import seaborn as sns\n",
|
||||
"import numpy as np\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"import random"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c0959af0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 1. Dataset - TCP logs\n",
|
||||
"The dataset contains traffic information generated by an open-source passive network monitoring tool, namely **tstat**. It automates the collection of packet statistics of traffic aggregates, using real-time monitoring features. Being a passive tool, the typical usage scenario is live monitoring of Internet links, in which all transmitted packets are observed. In case of TCP, Tstat identifies a new flow start when it observes a TCP three-way handshake. Similarly, it identifies a TCP flow end either when it sees the TCP connection teardown, or when it doesn’t observe packets for some time (idle time). A flow is defined by a unique link between the sender and receiver, e.g., a tuple of <em>(IP_Protocol_Type, IP_Source_Address, Source_Port, IP_Destination_Address, Destination_Port)</em>. For a specific flow, tstat calculates a number of statistics of all the packets transmitted over this flow, and then generate a log for such flow with multiple attributes (statistics). A log file is arranged as a simple table where each column is associated to specific information and each row reports the flow during a connection. The log information is a summary of the flow properties. For instance, in the TCP log we can find columns like the starting time of a TCP connection, its duration, the number of sent and received packets, the observed Round Trip Time.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"In this lab, since the focus is on the development of logistic regression from scratch, we only consider a portion of the dataset for simplicity. The data can be found in `log_tcp_part.csv`, in which there are multiple columns, the last one is the class label, indicating the flow is from either **google** or **youtube**, and the rest are features. Your job is a binary classification task to classify the domain of each flow (row) **from scratch**, including:\n",
|
||||
"- Build a logistic regression model,\n",
|
||||
"- Evaluate the performance."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "8fc1d837",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Load the dataset.\n",
|
||||
"2. Get the list of features (columns 1 to 10).\n",
|
||||
"3. Add a new column and assign numerical class labels of -1 and 1 to google and youtube.\n",
|
||||
"4. Answering the following questions:\n",
|
||||
" - How many features do we have?\n",
|
||||
" - How many samples do we have in total?\n",
|
||||
" - How many samples do we have for each class? Are they similar?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"id": "70294ef9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/tmp/ipykernel_226018/230400442.py:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
||||
" df_tcp.replace({\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>c_msgsize_count</th>\n",
|
||||
" <th>c_pktsize6</th>\n",
|
||||
" <th>c_msgsize4</th>\n",
|
||||
" <th>s_msgsize4</th>\n",
|
||||
" <th>s_pktsize2</th>\n",
|
||||
" <th>s_rtt_cnt</th>\n",
|
||||
" <th>s_rtt_std</th>\n",
|
||||
" <th>s_msgsize5</th>\n",
|
||||
" <th>c_msgsize6</th>\n",
|
||||
" <th>c_sit3</th>\n",
|
||||
" <th>class</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1418</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>-1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>0.466732</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>-1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>0.413304</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>-1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1418</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>-1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1418</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>-1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>19995</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>37</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1418</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>22.224528</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3.334</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>19996</th>\n",
|
||||
" <td>6</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>57</td>\n",
|
||||
" <td>1418</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>45</td>\n",
|
||||
" <td>1.252</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>19997</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1205</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>531</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>15.323660</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>4975.694</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>19998</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>690</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>767</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>17.997651</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1719.125</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>19999</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0.000000</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0.000</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>20000 rows × 11 columns</p>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" c_msgsize_count c_pktsize6 c_msgsize4 s_msgsize4 s_pktsize2 \\\n",
|
||||
"0 1 0 0 0 1418 \n",
|
||||
"1 1 0 0 0 0 \n",
|
||||
"2 1 0 0 0 0 \n",
|
||||
"3 1 0 0 0 1418 \n",
|
||||
"4 1 0 0 0 1418 \n",
|
||||
"... ... ... ... ... ... \n",
|
||||
"19995 4 0 37 0 1418 \n",
|
||||
"19996 6 45 45 57 1418 \n",
|
||||
"19997 4 0 1205 0 531 \n",
|
||||
"19998 4 0 690 0 767 \n",
|
||||
"19999 1 0 0 0 0 \n",
|
||||
"\n",
|
||||
" s_rtt_cnt s_rtt_std s_msgsize5 c_msgsize6 c_sit3 class \n",
|
||||
"0 0 0.000000 0 0 0.000 -1 \n",
|
||||
"1 3 0.466732 0 0 0.000 -1 \n",
|
||||
"2 3 0.413304 0 0 0.000 -1 \n",
|
||||
"3 1 0.000000 0 0 0.000 -1 \n",
|
||||
"4 0 0.000000 0 0 0.000 -1 \n",
|
||||
"... ... ... ... ... ... ... \n",
|
||||
"19995 3 22.224528 0 0 3.334 1 \n",
|
||||
"19996 2 0.000000 45 45 1.252 1 \n",
|
||||
"19997 4 15.323660 0 0 4975.694 1 \n",
|
||||
"19998 4 17.997651 0 0 1719.125 1 \n",
|
||||
"19999 1 0.000000 0 0 0.000 1 \n",
|
||||
"\n",
|
||||
"[20000 rows x 11 columns]"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_tcp = pd.read_csv('log_tcp_part.csv')\n",
|
||||
"features = df_tcp.columns[:-1] # Remove class\n",
|
||||
"df_tcp.replace({\n",
|
||||
" \"class\": {\n",
|
||||
" \"google\": -1,\n",
|
||||
" \"youtube\": 1,\n",
|
||||
" }\n",
|
||||
"}, inplace=True)\n",
|
||||
"\n",
|
||||
"df_tcp"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"id": "48d85d94",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Number of features: 10\n",
|
||||
"Number of samples: 20000\n",
|
||||
"Number of samples of google: 10000\n",
|
||||
"Number of samples of youtube: 10000\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"num_features = features.size\n",
|
||||
"num_samples = len(df_tcp)\n",
|
||||
"num_google = len(df_tcp.loc[df_tcp[\"class\"] == -1])\n",
|
||||
"num_youtube = len(df_tcp.loc[df_tcp[\"class\"] == 1])\n",
|
||||
"\n",
|
||||
"print(f\"Number of features: {num_features}\")\n",
|
||||
"print(f\"Number of samples: {num_samples}\")\n",
|
||||
"print(f\"Number of samples of google: {num_google}\")\n",
|
||||
"print(f\"Number of samples of youtube: {num_youtube}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c1c8cc80",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 2. Implement your logistic regression learning algorithm\n",
|
||||
"Here you will need to construct a class in which you need to define two functions besides the class initialization:\n",
|
||||
"- `fit`. In this method you will perform ERM. Learn the parameters of the model (i.e., the hypothesis h) from training with gradient descent\n",
|
||||
"- `predict`. In this method given one sample x (or more) you will perform the inference $sign(h(x))$ to obtain class labels.\n",
|
||||
"\n",
|
||||
"Hints:\n",
|
||||
"\n",
|
||||
"- The linear function used in the logistic regression is the following: $h(x)=w^T x +b $, where b is a scalar bias.\n",
|
||||
"- Logistic loss: $L((x,y),h)=\\log(1+e^{-y h(x)})$\n",
|
||||
"- ERM: $\\min_{w,b} f(w,b)=\\frac{1}{m}\\sum_{i=1}^{m} \\log(1+e^{-y^{(i)} h(x^{(i)})})$\n",
|
||||
"- Gradient for weight: $\\nabla_w f(w,b) = \\frac{1}{m} \\sum_i \\frac{-y^{(i)}x^{(i)}}{(1+e^{y^{(i)}h(x^{(i)})})}$\n",
|
||||
"- Gradient for bias: $\\nabla_b f(w,b)= \\frac{1}{m} \\sum_i \\frac{-y^{(i)}}{(1+e^{y^{(i)}h(x^{(i)})})}$\n",
|
||||
"- Update the parameters: $w \\leftarrow w - \\alpha \\nabla w$, $b \\leftarrow b - \\alpha \\nabla b$\n",
|
||||
"\n",
|
||||
"Notice that the sigmoid function $f(z) = \\frac{1}{1 + e^{-z}}$ appears multiple times. You can write also a method for the sigmoid function to help you in the computation. By considering f(z), the gradients rewrite as:\n",
|
||||
"\n",
|
||||
"- Gradient for weight: $\\nabla_w f(w,b) = \\frac{1}{m} \\sum_i ({f(h(x^{(i)})) - y^{(i)}})x^{(i)}$\n",
|
||||
"- Gradient for bias: $\\nabla_b f(w,b) = \\frac{1}{m} \\sum_i ({f(h(x^{(i)})) - y^{(i)}})$"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 176,
|
||||
"id": "90a02f52",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def sigmoid(z):\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"\n",
|
||||
"class LogisticRegression:\n",
|
||||
" def __init__(self, learning_rate, num_iterations):\n",
|
||||
" self.learning_rate = learning_rate\n",
|
||||
" self.num_iterations = num_iterations\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" def h(self, X):\n",
|
||||
" return np.dot(X, self.w) + self.b\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" def gradient_step_w(self, m, X, y):\n",
|
||||
" h = self.h(X)\n",
|
||||
" f = sigmoid(h)\n",
|
||||
" s = np.dot(X.T, np.subtract(f, y))\n",
|
||||
"\n",
|
||||
" return s/m\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" def gradient_step_b(self, m, X, y):\n",
|
||||
" h = self.h(X)\n",
|
||||
" f = sigmoid(h)\n",
|
||||
" s = np.subtract(f, y).sum()\n",
|
||||
" \n",
|
||||
" return s/m\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" def fit(self, X, y):\n",
|
||||
" self.w = np.zeros((X.shape[1]))\n",
|
||||
" self.b = 0\n",
|
||||
" m = len(X)\n",
|
||||
" \n",
|
||||
" for i in range(self.num_iterations):\n",
|
||||
" w_step = self.gradient_step_w(m, X, y)\n",
|
||||
" b_step = self.gradient_step_b(m, X, y)\n",
|
||||
"\n",
|
||||
" self.w -= self.learning_rate*w_step\n",
|
||||
" self.b -= self.learning_rate*b_step\n",
|
||||
"\n",
|
||||
" y_predict = np.transpose(self.predict(X))==y\n",
|
||||
" correct_predictions = np.count_nonzero(y_predict == True)\n",
|
||||
" accuracy = correct_predictions/len(y)\n",
|
||||
" print(accuracy)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" def predict(self, X):\n",
|
||||
" if self.w is None or self.b is None:\n",
|
||||
" raise ValueError\n",
|
||||
" \n",
|
||||
" p = self.h(X)\n",
|
||||
" return np.sign(p)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cc478b78",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 3. Use the model\n",
|
||||
"- Initialize your model with predefined learning rate of `0.1` and iterations of `100`.\n",
|
||||
"- Fit your model with features and targets.\n",
|
||||
"- Get the prediction with features."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 177,
|
||||
"id": "af5a590d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.5768\n",
|
||||
"0.5845333333333333\n",
|
||||
"0.5632\n",
|
||||
"0.6148\n",
|
||||
"0.5904666666666667\n",
|
||||
"0.617\n",
|
||||
"0.5955333333333334\n",
|
||||
"0.5915333333333334\n",
|
||||
"0.6082666666666666\n",
|
||||
"0.5925333333333334\n",
|
||||
"0.6115333333333334\n",
|
||||
"0.5924666666666667\n",
|
||||
"0.6012666666666666\n",
|
||||
"0.5922666666666667\n",
|
||||
"0.6109333333333333\n",
|
||||
"0.5952\n",
|
||||
"0.5922\n",
|
||||
"0.5996666666666667\n",
|
||||
"0.5904666666666667\n",
|
||||
"0.6062\n",
|
||||
"0.5915333333333334\n",
|
||||
"0.5988\n",
|
||||
"0.5916\n",
|
||||
"0.5979333333333333\n",
|
||||
"0.5917333333333333\n",
|
||||
"0.5962\n",
|
||||
"0.5933333333333334\n",
|
||||
"0.5955333333333334\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5944666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5945333333333334\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5948\n",
|
||||
"0.5948\n",
|
||||
"0.5948\n",
|
||||
"0.5948\n",
|
||||
"0.5948666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5948\n",
|
||||
"0.5948666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5948666666666667\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5948\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5948\n",
|
||||
"0.5947333333333333\n",
|
||||
"0.5946666666666667\n",
|
||||
"0.5948\n",
|
||||
"0.5947333333333333\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n",
|
||||
"/tmp/ipykernel_226018/1018497028.py:2: RuntimeWarning: overflow encountered in exp\n",
|
||||
" return 1/(1+np.exp(np.negative(z)))\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"\n",
|
||||
"x_train, x_test, y_train, y_test = train_test_split(df_tcp.drop(columns=[\"class\"], inplace=False), df_tcp[\"class\"])\n",
|
||||
"\n",
|
||||
"lr = LogisticRegression(0.1, 100)\n",
|
||||
"lr.fit(x_train, y_train.values)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 174,
|
||||
"id": "beda67a9",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.6056"
|
||||
]
|
||||
},
|
||||
"execution_count": 174,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"y_predict = np.transpose(lr.predict(x_test))==y_test.values\n",
|
||||
"correct_predictions = np.count_nonzero(y_predict == True)\n",
|
||||
"accuracy = correct_predictions/len(y_test)\n",
|
||||
"\n",
|
||||
"accuracy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 175,
|
||||
"id": "8db63dad",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.5965333333333334"
|
||||
]
|
||||
},
|
||||
"execution_count": 175,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"y_predict = np.transpose(lr.predict(x_train))==y_train.values\n",
|
||||
"correct_predictions = np.count_nonzero(y_predict == True)\n",
|
||||
"accuracy = correct_predictions/len(y_train)\n",
|
||||
"\n",
|
||||
"accuracy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "bc5ad9e7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 4. Model evaluation\n",
|
||||
"With predicted class labels and ground truths, we now evaluate the model performance through confusion matrix and numerical metrics. Specifically, you need to derive the following:\n",
|
||||
"- Confusion matrix - Note that, you should indicate the corresponding quantity of each element in the table. Here positive is class 1 and negative is class -1:\n",
|
||||
"\\begin{array}{|c|c|c|}\n",
|
||||
"\\hline\n",
|
||||
" & \\textbf{Predicted Positive} & \\textbf{Predicted Negative} \\\\\n",
|
||||
"\\hline\n",
|
||||
"\\textbf{Actual Positive} & \\text{True Positive (TP)} & \\text{False Negative (FN)} \\\\\n",
|
||||
"\\hline\n",
|
||||
"\\textbf{Actual Negative} & \\text{False Positive (FP)} & \\text{True Negative (TN)} \\\\\n",
|
||||
"\\hline\n",
|
||||
"\\end{array}\n",
|
||||
"- Precision of each class and the average value:\n",
|
||||
"$\\frac{\\text{True Positive (TP)}}{\\text{True Positive (TP) + False Positive (FP)}}$\n",
|
||||
"- Recall of each class and the average value:\n",
|
||||
"$\\frac{\\text{True Positive (TP)}}{\\text{True Positive (TP) + False Negative (FN)}}$\n",
|
||||
"- F1-score of each class and the average value:\n",
|
||||
"$F_1 = \\frac{2 \\times \\text{Precision} \\times \\text{Recall}}{\\text{Precision} + \\text{Recall}}$\n",
|
||||
"- Accuracy:\n",
|
||||
"$\\frac{\\text{True Positive (TP) + True Negative (TN)}}{\\text{True Positive (TP) + True Negative (TN) + False Positive (FP) + False Negative (FN)}}$\n",
|
||||
"- Answering the following questions:\n",
|
||||
" - Do you have same performance between classes? If not, which one performs better?\n",
|
||||
" - Change the parameters of learning rate or number of iterations. Do you have same performance? Better or Worse? Why?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "15b74982",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# your answers here"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.2"
|
||||
},
|
||||
"varInspector": {
|
||||
"cols": {
|
||||
"lenName": 16,
|
||||
"lenType": 16,
|
||||
"lenVar": 40
|
||||
},
|
||||
"kernels_config": {
|
||||
"python": {
|
||||
"delete_cmd_postfix": "",
|
||||
"delete_cmd_prefix": "del ",
|
||||
"library": "var_list.py",
|
||||
"varRefreshCmd": "print(var_dic_list())"
|
||||
},
|
||||
"r": {
|
||||
"delete_cmd_postfix": ") ",
|
||||
"delete_cmd_prefix": "rm(",
|
||||
"library": "var_list.r",
|
||||
"varRefreshCmd": "cat(var_dic_list()) "
|
||||
}
|
||||
},
|
||||
"types_to_exclude": [
|
||||
"module",
|
||||
"function",
|
||||
"builtin_function_or_method",
|
||||
"instance",
|
||||
"_Feature"
|
||||
],
|
||||
"window_display": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
20001
Labs/Lab 6/log_tcp_part.csv
Normal file
20001
Labs/Lab 6/log_tcp_part.csv
Normal file
File diff suppressed because it is too large
Load Diff
BIN
Labs/Lab 6/tstat.png
Normal file
BIN
Labs/Lab 6/tstat.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 132 KiB |
140001
Labs/Lab 7/RTP_dataset.csv
Normal file
140001
Labs/Lab 7/RTP_dataset.csv
Normal file
File diff suppressed because it is too large
Load Diff
2265
Labs/Lab 7/lab_7.ipynb
Normal file
2265
Labs/Lab 7/lab_7.ipynb
Normal file
File diff suppressed because one or more lines are too long
BIN
Labs/Lab 7/video_conference.png
Normal file
BIN
Labs/Lab 7/video_conference.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 130 KiB |
140001
Labs/Lab 8/RTP_dataset.csv
Normal file
140001
Labs/Lab 8/RTP_dataset.csv
Normal file
File diff suppressed because it is too large
Load Diff
BIN
Labs/Lab 8/RTP_dataset.csv.zip
Normal file
BIN
Labs/Lab 8/RTP_dataset.csv.zip
Normal file
Binary file not shown.
1303
Labs/Lab 8/lab_8.ipynb
Normal file
1303
Labs/Lab 8/lab_8.ipynb
Normal file
File diff suppressed because one or more lines are too long
BIN
Labs/Lab 8/validation.png
Normal file
BIN
Labs/Lab 8/validation.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 101 KiB |
765
Labs/Lab2 - Numpy.ipynb
Normal file
765
Labs/Lab2 - Numpy.ipynb
Normal file
@@ -0,0 +1,765 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "b1ea060a47a6211d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# LAB #2: Numpy\n",
|
||||
"\n",
|
||||
"## Introduction\n",
|
||||
"In this laboratory, you will perform some operation with NumPy arrays in such a way to build your first Machine Learning model. \n",
|
||||
"In particular, you will build a NumPy-based version of the K-Nearest Neighbors algorithm (a.k.a. KNN).\n",
|
||||
"\n",
|
||||
"## 0 Preliminary steps\n",
|
||||
"### 0.1 NumPy\n",
|
||||
"Make sure you have the NumPy library installed, its use is strongly recommended for this laboratory.\n",
|
||||
"NumPy is the fundamental package for scientific computing with Python. You can read more about it on\n",
|
||||
"the official documentation.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "9246699975edf562",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"! pip install numpy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ad497ed1d0092203",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### 0.2 Iris dataset download \n",
|
||||
"For this lab, you will need two of the datasets you have already met: Iris and MNIST. Please refer to\n",
|
||||
"Laboratory 1 for a complete description of the datasets.\n",
|
||||
"Iris. You can download it from:\n",
|
||||
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a838a5ed77a24051",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# linux users\n",
|
||||
"# !wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O iris.csv\n",
|
||||
"# windows users\n",
|
||||
"! pip install wget\n",
|
||||
"import wget\n",
|
||||
"wget.download(\"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data\", \"iris.csv\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ef169d9060adb9a7",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1 Exercises \n",
|
||||
"Note that exercises marked with a ($\\star$) are optional, you should focus on completing the other ones first."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "a820274dc6b6f678",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1.1 Iris Analysis with Numpy\n",
|
||||
"As you might remember from Lab. 1, the Iris dataset collects the measurements of different Iris flowers,\n",
|
||||
"and each data point is characterized by 4 **features** (sepal length, sepal width, petal length, petal width) and is associated to 1 **label** (i.e. an Iris species - Setosa, Versicolor, or Virginica) which in this case is the last element of the row (last column of the csv file). "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "46864c46cf9f9387",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Load the Iris dataset. You can use the `csv` library that we saw in the last laboratory or read it with the standard `open(filename, strategy)`. \n",
|
||||
"In the second case remember to split correctly the different fields, and avoid new line characters. In any case check for empty lines. \n",
|
||||
"This time remember to store the 4 features in a numpy array `x` of shape (n_sample, 4) and the labels in a different array `y` of shape (n_sample,) converting the 3 different species to a corresponding numerical value. E.g.,\n",
|
||||
" - Iris-setosa: 0\n",
|
||||
" - Iris-versicolor: 1\n",
|
||||
" - Iris-virginica: 2\n",
|
||||
"\n",
|
||||
"In order to check you have correctly loaded the data, print the shape of the two arrays: you should find\n",
|
||||
"(150, 4) for `x` and (150,) for `y`."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "a977ccc88ef2ca39",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"(150, 4)\n",
|
||||
"(150,)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"def type_mapper(type):\n",
|
||||
" match type:\n",
|
||||
" case b\"Iris-setosa\":\n",
|
||||
" return 0\n",
|
||||
" case b\"Iris-versicolor\":\n",
|
||||
" return 1\n",
|
||||
" case b\"Iris-virginica\":\n",
|
||||
" return 2\n",
|
||||
" \n",
|
||||
" return -1\n",
|
||||
"\n",
|
||||
"raw_csv = np.loadtxt(\"iris.csv\",\n",
|
||||
" delimiter=\",\", dtype=float, converters={4:type_mapper})\n",
|
||||
"\n",
|
||||
"x = raw_csv[:,0:4]\n",
|
||||
"y = raw_csv[:,4]\n",
|
||||
"\n",
|
||||
"print(x.shape)\n",
|
||||
"print(y.shape)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5050d162966956ce",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. Compute again the mean and standard deviation for each class by means of the numpy functions"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"id": "33bfaed602d4bc3e",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Metrics for specie 0\n",
|
||||
"Sepal length for mean: 5.006, std_dev: 0.3489469873777391\n",
|
||||
"Sepal width mean: 3.418, std_dev: 0.37719490982779713\n",
|
||||
"Petal length mean: 1.464, std_dev: 0.17176728442867112\n",
|
||||
"Petal width mean: 0.244, std_dev: 0.10613199329137281\n",
|
||||
"\n",
|
||||
"Metrics for specie 1\n",
|
||||
"Sepal length for mean: 5.936, std_dev: 0.5109833656783751\n",
|
||||
"Sepal width mean: 2.7700000000000005, std_dev: 0.31064449134018135\n",
|
||||
"Petal length mean: 4.26, std_dev: 0.4651881339845203\n",
|
||||
"Petal width mean: 1.3259999999999998, std_dev: 0.19576516544063705\n",
|
||||
"\n",
|
||||
"Metrics for specie 2\n",
|
||||
"Sepal length for mean: 6.587999999999998, std_dev: 0.6294886813914926\n",
|
||||
"Sepal width mean: 2.974, std_dev: 0.3192553836664309\n",
|
||||
"Petal length mean: 5.5520000000000005, std_dev: 0.546347874526844\n",
|
||||
"Petal width mean: 2.0260000000000002, std_dev: 0.2718896835115301\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i in range(3):\n",
|
||||
" iris = x[np.ma.masked_where(y, y==i)]\n",
|
||||
"\n",
|
||||
" print(f\"Metrics for specie {i}\")\n",
|
||||
" print(f\"Sepal length for mean: {iris[:,0].mean()}, std_dev: {iris[:,0].std()}\")\n",
|
||||
" print(f\"Sepal width mean: {iris[:,1].mean()}, std_dev: {iris[:,1].std()}\")\n",
|
||||
" print(f\"Petal length mean: {iris[:,2].mean()}, std_dev: {iris[:,2].std()}\")\n",
|
||||
" print(f\"Petal width mean: {iris[:,3].mean()}, std_dev: {iris[:,3].std()}\")\n",
|
||||
" print()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1f84beb708797ba9",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"3. Compute the distances among two samples (e.g., the $36^{th}$ and the $81^{th}$, the $13^{th}$ and the $15^{th}$) \n",
|
||||
"by means of the `np.linalg.norm(a-b)` function which computes the norm of `a-b`, i.e., the euclidean distance between the feature of the `a` and of the `b` samples. \n",
|
||||
" - Can you guess if the two couples of samples belong to the same species?\n",
|
||||
" - From the mean and standard deviations computed before can you guess which species? "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 32,
|
||||
"id": "4a47fb722be07fb4",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2.7892651361962706\n",
|
||||
"1.4317821063276353\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(np.linalg.norm(x[35]-x[81]))\n",
|
||||
"print(np.linalg.norm(x[12]-x[14]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9dc024bce0c0dd04",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"TODO: write your comment here"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "fd802b47b8519bb3",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f3fa448bd7bc9d94",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"TODO: write your comment here"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dcceaccd4a1a7526",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"source": [
|
||||
"4. Find the k nearest neighbors of a sample in the dataset.\n",
|
||||
" - Define a function `k_nearest_neighbors(x, x_set, k)` that takes as input a sample `x` and a set of sample (i.e., a matrix) `x_set` and returns the indices of the `k` nearest neighbors of `x` in `x_set`.\n",
|
||||
" - Reuse the `euclidean_distance` function that you defined before to do so. \n",
|
||||
" - Remember that the `x_set` is a matrix of shape ($N_{samples}, N_{features}$), so you have to compute the distance between `x` and each row of `x_set`. \n",
|
||||
" - In order to find the indices of the `k` nearest neighbors, you can use the `argsort` function that returns the indices that would sort an array\n",
|
||||
" - Apply the function to the $36^{th}$ sample of the dataset with $k=5$.\n",
|
||||
" - Print the indices of the $5$ nearest neighbors.\n",
|
||||
" - Print the labels of the $5$ nearest neighbors. Can you guess the label of the $36^{th}$ sample?"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"id": "b93f94748b3841e3",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Label of 0 nearest neighbor: 0.0\n",
|
||||
"Label of 1 nearest neighbor: 0.0\n",
|
||||
"Label of 2 nearest neighbor: 0.0\n",
|
||||
"Label of 3 nearest neighbor: 0.0\n",
|
||||
"Label of 4 nearest neighbor: 0.0\n",
|
||||
"Real label: 0.0\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def k_nearest_neighbors(x: np.ndarray, x_set: np.ndarray, k: int):\n",
|
||||
" distances = np.linalg.norm(x-x_set, axis=1)\n",
|
||||
" distances_sorted = np.argsort(distances)\n",
|
||||
"\n",
|
||||
" return distances_sorted[0:k]\n",
|
||||
"\n",
|
||||
"indices = k_nearest_neighbors(x[35], x, 5)\n",
|
||||
"for i, k in enumerate(indices):\n",
|
||||
" print(f\"Label of {i} nearest neighbor: {y[k]}\")\n",
|
||||
"\n",
|
||||
"print(f\"Real label: {y[35]}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4de2b1c8798fc98e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"TODO: write your comment here"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9dd1f94b256663e8",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## 1.2 KNN design and implementation\n",
|
||||
"In this exercise, you will implement your own version of the K-Nearest Neighbors (KNN) algorithm, and you will use it to assign an\n",
|
||||
"Iris species (i.e. a label) to flowers whose species is unknown.\n",
|
||||
"\n",
|
||||
"The KNN algorithm is straightforward. Suppose that some measurements (e.g., the iris features) and their\n",
|
||||
"relative label (e.g., the iris species) of a set of samples are known in advance. \n",
|
||||
"\n",
|
||||
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img2.png\" width=\"800\">\n",
|
||||
"\n",
|
||||
"Then, whenever we want to label a new sample, we look at the K most similar points (a.k.a. neighbors) and assign a label accordingly. \n",
|
||||
"\n",
|
||||
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img1-1.png\" width=\"800\">\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The simplest solution is using a majority voting scheme: if the majority of the neighbors votes for a label, we will go for it. \n",
|
||||
"This approach is naive only at first sight: the local similarity assumed by KNN happens to be roughly true, as you have seen in the previous exercises.\n",
|
||||
"Even though this reasoning does not generalize well, the KNN provides a valid baseline for your tasks.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "5d185976071690ce",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. Let’s identify a portion of our data for which we will try to guess the species. Randomly select 20%\n",
|
||||
"of the records and store the first four columns (i.e. the features representing each flower) into a\n",
|
||||
"two-dimensional numpy array of shape ($N_{test}, 4$), you can call it `X_test` and $N_{test}$ is the 20% of the total number of samples.\n",
|
||||
"For the same records, store the test label column (i.e. the one with the species values) into another array, namely `y_test`. \n",
|
||||
"This is the data that will be used to test the accuracy of your KNN implementation and its correct functioning (i.e. the testing data)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"id": "a642f03b563650e8",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[1. 0. 0. 2. 2. 0. 2. 1. 1. 1. 1. 1. 2. 2. 2. 1. 1. 2. 2. 2. 1. 2. 2. 2.\n",
|
||||
" 0. 2. 1. 0. 2. 1.]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"test_subset_indices = np.random.choice(len(y), size=int(len(y)*0.2), replace=False)\n",
|
||||
"X_test = x[test_subset_indices]\n",
|
||||
"Y_test = y[test_subset_indices]\n",
|
||||
"\n",
|
||||
"x[test_subset_indices]\n",
|
||||
"\n",
|
||||
"print(Y_test)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "192e5663358e8e82",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"2. Store the remaining 80% of the records in the same way. In this case, use the names X_train andy_train for the arrays.\n",
|
||||
"This is the data that your model will use as ground-truth knowledge (i.e. the training data, from which we extract the knowledge and that we will use for comparison).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"id": "b9f1639cc7fe3b53",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
|
||||
" 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1.\n",
|
||||
" 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.\n",
|
||||
" 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.\n",
|
||||
" 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"train_subset_indices = [i not in test_subset_indices for i in range(len(y))]\n",
|
||||
"X_train = x[train_subset_indices]\n",
|
||||
"Y_train = y[train_subset_indices]\n",
|
||||
"\n",
|
||||
"print(Y_train)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "dbbc62af2fef1d5c",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"3. Focus now on the KNN technique. \n",
|
||||
"From the next month, you will use the `scikit-learn` package. Many of its functionalities\n",
|
||||
"are exposed via an object-oriented interface. With this paradigm in mind, implement now the KNN\n",
|
||||
"algorithm and expose it as a Python class. The bare skeleton of your class should look like this (you\n",
|
||||
"are free to add other methods if you want to).\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"class KNearestNeighbors:\n",
|
||||
" def __init__(self, k):\n",
|
||||
" \"\"\"\n",
|
||||
" Store the value of k in a attribute of the class and initialize other attributes.\n",
|
||||
" :param k : int, number of neighbors to consider.\n",
|
||||
" \"\"\"\n",
|
||||
" pass # TODO: implement it!\n",
|
||||
" def fit(self, X, y):\n",
|
||||
" \"\"\"\n",
|
||||
" Store the 'prior knowledge' of you model that will be used\n",
|
||||
" to predict new labels.\n",
|
||||
" :param X : input data points, ndarray, shape = (R,C).\n",
|
||||
" :param y : input labels, ndarray, shape = (R,).\n",
|
||||
" \"\"\"\n",
|
||||
" pass # TODO: implement it!\n",
|
||||
" \n",
|
||||
" def predict(self, X):\n",
|
||||
" \"\"\"Run the KNN classification on X.\n",
|
||||
" :param X: input data points, ndarray, shape = (N,C).\n",
|
||||
" :return: labels : ndarray, shape = (N,).\n",
|
||||
" \"\"\"\n",
|
||||
" pass # TODO: implement it!\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Implement the `__init__` and `fit` methods first. \n",
|
||||
"- In the `__init__` method, you should store the value of `k` in a private attribute of the class.\n",
|
||||
"- In the `fit` method you should only store the training data in private attributes of the class."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 104,
|
||||
"id": "b5de6a78df7f8585",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-10-10T12:53:39.426246Z",
|
||||
"start_time": "2024-10-10T12:53:39.420295Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class KNearestNeighbors:\n",
|
||||
" def __init__(self, k):\n",
|
||||
" \"\"\"\n",
|
||||
" Store the value of k in a attribute of the class and initialize other attributes.\n",
|
||||
" :param k : int, number of neighbors to consider.\n",
|
||||
" \"\"\"\n",
|
||||
" self.k = k\n",
|
||||
"\n",
|
||||
" def fit(self, X, y):\n",
|
||||
" \"\"\"\n",
|
||||
" Store the 'prior knowledge' of you model that will be used\n",
|
||||
" to predict new labels.\n",
|
||||
" :param X : input data points, ndarray, shape = (R,C).\n",
|
||||
" :param y : input labels, ndarray, shape = (R,).\n",
|
||||
" \"\"\"\n",
|
||||
" self.X = x\n",
|
||||
" self.y = y\n",
|
||||
"\n",
|
||||
" def vote(self, labels: np.ndarray):\n",
|
||||
" voting = np.unique(labels, return_counts=True)\n",
|
||||
" return voting[0][voting[1].argmax()]\n",
|
||||
"\n",
|
||||
" \n",
|
||||
" def predict(self, X):\n",
|
||||
" \"\"\"Run the KNN classification on X.\n",
|
||||
" :param X: input data points, ndarray, shape = (N,C).\n",
|
||||
" :return: labels : ndarray, shape = (N,).\n",
|
||||
" \"\"\"\n",
|
||||
" distances = [np.linalg.norm(x-self.X, axis=1) for x in X]\n",
|
||||
" distances_sorted = np.argsort(distances)\n",
|
||||
" nearest_neighbors_labels = y[distances_sorted[:,0:self.k]]\n",
|
||||
"\n",
|
||||
" return np.apply_along_axis(self.vote, 1, nearest_neighbors_labels)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6ad6f4fc7071bff0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"4. Implement the `predict` method. The function receives as input a numpy array with N rows and C\n",
|
||||
"columns, corresponding to N flowers. The method assigns to each row one of the three Iris species \n",
|
||||
"using the KNN algorithm, and returns the predicted species as a numpy array. \n",
|
||||
"\n",
|
||||
" - For finding nearest neighbours, you can either re-use the previously defined `k_nearest_neighbors` function or \n",
|
||||
"implement a new one exploiting the numpy broadcasting capabilities in order to avoid iterating over the sample matrix `X`.\n",
|
||||
" - Then, assign the *predicted label* to each sample using a majority voting scheme, i.e., the label that appears most frequently among the k nearest neighbors. To do so you can use the `np.unique(neighbours_labels, return_count=True)` function that returns the unique labels and their counts. \n",
|
||||
" - Finally, return the predicted labels as a numpy array."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"id": "c227627e47cc7253",
|
||||
"metadata": {
|
||||
"ExecuteTime": {
|
||||
"end_time": "2024-10-10T13:03:44.621187Z",
|
||||
"start_time": "2024-10-10T13:03:44.609767Z"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4cbd1131d3ba785d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"5. Now let’s fit the KNN model with the X_train and y_train data. Then, try to use your KNN model\n",
|
||||
"to predict the species for each record in X_test and store them in a nupy array called y_pred.\n",
|
||||
"As we did in the previous lab, check how many Iris species in the array y_pred have been guessed correctly computing with respect to the ones in y_test computing the accuracy. \n",
|
||||
" - A prediction is correct if `y_pred[i] == y_test[i]`. To get the accuracy then compute the ratio between the number of correct guesses and the total number of guesses is known. \n",
|
||||
" - If all labels are assigned correctly ((y_pred == y_test).all() == True), the accuracy of the model is 100%. \n",
|
||||
" - Instead, if none of the guessed species corresponds to the real one ((y_pred == y_test).any() == False), the accuracy is 0%\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 112,
|
||||
"id": "ca4f0b4bbe44c9fe",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.8666666666666667\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"knn = KNearestNeighbors(5)\n",
|
||||
"knn.fit(X_train, Y_train)\n",
|
||||
"predictions = knn.predict(X_test)\n",
|
||||
"correct_guesses = predictions == Y_test\n",
|
||||
"accuracy = np.count_nonzero(correct_guesses == True) / len(correct_guesses)\n",
|
||||
"print(accuracy)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "7514fc82de74b729",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"6. ($\\star$) As a software developer, you might want to increase the functionalities of your product and\n",
|
||||
"publish newer versions over time. The better your code is structured and organized, the lower is the\n",
|
||||
"effort to release updates.\n",
|
||||
"As such, extend your KNN implementation adding the parameter `distance`. This has to be one among:\n",
|
||||
" - Euclidean distance: $ euclidean(p,q) = \\sqrt{\\sum_{i=1}^{n} (p_i _- q_i)^2} $\n",
|
||||
" - Manhattan distance: $ manhattan(p,q) = \\sum_{i=1}^n |p_i - q_i|$\n",
|
||||
" - Cosine distance: $ cosine(p, q) = 1 - \\frac{\\sum_{i=1}^n p_i q_i}{ \\sqrt{\\sum^n_{i=1} p^2_i} \\cdot \\sqrt{\\sum^n_{i=1} q_i^2}}$\n",
|
||||
"\n",
|
||||
"If any of this distance is not already implemented in `numpy` implement it yourself"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "436c6395a2f3d853",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "24c76d735fe65dbd",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"7. ($\\star$) Again, extend now your KNN implementation by adding the parameter `weights` to the constructor,\n",
|
||||
"as shown below:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"class KNearestNeighbors:\n",
|
||||
" def __init__(self, k, distance_metric=\"euclidean\", weights=\"uniform\"):\n",
|
||||
" self.k = k\n",
|
||||
" self.distance_metric = distance_metric\n",
|
||||
" self.weights = weights\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Change your KNN implementation to accept a new weighting scheme for the labels. If weights=\n",
|
||||
"\"distance\", weight neighbor votes by the inverse of their distance (for the distance, again, use\n",
|
||||
"distance_metric). The weight for a neighbor of the point p is:\n",
|
||||
"\n",
|
||||
"$\n",
|
||||
"w(p, n) = \\frac{1}{distance\\_metric(p, n)}\n",
|
||||
"$\n",
|
||||
"\n",
|
||||
"Instead, if the default is chosen (weights=\"uniform\"), use the majority voting you already implemented\n",
|
||||
"in Exercise 6.\n",
|
||||
"\n",
|
||||
"<img src=\"https://mlarchive.com/wp-content/uploads/2022/09/img5.png\">\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a84262b9fd13d9f1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "54f1e2a662695741",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"8. ($\\star$) Test the modularity of the implementation applying it on a different dataset. Ideally, you should\n",
|
||||
"not change the code of your KNN python class.\n",
|
||||
"- Download the MNIST dataset and retain only 100 samples per digit. You will end up with a dataset of 1000 samples.\n",
|
||||
"- Define again four numpy arrays as you did in Exercises 2 and 3.\n",
|
||||
"- Apply your KNN as you did for the Iris dataset.\n",
|
||||
"- Evaluate the accuracy on MNIST’s y_test."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b720ef714195eb68",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# download MNIST dataset\n",
|
||||
"\n",
|
||||
"# linux users\n",
|
||||
"#! wget https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv -O mnist.csv\n",
|
||||
"\n",
|
||||
"# windows users\n",
|
||||
"! pip install wget\n",
|
||||
"import wget\n",
|
||||
"wget.download(\"https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv\", \"mnist.csv\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 158,
|
||||
"id": "77afcee410ef94ac",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[[0 0 0 ... 0 0 0]\n",
|
||||
" [0 0 0 ... 0 0 0]\n",
|
||||
" [0 0 0 ... 0 0 0]\n",
|
||||
" ...\n",
|
||||
" [9 0 0 ... 0 0 0]\n",
|
||||
" [9 0 0 ... 0 0 0]\n",
|
||||
" [9 0 0 ... 0 0 0]]\n",
|
||||
"(1000, 784)\n",
|
||||
"(1000,)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# extracting MNIST dataset\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"raw_csv = np.loadtxt(\"mnist.csv\",\n",
|
||||
" delimiter=\",\", dtype=int, converters={4:type_mapper})\n",
|
||||
"\n",
|
||||
"dataset_reduced = np.ndarray((0,785),dtype=int)\n",
|
||||
"\n",
|
||||
"for i in range(10):\n",
|
||||
" items_with_digit = raw_csv[raw_csv[:,0] == i]\n",
|
||||
" dataset_reduced = np.concatenate((dataset_reduced, items_with_digit[0:100,:]))\n",
|
||||
"\n",
|
||||
"print(dataset_reduced)\n",
|
||||
"\n",
|
||||
"x = dataset_reduced[:,1:]\n",
|
||||
"y = dataset_reduced[:,0]\n",
|
||||
"\n",
|
||||
"print(x.shape)\n",
|
||||
"print(y.shape)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 160,
|
||||
"id": "d1a0834dd8885a2b",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# define four numpy arrays x_train, y_train, x_test, y_test\n",
|
||||
"test_subset_indices = np.random.choice(len(y), size=int(len(y)*0.2), replace=False)\n",
|
||||
"X_test = x[test_subset_indices]\n",
|
||||
"Y_test = y[test_subset_indices]\n",
|
||||
"\n",
|
||||
"x[test_subset_indices]\n",
|
||||
"\n",
|
||||
"train_subset_indices = [i not in test_subset_indices for i in range(len(y))]\n",
|
||||
"X_train = x[train_subset_indices]\n",
|
||||
"Y_train = y[train_subset_indices]\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 171,
|
||||
"id": "c03d2add840c1531",
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.885\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Apply KNN on MNIST\n",
|
||||
"knn = KNearestNeighbors(5)\n",
|
||||
"knn.fit(X_train, Y_train)\n",
|
||||
"predictions = knn.predict(X_test)\n",
|
||||
"correct_guesses = predictions == Y_test\n",
|
||||
"accuracy = np.count_nonzero(correct_guesses == True) / len(correct_guesses)\n",
|
||||
"print(accuracy)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.12.6"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
1277
Labs/Lab3 - Pandas and Numpy.ipynb
Normal file
1277
Labs/Lab3 - Pandas and Numpy.ipynb
Normal file
File diff suppressed because one or more lines are too long
BIN
Labs/New_York_City_Map.PNG
Normal file
BIN
Labs/New_York_City_Map.PNG
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 85 KiB |
Reference in New Issue
Block a user