# Lab #1

The purpose of this laboratory is to get you acquainted with Python. 
More specifically, you will learn how to:
- read different types of datasets (CSV and JSON). 
- extract some useful information (mean and standard deviation) from the datasets while only using basic python.
- create a simple rule-based classifier that is already capable to perform some classification.


## Preliminaries
### Python availability
Make sure that Python 3 is installed on your device with the commands `python --version`. The version should be in the form `3.x.x.`

In [2]:
! python --version

Python 3.12.6


### Dataset Download
For this lab, three different datasets will be used. Here, you will learnmore about them and how to retrieve
them.

#### Iris
Iris is a particularly famous *toy dataset* (i.e. a dataset with a small number of rows and columns, mostly
used for initial small-scale tests and proofs of concept). 
This specific dataset contains information about the **Iris**, a genus that includes 260-300 species of plants. 
The Iris dataset contains measurements for 150 Iris flowers, each belonging to one of three species (50 flowers each): 

Iris Virginica             |  Iris Versicolor          |   Iris Setosa  |
:-------------------------:|:-------------------------:|:---------------|
:<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/1200px-Iris_virginica_2.jpg" alt="Iris Virginica" width="200" /> | <img src="https://www.waternursery.it/document/img_prodotti/616/1646318149.jpeg" alt="Iris Versicolor" width="200" /> |<img src="https://d2j6dbq0eux0bg.cloudfront.net/images/28296135/2323483832.jpg" alt="Iris Setosa" width="200" />|

Each of the 150 flowers contained in the Iris dataset is represented by 5 values:
- sepal length, in cm
- sepal width, in cm
- petal length, in cm
- petal width, in cm
- Iris species, one of: Iris-setosa, Iris-versicolor, Iris-virginica (the label)

Each row of the dataset represents a distinct flower (as such, the dataset will have 150 rows). Each
row then contains 5 values (4 measurements and a species label).
The dataset is described in more detail on the [UCI Machine Learning Repository website](https://archive.ics.uci.edu/dataset/53/iris). The dataset
can either be downloaded directly from there (iris.data file), or from a terminal, using the `wget` tool. The
following command downloads the dataset from the original URL and stores it in a file named iris.csv.

`wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv`

The dataset is available as a Comma-Separated Values (CSV) file. These files are typically used to
represent tabular data. 
- Each row is represented on one of the lines. 
- Each of the rows contains a fixed number of columns. 
- Each of the columns (in each row) is separated by a comma (,).

To read CSV files, Python offers a module called `csv` (here the offical [doc](https://docs.python.org/3/library/csv.html)). This module allows using `csv.reader()`, which
reads a file row by row. For each row, it returns a list of columns that can be processed as needed. 


Let's download the dataset and print the first three rows.




In [3]:
! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv

print("Reading first lines of IRIS dataset")
import csv 
with open("iris.csv") as f:
    for i, cols in enumerate(csv.reader(f)):
        print(cols)
        if i >= 4:
            break



['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']


Note by default, csv.reader converts all fields read into strings (str). 
If you want to treat them as number, remember to cast them correctly!

#### MNIST
The MNIST dataset is another particularly famous dataset. It contains several thousands of hand-written
digits (0 to 9). 
- Each hand-written digit is contained in an image represented as $28 x 28$ 8-bit grayscale image. 
- This means that each digit has $784$ ($28^2$) pixels
- Each pixel has a value that ranges from 0 (black) to 255 (white).

<img src="https://machinelearningmastery.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png" alt="MNIST images" width="500" />

The dataset can be downloaded from the following link:

[https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv](https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv)

In this case, MNIST is represented as a CSV file. Similarly to the Iris dataset, each row of the MNIST
datasets represents the pixels of the image representing a digit. For the sake of simplicity, this dataset contains only a small fraction (10; 000
digits out of 70; 000) of the real MNIST dataset. 

For each digit, 785 values are available: 
- the first one is the numerical value depicted in the image (e.g. for Figure 2 it would be 5). 
- the following 784 columns represent the grayscale image in row-major order (for more information about row- and column-major order of matrices, see [Wikipedia](https://en.wikipedia.org/wiki/Row-_and_column-major_order)).

The MNIST dataset in CSV format can be read with the same approach used for Iris, keeping in mind
that, in this case, the digit label (i.e. the first column) is an integer from 0 to 9, while the following 784
values are integers between 0 and 255.

## Exercises
Note that exercises marked with a (*) are optional, you should focus on completing the other ones first.
### Iris analysis
1. Load the previously downloaded Iris dataset as a list of lists (each of the 150 lists should have 5 elements). You can make use of the csv module presented

In [108]:
import csv

iris_list = []

with open("iris.csv") as f:
    for cols in csv.reader(f):
        if len(cols) != 5:
            continue
        iris_list.append(cols)

print(f"Dataset loaded. Number of lines: {len(iris_list)}")


Dataset loaded. Number of lines: 150


2. Compute and print the mean and the standard deviation for each of the 4 measurement columns (i.e. sepal length and width, petal length and width). Remember that, for a given list of n values $x = (x_1, x_2, ..., x_n)$, the mean $\mu$ and the standard deviation $\sigma$ are defined respectively as:
$$\mu = {1 \over n} \sum_i^n x_i $$

$$ \sigma = \sqrt{ {1 \over n} \sum_i^n (x_i - \mu)^2} $$

In [25]:
from math import sqrt

def mean(items):
    return sum(items)/len(items)

def std_dev(items, mu = None):
    if mu is None:
        mu = mean(items)

    return sqrt(mean([(x-mu)**2 for x in items]))

sepal_length = []
sepal_width = []
petal_length = []
petal_width = []

for iris in iris_list:
    sepal_length.append(float(iris[0]))
    sepal_width.append(float(iris[1]))
    petal_length.append(float(iris[2]))
    petal_width.append(float(iris[3]))

sepal_length_metrics = (mean(sepal_length), std_dev(sepal_length))
sepal_width_metrics = (mean(sepal_width), std_dev(sepal_width))
petal_length_metrics = (mean(petal_length), std_dev(petal_length))
petal_width_metrics = (mean(petal_width), std_dev(petal_width))

print(f"Sepal length mean: {sepal_length_metrics[0]}, std_dev: {sepal_length_metrics[1]}")
print(f"Sepal wifth mean: {sepal_width_metrics[0]}, std_dev: {sepal_width_metrics[1]}")
print(f"Petal length mean: {petal_length_metrics[0]}, std_dev: {petal_length_metrics[1]}")
print(f"Petal width mean: {petal_width_metrics[0]}, std_dev: {petal_width_metrics[1]}")

Sepal length mean: 5.843333333333334, std_dev: 0.8253012917851409
Sepal wifth mean: 3.0540000000000003, std_dev: 0.43214658007054346
Petal length mean: 3.758666666666666, std_dev: 1.758529183405521
Petal width mean: 1.1986666666666668, std_dev: 0.7606126185881716



3. Compute and print the mean and the standard deviation for each of the 4 measurement columns, separately for each of the three Iris species (versicolor, virginica and setosa).

In [32]:
metrics_species = {}
species = set([iris[4] for iris in iris_list])
print(f"Species found: {species}")

for specie in species:
    irises_filtered = filter(lambda s: s[4] == specie, iris_list)

    sepal_length = []
    sepal_width = []
    petal_length = []
    petal_width = []

    for iris in irises_filtered:
        sepal_length.append(float(iris[0]))
        sepal_width.append(float(iris[1]))
        petal_length.append(float(iris[2]))
        petal_width.append(float(iris[3]))

    metrics_species[specie] = {}

    metrics_species[specie]["sepal_length_metrics"] = (mean(sepal_length), std_dev(sepal_length))
    metrics_species[specie]["sepal_width_metrics"] = (mean(sepal_width), std_dev(sepal_width))
    metrics_species[specie]["petal_length_metrics"] = (mean(petal_length), std_dev(petal_length))
    metrics_species[specie]["petal_width_metrics"] = (mean(petal_width), std_dev(petal_width))

    print(f"Metrics for specie {specie}")
    print(f"Sepal length for mean: {metrics_species[specie]["sepal_length_metrics"][0]}, std_dev: {metrics_species[specie]["sepal_length_metrics"][1]}")
    print(f"Sepal wifth mean: {metrics_species[specie]["sepal_width_metrics"][0]}, std_dev: {metrics_species[specie]["sepal_width_metrics"][1]}")
    print(f"Petal length mean: {metrics_species[specie]["petal_length_metrics"][0]}, std_dev: {metrics_species[specie]["petal_length_metrics"][1]}")
    print(f"Petal width mean: {metrics_species[specie]["petal_width_metrics"][0]}, std_dev: {metrics_species[specie]["petal_width_metrics"][1]}")
    print()

Species found: {'Iris-versicolor', 'Iris-setosa', 'Iris-virginica'}
Metrics for specie Iris-versicolor
Sepal length for mean: 5.936, std_dev: 0.5109833656783751
Sepal wifth mean: 2.77, std_dev: 0.31064449134018135
Petal length mean: 4.26, std_dev: 0.4651881339845203
Petal width mean: 1.3259999999999998, std_dev: 0.19576516544063705

Metrics for specie Iris-setosa
Sepal length for mean: 5.006, std_dev: 0.3489469873777391
Sepal wifth mean: 3.418, std_dev: 0.37719490982779713
Petal length mean: 1.464, std_dev: 0.17176728442867112
Petal width mean: 0.24400000000000002, std_dev: 0.10613199329137281

Metrics for specie Iris-virginica
Sepal length for mean: 6.587999999999999, std_dev: 0.6294886813914926
Sepal wifth mean: 2.9739999999999998, std_dev: 0.3192553836664309
Petal length mean: 5.5520000000000005, std_dev: 0.546347874526844
Petal width mean: 2.026, std_dev: 0.2718896835115301




4. Based on the results of exercises 2 and 3, which of the 4 measurements would you considering as being the most characterizing one for the three species? (In other words, which measurement would you consider “best”, if you were to guess the Iris species based only on those four values?)

In [38]:
for specie, content in metrics_species.items():
    print(f"Metrics for specie {specie}")

    for metric, values in content.items():
        print(f"Range for {metric}: [{values[0]-values[1]}, {values[0]+values[1]}]")
    print()

# The best index seems to be the petal_length

Metrics for specie Iris-versicolor
Range for sepal_length_metrics: [5.425016634321625, 6.446983365678375]
Range for sepal_width_metrics: [2.4593555086598187, 3.0806444913401814]
Range for petal_length_metrics: [3.7948118660154795, 4.7251881339845205]
Range for petal_width_metrics: [1.1302348345593627, 1.521765165440637]

Metrics for specie Iris-setosa
Range for sepal_length_metrics: [4.657053012622261, 5.3549469873777396]
Range for sepal_width_metrics: [3.040805090172203, 3.7951949098277975]
Range for petal_length_metrics: [1.292232715571329, 1.635767284428671]
Range for petal_width_metrics: [0.1378680067086272, 0.35013199329137284]

Metrics for specie Iris-virginica
Range for sepal_length_metrics: [5.958511318608506, 7.217488681391492]
Range for sepal_width_metrics: [2.654744616333569, 3.2932553836664304]
Range for petal_length_metrics: [5.005652125473157, 6.098347874526844]
Range for petal_width_metrics: [1.7541103164884697, 2.2978896835115297]




5. Based on the considerations of Exercise 3, assign the flowers with the following measurements to what you consider would be the most likely species.
````
5.2, 3.1, 4.0, 1.2: versicolor
4.9, 2.5, 5.6, 2.0: virginica
5.4, 3.2, 1.9, 0.4: setosa
````


6. (*) Create a Rule-based classifier similar to the one seen in class. This classifier, again, will receive some rule and will classify each sample into one of the three species.

In [116]:
def classify_iris(row):
    petal_length = float(row[2])
    diffs = {}

    for specie in species:
        diffs[specie] = abs(metrics_species[specie]["petal_length_metrics"][0]-petal_length)
    
    min_val = min(diffs.values())

    for k in diffs:
        if diffs[k] == min_val:
            return k

    return None

classify_iris(iris_list[120])

'Iris-virginica'

7. (*) Compute prediction for all the elements in the dataset and store them in a list. Then, compute the accuracy of the classifier that you create. Remember that the accuracy metric is:

$$ {\text{number of correct predictions (TP + TN)} \over \text{total number of predictions (TP+TN+FP+FN)}} $$

Where one can check whether the prediction is correct by looking at the label of the sample ($5^{th}$ column)

In [120]:
correct_predictions = 0.0

for row in iris_list:
    guess = classify_iris(row)
    if guess == row[4]:
        correct_predictions += 1

print(f"Accuracy is {correct_predictions / len(iris_list)}")


Accuracy is 0.9466666666666667


### MNIST Analysis

1. Load the previously downloaded MNIST dataset. You can make use of the csv module already presented.

In [15]:
# ! wget https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv -O mnist.csv

import csv

mnist_dataset = []

with open("mnist.csv") as f:
    for cols in csv.reader(f):
        mnist_dataset.append((int(cols[0]), [int(value) for value in cols[1:]]))

print(f"Dataset loaded. Number of lines: {len(mnist_dataset)}")


Dataset loaded. Number of lines: 10000


2. Create a function that, given a position $1 < k < 10,000$, prints the $k^{th}$ sample of the dataset (i.e. the $k^{th}$ row of the csv file) as a grid of $28x28$ characters. More specifically, you should map each range of pixel values to the following characters:
    - [0; 64) &rarr; " "
    - [64; 128) &rarr; "."
    - [128; 192) &rarr; "*"
    - [192; 256) &rarr; "#"
So, for example, you should map the sequence `0, 72, 192, 138, 250` to the string `.#*#`.
*Note*: Remember to start a new line every time you read 28 characters

Example of output: 
```
         .#      **
        .##..*#####
       #########*.
      #####***.
     ##*
    *##
    ##
   .##
    ###*
    .#####.
        *###*
           *###*
              ###
              .##
              ###
            .###
      .    *###.
     .#  .*###*
     .######.
      *##*.
```


In [12]:
def print_k_element(dataset, k):
    assert 1<=k<=10000
    digit = dataset[k-1][1]

    for i, c in enumerate(digit):
        newline = ""
        if i % 28 == 27:
            newline = "\n"
        
        if 64 <= c < 128:
            printable_char = "."
        elif 128 <= c < 192:
            printable_char = "*"
        elif 192 <= c < 256:
            printable_char = "#"
        else:
            printable_char = " "

        print(printable_char, end=newline)

print_k_element(mnist_dataset, 8)

                            
                            
                            
                            
                            
                            
            *#              
          .###              
          ####*             
         *######.           
         ###*####           
        .##  .####          
        .#.   *##*          
        *#*   ###*          
        .## .*####          
         #####* ##*         
         .###*  *##         
          .**    *#*        
                 .##        
                  *#.       
                   ##       
                   .#*      
                    ##      
                     #*     
                     *#     
                      #.    
                            
                            


3. Compute the Euclidean distance between each pair of the 784-dimensional vectors of the digits at
the following positions: $26^{th}$, $30^{th}$, $32^{nd}$, $35^{th}$.

*Note*: Remember that Python arrays are indexed from 0, so the $k^{th}$ value will be at position $k-1$

In [7]:
def euclidian_distance(v1, v2):
    return sqrt(sum([(a - b)**2 for a,b in zip(v1, v2)]))

values_index = [25, 29, 31, 34]

for i in values_index:
    for j in values_index:
        if i >= j:
            continue

        print(f"Distance between item {i} and {j}: {euclidian_distance(mnist_dataset[i][1], mnist_dataset[j][1])}")

NameError: name 'sqrt' is not defined

4. Based on the distances computed in the previous step and knowing that the digits listed in Exercise 3 are (not necessarily in this order) $0, 1, 1, 7$ can you assign the correct label to each of the digits of Exercise 3?

Item 29 and 31 have the lower distance, so they are probably the same number ("1"). The distance between 29-34 and 31-34 is similar, and is lower than the distance between 29-25 and 31-25. As the digit 1 is similar to the digit 7, item 34 is 7 and so item 25 is 0.

5. There are 1,135 images representing 1’s and 980 images representing 0’s in the dataset. For all 0’s and 1’s separately, count the number of times each of the 784 pixels is black (use 128 as the threshold value). You can do this by building a list `Z` and a list `O`, each containing 784 elements, containing respectively the counts for the 0’s and the 1’s. `Z[i]` and `O[i]` contain the number of times the $i^{th}$ pixel was black for either class. For each value i, compute `abs(Z[i] - O[i])`. The $i$ with the highest value represents the pixel that best separates the digits “0” and “1” (i.e. the pixel that is most often black for one class and white for the other). Where is this pixel located within the grid? Why is it?

In [20]:
Z = [0]*784
O = [0]*784

for digit in mnist_dataset:
    if digit[0] != 0:
        continue

    for i, value in enumerate(digit[1]):
        if digit[1][i] > 128:
            Z[i] += 1

for digit in mnist_dataset:
    if digit[0] != 1:
        continue

    for i, value in enumerate(digit[1]):
        if digit[1][i] > 128:
            O[i] += 1

max_val = 0
max_pos = -1

for i in range(784):
    endline = "\t"
    if i % 28 == 27:
        endline = "\n"
    
    diff = abs(Z[i]-O[i])
    if diff > max_val:
        max_val = diff
        max_pos = i

    print(diff, end=endline)

print(f"The pixel located at {max_pos//28}:{max_pos%28} has the maximum value of {max_val}")

0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	0	1	3	2	1	0	0	0	0	1	1	0	0	0	0	0	0	0
0	0	0	0	0	0	0	0	0	0	2	1	0	5	9	17	25	14	9	5	5	1	3	2	0	0	0	0
0	0	0	0	0	0	1	1	1	7	20	34	50	59	19	70	82	47	24	14	12	0	4	3	0	0	0	0
0	0	0	0	0	0	2	3	9	30	72	136	185	181	161	140	181	174	145	81	61	32	18	2	5	2	0	0
0	0	0	0	0	3	4	15	40	98	192	295	353	325	225	185	212	260	298	258	203	135	74	18	5	2	0	0
0	0	0	0	1	3	11	32	92	188	327	449	466	353	199	113	145	222	346	395	362	305	170	54	8	0	0	0
0	0	0	0	4	5	19	75	169	309	467	560	541	355	34	113	47	94	315	486	513	411	287	120	18	1	0	0
0	0	0	0	3	8	46	140	251	436	570	633	540	218	196	414	313	64	242	495	590	531	376	184	35	1	0	0
0	0	0	0	4	21	89	210	366	532	661	646	432	23	467	696	497	173	167	458	603	571	466	244	66	1	0	0
0	0	0	0	4	44	150	308	490	646	670	566	330	181	748	901	552	214	125	403	583	600	511	294	101	2	0	0
0	0	0	1	7	78	234	407	590	670	617	455	184	440	995	979	534	129	139	373	579	615	528	328	121	

6. (*) Extract a subset of the MNIST dataset composed of only 0 and 1 digits. Create a Rule-based classifier that take as input the rule that you discovered in ex. 5. As previously then, compute the prediction of such a classifier on all the samples in the dataset

In [25]:
def classify_number(row):
    if row[406] > 128:
        return 1
    else:
        return 0


correct_predictions = 0.0
dataset_len = 0

for row in mnist_dataset:
    if row[0] != 0 and row[0] != 1:
        continue

    dataset_len += 1
    guess = classify_number(row[1])
    if guess == row[0]:
        correct_predictions += 1

print(f"Accuracy is {correct_predictions / dataset_len}")

Accuracy is 0.9895981087470449
