Loading the Dataset

Learn data, and you can tell stories that more people don’t even know about yet but are eager to hear.

Nathan Yau - Statistician and data visualization expert

In the following lessons, we will be using the iris flower dataset, which is famous for being commonly used as the “hello world” dataset in machine learning and statistics.

The iris flower database has 150 observations of iris flowers. There are four columns of measurements in centimeters, and the fifth column is the species of the flower. All observed flowers belong to one of three species.

In this step, we are going to load the iris data from CSV file URL.

Instruction Video

Tähän video, tarvitsee testvideomp4.yaml edukamu-komponentin, jos käytetään videoita mp4:na, pelkkä url kansioon ei tunnista videoformaattia

Watch the video and do the same in Google Colab. You will find the code and written instructions below.

Import Libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

# Load libraries
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Load Dataset

We can load the data directly from the UCI Machine Learning repository. We are using "pandas", a Python programming language for data manipulation and analysis, to load the data and we will use it to explore the data both with descriptive statistics and data visualization.

We will be specifying the names of each column when loading the data. This will help later when we explore the data.

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)