We will still be using the iris flower database that was brought up during the previous lesson. We will now go over that data in a few different ways. There are easy commands for this purpose, and these are useful in almost any possible future projects.
Dimensions of the dataset.
Peek at the data itself.
Statistical summary of all attributes.
Breakdown of the data by the class variable.
Tähän video, tarvitsee testvideomp4.yaml edukamu-komponentin, jos käytetään videoita mp4:na, pelkkä url kansioon ei tunnista videoformaattia
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains the shape property.
# shape
print(dataset.shape)
(150, 5)
You should now see 150 instances and 5 attributes like this
Keeping an eye on your data is always a good idea. The following command showcases the first 20 rows of data.
# head
print(dataset.head(20))
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
You should see a list like this
Next, we can go over the summary of each attribute
This includes the count, mean, the min and max values as well as some percentiles.
# descriptions
print(dataset.describe())
sepal-length sepal-width petal-length petal-width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
The numerical values have similar ranges between 0 and 8 centimeteres, and they all use the same scale.
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
# class distribution
print(dataset.groupby('class').size())
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
We can see that each class has the same number of instances (50 or 33% of the dataset).
For reference, we can tie all of the previous elements together into a single script.
The complete example should now look like this
# summarize the data
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
# shape
print(dataset.shape)
# head
print(dataset.head(20))
# descriptions
print(dataset.describe())
# class distribution
print(dataset.groupby('class').size())
Tähän yhdistä kohdat- tehtäväkomponentti