Visualizing the Dataset

When there’s a large number of pie-charts in a report or a presentation, there is something wrong in the organization, and it’s not the pie. A pie chart is a potential symptom of a lack of data analysis skills that have to be resolved.

Jorge Camoes - Data visualization consultant and trainer

Data Visualization

We now have an essential idea of the data. We need to expand that with some visualizations.

We are going to look at two types of plots:

Univariate plots to better understand each attribute.
Multivariate plots to better understand the relationships between attributes.

Instruction video

Tähän video, tarvitsee testvideomp4.yaml edukamu-komponentin, jos käytetään videoita mp4:na, pelkkä url kansioon ei tunnista videoformaattia

Watch the video and do the same in Google Colab. You will find the code and written instructions below.

Univariate Plots

We start with some univariate plots, that is, plots of every individual variable.

Given that the input variables are numeric, we can make box and whisker plots of each.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

Box and Whisker Plots for Each Input Variable for the Iris Flowers Dataset

A box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the term box and whisker plot. This gives us a much clearer idea of the distribution of the input attributes.

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
dataset.hist()
pyplot.show()

Histogram Plots for Each Input Variable for the Iris Flowers Dataset

A histogram plot is a way of inputting data into bins of different amounts and displaying them in a column chart. In this exercise, we are splitting our dataset to display the correlation between the amount and size of flower parts.

It looks like maybe two of the input variables have a Gaussian distribution (Gaussian or Normal Distribution is a very common term in statistics, seen as a centered curve in the visualized data). This is useful to note as we can use algorithms that can exploit this assumption.

Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This will help us to spot structured relationships between input variables.

# scatter plot matrix
scatter_matrix(dataset)
pyplot.show()

Scatter Matrix Plot for Each Input Variable for the Iris Flowers Dataset

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Learn data, and you can tell stories that more people don’t even know about yet but are eager to hear.

Nathan Yau - Statistician and data visualization expert

Tähän questionscroll- tehtäväkomponentti