Algorithms and Datasets

Our intelligence is what makes us human, and AI is an extension of that quality

Yann LeCun - Chief AI Scientist at Facebook

Algorithms

Now we are jumping right into the interesting world of algorithms. First, we prepare our dataset. Then we try out 6 different algorithms and see how good predictions they give. Finally, we will pick up the winner.

Simple 4-step process to follow in this lesson

Dataset preparation (validation data and test harness)
Testing algorithms
Selecting the best model
Vizualization of results (plotting)

Tähän video, tarvitsee testvideomp4.yaml edukamu-komponentin, jos käytetään videoita mp4:na, pelkkä url kansioon ei tunnista videoformaattia

Watch the video and do the same in Google Colab. You will find the code and written instructions below.

Dataset Preparation

Validation Data Split

We need to prepare a test for our future model to ensure that it works. Later, we will use statistical methods to estimate the accuracy of the models. We also want a more concrete model on unseen data by evaluating it on actual unseen data.

We need to hold back some data that the algorithms will not get to see and we will use this data to get an independent idea of how accurate the best model is.

Let's split the loaded dataset into two, 80% of which we will use to train and select among our models, and 20% that will hold back as validation dataset.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)

We now have training data in the Xtrain and Ytrain for preparing models and an Xvalidation and Yvalidation sets that we can later use for validation.

Test Harness

We will use stratified 10-fold cross-validation to estimate model accuracy. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations.

Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.

Tähän flashcards- tehtäväkomponentti

I imagine a world in which AI is going to make us work more productively, live longer, and have cleaner energy.

Fei-Fei Li - Sequoia Capital Professor of Computer Science at Stanford University

Testing Algorithms

We don't know which algorithms would be good to use on this problem or what configurations to use. We should get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting good results.

Let's test 6 different algorithms:

Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN)
Classification and Regression Trees (CART)
Gaussian Naive Bayes (NB)
Support Vector Machines (SVM)

These are a good mixture of simple linear (LR and LDA) and nonlinear (KNN, CART, NB, SVM) algorithms

You can test algorithms by running this code:

# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))

Tähän yhdistä oikeat kohdat- tehtäväkomponentti

Select the Best Model

We now have 6 models and estimations for each. We need to compare them to each other and select the most accurate

Running the example above, we will get the following results:

LR: 0.941667 (0.065085)
LDA: 0.975000 (0.038188)
KNN: 0.958333 (0.041667)
CART: 0.950000 (0.055277)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)

Your results may vary given the stochastic nature of the algorithm or evaluation procedure. In this case, we can see that Support Vector Machines (SVM) has the best-estimated accuracy (0.98 or 98%).

Vizualization of Results

We will also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There are accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.

# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()

As we can now see, box and whisker plots are squashed at the top of the range, with evaluation achieving 100% accuracy, and some pushing down into the high 80% accuracies.

Algorithm Comparison

Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset

Tähän questionscroll- tehtäväkomponentti