Now we are jumping right into the interesting world of algorithms. First, we prepare our dataset. Then we try out 6 different algorithms and see how good predictions they give. Finally, we will pick up the winner.
Simple 4-step process to follow in this lesson
Dataset preparation (validation data and test harness)
Testing algorithms
Selecting the best model
Vizualization of results (plotting)
Tähän video, tarvitsee testvideomp4.yaml edukamu-komponentin, jos käytetään videoita mp4:na, pelkkä url kansioon ei tunnista videoformaattia
We need to prepare a test for our future model to ensure that it works. Later, we will use statistical methods to estimate the accuracy of the models. We also want a more concrete model on unseen data by evaluating it on actual unseen data.
We need to hold back some data that the algorithms will not get to see and we will use this data to get an independent idea of how accurate the best model is.
Let's split the loaded dataset into two, 80% of which we will use to train and select among our models, and 20% that will hold back as validation dataset.
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
We now have training data in the Xtrain and Ytrain for preparing models and an Xvalidation and Yvalidation sets that we can later use for validation.
We will use stratified 10-fold cross-validation to estimate model accuracy. This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations.
Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.
Tähän flashcards- tehtäväkomponentti
We don't know which algorithms would be good to use on this problem or what configurations to use. We should get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting good results.
Let's test 6 different algorithms:
Logistic Regression (LR)
Linear Discriminant Analysis (LDA)
K-Nearest Neighbors (KNN)
Classification and Regression Trees (CART)
Gaussian Naive Bayes (NB)
Support Vector Machines (SVM)
These are a good mixture of simple linear (LR and LDA) and nonlinear (KNN, CART, NB, SVM) algorithms
You can test algorithms by running this code:
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
results.append(cv_results)
names.append(name)
print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
Tähän yhdistä oikeat kohdat- tehtäväkomponentti
We now have 6 models and estimations for each. We need to compare them to each other and select the most accurate
Running the example above, we will get the following results:
LR: 0.941667 (0.065085)
LDA: 0.975000 (0.038188)
KNN: 0.958333 (0.041667)
CART: 0.950000 (0.055277)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)
Your results may vary given the stochastic nature of the algorithm or evaluation procedure. In this case, we can see that Support Vector Machines (SVM) has the best-estimated accuracy (0.98 or 98%).
We will also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There are accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).
A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.
# Compare Algorithms
pyplot.boxplot(results, labels=names)
pyplot.title('Algorithm Comparison')
pyplot.show()
Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset
Tähän questionscroll- tehtäväkomponentti