Image for post
Image for post
Photo by Markus Winkler on Unsplash

The barriers to getting into machine learning have never been lower: Go do ML

The need to code has never been so important. Why we don’t teach our kids to code at the earliest stage possible is beyond me?

The barrier to coding has never been so low. If you access repl.it, you can virtually ever programming language you want, and have it running in the Cloud. And now you can setup teams, where you can add as many students as you want. And it you want to get serious about your coding, you go and buy a R-PI, and you have a device which fits in your pocket and ready to code on.

The need to learn machine learning has never been higher. As a world we are increasingly swamped by data, and in areas such as cybersecurity, the need to process masses of data increases by the day.

The barrier to machline learning has never been lower. With Python, a machine learning programme can be created in minutes.

So what’s holding you back? In this article, I will introduce the the key metrics used to define success in machine learning, and how we plot the ROC (Receiver Operating Characteristic).

And so here is the code:

  • Code 1 (Show ROC (Receiver Operating Characteristic) Curve): [Here]
  • Code 2 (Confusion Matrix for Handwriting numbers): [Here]
  • Code 3 (Creating a data set with two clusters): [Here]
  • Code 4 (Random Forest Classifier for numeric predictions with ROC curve and AUC (Area Under Curve): [Here]
  • Code 5 (Numeric prediction with R2 metric): [Here]
  • Code 6 (Numeric prediction and metrics): [Here]
  • Code 7 (Category prediction and metrics): [Here]

And here is how Splunk implements machine learning in the methods related to cybersecurity analysis:

Image for post
Image for post

Coding examples

Example Code 1: [Here]:

Example Code 2 [Here]

Example Code 3: [Here]

Example Code 4: [Here]

Code sample 5: [Here]

Code 6 (Numeric Prediction): [Here]

Code 7 (Cluster prediction and metrics): [Here]

Tutorial

1. We want to differentiate Eve from Bob. In monitoring Eve’s accesses to email on a daily basis we find daily accesses of 20, 25, 16, 42 and 22, and then monitory Bob’s accesses as: 50, 41, 60, 54 and 39. With the following we aim to detect Bob from Eve, and plot the ROC Curve. Use the following code to determine the ROC curve and the AUC value:

Listing:

What is the AUC:

What are the thresholds used?

For each threshold, what is the FPR and what is the TPR:

If we set a threshold of 42, what is the FPR and what is the TPR?

Bob’s daily accesses for email are now monitored for 50, 55, 43, 90, 110 and 66, and Eve has accesses of 14, 32, 19, 46, 21, 48 and 50. Use the program to determine the new ROC curve:

What is the AUC:

What are the thresholds used?

For each threshold, what is the FPR and what is the TPR:

If we set a threshold of 42, what is the FPR and what is the TPR?

2. Now, we can add Alice, and who has accesses of 13, 23, 32, 40, 11, and 14, and determine the following:

What is the AUC:
What are the thresholds used?
For each threshold, what is the FPR and what is the TPR:
If we set a threshold of 42, what is the FPR and what is the TPR?

3. Now change the positive label to Alice, and determine the following:

What is the AUC:

What are the thresholds used?

For each threshold, what is the FPR and what is the TPR:

If we set a threshold of 42, what is the FPR and what is the TPR?

In the following example we will load a dataset for a machine learning model to differentiate hand written digits. Run the following code and determine the confusion matrix:

What are the TP for the character ‘0’

What are the FP for the character ‘0’

What are the FP for the character ‘0’

We are using SVM (Support Vector Machine), and which uses a gamma factor. Vary the gamma value with 0.1, 0.2, 0.3 and so on, up to 1.0, and observe how the confusion matrix changes:

5. The following code generates a data set which has two clusters, and then marks each of the dataset elements for their cluster source. Run the program several times and observe the creation of the clusters:

Modify the code so that it now generates 250 points.

6. We will now use this method of cluster generation, and then split the data into 70% training data, and 30% test data, in to train a RandomForestClassifier model to predict the results:

For 10 points, what is the AUC?
For 100 points, what is the AUC?
For 250 points, what is the AUC?
For 1000 points, what is the AUC?
For the different number of points, how does the shape of ROC Curve change?

7. There are a few ensemble methods for machine learning in skLearn, including BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier, and GradientBoostingRegressor. Modify the code given in Q.6 to support each of the different models:

For each of the methods, what is the AUC, and which method is the best performing? ExtraTreesClassifier AUC: AdaBoostClassifier AUC: GradientBoostingRegressor AUC: BaggingClassifier AUC:

8. You have been asked to identify if there is a linkage between gun ownership and population density in US states, and whether there is a link to the number of murders per 100K of the population. An outline of the code is given here:

By examining the R2 value, is the machine learning implementation a good model?

Can you create a better model for predicting Murders per 100K pf the population, and with only two features?

We have a new MSc in the planning related to Cyber&Data, here’s a little taster:

https://asecuritysite.com/cyberdata

Written by

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. EU Citizen. Auld Reekie native. Old World Breaker. New World Creator.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store