Confusion Matrix Related to CyberCrimes

Ajmal Muhammed
5 min readJun 6, 2021

--

We all know that confusion Matrix is a common term that comes in machine learning. But, how it can be related to cybercrimes.

Let's start to discuss the Confusion matrix

What is Cybercrime?

Cybercrime is a criminal activity that either targets or uses a computer, a computer network, or a networked device. Most, but not all, cybercrime is committed by cybercriminals or hackers who want to make money. Cybercrime is carried out by individuals or organizations.

Some cybercriminals are organized, use advanced techniques, and are highly technically skilled. Others are novice hackers. Rarely, cybercrime aims to damage computers for reasons other than profit. These could be political or personal.

Types of cybercrime

Here are some specific examples of the different types of cybercrime:

  • Email and internet fraud.
  • Identity fraud (where personal information is stolen and used).
  • Theft of financial or card payment data.
  • Theft and sale of corporate data.
  • Cyberextortion (demanding money to prevent a threatened attack).
  • Ransomware attacks (a type of cyber extortion).
  • Cryptojacking (where hackers mine cryptocurrency using resources they do not own).
  • Cyberespionage (where hackers access government or company data).

Most cybercrime falls under two main categories:

  • Criminal activity that targets
  • Criminal activity that uses computers to commit other crimes.

MACHINE LEARNING IN CYBERSECURITY

Machine learning has become a vital technology for cybersecurity. Machine learning preemptively stamps out cyber threats and bolsters security infrastructure through pattern detection, real-time cyber crime mapping and thorough penetration testing.

What is confusion Matrix?

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020. For illegal activities, cybercriminals utilize any network computing devices as a primary means of communication with victims’ devices, so attackers get profit in terms of finance, publicity, and others by exploiting the vulnerabilities over the system. Cybercrimes are steadily increasing daily. Evaluating cybercrime attacks and providing protective measures by manual methods using existing technical approaches and also investigations have often failed to control cybercrime attacks. Existing literature in the area of cybercrime offenses suffers from a lack of computation methods to predict cybercrime, especially on unstructured data. Therefore, this study proposes a flexible computational tool using machine learning techniques to analyze the cybercrimes rate at state-wise in a country that helps to classify cybercrimes. Security analytics with the association of data analytic approaches help us for analyzing and classifying offenses from India-based integrated data that may be either structured or unstructured.

Let’s suppose we were working on a binary classification problem to detect whether or not a transaction is fraudulent. Our model uses characteristics of the user and transaction and returns 1 if the transaction is predicted to be fraudulent and 0 if not.

Given that machine learning models are rarely 100% accurate there is going to be a level of risk in deploying this model. If we incorrectly classify a non-fraudulent transaction as fraud then we may well lose that transaction, and possibly even the future customer's business. On the other hand, if we incorrectly detect a fraudulent transaction as non-fraudulent then we might stand to lose the value of that transaction.

The confusion matrix essentially places the resulting predictions into four groups. They are as follows:

True positive (TP): the model predicts fraud and the transaction is indeed fraudulent.

False-positive (FP): the model predicts fraud but the transaction is not fraudulent.

True negative (TN): the model predicts not fraud and the transaction is not fraudulent.

False-negative (FN): the model predicts not fraud but the transaction is in fact fraudulent.

Also, there is two types of errors in the predictions, they are as follows:

False Positive (FP) — Type 1 error

  • The predicted value was falsely predicted
  • The actual value was negative but the model predicted a positive value
  • Also known as the Type 1 error

False Negative (FN) — Type 2 error

  • The predicted value was falsely predicted
  • The actual value was positive but the model predicted a negative value
  • Also known as the Type 2 error

Example Case Study

Let’s pretend we have a two-class classification problem of predicting whether a photograph contains a man or a woman.

We have a test dataset of 10 records with expected outcomes and a set of predictions from our classification algorithm.

Expected,  Predicted
1) man, woman
2) man, man
3) woman, woman
4) man, man
5) woman, man
6) woman, woman
7) woman, woman
8) man, man
9) man, woman
10) woman, woman

Let’s start off and calculate the classification accuracy for this set of predictions.

The algorithm made 7 of the 10 predictions correct with an accuracy of 70%.

accuracy = total correct predictions / total predictions made * 100
accuracy = 7 / 10 * 100

But what type of errors were made?

Let’s turn our results into a confusion matrix.

First, we must calculate the number of correct predictions for each class.

men classified as men: 3
women classified as women: 4

Now, we can calculate the number of incorrect predictions for each class, organized by the predicted value

men classified as women: 2
woman classified as men: 1

We can now arrange these values into the 2-class confusion matrix:

       men womenmen     3      1women   2     4

We can learn a lot from this table.

  • The total actual men in the dataset is the sum of the values on the men column (3 + 2)
  • The total actual women in the dataset is the sum of values in the women column (1 +4).
  • The correct values are organized in a diagonal line from top left to bottom-right of the matrix (3 + 4).
  • More errors were made by predicting men as women than predicting women as men.

--

--