Home Artificial Intelligence ML models for User Recognition using Keystroke Dynamics 1. Constructing Models Train the MLs Random Forest Thanks for reading and in your support.

ML models for User Recognition using Keystroke Dynamics 1. Constructing Models Train the MLs Random Forest Thanks for reading and in your support.

0
ML models for User Recognition using Keystroke Dynamics
1. Constructing Models
Train the MLs
Random Forest
Thanks for reading and in your support.

The keystroke dynamics which can be utilized in this text’s machine learning models for user recognition are behavioral biometrics. Keystroke dynamics uses the distinctive way that all and sundry types to verify their identity. That is completed by analyzing the on Key-Press and Key-Release — that make up a keystroke on computer keyboards to extract typing patterns. The article will examine how these patterns will be applied to create 3 precise machine learning models for user recognition.

The goal of this text can be split in two parts, 3 Machine Learning models (1. 2. 3. ) and in an actual live single point API able to predicting the user based on 5 input parameters: the ML model and 4 keystroke times.

Source: https://www.rootstrap.com/blog/a-primer-into-keystroke-recognition-technology

The issue

The target of this part is constructing ML models for user recognition based on their keystroke data. keystroke dynamics is a behavioral biometric which utilizes the unique way an individual types to confirm the identity of a person.

Typing patterns are predominantly extracted from computer keyboards. the patterns utilized in keystroke dynamics are derived mainly from the 2 events that make up a keystroke: the Key-Press and Key-Release.

The Key-Press event takes place on the initial of a key and the Key-Release occurs at the following of that key.

On this step, a dataset of keystroke information of users is given with following information:

  • keystroke.csv: on this dataset the keystroke data from 110 users are collected.
  • All users are asked to type a 13-length constant string 8 times and the keystroke data (key- press time and key-release time for every key) are collected.
  • The info set comprises 880 rows and 27 columns.
  • The primary column indicates UserID, and the remainder shows the press and release time for first to thirteenth character.

You need to do following steps:

  1. Often, the raw data just isn’t informative enough, and it is required to .

On this regard, 4 features:

  • Hold Time “HT”,
  • Press-Press time “PPT”,
  • Release-Release Time “RRT”,
  • Release-Press time “RPT”

are introduced and the definition of every of them are described above.

2. For every row in keystroke.csv, you must for every two consecutive keys.

3. After completing previous step, you must per row. Consequently, you must have 8 features (4 mean and 4 standard deviation) per row. → process_csv()

def calculate_mean_and_standard_deviation(feature_list):
from math import sqrt
# calculate the mean
mean = sum(feature_list) / len(feature_list)

# calculate the squared differences from the mean
squared_diffs = [(x - mean) ** 2 for x in feature_list]

# calculate the sum of the squared differences
sum_squared_diffs = sum(squared_diffs)

# calculate the variance
variance = sum_squared_diffs / (len(feature_list) - 1)

# calculate the usual deviation
std_dev = sqrt(variance)

return mean, std_dev

def process_csv(df_input_csv_data):
data = {
'user': [],
'ht_mean': [],
'ht_std_dev': [],
'ppt_mean': [],
'ppt_std_dev': [],
'rrt_mean': [],
'rrt_std_dev': [],
'rpt_mean': [],
'rpt_std_dev': [],
}

# iterate over each row within the dataframe
for i, row in df_input_csv_data.iterrows():
# iterate over each pair of consecutive presses and releases
# print('user:', row['user'])

# list of hold times
ht_list = []
# list of press-press times
ppt_list = []
# list of release-release times
rrt_list = []
# list of release-press times
rpt_list = []

# I exploit the IF to pick only the X rows of the csv
if i < 885:
for j in range(12):
# calculate the hold time: release[j]-press[j]
ht = row[f"release-{j}"] - row[f"press-{j}"]
# append hold time to list of hold times
ht_list.append(ht)

# calculate the press-press time: press[j+1]-press[j]
if j < 11:
ppt = row[f"press-{j + 1}"] - row[f"press-{j}"]
ppt_list.append(ppt)

# calculate the release-release time: release[j+1]-release[j]
if j < 11:
rrt = row[f"release-{j + 1}"] - row[f"release-{j}"]
rrt_list.append(rrt)

# calculate the release-press time: press[j+1] - release[j]
if j < 10:
rpt = row[f"press-{j + 1}"] - row[f"release-{j}"]
rpt_list.append(rpt)

# ht_list, ppt_list, rrt_list, rpt_list are a listing of calculated values for every feature -> feature_list
ht_mean, ht_std_dev = calculate_mean_and_standard_deviation(ht_list)
ppt_mean, ppt_std_dev = calculate_mean_and_standard_deviation(ppt_list)
rrt_mean, rrt_std_dev = calculate_mean_and_standard_deviation(rrt_list)
rpt_mean, rpt_std_dev = calculate_mean_and_standard_deviation(rpt_list)
# print(ht_mean, ht_std_dev)
# print(ppt_mean, ppt_std_dev)
# print(rrt_mean, rrt_std_dev)
# print(rpt_mean, rpt_std_dev)

data['user'].append(row['user'])
data['ht_mean'].append(ht_mean)
data['ht_std_dev'].append(ht_std_dev)
data['ppt_mean'].append(ppt_mean)
data['ppt_std_dev'].append(ppt_std_dev)
data['rrt_mean'].append(rrt_mean)
data['rrt_std_dev'].append(rrt_std_dev)
data['rpt_mean'].append(rpt_mean)
data['rpt_std_dev'].append(rpt_std_dev)

else:
break
data_df = pd.DataFrame(data)
return data_df

All of the code you’ll find on my GitHub within the KeystrokeDynamics repository:

Now that we now have parsed the info we will start constructing the models by training the MLs.

Support Vector Machine

def train_svm(training_data, features):
import joblib
from sklearn.svm import SVC

"""
SVM stands for Support Vector Machine, which is a sort of machine learning algorithm used:
for classification and regression evaluation.

SVM algorithm goals to search out a hyperplane in an n-dimensional space that separates the info into two classes.
The hyperplane is chosen in such a way that it maximizes the margin between the 2 classes,
making the classification more robust and accurate.

As well as, SVM may also handle non-linearly separable data by mapping the unique features to a
higher-dimensional space, where a linear hyperplane will be used for classification.

:param training_data:
:param features:
:return: ML Trained model
"""

# Split the info into features and labels
X = training_data[features]
y = training_data['user']

# Train an SVM model on the info
svm_model = SVC()
svm_model.fit(X, y)

# Save the trained model to disk
svm_model_name = 'models/svm_model.joblib'
joblib.dump(svm_model, svm_model_name)

Additional reading:

def train_random_forest(training_data, features):
"""
Random Forest is a sort of machine learning algorithm that belongs to the family of ensemble learning methods.
It's used for classification, regression, and other tasks that involve predicting an output value based on
a set of input features.

The algorithm works by creating multiple decision trees, where each tree is built using a random subset of the
input features and a random subset of the training data. Each tree is trained independently,
and the ultimate output is obtained by combining the outputs of all of the trees indirectly, resembling taking the common
(for regression) or majority vote (for classification).

:param training_data:
:param features:
:return: ML Trained model
"""
import joblib
from sklearn.ensemble import RandomForestClassifier

# Split the info into features and labels
X = training_data[features]
y = training_data['user']

# Train a Random Forest model on the info
rf_model = RandomForestClassifier()
rf_model.fit(X, y)

# Save the trained model to disk
rf_model_name = 'models/rf_model.joblib'
joblib.dump(rf_model, rf_model_name)

Additional reading:

def train_xgboost(training_data, features):
import joblib
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
"""
XGBoost stands for Extreme Gradient Boosting, which is a sort of gradient boosting algorithm used for classification and regression evaluation.
XGBoost is an ensemble learning method that mixes multiple decision trees to create a more powerful model.
Each tree is built using a gradient boosting algorithm, which iteratively improves the model by minimizing a loss function.
XGBoost has several benefits over other boosting algorithms, including its speed, scalability, and skill to handle missing values.

:param training_data:
:param features:
:return: ML Trained model
"""

# Split the info into features and labels
X = training_data[features]
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(training_data['user'])

# Train an XGBoost model on the info
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X, y)

# Save the trained model to disk
xgb_model_name = 'models/xgb_model.joblib'
joblib.dump(xgb_model, xgb_model_name)

Additional reading:

LEAVE A REPLY

Please enter your comment!
Please enter your name here