Skip to content

ankiteciitkgp/keystrokeDynamics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Keystroke Dynamics

1.1 Introduction

Keystroke dynamics is the study of whether people can be distinguished by their typing rhythms, much like handwriting is used to identify the author of a written text. Possible applications include acting as an electronic fingerprint, or in an access-control mechanism. A digital fingerprint would tie a person to a computer-based crime in the same manner that a physical fingerprint ties a person to the scene of a physical crime. Access control could incorporate keystroke dynamics both by requiring a legitimate user to type a password with the correct rhythm, and by continually authenticating that user while they type on the keyboard. Keystroke dynamics is a behavioral biometric, this means that the biometric factor is ’something you do’.

1.2 Methods

The methods for collecting the data, extracting the features and building the classifier are described in this section.

1.2.1 Data Acquisition

A keylogger was used to collect the keystroke data for all the group members. This was collected in the form of :

• ”Key Event” - whether a button was pressed or released • ”Key Code” - which button was pressed • ”Shift”,”Alt”,”Control” - if these buttons were pressed or not • ”Timestamp” - the exact timestamp of the event

1.2.2 Feature Extraction

A custom-built feature extractor was used in the pipeline for extracting a set of 18 features from the keylogger data. These features are characteristic of an user’s typing pattern. The keyboard was divided into two parts as per the convention in (Figure 1.1).

keyboard finger positon

The extracted features can be classified into four different categories : • Latency: It is defined as the time interval between ”KEY DOWN” event (pressing) of two consecutive key strokes • Hold Time: It is the time interval for which a particular key is pressed • Counts per Character: It is a fraction of the number of times a key is pressed over the total number of keys pressed in a keystroke data • Characters per Minute: It is the average number of keys pressed in a time interval of one minute

The difference between timestamps aided in providing latency and hold times that were used in the training data. This processed data was further used to create bins of typing sessions of the user. Each bin represents a typing session of the user. A bin is created when there is no ”Key Event” recorded for a period of 10 seconds or when the user types uninterruptedly for a period of 1 minute without a break lasting more than 10 seconds. Each bin was analysed for the latency and holds times for certain combination of keys. The latency and hold times have a threshold of 1.5 seconds above which they are ignored. The mean values of the latency and hold times of a bin for the combination of keys are considered as an observation vector. A bin is considered as a valid bin if it contains more than 100 events within it, barring which it is discarded and the next bin is constructed. Additional features considered were the no. of backspaces per character of the user. This gives a sense of how prone to typing mistakes a user is and reflects mood of the user, although testing for this feature gives inconsistent results as this feature may also reflect the current nature of the text being typed and varied widely for the same user. Another feature used is the cpm(characters per minute), which reflects the typing speed of a user and which gives a distinction among several users. The cpm feature was considered in only those bins where the user typed more than 50 valid characters so as to give a true sense of a user’s typing speed. The extracted features are shown in the Table 1.1.

Table 1.1

1.2.3 Feature Engineering

First, the unnecessary keys captured by the Keylogger, such as keys used for navigation, were filtered out as they did not have any significance in determining an user’s typing pattern. Any missing values of an attribute for an observation bin are filled in by the mean of the corresponding features from the other bins. (*Filling in the missing values with the median results in a lower training and test accuracy) The data collected in this format gives us an observation vector wherein the user continuously types a stream of text, as opposed to the method of collecting latency data of character combinations at different time stamps and including them in a single obser- vation vector. The former proposed method which gives a more accurate and intuitive sense of the typing pattern of an user.

1.3 Classifier

A Mixture of Gaussians model (GMM) was used for classifying the obtained training data from the feature extraction step. The GMM Estimator provided by the scikit-learn library in Python was used in this project. The means of all the attributes per output label class, was provided as an initialization parameter for the model. This mean, along with the covariances and the weights, were improved over subsequent iterations. Evaluation was done using 4 different types of covariances - Spherical, Diagonal, Full and Tied.

1. Results

The accuracy in predictions by the Mixture of Gaussian models has been shown in the following Table:

Dataset Tied(%) Diagnol(%) Spherical(%) Full (%)
Training 86.29 79.44 43.15 42.83
User-1 70.37 85.19 92.59 88.89
User-2 94.74 94.74 78.95 100.00
User-3 77.78 100.00 11.11 -
User-4 89.29 53.57 25.00 -
User-4 60.00 60.00 - -

In the table above, it can be observed that the Diagonal and Tied type Covariance Matrix performs better than the other types. For the test datasets, 4 out of 5 users were predicted with an accuracy of over 70%. Each individual test datasets were of the length of a standard news article.

Dataset 1 2 3 4
Training 87.65 76.29 89.41 90.27
User-1 - 70.37 88.89 -
User-2 - 94.74 100 100.00
User-3 - 77.78 - 22.22
User-4 7.14 89.29 92.86 92.86
User-4 20.00 60.00 20.00 -

In the above table, the final model trained used mean to fill up missing values in bins, didn’t use no.of backspace per character as a feature and considered bins with more than 100 events and latency and hold time thresholds of 1.5 s.

  1. Using median to fill up missing values of bins
  2. Final Model
  3. Considering no. of backspaces per character as a feature
  4. Considering bins with any number of events and latency and hold time thresholds of 2s
  5. Using no. of backspaces per character as a feature and 4 additional features: lR,rL,rR,lL

Releases

No releases published

Packages

No packages published

Languages