Generally in a real time networktraffic the payload feature are long and are of different data type. It becomesextremely difficult to any intelligent systems like machine learning systems tohandle such long payload features. In the present work lot of experiments werecarried out on ISCX data set. In the literature, researchers used different IDSdata sets for testing their models. However, in this paper the ISCX 2012intrusion detection data set is used for better comparisons in the resultsbecause some of the recent works 16 used this same data set. This data set has been generated by theInformation Security Centre of Excellence (ISCX) at the University of NewBrunswick in 2012 1.
The data set consists of real traces analyzed to createprofiles for agents that generate real traffic for FTP, HTTP, SMTP, SSH, IMAP,POP3, etc., 1. The generated data set contains various features that include fullpacket payloads in addition to other relevant features such as total number ofbytes sent and / or received. This ICSX Data set consists of different types offeatures like numeric, alpha numeric, date, time, categorical and strings.
Usually the packet header information are represented by a combination of theseabove types, but the payload features are usually represented by long stringvalues which contains very long strings that makes it really difficult for anymachine learning algorithms to deal with. To address this problem encodedschemes have been chosen to encode these features by using bigram and trigramtechniques. Fig.1 illustratesthe main steps of the feature extraction process that is employed to extract featuresusing a proposed scheme.
The present approach and algorithms are very similarto the procedures in 16 but in the present study, along with bigram scheme,trigram approach also studied and experimented on the same data set . Thisbigram / Trigram techniques are used with payload features to investigate ifthe payload features contain informative features or not. It is opted to dothis since most research ignores these features due to their long strings,which makes them difficult to utilize in machine learning.