From time to time I browse the projects listed on freshmeat.net. I came accross Sally, a tool for mapping strings into vector spaces. This mapping allows one to apply machine learning techniques and data mining to the analysis of string data. I looked at two of the examples given; Sally can be used to map documents to a vector space and build a classifier to categorize text documents using Support Vector Machines and map DNA sequences to a vector space, then build a classifier to detect the start of DNA sequences.
I wondered whether this tool could be applied to automated pattern generation for network intrusion detection or network protocol identification. Patterns are used by various network intrusion detection systems (NIDS) and packet identifiers like Linux's Netfilter. For example, to detect bittorrent traffic, packet payloads are searched to find a string of characters that uniquely identify the bittorrent protocol. L7 (http://l7-filter.sourceforge.net/protocols) is a good source for patterns. L7 lists "bittorrent protocol" as one of the patterns that could be used for identifying bittorrent on your network. The idea therefore, would be to generate or discover patterns like "bittorrent protocol" automatically using Sally. This would be a little different to the Sally examples I looked at where a classifier was genereted by a supervised learner. In my push for an automated pattern generation tool, I will be using Sally together with an unsupervised machine learning technique, called clustering, to discover patterns in network traffic for any particular protocol. The idea would be to map bittorrent packets to a vector space using Sally, and then cluster those vector spaces. The generated clusters would each represent one or more possible patterns.
But first, how does Sally's mapping work ? I did'nt get it unitl I looked at the source code. It calculates the hash for every n-gram in the string and uses that hash value as a feature. The value for that feature is the total number of times that hash value is found in a string. If you chose to work with a byte 8-gram then you would have exactly 12 non-zero hash-value features for the string containing "bittorrent protocol", one each for "bittorre", "ittorren", "ttorrent", "torrent ", "orrent p", "rrent pr", "rent pro", "ent prot", "nt proto", "t protoc", " protoco" and "protocol". The hash values calculated for these n-grams with their corresponding counts look like this:
313166:1 575006:1 667871:1 1104474:1 1213512:1 1355539:1 1382445:1 1532866:1 1559069:1 2132221:1 2874899:1 2985843:1
A sparse vector of count values is generated and not all clustering algorithms can readily deal with a high-dimensional sparse feature set. The Cluto clustering toolkit can. It's not open-source but they do provide a binary for non-comercial use. Next time, I will be atempting to cluster the feature set Sally generates and hopefully use the clusters to discover patterns in network traffic.
Wednesday, March 16, 2011
Automated Pattern Discovery From Network Traffic
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment