Tuesday, March 22, 2011

Automated Pattern Discovery From Network Traffic (2)

Last time, I described a way to find pattern strings in network traffic using machine-learning tools and techniques and it is the goal of this post to describe the method and results of applying these techniques to real network traffic. As an experiment or proof-of-concept, I looked for patterns in Bit-torrent traffic. The results look very promising. We will see how I uncovered a couple of patterns that could be used and probably are used by NIDS and DPI products to identify Bit-torrent traffic.

I will not go into any detail whatsoever on how to capture Bit-torrent traffic using Wireshark because I believe it is incidental to what I really want to show; how to apply machine-learning techniques using Sally and Cluto for pattern discovery. However, it is important to say that I will only be looking for patterns in the first packet of each flow or stream. I copied the contents of the first packet of each flow to a separate file. These files provide the input data to Sally. As I described in my last post, Sally maps strings into a vector space which we then use as input to Cluto, a toolkit for clustering.
One problem I did run into when I tried to combine these two tools is that Cluto's expected input file format is different to the format Sally provides. I hacked Sally's source code slightly so that its output matched Cluto's expected format. You can download the patch here: sally_patch. If you wish to reproduce my results, you can also download Sally's configuration file I used: sally_configuration. The script I used to glue together Sally and Cluto can be found here: glue_sally_cluto.py. This being a proof-of-concept, do not expect to find production ready code.

The glue_sally_cluto.py script first runs Sally, then Cluto, the results of which are then copied into a separate directory called "clusters". The directory holds the contents of each cluster Cluto generated and each cluster contains the packets Cluto lumped together in the clustering process. In my experiment, I used the tool chain to separate 750 packets into 10 clusters. As it turns out, a simple visual inspection of the clusters gave me the patterns. I found two: "BitTorrent protocol" and "d1:ad2:id20".

As an example, I looked at the contents of each packet in cluster "0":

dimitri@dimitri-laptop:/tmp/sig_analysis/clusters/0$ find -exec more {} \; 


BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol


The contents of each packet in cluster "5" indicate that an overwhelming number of packets start with the pattern "d1:ad2:id20":

dimitri@dimitri-laptop:/tmp/sig_analysis/clusters/5$ find -exec more {} \; 


L�RY � �O� 3�s!p�������� ��F� ;!7�����,��T������
�[��P ��7�ݞ �ڍQȝ�' ާ�                           ��V���v�� G�v=_/��@���G I���3 ]lV�˗|�� �t�f���Ż����ܐB32( ���>� �b C���#<y胀Y�:��v5��P�_��6[�5K󤜌�F `S
                     +�yp
d1:ad2:id20:
� h�B=A�~���lv� ,e1:q4:ping1:t4:�~
L6�J�A�\%<�c�e1:q4:ping1:t4:
d1:ad2:id20:Y�]{]� Pg=CA�,�B��0}e1:q4:ping1:t4:�6
d1:ad2:id20:{aB�Dǹ< d 0
                        A��e1:q4:ping1:t4:<b
d1:ad2:id20:hh ��Jb�νd��W%��|�e1:q4:ping1:t4:�(
d1:ad2:id20:���� �Y y���fc�] ���e1:q4:ping1:t4:�
d1:ad2:id20:<@˩
d1:ad2:id20:O�؇� ��"BG􃢅�e1:q4:ping1:t4:�l
d1:ad2:id20:CgC ���u ȡ��N�p���e1:q4:ping1:t4:rB
d1:ad2:id20:�T_}8 S��:�*�4���قe1:q4:ping1:t4:��
d1:ad2:id20:f$� �F0ik�D �_I k��e1:q4:ping1:t4:L�
�e1:q4:ping1:t4:e� 2~ג�=��ƷZ
d1:ad2:id20:�|IM��� ���r��E��e1:q4:ping1:t4:�m
d1:ad2:id20:JЅN�#�\bN�.|��}��e1:q4:ping1:t4:�+
d1:ad2:id20:��a�6ދO��˞ \��~ ��e1:q4:ping1:t4:}�
d1:ad2:id20: Z
d1:ad2:id20: ��� ^L!?b
d1:ad2:id20:
            D����{ �jUh�D



All the other clusters shared the same patterns; either "BitTorrent protocol" or "d1:ad2:id20".



Wednesday, March 16, 2011

Automated Pattern Discovery From Network Traffic

From time to time I browse the projects listed on freshmeat.net. I came accross Sally, a tool for mapping strings into vector spaces. This mapping allows one to apply machine learning techniques and data mining to the analysis of string data. I looked at two of the examples given; Sally can be used to map documents to a vector space and build a classifier to categorize text documents using Support Vector Machines and map DNA sequences to a vector space, then build a classifier to detect the start of DNA sequences.


I wondered whether this tool could be applied to automated pattern generation for network intrusion detection or network protocol identification. Patterns are used by various network intrusion detection systems (NIDS) and packet identifiers like Linux's Netfilter. For example, to detect bittorrent traffic, packet payloads are searched to find a string of characters that uniquely identify the bittorrent protocol. L7 (http://l7-filter.sourceforge.net/protocols) is a good source for patterns. L7 lists "bittorrent protocol" as one of the patterns that could be used for identifying bittorrent on your network. The idea therefore, would be to generate or discover patterns like "bittorrent protocol" automatically using Sally. This would be a little different to the Sally examples I looked at where a classifier was genereted by a supervised learner. In my push for an automated pattern generation tool, I will be using Sally together with an unsupervised machine learning technique, called clustering, to discover patterns in network traffic for any particular protocol. The idea would be to map bittorrent packets to a vector space using Sally, and then cluster those vector spaces. The generated clusters would each represent one or more possible patterns.


But first, how does Sally's mapping work ? I did'nt get it unitl I looked at the source code. It calculates the hash for every n-gram in the string and uses that hash value as a feature. The value for that feature is the total number of times that hash value is found in a string. If you chose to work with a byte 8-gram then you would have exactly 12 non-zero hash-value features for the string containing "bittorrent protocol", one each for "bittorre", "ittorren", "ttorrent", "torrent ", "orrent p", "rrent pr", "rent pro", "ent prot", "nt proto", "t protoc", " protoco" and "protocol". The hash values calculated for these n-grams with their corresponding counts look like this:


313166:1 575006:1 667871:1 1104474:1 1213512:1 1355539:1 1382445:1 1532866:1 1559069:1 2132221:1 2874899:1 2985843:1


A sparse vector of count values is generated and not all clustering algorithms can readily deal with a high-dimensional sparse feature set. The Cluto clustering toolkit can. It's not open-source but they do provide a binary for non-comercial use. Next time, I will be atempting to cluster the feature set Sally generates and hopefully use the clusters to discover patterns in network traffic.

Wednesday, March 02, 2011

Test Code Coverage with Objdump and Valgrind

I was working on a C++ project recently where gcov inexplicitly failed to report correct results, especially for template files. Some templated code was incorrectly being reported as not-covered by my tests. Instead of doing the right thing by finding out why gcov was not working with my source code, I decided to write a python script that combined the outputs of objdump and valgrind into a test coverage report. It worked well enough and I thought I would share it.  Here is an extract of a test coverage report:

---------------------------------
File: HTTPmsg.h
Code not executed in: HTTPmsg<TCPfragmsg>::getContentEncoding() const
line no: 110

Code not executed in: HTTPmsg<TCPfragmsg>::DJBHash(char const*, char const*)
line no: 271

Code not executed in: HTTPmsg<TCPmsg>::isCandidateRequest() const
line no: 215

Code not executed in: HTTPmsg<TCPfragmsg>::isCandidateResponse() const
line no: 236 238 239 242 248 251 252

Code not executed in: HTTPmsg<TCPfragmsg>::getContentDirection() const
line no: 105

Compiled lines: 82, Not executed: 11
Code coverage:  86%
---------------------------------

The script can be found here: coverage

Tested on objdump (GNU Binutils for Ubuntu) 2.20.1 and valgrind-3.6.0.SVN-Debian.


Below, you can see the output of objdump and valgrind's for the 'isCandidateRequest' member function. The objdump output describes what code was compiled and the output of callgrind describes what code was executed. The script works by checking for common line numbers between them. Line numbers that appear in the object dump and not in the valgrind output are marked as non-executed source code.

objdump:  'isCandidateRequest' member function:

080e9498 <HTTPmsg<TCPmsg>::isCandidateRequest() const>:
_ZNK7HTTPmsgI6TCPmsgE18isCandidateRequestEv():
/projects/conduits/conduits/HTTPmsg.h:203
 80e9498:       55                       push   %ebp
 80e9499:       89 e                    mov    %esp,%ebp
 80e949b:       83 e28                sub    $0x28,%esp
/projects/conduits/conduits/HTTPmsg.h:205
 80e949e:       8b 45 08              mov    0x8(%ebp),%eax
 80e94a1:       8b 80 e0 00         mov    0xe0(%eax),%eax
 80e94a7:       85 c0                   test   %eax,%eax
 80e94a9:       75 26                   jne    80e94d1 <HTTPmsg<TCPmsg>::isCandidateRequest() const+0x39>
/projects/conduits/conduits/HTTPmsg.h:206
 80e94ab:       b9 00 00 00 00    mov    $0x0,%ecx
 80e94b0:       a1 e8 7e 26 08    mov    0x8267ee8,%eax
 80e94b5:       8b 15 ec 7e 26    mov    0x8267eec,%edx
 80e94bb:       83 c0 01              add    $0x1,%eax
 80e94be:       83 d2 00              adc    $0x0,%edx
/projects/conduits/conduits/HTTPmsg.h:215
 80e9624:       b9 00 00 00 00    mov    $0x0,%ecx
 80e9629:       a1 20 7f 26 08     mov    0x8267f20,%eax
 80e962e:       8b 15 24 7f 26     mov    0x8267f24,%edx
 80e9634:       83 c0 01              add    $0x1,%eax
 80e9637:       83 d2 00              adc    $0x0,%edx
 80e963a:       a3 20 7f 26 08     mov    %eax,0x8267f20


Callgrind: 'isCandidateRequest' member function:

fn=HTTPmsg<TCPmsg>::isCandidateRequest() const
203 87
205 116
jcnd=29/29 206
205
206 174