Tuesday, March 22, 2011

Automated Pattern Discovery From Network Traffic (2)

Last time, I described a way to find pattern strings in network traffic using machine-learning tools and techniques and it is the goal of this post to describe the method and results of applying these techniques to real network traffic. As an experiment or proof-of-concept, I looked for patterns in Bit-torrent traffic. The results look very promising. We will see how I uncovered a couple of patterns that could be used and probably are used by NIDS and DPI products to identify Bit-torrent traffic.

I will not go into any detail whatsoever on how to capture Bit-torrent traffic using Wireshark because I believe it is incidental to what I really want to show; how to apply machine-learning techniques using Sally and Cluto for pattern discovery. However, it is important to say that I will only be looking for patterns in the first packet of each flow or stream. I copied the contents of the first packet of each flow to a separate file. These files provide the input data to Sally. As I described in my last post, Sally maps strings into a vector space which we then use as input to Cluto, a toolkit for clustering.
One problem I did run into when I tried to combine these two tools is that Cluto's expected input file format is different to the format Sally provides. I hacked Sally's source code slightly so that its output matched Cluto's expected format. You can download the patch here: sally_patch. If you wish to reproduce my results, you can also download Sally's configuration file I used: sally_configuration. The script I used to glue together Sally and Cluto can be found here: glue_sally_cluto.py. This being a proof-of-concept, do not expect to find production ready code.

The glue_sally_cluto.py script first runs Sally, then Cluto, the results of which are then copied into a separate directory called "clusters". The directory holds the contents of each cluster Cluto generated and each cluster contains the packets Cluto lumped together in the clustering process. In my experiment, I used the tool chain to separate 750 packets into 10 clusters. As it turns out, a simple visual inspection of the clusters gave me the patterns. I found two: "BitTorrent protocol" and "d1:ad2:id20".

As an example, I looked at the contents of each packet in cluster "0":

dimitri@dimitri-laptop:/tmp/sig_analysis/clusters/0$ find -exec more {} \; 


BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol
BitTorrent protocol



The contents of each packet in cluster "5" indicate that an overwhelming number of packets start with the pattern "d1:ad2:id20":

dimitri@dimitri-laptop:/tmp/sig_analysis/clusters/5$ find -exec more {} \; 


L�RY � �O� 3�s!p�������� ��F� ;!7�����,��T������
�[��P ��7�ݞ �ڍQȝ�' ާ�                           ��V���v�� G�v=_/��@���G I���3 ]lV�˗|�� �t�f���Ż����ܐB32( ���>� �b C���#<y胀Y�:��v5��P�_��6[�5K󤜌�F `S
                     +�yp
d1:ad2:id20:
� h�B=A�~���lv� ,e1:q4:ping1:t4:�~
L6�J�A�\%<�c�e1:q4:ping1:t4:
d1:ad2:id20:Y�]{]� Pg=CA�,�B��0}e1:q4:ping1:t4:�6
d1:ad2:id20:{aB�Dǹ< d 0
                        A��e1:q4:ping1:t4:<b
d1:ad2:id20:hh ��Jb�νd��W%��|�e1:q4:ping1:t4:�(
d1:ad2:id20:���� �Y y���fc�] ���e1:q4:ping1:t4:�
d1:ad2:id20:<@˩
d1:ad2:id20:O�؇� ��"BG􃢅�e1:q4:ping1:t4:�l
d1:ad2:id20:CgC ���u ȡ��N�p���e1:q4:ping1:t4:rB
d1:ad2:id20:�T_}8 S��:�*�4���قe1:q4:ping1:t4:��
d1:ad2:id20:f$� �F0ik�D �_I k��e1:q4:ping1:t4:L�
�e1:q4:ping1:t4:e� 2~ג�=��ƷZ
d1:ad2:id20:�|IM��� ���r��E��e1:q4:ping1:t4:�m
d1:ad2:id20:JЅN�#�\bN�.|��}��e1:q4:ping1:t4:�+
d1:ad2:id20:��a�6ދO��˞ \��~ ��e1:q4:ping1:t4:}�
d1:ad2:id20: Z
d1:ad2:id20: ��� ^L!?b
d1:ad2:id20:
            D����{ �jUh�D



All the other clusters shared the same patterns; either "BitTorrent protocol" or "d1:ad2:id20".



Socializer Widget By Blogger Yard
SOCIALIZE IT →
FOLLOW US →
SHARE IT →

0 comments:

Post a Comment