Inferring Protocol State Machine from Real-World Trace Yipeng Wang12 , Zhibin Zhang1 , and Li Guo1 1

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2 Graduate University, Chinese Academy of Sciences, Beijing, China [email protected]

Abstract. Application-level protocol specifications are helpful for network security management, including intrusion detection, intrusion prevention and detecting malicious code. However, current methods for obtaining unknown protocol specifications highly rely on manual operations, such as reverse engineering. This poster provides a novel insight into inferring a protocol state machine from real-world trace of a application. The chief feature of our method is that it has no priori knowledge of protocol format, and our technique is based on the statistical nature of the protocol specifications. We evaluate our approach with text and binary protocols, our experimental results demonstrate our proposed method has a good performance in practice.


Introduction and System Architecture

Finding protocol specifications is a crucial issue in network security, and detailed knowledge of a protocol specification is helpful in many network security applications, such as intrusion detection systems and vulnerability discovery etc. In the context of extracting protocol specifications, inferring the protocol state machine plays a more important role in practice. ScriptGen [1] is an attempt to infer protocol state machine from network traffic. However, the proposed technique is limited for no generalization. This poster provides a novel insight into inferring a protocol state machine from real-world packet trace of an application. Moveover, we propose a system that can automatically extract protocol state machine for stateful network protocols from Internet traffic. The input to our system is real-world trace of a specific application, and the output to our system is the protocol state machine of the specific application. Furthermore, our system has the following features, (a) no knowledge of protocol format, (b) appropriate for both text and binary protocols, (c) the protocol state machine we inferred is of good quality. The objective of our system is to infer the specifications of a protocol that is used for communication between different hosts. To this end, our system carries on the whole process in four phases, which are shown as follows: Network data collection. In this phase, network traffic of a specific application (such as SMTP, DNS etc.) is collected carefully. In this poster, The method of collecting packets under specific transport layer port is adopted. S. Jha, R. Sommer, and C. Kreibich (Eds.): RAID 2010, LNCS 6307, pp. 498–499, 2010. c Springer-Verlag Berlin Heidelberg 2010 

Fig. 1. The Protocol State Machine of SMTP and XUNLEI Protocol

Packet analysis. During the part of packet analysis, we first look for high frequency units from off-line application-layer packet headers, which is obtained by the phase of network data collection. Then, we employ Kolmogorov-Smirnov (K-S) test to determine the optimal number of units. Finally, we replay each applicationlayer packet header and construct protocol format messages with objective units. Message clustering. In this phase, we extract the feature from each protocol format message. The feature is used to measure the similarity between messages. Then, the partitioning around medoids (PAM) clustering algorithm is applied to group similar messages into a cluster. Finally, the medoid message of a cluster will become a protocol state message. State machine inference. In order to infer protocol state machine, we should be aware of the packet state sequence of flows. For the purpose of labeling the packet state, initially we have to find the nearest medoid message of each packet and assign the identical label type to the packet. Then, by finding the relationship between different state types, a protocol machine is constructed. After state machine minimization, we will get the ultimate protocol state machine.



We make use of SMTP (text protocol) and XUNLEI (binary protocol) to test and verify our method. The protocol state machine of SMTP we inferred is shown in Fig. 1 left, and XUNLEI in right. Moreover, our evaluation experiments show that our system is capable of parsing about 86% flows of SMTP protocol and about 90% flows of XUNLEI protocol.

Reference 1. Leita, C., Mermoud, K., Dacier, M.: Scriptgen: an automated script generation tool for honeyd. In: Annual Computer Security Applications Conference (2005)

