A Steganographic Approach to Localizing Botmasters

Viewer
Transcript

2014 28th International Conference on Advanced Information Networking and Applications Workshops

A Steganographic Approach to Localizing Botmasters Julian L. Rrushi Center for Cybersecurity British Columbia Institute of Technology 3700 Willingdon Avenue, Burnaby, BC, V5G 3H2, Canada [email protected] enforcement to localize botmasters. We have devised a novel application of steganography [4], which enables law enforcement to insert a hidden watermark into a decoy document. That decoy document is leaked via a honeynet, and thus is subject to network tracking conducted by collaborating Internet Service Providers (ISPs). The proposed approach exploits storage of the marked decoy document on a botmaster's machine. The steganography that underlies that approach is such that it enables law enforcement to leverage unauthorized possession of the marked decoy document into evidence of involvement in botnet operation. As pointed out by Bowen et al. [2], in practice, the construction of a perfect decoy document might be unachievable. The main contribution of our research consists in addressing that specific challenge. Our research generates a marked decoy document that is virtually identical to a genuine document. A botmaster is given no clues of any hidden bits within the marked decoy document. In this paper, we explain our research as applied to Microsoft Word documents. The remaining of this paper is organized as follows. In Section II we provide a formal definition of the problem domain that we address with this research. In Section III we discuss related research. Section IV describes novel steganography used to generate and hide a watermark within a honeytoken. In Section V we discuss network tracking of a leaked honeytoken to a botmaster's machine. In Section VI we summarize our contribution, and thus conclude the paper.

Abstract—Law enforcement employs an investigative approach based on marked money bills to track illegal drug dealers. In this paper we discuss research that aims at providing law enforcement with the cyber counterpart of that approach in order to track perpetrators that operate botnets. We have devised a novel steganographic approach that generates a watermark hidden within a honeytoken, i.e. a decoy Word document. The covert bits that comprise the watermark are carried via secret interpretation of object properties in the honeytoken. The encoding and decoding of object properties into covert bits follow a scheme based on bijective functions generated via a chaotic logistic map. The watermark is retrievable via a secret cryptographic key, which is generated and held by law enforcement. The honeytoken is leaked to a botmaster via a honeynet. In the paper, we elaborate on possible means by which law enforcement can track the leaked honeytoken to the IP address of a botmaster's machine. Keywords-botnets; steganography; computer security

I.

INTRODUCTION

Network operation of botnets represents a major cybercrime vector at the present time. While various botnet mitigation solutions have been proposed both in academia and in industry, localizing botmasters and hence holding them accountable for their actions may be the ultimate solution to the botnet crime phenomenon. The high distribution of botnets over large geographic areas along with their decentralized architecture in the majority of their contemporary forms, i.e. peer-to-peer (P2P), makes identification of botmasters via traditional network forensics quite challenging. We researched on whether any of the effective investigative techniques devised for use by law enforcement could be adapted to the cyber space in order to localize active botmasters. The presence of a parallel between the two worlds along with the similarity in their respective perpetrator tracking challenges made the idea appealing. Law enforcement have special instruments and technical knowledge that enable them to mark money bills in a way that is undetectable by any other parties. Those marked money bills are used by undercover agents to buy illegal drugs from illegal drug dealers. If a suspect is found in possession of large amounts of marked money bills, then those bills are used as evidence in a formal charge against the suspect [20]. In this paper we discuss a cyber counterpart of that technique, and also consider it in conjunction with network tracking. Our approach can be used by law 978-1-4799-2652-7/14 $31.00 © 2014 IEEE DOI 10.1109/WAINA.2014.152

II.

PROBLEM STATEMENT

In a botnet, specific botnet member hosts serve as repositories of sensitive data that bots extract from infected hosts. In order for a botmaster to acquire or access the sensitive data, he/she has to use his/her own machine, i.e. a machine the botmaster has physical access to, to connect to those repositories. In most botnets that have a centralized command and control architecture, bots send the sensitive data to a command and control server. Thus, in those cases it is a command and control server that acts as a repository of sensitive data. In botnets that have a peer-to-peer (P2P) architecture [25], any botnet member host may serve as a repository of sensitive data for specific periods of time. P2P botnets establish an overlay network, which bots employ to communicate with each other [16]. Sensitive data are propagated through peer bots over such overlay network until reaching the peer bots currently serving as repositories of sensitive data.

852

supposed to be accessed. Any access to a honeytoken of that kind is deemed by the defender to be indicative of a security breach. Yuill et al. [26] instrument a network file system to support creation and monitoring of honeytokens, which the authors refer to as honeyfiles. Honeyfiles are bait files that are intended for an attacker to access. Honeyfiles are designed such as to be no different than ordinary files in the file system of a computer system. For an attacker to obtain the sensitive data stored in a honeyfile, he/she would have to access that honeyfile, which will result in an intrusion alert being raised by the instrumented file system. Bowen et al. [2] propose an approach that employs honeytokens to detect insider threats in an organization, whom the authors categorize as masqueraders or traitors. Masqueraders impersonate other inside users in an organization, while traitors act by using their own legitimate credentials. Bowen et al. developed a web application that automatically generates honeytokens, which registered users store in their machines. A host-based application is employed to monitor possible accesses to such honeytokens. The registered user will receive an alert from the host-based application in case the honeytoken stored in his/her machine is accessed. Shue proposes a similar approach to detect insider threats in an organization [21]. Genuine sensitive documents are automatically scanned to identify segments of sensitive data. The data in those segments are replaced with fake but plausible data. The altered documents are used as honeytokens, and thus are stored on semipublic resources to look like accidental leaks. Any possible access to any of such honeytokens is reported to the organization's security personnel. Our approach designs and leverages a honeytoken in a different manner. Our approach exploits storage of a honeytoken on a botmaster's machine rather than file system access to that honeytoken. The design of a honeytoken as engineered in the previously discussed related research would not work in our problem domain. That design does not meet the operational requirements discussed in the previous section. Because of that reason, our approach goes beyond placing bogus sensitive data on a file by proposing a novel steganographic scheme, which enables the defender to hide a watermark within the file in a way that meets the requirements in question. Catlett and He [3] propose an approach that generates and employs honeytokens for the purpose of localizing criminal organizations involved in financial cybercrimes. A honeytoken is prepared as in the related works discussed previously in this section, i.e. by placing bogus data on a file. The bogus data may comprise mathematically valid credit card numbers, social security numbers, bank account numbers, personal identification numbers (PINs), e-banking login credentials, account information generated randomly from census data of common names and street addresses, etc. Again, all of those data are fake, and thus do not correspond to a concrete individual. Catlett and He also work on the consistency of the bogus data by putting in place realistic correlations between the different data fields in the honeytoken. For example, the authors ensure that an account PIN matches the birthday of the fictitious individual that

Connections of a botmaster's machine to botnet member hosts acting as repositories of sensitive data, either directly or through intermediary hosts, may represent an opportunity for law enforcement to probe that botmaster for the purpose of inferring his/her profile. In this paper we investigate on how to employ honeytokens in a way that provably reveals the geographic location where a botmaster operates from, and hence possibly localizes that botmaster. A honeytoken is a digital or information system resource whose value lies in the unauthorized use of that resource [18]. Examples of a honeytoken include a credit card number, an Excel or PowerPoint file, a database table or view, a bogus login account, etc. [22]. In this paper we argue that not only unauthorized use, but also unauthorized possession, of a honeytoken can be of value to a defender. We notice from [22] the flexibility on what can be used as honeytoken and how that honeytoken can be leveraged. It is quite feasible to leak honeytokens to botmasters in various ways. Honeynets [18], for instance, are a practical tool that can be employed for that purpose. Finding a way to track network propagation of a leaked honeytoken from a honeypot to a machine that serves as repository of sensitive data in a botnet, and from there possibly to a botmaster's machine, has potential for revealing the Internet Protocol (IP) address that was used by the botmaster's machine to connect to the botnet. Various techniques such as IP2Geo [17] exist for mapping an IP address to a geographic location. Thus, the derivative outcome of honeytoken tracking provides law enforcement with a search area that can be used to prioritize leads and better allocate resources. A key objective of such investigative process is to induce a botmaster to store one or more honeytokens on any of his/her computer devices. Legally authorized physical inspection of those devices would reveal unauthorized possession of the honeytokens. A fundamental research challenge lies in honeytoken generation, which represents much of the focus of this paper. A honeytoken in this research should meet specific requirements for the overall investigative process to be effective. A honeytoken should be undistinguishable from its genuine counterparts. A honeytoken should be designed and implemented in a way that does not present or leak any indicators that suggest its being a decoy. A honeytoken should be designed and implemented such as to enable law enforcement to prove in a court of law that the honeytoken found on the computer device of an individual constitutes unauthorized possession of that honeytoken. A possible way for law enforcement to deliver that proof consists in unequivocally demonstrating that they, i.e. the law enforcement, have generated the honeytoken in question, and thus are the originators of that honeytoken. We strictly treat the problem domain discussed and addressed in this paper as pure academic research, and thus we ourselves do not conduct or imply any operational application of this research. III.

RELATED WORK

Various related research explore the idea of monitoring possible accesses to deceptive files serving as honeytokens. Such honeytokens have no production value, and thus are not

853

or a gmail server would be handled by the proxy on behalf of the perpetrator's machine. The perpetrator's machine would only connect to the compromised machine that is hosting the proxy, while it would be the latter to connect to a machine hosting the banking application or the gmail server. The banking application or the gmail server has visibility only on the IP address of the compromised machine where the proxy is running. The IP address of the perpetrator’s machine is not visible to the banking application or to the gmail server. It is quite common for perpetrators involved in cybercrimes to have gained access to vulnerable machines in Internet, which they could use as intermediary points during online use of a target account. A botmaster, for example, has remote control over the compromised hosts that comprise a botnet, each of which may be leveraged to host the proxy. An example of evasion technique that renders the IP address of the perpetrator's machine unavailable is the case in which the perpetrator makes use of onion routing [6], i.e. legitimate networking infrastructure deployed in Internet that uses encryption to hide the source IP address of network traffic. In that case the banking application or gmail server do not have access to the IP address in question, therefore the overall tracking approach is evaded. To our understanding, there are techniques that avoid tracking based on physical addresses or other data specified during access to trap accounts. Nevertheless, that discussion lies outside the scope of both this research and our area of expertise, therefore we do not treat it in this paper.

account is associated with. The authors deem that random and unrelated data fields may be less realistic in certain cases. Unlike most of related research, Catlett and He integrate the bogus data they store in honeytokens with genuine data stored in genuine computer resources. For example, the approach proposed by the authors includes creation of an actual account associated with the bogus data about a fictitious individual. That additional effort helps the defender make the bogus data in a honeytoken even more consistent. A sophisticated criminal organization may have the ability to validate the existence of an account before taking any actual steps to use it. After generating such honeytokens, the authors propose to leak or disseminate those honeytokens to criminal organizations. Online accesses using data from the honeytokens in question are detected by a banking web application, which logs session data about the transaction. Those session data include the IP address of the host being used by a perpetrator to access a trap account, any physical addresses specified during the transaction, other customer account numbers involved in the transaction, etc. The logged session data is then analyzed to track the criminal organization entrapped by the approach in question. Bowen et al. [2] propose a similar exploitation of honeytokens for insider threat detection. The bogus data that Bowen et al. propose to store in honeytokens include computer logins that provide no access to valuable resources, e-banking Internet login credentials, credit card numbers, etc. The honeytokens are stored on production hosts, which if accessed by malicious insiders will leak the bogus data to them. As in [3], the trap accounts are created in practice. Those trap accounts are periodically monitored by a series of custom scripts. If a trap account is accessed, those custom scripts will retrieve the connection data that pertains to that access. A concrete example of a trap login is a pair of username and password associated with a gmail account. The custom scripts access that gmail account to retrieve account activity data, namely the IP addresses for the previous five account accesses along with the specific times at which those accesses were made. If a malicious insider accesses the gmail trap account, the IP address of the host he/she uses to connect along with the actual time of access will become available to the defender. The IP address represents key data on tracking perpetrators as it pinpoints a specific host within an organization's information technology (IT) infrastructure or a geographic location which it maps to. Nevertheless, we deem that the authenticity, and in some cases the availability, of a logical binding between a perpetrator and the IP address logged from an access to a trap account is questionable. The barrier lies in various techniques that allow for evading IP address tracking. Let us look into a concrete example of such techniques. A perpetrator could install in a compromised machine in Internet a proxy of the protocol carrying the access to a target account. In the case of research discussed in [2] and [3], the installed proxy would be a hypertext transfer protocol (HTTP) proxy. The perpetrator then would access the target account via the proxy. The HTTP requests towards a banking application

IV.

HIDING DATA IN A HONEYTOKEN

We now discuss the engineering of a honeytoken. Although we describe our approach as applied to the use of an Office Word file as honeytoken, our discussion applies similarly to text files with any other binary format, such as for example Excel, PowerPoint, Portable Document Format (PDF), etc. We have devised a novel steganographic approach, which we describe later on in this section, to hide a specific bitstream in the form of watermark within a Word binary file that we use as honeytoken. The honeytoken is prepared such as to contain bogus sensitive data, such as for example a bogus credit card number, a non existent bank account number associated with the former, bogus online banking credentials, etc. The watermark is the keyed-hash message authentication code (HMAC) [9] of those bogus sensitive data. The defender generates a secret cryptographic key, which he/she uses in conjunction with a hash function such as the Message-Digest algorithm 5 (MD5) to generate the HMAC in question. The defender stores such steganographically marked honeytoken on the hard disk drive of a honeypot in a honeynet. Infection of that honeypot by a botnet exposes the honetoken to a possible theft performed by bot code, which will possibly send the honeytoken along with other stolen files over Internet to a botmaster. Finding the honeytoken on the computer of an individual is indicative of active involvement of that individual in botnet crimes, including data theft and unauthorized access to a computer system, to the same extent as marked money bills found in an individual's possession are indicative of active involvement

854

question does not require any changes to the algorithms involved in processing FIB substructures and hence work with objects. In a honeytoken generated according to this research, each property of an object takes a value from a set of possible values, which is defined specifically for that property and object in [15], identically to any other Word binary file.

of that individual in crimes related to illegal drugs. The proposed steganographic approach allows the defender to retrieve the hidden HMAC of the aforementioned bogus sensitive data from the honeytoken, which proves that the honeytoken is marked. The defender's knowledge of the secret cryptographic key, which in conjunction with a hash function generates exactly the HMAC in question, proves that the honeytoken was generated by the defender. The idea of hiding an HMAC within a honeytoken, and thus involving a secret cryptographic key such as to prove authorship of that honeytoken, is a cyber counterpart of the techniques used by law enforcement to prove existence and authorship of a mark hidden in money bills used to make an illegal drugs buy arranged by undercover agents. In both of those cases, the covertly hidden data identify their source, i.e. the defender, and thus incriminate the actual possessor. An assumption that underlies this research is that botnets send a botmaster stolen files that have a binary format, such as Office Word files, in addition to stolen data extracted from files with a text format, network packets sniffed from the network, keystrokes intercepted via keyboard logging, etc. In defense of that assumption, we can mention a recent botnet, Kneber, which has been known to steal sensitive documents from infected Government computers. Kneber's code base comprises a Perl script that searches the hard drive of a compromised host for Word, Excel, and PDF files, and thus sends those files to a server located in some country. As another example, GhostNet is reported to have stolen sensitive documents such as Word files from compromised hosts, which it uploaded to a remote server.

TABLE I. Text Font Style Size

SECRETLY INTERPRETABLE OBJECT PROPERTIES Character Spacing Position By (float)

Paragraph Alignment Outline level Indentation left

Numbering None 1. I.

These characteristics, overall, make the proposed data hiding technique feasibly applicable and transparent. Examples of object properties related to a block of text that can be secretly interpreted to carry hidden bits are given in Table I. As an interpretation instance, a specific value of text font such as Times New Roman can be interpreted as bit 1, while another specific value of text font such as Arial can be interpreted as bit 0. Similarly, a text size 12 can be interpreted as bit 1, and a text size 14 can be interpreted as bit 0. And so on. Although we find it straightforward to generate automatically a honeytoken that contains HMAC bits hidden according to the proposed data hiding technique, our experience with this research suggests the necessity of a visual verification of such honeytoken by a human. The reason is that not all combinations of object properties in a Word binary file would be found in real-world Word documents. Some of those combinations of object properties generate difficult to read or distorted content, and thus are unlikely to be used by end users despite being compliant with the Word binary file format specifications provided in [15]. For example, it is quite unlikely to see a financial statement issued by a bank as a Word document that has text with size 72, centered, with paragraph indentation left of 10 cm. We deem that object properties may be automatically selected such as to carry hidden HMAC bits while avoiding unusual formatting, although that is a task that we did not implement in this research. Nevertheless, as we wrote previously, human verification is necessary as algorithmic selection of object properties does not equate human eye. At the end, it is a botmaster, i.e. a human, that views and possibly inspects a leaked honeytoken. Extraction of hidden HMAC bits, i.e. decoding of object properties, is safe to be conducted automatically. Up to the point of determining the properties of objects present in a honeytoken, the decoding process proceeds similarly to an application such as Microsoft Office Word. Namely it scans the various substructures in the FIB to determine location and size of each one of those objects, whose properties are stored as lists of differences from preliminarily defined default values. Once the properties of each object are retrieved, those properties are then interpreted, i.e. decoded, into 0 and 1 bits. The association between object properties and 0 or 1 bits is applied

A. Carriers of Hidden Bits In this research we carry hidden bits via secret interpretation of specific elements of the Word binary file format [15]. Thus, we do not introduce concrete HMAC bits into the Word binary file that we use as honeytoken. Specific elements of the Word binary file format are interpreted as 1 or 0 bits according to an encoding scheme that we discuss in the next subsection. According to [15], the Word binary file format consists of a series of records that specify objects such as characters, paragraphs, sections, tables, pictures, etc. The first record is referred to as a master record and is named the File Information Block (FIB). The FIB comprises pairs of integers organized as substructures that indicate the locations of objects that are present in a Word binary file. In each one of those substructures, the first integer specifies the location of an object, while the other integer specifies the size of that object. The various substructures in the FIB allow an application such as Microsoft Office Word or Open Office Word to locate all objects in the file and determine the properties of those objects. Those properties are precisely the elements that in this research are employed as carriers of hidden bits. Clearly a secret interpretation of object properties in a Word binary file as a stream of 0 or 1 bits does not interfere with their original semantic or function. In other words, that secret interpretation does not require any changes to object properties, nor does it require any changes to FIB substructures. Consequently, the secret interpretation in

855

sequentially to all objects present in a honeytoken, regardless of their type. That way the defender is given the ability to encode the whole sequence of HMAC bits. The defender does so by employing as many objects as necessary, such as for example by creating a table nested within another table. A typical number of bits that we needed to encode in this research was 128, which is the length of the HMAC of bogus sensitive data generated via the MD5 hash function. In most cases the number of objects required to encode the HMAC of bogus sensitive data corresponds to an overall amount of object properties that exceeds the number of bits in that HMAC. Clearly we need a means of telling where exactly the secret interpretation of object properties ends, and thus not interpret the object properties that come after that point. A possible way of addressing that requirement is to introduce a marker in the set of symbols that object properties are mapped to. Marker indicates that the object properties that follow are not to be interpreted. Thus, each object property would be secretly encoded as bit 0, bit 1, or . An additional possibility that is available to the defender consists in distributing the secret interpretation of object properties throughout the honeytoken. In that case we would need to determine the beginning and the end of each segment of the interpretation in question. Let us use and w as delimiters of each such segment, namely interpret as the beginning and w as the end. Each object property would be secretly encoded as bit 0, bit 1, , or w. In a honeytoken, those object properties that lie within an object property encoded as and an object property encoded as w are to be interpreted as bit 0 or bit 1. All the other properties of objects present in the honeytoken are not made subject to any secret interpretation. Clearly such a fragmentation of the overall sequence of object properties in a honeytoken requires employment of more objects to carry the HMAC of bogus sensitive data, compared to the scheme described previously. That comes natural as not all object properties are used to carry hidden bits. Nevertheless, distributing the secret interpretation of object properties contributes significantly to avoiding unusual formatting of the honeytoken.

those specific probabilities after the step in question completes. Notice that state i can be the same as state j , i.e. the Markov chain process can move from a state to that very same state. The stochastic matrix below is an example that shows the common form of a one time matrix of transition probabilities used in this research. Both rows and columns are labeled by the three random states of the Markov chain process. Each transition probability pij denotes the likelihood of transition from state i to state j . For instance, with reference to the matrix below, if the Markov chain process is in state 0, then the likelihood of a transition to that very same state, i.e. state 0, is 0.3951. The likelihood of a transition to state 1 is 0.5703, and the likelihood of a transition to state is 0.0346. Those transition probabilities amount to 1.0 as it is certain that the Markov chain process will take a step from state 0 to either one of its three states.

When the Markov chain process is in state i, the transition probabilities that pertain to its next transition are those found on the specific row that corresponds to state i. The state reached by that transition is the one with the highest likelihood, i.e. the state j whose associated transition probability pij in the row in question has the highest value compared to the other transition probabilities in that row. In the example given above, if the Markov chain process is in state 0, then the transition probabilities that affect the next transition are 0.3951, 0.5703, and 0.0346. That transition will take the Markov chain process to state 1, as state 1 has the highest likelihood, namely 0.5703, compared to the other two states. As we wrote earlier in this section, each row of the matrix of transition probabilities is only used once. After use, each row is entirely updated with new values. Referring to our example, it is the first row that is consulted to determine the next transition, namely from state 0 to state 1. After that specific transition takes place, the transition probabilities 0.3951, 0.5703, and 0.0346 in that row are updated with new values, say 0.0526, 0.3083, and 0.6391, respectively. The other two rows that are not involved in the aforementioned transition remain unchanged. The matrix of transition probabilities at that point is ready to be used in relation to the succeeding transition. The overall discussion that centers around this example holds similarly for the cases in which the Markov chain process is in state 1 or state . In this research we generate transition probabilities that are random-like and hence unpredictable, yet easily reproducible. Due to such randomness in the matrix of transition probabilities, the Markov chain process performs a random walk through its possible states. In this research, the random sequence of states visited by the Markov chain process throughout that random walk is coupled with a list of object properties. We identify the sequence of objects present in a honeytoken, and thus organize their properties in a list according to the order of

B. Hidden Encoding / Decoding Let us define a Markov chain process [8] with three random states, namely 0, 1, and . The Markov chain process starts from the state that equates the most significant bit of the defender’s secret cryptographic key, i.e. the cryptographic key that the defender employs to generate the HMAC of bogus sensitive data placed in the honeytoken. Thus, the Markov chain process starts from state 0 or from state 1. For i, j ‫{ א‬0, 1, }, the Markov chain process takes a step from state i to state j with a onetime transition probability pij. Thus, the transition probabilities pij for all possible i’s and j’s form a 3 X 3 matrix of transition probabilities, which we associate with each step to determine the next transition. In this research, the transition probabilities that pertain to a step from any state i to any state j are employed only once. The matrix of transition probabilities is updated with new values for

856

xn is a pseudo-random number generated by the logistic map at iteration n. An exception is x0, which is provided as part of the input to the logistic map. If m is the number of steps that comprise the random walk performed by the Markov chain process, then the overall number of transition probabilities that are needed to control that random walk is 3m. That is because each step consumes a row of three transition probabilities. Thus, n takes an initial value of 0, which is incremented by 1 at each iteration until reaching a value of 3m. is a bifurcation parameter, and along with x0 represents the initial conditions of the logistic map. For 3.57 > 助Ń 4, the logistic map exhibits chaotic behavior [10].

appearance of those objects in the honeytoken. Let m denote the number of elements in that list, i.e. the overall number of object properties that we have available for carrying hidden bits. We have the Markov chain process perform a random walk of m steps. The states reached by those steps, in the order those steps are taken, are associated with object properties in the aforementioned list, in the order those object properties appear in that list. That way we obtain a bijective function f that maps each state traversed by the random walk to a distinct object property in our list. Thus, the bijective function f represents a secret encoding scheme, while the bijective function f 1 represents a secret decoding scheme. A part of an example bijective function that is usable for encoding and decoding of object properties in a honeytoken is given in Table II. The specific property referenced in that table is the font of a block of text. Only a small subset of the possible values of the font in question is given due to space limitations. We now elaborate on how we generate the transition probabilities that control the random walk of the Markov chain process. In this research we draw on chaos theory, namely exploit the high sensitivity of chaotic dynamical systems to their initial conditions. That high sensitivity allows for generating chaotic signals that are robust from a randomness perspective. The possibility of using chaotic signals to carry information was first proposed in [7], while the use of chaos for random number generation was investigated in [1, 23]. TABLE II.

For 3.99996 努Ń < 4, the logistic map is suitable for pseudorandom number generation [11]. Each xn generated is a real number between 0.0 and 1.0, therefore perfectly matches the definition of a probability. Thus, we consider and employ xn's as transition probabilities in relation to the Markov chain process. The initial conditions of the logistic map, namely and x0, act as a secret key with respect to the transition probabilities generated by that logistic map. Any party, who is in possession of those two parameters, would be able to reconstruct the xn's, and thus gain access to the schemes used for secret encoding / decoding of object properties in a honeytoken. Clearly in this research the initial conditions in question are kept secret by the defender, which prevents any other unauthorized party, including a targeted botmaster, from revealing an HMAC hidden within a honeytoken. We derive those initial conditions from the secret cryptographic key that the defender employs to generate the HMAC. Thus, the defender is required to maintain only one secret key for the overall approach rather than multiple secret keys. The typical length of a cryptographic secret key that is used in conjunction with the MD5 hash function is 128 bits. The value of is created by appending to 3.99996 the decimal representation of each one of the 8 high order bytes of the defender's secret cryptographic key. Thus, 8 natural numbers between 0 and 255 are appended to 3.99996 to create a real number that we employ as in a logistic map. The value of x0 is created by appending to 0. the decimal representation of each one of the 8 low order bytes of the defender's secret cryptographic key. The process of creating the values of and x0 is illustrated in Table III. The byte stream 0xD1 0xF0 0xA8 0x5C 0xB7 0xD5 0x69 0x26 0x4B 0x77 0x35 0x40 0xE2 0x30 0x83 0xC9 is an example of a defender's secret cryptographic key in hexadecimal representation. The initial conditions that correspond to that key are = 3.999962092401689218321310538 and x0 = 0.75119536422648131201. Guessing or brute forcing such initial conditions is equivalent to guessing or brute forcing the 128 bits that comprise the defender's secret cryptographic key, which clearly is not computationally feasible.

EXCERPT FROM AN INSTANCE OF HIDDEN ENCODING / DECODING

Font Property Abadi MT Condensed Adobe Minion Web Agency FB Aharoni Algerian Almanac MT American Uncial Andale Mono Andalus Andy Angsana New AngsanaUPC Aparajita Arabic Transparent Arabic Typesetting

Encoding/Decoding 1 0 0 0 1 1 1 1 0 1 0

In this research we employ a logistic map [10] to randomly generate transition probabilities for use with the Markov chain process. While most polynomial maps that exhibit chaotic behavior would perhaps perform well in that regard, the discrete time domain and the real space domain of a logistic map make it particularly interesting to us. Furthermore, the viability of a logistic map to function as a source of pseudo-random numbers was shown by a study conducted by Ulam and von Neumann, which they discussed in [24]. Let us recall that a logistic map is formulated as follows: xn+1 = n (1 ǙŃxn ). n denotes the iteration, while

V.

HONEYTOKEN TRACKING

We now discuss tracking of a leaked honeytoken over Internet in order to reveal the IP address of a botmaster's machine. Related research such as the work of da Silva

857

written in VBScript, which can be added to that file via a plug-in referred to as Microsoft Script Editor (MSE). Similarly, it is possible to embed and run Javascript code in a PDF file. Both web bugs and document bugs are, at least from a technical perspective, usable as means of revealing the IP address of a machine being used by an individual involved in cybercrime. The research in [14] employs web bugs to track a phisher, while the research in [2] employs document bugs to track a malicious insider. In both of those works, the embedded bugs report via download of a remote image. Also, the research in [21] mentions honeytokens which, when opened with the corresponding application, report the access to security personnel over the network. Although the author of [21] does not specifically elaborate on how such document reports the access over the network, we deem that an embedded script is possibly the mechanism which performs that action. In this research a steganographically marked honeytoken is a Microsoft Word file, and as such it has the functionality of hosting a report-back script, i.e. document bug, embedded via the MSE plug-in. Upon honeytoken opening, the report-back script would attempt to download an image from a web server set up by the defender.

Mendo in [5] has explored document flow tracking for detecting leakage of sensitive documents in an enterprize network. Nonetheless, that approach relies on installation of software agents on each host of interest, and therefore is inapplicable to our problem domain. We see two main ways by which a leaked honeytoken can be tracked to the IP address of a botmaster's machine, namely report-back script code embedded within a leaked honeytoken, and network monitoring conducted at the ISP level. The former draws on the concept of web bug [13], i.e. HTML code embedded within a web page, intended to go unnoticed by the end user, and inserted to spy on the end user. Web bugs were commonly used by unethical individuals such as phishers and spammers, and most commonly come in the form of image definitions. The actual image is downloaded from an external web server, which is usually controlled by the creator of the web bug. The HTTP request for uploading the image file carries surveillance data, which are logged by the web server in question. A concept similar to web bugs is implementable in various binary file formats such as Office formats, including Word, and PDF. Those document bugs consist of scripts that run from within a file in which they are embedded [12]. For example, a Microsoft Word file can host embedded scripts TABLE III. High Order 0xD1 0xF0 0xA8 0x5C 0xB7 0xD5 0x69 0x26

Decimal 209 240 168 92 183 213 105 38

CONSTRUCTION OF THE INITIAL CONDITIONS OF A CHAOTIC LOGISTIC MAP 3.99996 3.99996209 3.99996209240 3.99996209240168 3.9999620924016892 3.9999620924016892183 3.9999620924016892183213 3.9999620924016892183213105 3.999962092401689218321310538

Low Order 0x4B 0x77 0x35 0x40 0xE2 0x30 0x83 0xC9

Decimal 75 119 53 64 226 48 131 201

x0 0.0 0.75 0.75119 0.7511953 0.751195364 0.751195364226 0.75119536422648 0.75119536422648131 0.75119536422648131201

thus is made available to the defender. The logged IP addresses form a chain that represents the network path that was followed by the leaked honeytoken. In the case of botnets that use public key cryptography, honeytoken tracking is much more challenging. In that case bots use a public key to encrypt stolen sensitive data that they send to botmasters. The private key associated with that public key is only known to the botmaster, consequently normally the defender would not be able to decrypt any ciphertext sent to botmasters. We have found a way of interfering with the encryption process conducted by bots such as to enable an ISP to conduct tracking of the leaked honeytoken. Our approach is applicable in the case bots make use of the cryptographic primitives provided by the operating system of the compromised host. Asymmetric key algorithms are slow compared to symmetric key algorithms, and thus are not used directly to encrypt large amounts of data. Various public-key cryptographic applications in operating systems such as Windows and Linux generate a pseudo-random bitstream,

The network packets carrying that HTTP GET request would reveal the IP address of the machine on which the honeytoken was opened. A major drawback of such tracking technique is the easiness by which it can be bypassed. Simple actions like removing network connectivity from the machine before opening a Word or PDF file, or disabling script execution in such files, would certainly prevent the leaked honeytoken from requesting the remote image. In technical terms, network monitoring via a packet processor to intercept and thus recognize IP data packets that transfer a leaked honeytoken is more effective. That process is implementable by ISPs located within the area of jurisdiction of the law enforcement that is following the case. In the case of botnets that transfer stolen sensitive data in clear, a SNORT rule that recognizes the bogus sensitive data stored in the leaked honeytoken would suffice to perform the tracking. Upon passage of the leaked honeytoken through an ISP, the destination IP address in the header of network packets, whose payloads carry the leaked honeytoken, is logged and

858

[7]

which is used as a secret key in conjunction with a symmetric key algorithm to encrypt a file. That secret key then is encrypted via a asymmetric key algorithm using a public key. The ciphertext that corresponds to the encrypted secret key is appended to the end of the encrypted file. The receiver's public-key cryptographic application extracts the encrypted secret key from the end of the encrypted file; decrypts that secret key using the private key associated with the aforementioned public key; and finally uses the decrypted secret key to decrypt the encrypted file. We introduce a decoy cipher on a honeypot, which is a trojanized version of the public-key cryptographic application running on that honeypot. The decoy cipher logs the secret key generated by a bot on a compromised host, which in this case is a honeypot, before that bot encrypts the secret key in question with a botmaster's public key and appends it to the end of a sensitive file that is about to be sent to the botmaster. That file in this case is the steganographically marked honeytoken. Given that the bot encrypts the honeytoken symmetrically with the secret key in question rather than asymmetrically with a botmaster's public key, the defender and hence an ISP can decrypt the honeytoken despite not knowing the botmaster's private key.

[8] [9]

[10]

[11] [12]

[13]

[14]

[15]

VI.

CONCLUSION [16]

In this paper we discussed a novel application of steganography that enables law enforcement to provably reveal the geographic location where a botmaster operates from. The proposed approach is the cyber counterpart of an investigative technique that tracks perpetrators via marked money bills used to make an illegal drugs buy arranged by undercover agents. The main idea is to generate a watermark hidden within a decoy document, i.e. honeytoken, and thus leak the honeytoken to a botmaster via a honeynet. The scheme used for encoding and decoding hidden bits draws on chaos theory. That scheme leverages secret interpretation of object properties in a Word file, which in this research are the actual carriers of hidden bits. The overall approach exploits tracking of a leaked honeytoken along with its storage on a botmaster's machine.

[17]

[18]

[19]

[20] [21]

REFERENCES [22] [1]

[2]

[3] [4]

[5] [6]

T. Addabbo, A. Fort, S. Rocchi, and V. Vignoli. “Chaos based generation of true random bits”. Studies in Computational Intelligence, vol. 184, pp.355-377, 2009. B. M. Bowen, S. Hershkop, A. D. Keromytis, and S. J. Stolfo. “Baiting inside attackers using decoy documents”. Security and Privacy in Communication Networks, vol. 19, pp.51-70, 2009. S. K. Catlett and X. He. “Fraud detection using honeytoken data tracking”, April 2009. I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker. “Digital Watermarking and Steganography”. Morgan Kaufmann, 2nd edition, November 2007. T. G. da Silva Mendo. “Document flow tracking within corporate networks”. Master's thesis, Universidade de Lisboa, November 2009. R. Dingledine, N. Mathewson, and P. Syverson. “Tor: The secondgeneration onion router”. In Proceedings of the 13th USENIX Security Symposium, pp. 303-320, August 2004.

[23]

[24]

[25]

[26]

859

S. Hayes, C. Grebogi, and E. Ott. “Communicating with chaos”. Physical Review Letters, 70(20), pp. 3031-3034, May 1993. R. A. Howard. “Dynamic Probabilistic Systems”, vol. I: Markov Models. Dover Publications, June 2007. Information Technology Laboratory, National Institute of Standards and Technology. “The keyed-hash message authentication code (HMAC)”. Federal Information Processing Standards Publication FIPS PUB 198, National Institute of Standards and Technology, March 2002. A. Kanso and N. Smaoui. “Logistic chaotic maps for binary numbers generations”. Chaos, Solitons & Fractals, 40(5), pp. 2557-2568, June 2009. L. Kocarev and G. Jakimoski. “Logistic map as a block encryption algorithm”. Physics Letters A, 289(4-5), pp. 199-206, 2001. W.-J. Li, S. Stolfo, A. Stavrou, E. Androulaki, and A. D. Keromytis. “A study of malcode-bearing documents”. In Proceedings of the 4th International Conference on Detection of Intrusions and Malware,and Vulnerability Assessment, Lucerne, Switzerland, July 2007. D. Martin, H. Wu, and A. Alsaid. “Hidden surveillance by web sites: Web bugs in contemporary use”. Communications of the ACM Mobile Computing Opportunities and Challenges, 46(12), pp. 258264, December 2003. C. M. McRae and R. B. Vaughn. “Phighting the phisher: Using web bugs and honeytokens to investigate the source of phishing attacks”. In Proceedings of the 40th Annual Hawaii International Conference on System Sciences, vol. 9, pp. 4558-4564, Waikoloa, Hawaii, USA, January 2007. Microsoft Corporation. Microsoft office binary (doc, xls, ppt) file formats, February 2008. [Online; accessed January 26th, 2014]. S. Nagaraja, P. Mittal, C.-Y. Hong, M. Caesar, and N. Borisov. “Botgrep: Finding bots with structured graph analysis”. In Proceedings of the 19th USENIX Security Symposium, pp. 95-110, California, USA, August 2010. V. N. Padmanabhan and L. Subramanian. “An investigation of geographic mapping techniques for Internet hosts”. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, New York, USA, October 2001. F. Pouget, M. Dacier, and H. Debar. “Honeypot, honeynet, honeytoken: Terminological issues”. Research Report RR-03-081, Institut Eurecom, September 2003. Public Safety Canada. “Government of Canada introduces legislation to fight crime in the 21st century”, June 2009. [Online; accessed January 26th, 2014]. S. Schneider. “Iced: The Story of Organized Crime in Canada”. Wiley, 1st edition, December 2009. C. A. Shue. “Automated honey token generation with redaction”, January 2010. [Online; accessed February 14th, 2011]. L. Spitzner. Honeytokens: The other honeypot, July 2003. [Online; accessed January 26th, 2014]. T. Stojanovski and L. Kocarev. “Chaos-based random number generators - part I: Analysis”. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 48(3), pp. 281-288, March 2001. S. Ulam and J. von Neumann. “On combination of stochastic and deterministic processes”. Bulletin of the American Mathematical Society, 53(11):1120, November 1947. P. Wang, S. Sparks, and C. C. Zou. “An advanced hybrid peer-to-peer botnet”. IEEE Transactions on Dependable and Secure Computing, 7(2), pp. 113-127, April-June 2010. J. Yuill, M. Zappe, D. Denning, and F. Feer. “Honeyfiles: Deceptive files for intrusion detection”. In Proceedings of the 5th IEEE Workshop on Information Assurance, pp. 116-122, West Point, New York, USA, June 2004.

A Steganographic Approach to Localizing Botmasters

honeytoken to the IP address of a botmaster's machine. Keywords-botnets .... honeytoken include a credit card number, an Excel or. PowerPoint file, a database table or ...... Mobile Computing Opportunities and Challenges, 46(12), pp. 258-.

Download PDF

296KB Sizes 3 Downloads 191 Views

Report

A Steganographic Approach to Localizing Botmasters

Recommend Documents