Web Application Model Recovery for User Input ...

Viewer
Transcript

Web Application Model Recovery for User Input Validation Testing Nuo Li, Ji Wu, Mao-zhong Jin, Chao Liu Software Engineering Institute, School of Computer Science and Engineering, Beihang University, China seraphic, wuji, jmz, [email protected]

Abstract The invalidated input is one of the most critical web application security flaws. However, testing the user input validation function is an intellectual and labor intensive task. We are developing a model driven framework to help testers to accomplish this job in visual view with guidance. This paper reports our on-going work. A meta-model of Web application for user input validation testing is defined. Based on the meta-model, by analyzing HTML files, a light weight method is given to create the model. Our evaluation shows that the proposed method can comprehensively model Web applications, and accurately identify the purpose of input points, which are very important for the test case generation in the future. Keywords: Model-driven testing, Web application, user input validation

1. Introduction In the internet era, Web applications are becoming the core business in many areas. Meanwhile, there is a rapid increase in the amount of attacks on Web applications. Current technologies such as anti-virus software programs and network firewalls offer comparatively secure protection at the host and network levels, but not at the application level [1]. According to the open Web application security project’s assessment [2], the top one critical web application security flaw is un-validated input. Therefore, to develop a secure Web application, data from web requests must be validated before being used. Recently, more attention is paid to this problem This research is based on the work supported by the National High Technology Research and Development Program of China (Grant No. 2006AA01Z176) and the National Natural Science Foundation of China (Grant No. 60603039).

[1][3][4][5]. However, the Web applications developers often omit validating input data of users, and the validation functions are usually not clearly identified and defined in the requirement. Weber [6], a senior security consultant, took the Cross Site Scripting for example to show how to test Web applications for such vulnerabilities in practice. The first step is to get some automatic tools to intercept the HTTP requests. Secondly, map out the site and its functionality by talking with developers and project managers. Thirdly, identify and list out every point of user-supplied input. Then, the testers should think through and list out test cases manually. Finally, start testing and pay attention to the output to adjust test cases. These steps are troublesome. By adopting the model-driven testing methodology, we can help testers to test Web Application’s User Input Validation (WA-UIV) functions visually and thoroughly. Model-driven testing approach attempts to offer a suite of visualized facilities to define, execute and analyze testing [7]. By using proper method, the Web application can be presented visually and all test related information can be discovered automatically. Furthermore, some validation rules could be associated with the context information in the Web application model, and presented to testers in a visual view with guidance. These patterns will be a great help to generate WA-UIV test cases. This paper reports our on-going work on how to create the System Under Test (SUT) model of WA-UIV testing, especially how to identify the description text of the input points. Section 2 introduces the SUT model of WA-UIV testing, and the method of model generation. Section 3 describes the prototype of the modeling framework. Experimental results are presented in section 4. Section 5 discusses the related works. Section 6 concludes the paper and the future work sketching the WA-UIV test case generation.

2. The SUT model of WA-UIV Testing Test model is the core conception of model-driven

testing, and models can be constructed from different views in different phases of testing [8]. The SUT model is the base of test case generation and test execution.

2.1 Definition of the SUT model

basic information of input points and some description information about them. HTML elements are modeled as classes generalizing the Class class from UML2 and the relations between the elements generalize the Association class from UML2. The attribute named ‘descText’ of input, textarea and select is the description text of textbox or selection list. How to identify the ‘descText’ of these kinds of input points is detailedly explained in the next section. Once the description text is identified, the purpose of the input will be generated based on a topic model where each topic is associated with two sets: one includes possible words describing the topic, i.e. the purpose of an input point; the other includes values that are the valid input data for the input point with this topic.

2.2 Description-Text Identification The SUT model is generated by analyzing HTML files. Most of the model elements defined in the meta-model could be analyzed from the HTML files easily. However how to identify the ‘descText’ is a difficult problem, because input tags, text tags and format tags are often intermixed in HTML codes.

Figure 1. Meta-model of the SUT model of WA-UIV testing Web applications provide users services and obtain user input through navigation and other HTML components. So the navigation model is chosen as the SUT model. On the client side, although the components which could accept users’ input are various in appearances, they could be classified into three types: url, cookie and form which is visible or hidden. Furthermore, a form could contain select, textarea, input and button tags. Di Lucca et al. [9] and Ricca and Tonella [10] proposed how to describe the navigation model of Web applications by UML models. We extended their definition to define the SUT model of WA-UIV testing (shown in Figure 1, which was drawn by Rational Rose). The UML 2.0 Testing Profile [12] explains that the SUT is exercised via its public interface operations and signals by the test components when test execution. For WA-UIV testing, the ‘public interface’ is the ‘input point’ where to accept the data user offered. Figure 1 presents the meta-model of the SUT model of WA-UIV testing. It depicts the navigation relation among the client pages, the relationship between input points, the

Figure 2. DOM tree of a HTML page Figure 2 demonstrates an example of the DOM tree of a HTML page. In order to identify the ‘descText’ of the tag, the sub-tree belonging to the

tag is pruned, and then the tags around the tag in the sub-tree are analyzed to indentify which text should be assigned to the ‘descText’ of the tag. In Figure 2, the ‘descText’ of the tag should be the text (‘search’) of its brother node . Sometimes the ‘descText’ of a tag is in a select box before it, or maybe in the text node behind it. All these cases need special attention during automatic discovery. Based on the analysis of more than 50 public accessible websites, which contain registration pages or search engines, the priorities of text in different instance are listed in Table 1 for different types of input points.

Table 1. The priority of text in different instance Input Point Select Radio Input Text(IP) type list /checkbox box area 1st text node 2 1 5 5 before IP 1st text node 2 3 3 behind IP type=’file/pa 6 ssword/hidde n/submit’ 1st select box 4 4 before IP default value 1 1 1 1st button 2 2 behind IP

with box2. Therefore box0, box1 and box2 should be combined into one form before generating test cases. In this case, the combined form could be either box1 or box2.

2.3 Form Combination Since many client pages contain the same form, it will waste space to generate input data for the same forms. In order to improve the efficiency and consistence of SUT model recovery for further test case generation, the same forms should be combined. Two forms in a Web application could be exactly same, inclusive, basically same or different. Following the definition in Figure 1, there are two forms f1 and f2: f1.method=m1, f1.enctype=e1, and f1.action=a1, f2.action=a2, f2.method=m2, f2.enctype=e2. The input points belonging to f1, including inputs, selects, textareas and buttons, are the set IP1 while the input points belong to f2 are the set IP2. f1 and f2 are exactly same, iff (a1= a2)∧(m1= m2)∧ (e1= e2)∧(IP1= IP2); f1 includes f2, iff (a1= a2)∧(m1= m2)∧(e1= e2)∧ (IP2 ⊂ IP1); f1 and f2 are basically same, iff (a1= a2)∧(m1= m2) ∧(e1= e2)∧(IP1≠IP2); f1 and f2 are different, iff (a1≠a2)∨(m1≠m2)∨(e1≠ e2). During the navigation model generation, the relation between forms should be identified, and the ‘cStatus’ attributes of forms could be assigned with exclusive, same as fi, included in fi or basically same as fi. The forms whose ‘cStatus’ attributes are exclusive or basically same as fi should be submitted with input data when expanding the SUT model. Furthermore, before generating test cases for the WA-UIV, the included forms and the basically same forms should be exclusively combined. If they have the same action, method and enctype, they are merged into one form whose input points consist of all of the different input points belong to them respectively, and then the test case generation will only focus on the combined forms. Take the three forms in Figure 3 for example, both box1 and box2 include box0, while box1 is exactly same

Figure 3 Examples of HTML forms

3. Implementation The model-driven WA-UIV testing framework is implemented based on the Eclipse Modeling Framework, which helps rapidly turn models into efficient and customizable Java code. In this way we shall focus on the definition of the meta-model and how to collect model information from the HTML source code. At first, the HTML page is crawled by its URL; the HTML analyzer generates a DOM tree for it. Then the model elements defined in Figure 1 are abstracted and sent to the SUT model generator which is based on the Eclipse Modeling Framework. At the same time, the form filter filtrates the ‘new input points’ and the nodes surrounding them from the DOM tree. The input description abstractor identifies the ‘descText’ of input point based on the priority weights given in Table 1. Figure 4 presents the pseudocode of how the input description abstractor identifies the ‘descText’ before an input point. Since there may be some font information inserted in the text around input point, the process of text identification is recursive. When the ‘descText’s are identified by the input description abstractor, they are compared with the pre-stored texts in a topic base, which implements the topic model, to identify the purpose of the input points.

The ‘descText’ of an input point may match several topics. In this situation, the topic is chosen on this order: the topic matching the longest string in the ‘descText’; the topic appearing the most times in the ‘descText’; the topic with the max weight. Furthermore, if the topic cannot be obtained by ‘descText’ or input description abstractor fails to get the ‘descText’, the topic will be identified by some other attributes value of the input point, such as the value of its name, id and value. Once the purposes of the input points are identified, input data are chosen for them to submit to the Web application to get the returned pages which are used to expand the navigation model. Meanwhile, the topic base is expanded with the option texts contained in the select lists with the topics of the select lists.

of input points are presented in the WA-UIV test view. In the future, the heuristic information for WA-UIV test case generation will also be shown in this view; automatically generated test cases will be displayed in another view, and users can edit them as sequence models.

WA-UIV test view

Figure 4. Pseudocode of the input description abstractor

4. Experimental Results

Figure 5. A snapshot of the model-driven WA-UIV testing framework at work To evaluate the proposed approach for the model information collection, we conducted a number of experiments on the seven websites listed in Table2. The first column of Table 2 lists the websites we tested. The third column, input points, consists of input boxes, radio buttons, check boxes, select lists, text areas and buttons. The description texts of them are located at random positions visually, such as left, top, right, and bottom, or even inside. The forth column, text boxes, focuses on input boxes and text areas. Because such input points actually need the identification of their purposes, we list them separately. The fifth column is the ratio of the rightly extracted input points to the total of input points, while the sixth column, the percentage of identified text of text boxes, is calculated by the ratio of the number of rightly identified text for text boxes to the total of text boxes. The last column lists the ratio of right identified purpose for text boxes to the total of text boxes whose texts are right identified.

A snapshot of our framework generating the SUT model is presented in Figure 5. The navigation model of SUT is depicted in the main view, while an outline view is on the right sidebar. The basic and context information Table 2. Input point information extraction statistics Website Tested Input Text Extracted input Identified text of Right purpose forms points boxes points (%) text boxes (%) (%) www.baidu.com 10 112 17 100 90.00 100 www.taobao.com 34 569 62 100 97.51 100 www.edu.cn 23 127 41 100 100 100 lib.buaa.edu.cn 28 174 40 97.75 100 99.26 www.olympic.org 23 147 25 87.83 100 100 milktea12345.bokee.com 11 95 30 97.50 97.14 98.96 MyForum (in our lab) 14 201 36 100 90.91 99.31 Total/average 143 1425 251 97.58 96.51 99.65 After form combination, we shall concentrate on the and text box in these forms are collected through Firefox. pages with different forms. The numbers of input point In total, 143 forms with 1425 input points are analyzed

and 97.58% input points are correctly abstracted with basic HTML information. The input points whose extraction failed are written in client scripts. There are 251 text boxes in the tested forms, and 96.51% of their ‘descText’ are successfully identified. Finally, 99.65% of the text boxes with right description text are assigned with right topic. This statistic shows that our light weight method of description text identification for input points performs well in practice, and the topic base is effective for topic generation. Actually, the more pages parsed by the framework, the more purposes of text boxes can be recognized accurately, because the topic base is expanded during modeling. Most of the text boxes whose ‘descText’s are unsuccessfully gotten usually have no description texts around them, or have description texts around them on both sides in the same distance. During the experimental study, we found that the input tags whose types are hidden appeared most frequently. Such input points are quite easy to be omitted in manual testing, but some important system variables are often transferred by them. So such input points are likely the target of hackers. The model-driven WA-UIV testing framework can detect these input points completely, and when generating test cases, they should be paid more attention than it is generally given.

5. Related Work Since user input validation testing is essential for any software that deals with input from its external environment [3], some techniques of WA-UIV testing have especially been proposed. Liu et al.[3] tried to verify and generate test cases for the WA-UIV from program source codes. In case studies, their prototype system found 34 programs out of 150 programs did not have input validation feature, which accounted for 22.67%. Such experimental results are exciting; however, there is still a long way to apply their approach in practice, since there are many technical problems. For example, it is not an easy work to automatically generate the control flow diagram of Web applications, because the control flow of Web applications is usually implemented in several classes which are in different files or even in different packages and in different languages such as scripts on web pages. Offutt et al.[5] developed a strategy called bypass testing to create client-side tests for Web applications that intentionally violated explicit and implicit checks on user inputs. However, they just described their bypass testing strategy, defined specific rules and adequacy criteria for tests; they did not give the approach of purpose identification. If we apply their rules to test the WA-UIV automatically, the purpose of input points should be

identified at first. Raghavan et al.[11] proposed a layout-based information extraction technique and demonstrated its use in automatically extracting semantic information from search forms and response pages. Compared with the layout-based information extraction technique, our method of description text extraction is more lightweight. Furthermore, different with them, our goal is not to dig the tremendous amount of content ‘hidden’ behind search forms, but to correctly identify all data entry points and their purposes. Since user input validation functions are distributed all over Web applications, testing of them is forgettable and troublesome. To introduce the model-driven testing methodology to WA-UIV testing assistants testers to recover input points and present the structure of Web application visually, reduces the testing time by reusing common test functions, separates the testing logic from the test implementation, and enables developers to focus on designing good tests specific to applications while relying on the toolset's test execution environment to solve problems related test execution. Recently many researchers have studied how to assure the quality of Web applications by model based methods. The major of them focus on development modeling [13][14][15], while some others are studying test models especially [9][16][10]. For example, Di Lucca et al. [9][17] exploits an UML model of a Web application as a test model, and proposes a definition of the unit level for testing the Web application. Based on the model, a method to test the single units of a Web application and for the integration testing is proposed. Ricca and Tonella [10] also propose a UML model of Web applications for their high level representation. Such a model drives Web application testing, in which it can be exploited to define white box testing criteria. Compared with the model defined by Di Lucca et al., their model aims at explicitly representing user navigations. Compared with the previous studies of Web application modeling, which included capturing both structural and behavioral test artifacts of Web applications, the definition of our SUT model is more concerned with WA-UIV functions, and details the information relevant to WA-UIV. In addition, such model is easier to be implemented in projects, because its definition satisfies the requirements of building the SUT model from the client side.

6. Conclusions and Future Works This paper proposed a light weight and effective method to create the SUT model of WA-UIV testing. The meta-model is given, and compared with the other models defined for Web applications testing, this model focuses

more on the user input related information. In order to build such model automatically and efficiently, how to identify the purpose of each input point and how to avoid submit the same forms are explained in detail. The experiment result shows that most HTML input points could be identified with its purpose. According Huang’s successful experience [1], combing the JavaScript parser and heuristic data, more input points can be identified. Analysis of JavaScript can also help to identify cookie information. More future efforts will be taken for an effective JavaScript parser. In addition to the SUT model, validation rule base, test data pool and oracle are needed to generate test cases. With these artifacts, test cases will be generated for the WA-UIV testing in three levels: test each input point with invalid input data; violate the constraint among different parameters of a form; submit different forms in invalid sequences. Because the SUT model does not only contain the basic information of input points, but also indicates their purposes, input domains and valid inputs, some test data could be generated with the help of the association between the SUT model and the validation rule base. In the future, once the model-driven WA-UIV testing framework is completed, we shall validate its efficiency through testing Web applications with input vulnerabilities. Testers will be able to use the framework to analyze the Web applications to abstract their navigation models. While guiding testers where probably exist input vulnerabilities and which security test patterns could be instantiated, the framework generates test cases in the three levels refered in the last paragraph automatically. Furthermore, this framework supports the execution of automatic or customized test cases.

Reference: [1] Y. W. Huang, F. Yu, C. Hang, C. H. Tsai, et al. “Securing Web Application Code by Static Analysis and Runtime Protection”, Proceedings of the 13th international conference on World Wide Web, ACM Press, New York, NY, USA, 2004, pp. 40-52. [2] OWASP Top Ten Project. http://www.owasp.org/index. php/. November 21, 2006/ January 11, 2007. [3] H. Liu, H. B. K. Tan. “Automated Verification and Test Case Generation for Input Validation”, Proceedings of the 1st International Workshop on Automation of Software Test, at 28th International Conference on Software Engineering, ACM Press, Shanghai, China, 2006, pp. 9-14. [4] Y. W. Huang, S. K. Huang, T. P. Lin, C. H. Tsai. “Web Application Security Assessment by Fault Injection and Behavior Monitoring”, Proceedings of the 12th international conference on World Wide Web, ACM Press, Budapest, Hungary, 2003, pp. 148-159.

[5] J. Offutt, Y. Wu, X. C. Du, H. Huang. “Bypass Testing of Web Applications”. Proceedings of 15th International Symposium on Software Reliability Engineerin, IEEE Computer Society, Saint-Malo, Bretagne, France, 2004, pp. 187-197. [6] C. Weber. Testing Your Web Applications for Cross Site Scripting Vulnerabilities. http://www.microsoft.com /technet/community/columns/secmvp/sv0505.mspx. May 6, 2005/January 11, 2007. [7] Model Driven Testing Tools Whitepaper. IBM Haifa Research Laboratory. 2003-07-31 http://www.haifa.ibm. com/projects/verification/mdt/papers/MDTWhitePaper.pd f [8] N. Li, Q. Q. Ma, J. Wu, M. Z. Jin, C. Liu. “A Framework of Model-Driven Web Application Testing”. Proceedings of the 30th Annual International Computer Software and Applications Conference, Volume: 2, IEEE Computer Society, Washington, DC, USA, 2006, pp. 157-162. [9] G. A. Di Lucca, A. R. Fasolino, F. Faralli, and U. De Carlini. “Testing Web Applications”, Proceedings of the 18th International Conference on Software Maintenance, Maintaining Distributed Heterogeneous Systems, IEEE Computer Society, Montreal, 2002, pp. 310-319. [10] F. Ricca and P. Tonella, “Analysis and Testing of Web Applications”, Proceedings of the 23rd IEEE International Conference on Software Engineering, IEEE Computer Society, Toronto, Ontario, 2001, pp.25-34. [11] S. Raghavan, H. G. Molina. “Crawling the Hidden Web”, Proceedings of 27th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, pp. 129-138. [12] UML 2.0 Testing Profile Specification. OMG Adopted Specification. 2003-08-03. [13] J.Conallen, “Modeling Web Application Architectures with UML”, Communications of the ACM, Vol.42, No.10, 1999, pp. 63-70. [14] L. Baresi, F. Garzotto, and P. Paolini, “Extending UML for Modeling Web Applications”. Proceedings of the 34th Annual Hawaii International Conference on System Sciences. IEEE Computer Society, Maui, Hawaii, 2001. [15] J. Li, J. Chen, and P. Chen. “Modeling Web Application Architecture with UML”, Proceedings of the 36th International Conference on Technology of Object-Oriented Languages and Systems, IEEE Computer Society, Xi’an, 2000, pp. 265-274. [16] D. C. Kung, C. H. Liu, P. Hsia, “An object-oriented web test model for testing Web applications”, Proceedings of the 1st Asia-Pacific Conference on Quality Software, IEEE Computer Society, Hong Kong, 2000, pp. 111-120. [17] G. A. Di Lucca, A. R. Fasolino, F. Faralli, and U. De Carlini, “WARE: a tool for the Reverse Engineering of Web Applications”, Proceedings of the Sixth European Conference on Software Maintenance and Reengineering, IEEE Computer Society, Budapest, 2002, pp. 241-250.

Multi-Model Similarity Propagation and its Application for Web Image ...