A Solution for Comparison based Conversion of XML ...

Viewer
Transcript

A Solution for Comparison based Conversion of XML Documents Agustinus Tedja Telematik Department Technical University Hamburg, Germany

Muhammad Farhat Kaleem Telematik Department Technical University Hamburg, Germany

Abstract - The business communities have paid ample attention to XML usage in information exchange. The XML specification leaves the interpretation of the data to the applications that read it. Due to this, each company or institution can define its own XML structure for data, even though it may need to share the data with other companies. The information exchange between companies doing business together necessitates the need for a converter in certain business scenarios, that can understand different XML structures and convert from one to the other. A generic converter that can cope with a generic XML structure and efficiently produce a conversion mapping is proposed in this paper. The paper describes the creation of such a converter based on XML comparison principle. The conversion mapping is done through generation of an XSLT script that is transformed by an XSLT processor. The development challenge, capabilities and limitations of the converter, and assumptions behind the principles applied are described in this paper as well. Keywords: XML, comparison, conversion, generator, XSLT

1 Introduction The interaction between business communities in World Wide Web requires mutual information exchange. The simplicity of XML [15] makes it attractive for application to a business solution [1]. The XML specification defines only the general syntax and it is up to the user to express the business needs and the constraints in the XML format. Although many companies operate in the same business field, each may have a different XML structure definition due to the liberal extensibility of XML (as implied by its name). A further issue occurs when several business entities having similar activities establish mutual cooperation and exchange information using XML messages. The extensibility property of XML [14] can result in a problem when two or more companies exchange the same information but use different XML structure. The definition of structure difference in this context covers difference in tag names hierarchy and position, as well as in language. An example is XML messaging between automotive companies. One company, let’s say company A, from England, establishes a cooperation and sends information about car type in a structure like: BMW

to company B in Germany that uses German language in its XML message. Company B may use the following tags to provide the same information:

BMW

Although both companies want to express the same information, they use different XML structures. The tag names depend on the language prevalent in these countries. However, can the machine conclude that element is the same as ? XML specifies neither semantics nor a tag set. Since there is no predefined tag set, we cannot assume preconceived semantics. All of the semantics of an XML document will either be defined by the applications that process them or by style sheets. XML does not deliver semantics description at all [13]. This paper describes the research work inspired by a real case study of Europcar online reservation system in Hamburg, Germany [9]. A client company,i:FAO, has its system connected to Europcar. i:FAO acts as a direct mediator to the user with Europcar providing the main database of cars. However, Europcar plans to extend its system to different client companies in the future. The data exchange is intended to be in XML. This would require significant effort to create new conversion code to and from new client companies that have their own XML definition. It would be practical to have a smart converter that can recognize the incoming XML messages, that are previously not known to the Europcar system. The converter should know which information items it needs to translate and should also be able to detect any structure changes in the XML message. Thus it

may semi-automatically generate the mapping structure for that conversion.

2 Approach This paper does not cover the entire case study issues, instead it discusses the approach used for building an XML converter that caters to the problem defined previously. As mentioned before, XML specifies neither semantics nor a tag set. Each business may define its own tag set names and structure. However, for businesses caarrying out the same activity, we may assume that their XML messages convey the same information content. Coming back to the fact since there is no predefined tag set, there cannot be any preconceived semantics. The question now is how to deliver the semantics information in different XML messages. Each time an XML message comes in, a mediator must convert the structure and move the information content accordingly. Since an XML document is a collection of text in a natural language form, the best mediator that can understand the semantics is ahuman being. Another simpler and more practical way to deliver the semantics is to use XML content comparison [12]. XML document consists of two main parts: structure and content. Both of these can be the object of comparison. The following three semantics of a comparison are conceivable. Two XML documents are considered equal if: 1. They do not differ in both structure and content. BMW = BMW

2. They do not differ in structure, although their contents are different. BMW = Audi

3. They do not differ in content, although their structures are different. BMW = BMW

The first category is a very strict equality principle. It is suitable for checking the correctness of an XML document, e.g. a security configuration file. The second category is useful for checking the XML format and validity, no matter what the content value. The third category is suitable for the requirements we have previously identified. It allows the

system to know, based on the equality of the content value, that different tag names actually express the same meaning. It is the third category used for the solution described in this paper. As a side note, we may also mention that it has been shown [6] that XML comparison also requires caution. The white spaces, entity reference and data position in element or attribute are several aspects that must be taken care of.

3 The Architecture The components used in the proposed solution are described in this section. The first task is to parse XML document that needs to be converted. This XML document is provided by the source company. The source company is the one that sends a request to obtain certain information. In the Europcar case study, the i:FAO is the source company. Hereafter it is called XML source. The company that receives the converted messages is called the destination company and its XML document is called XML destination. For a certain relationship (e.g. a certain request) between XML source and XML destination, we need both XML documents, or references. The XML references represent their respective XML structures. The nodes to be converted must contain the same content value in both references. The reference documents are used only in the recognition phase, which happens once for each unique relationship. For example in Figure 1, the XML source reference contains car information in English and the destination reference in German. The element information item carType would be translated into AutoTyp, carSeries into AutoModel and the attribute number into zahl. The respective words in both source and destination have the same meaning semantically. The third criterion of XML comparison used here is based on equality of content. Two different tag names are considered equal, in other words they have the same meaning, if the contents between the tags are the same. The destination company publishes its XML reference to tell the source companies how they should set their tag names to the ones required by the destination. This is determined by the content value. The XML references at both

source and destination declare the message structure (the tag names and node hierarchy) as well as the content. For the sake of simplicity and quick readability, the content value may bedefined the same as the tag name but written in upper case. For example in the element , the content value is AUTOTYP. The source company can look at this published reference and decide which elements or attributes must be converted. In addition to avoid tag name clashes among many XML applications, namespaces may be declared as well.

Figure 1. XML References

Figure 2. Overview of conversion process using XML comparison

When XML references from source and destination are available, the recognition phase can start. The diagram in Figure 2 shows the overall process in general. First we need to read the XML source reference. We choose SAX 2.0 API [18] here to parse XML because it provides

instant processing on certain nodes we are interested in. For most usages, the data in XML is usually stored only in attributes or elements. At this time, no entity is allowed to be present because entity handling needs additional effort to check the external reference. Since we just concentrate on attributes and elements, the appropriate SAX event can be called back as soon as it occurs and the content can be immediately extracted. The parser reads the XML source reference and evaluates certain XML constructs, such as the start of the document, the start of an element, the character data within an element, the end of an element as well as the end of the document. When an attribute value or an element content is found,it is recorded along with its position expressed in XPath style. When the parser has finished parsing the entire souce reference, the XML destination reference can be parsed. The SAX API is used here as well. Shortly before this parsing, however, an empty XSLT [11, 16] tree is built using the JDOM API [4, 7]. The SAX API traverses the XML destination structure, checking the elements and attributes one by one. If there is an element without text value or with white spaces, the element is added as such to the XSLT tree. If SAX finds attribute value or element content, it is compared to the source information. If there is equality of content, a copy instruction (XSLT) is inserted into the XSLT tree. This way the XSLT document is built and modified on the fly. The process moves along till the end of document. Finally we have a complete XSLT documentwhich may be saved persistently, to a stream or as a JDOM tree in the memory. Once an XSLT document has been built for a unique relationship, it can be used repeatedly as long as the XML structures from both source and destination do not change. This usage is called the routine phase. For example in Figure 3, the source XML message contains information about a Z8 BMW car in English. With the help of corresponding XSLT document, the processor will convert it into destination XML structure in German. When source or destination XML structure changes, the recognition process must be repeated once more to generate a new XSLT document.

• • Figure 3. XML conversion with XSLT

A problem that may occur silently is that the XSLT processor does not throw an exception when the XML source structure is wrong. The converter still continues to work and generates an empty converted XML document. A structure checker could be added before the XSLT processor to check for this problem. If the incoming XML structure conforms to the previous known source information, it is passed to be processed by XSLT processor. If, however, this structure is not as expected, the application will halt and ask for a new XML source reference to generate modified XSLT instructions. This XML source checker simply compares the XPath of each attribute value or element content found in the XML source with the path location stored previously

The converter can detect any changes in the incoming XML document that does not conform to the previously stored structure. A number of international encodings are supported by the parser Xerces-J, which has been used in this case. It can recognize the international characters such as ä,ü and ö in German language, as long as the encoding set is defined in the XML document.

•

Single item and enumerated items conversion Single item: BMW BMW

Ö

Enumerated items (one level nest only and no attribute) BMW Mercedes Audi Ö BMW Mercedes Audi

•

Namespace support (SAX 2.0 and DOM 2.0) for single element item, enumerated items and attributes.

5 Assumptions and Limitations Figure 4. XML conversion with XSLT plus structure checker

4 Capabilities •

The converter described in the previous section can convert data in the following formats (from position in the source document to the position in the destination document): 1. From attribute to attribute ⇒

2. From element to element BMW ⇒ BMW

3. From attribute to element ⇒ 1

4. From element to attribute BMW

⇒

There are several assumptions for the converter to be valid: • XML documents must be well-formed otherwise the parser would throw SAX exception. • Since all XML content is simple text, therefore it is treated as the Java object String. There are neither numeric types,(integer, float, etc) nor date/time types. • There is no concatenation or change of values, e.g. concatenation of several values into single new value or change of date format (yyyymmdd into dd.mm.yyyy) or currency change (from DM to Euro). • The XML reference must be exhaustive. The converter does not need a DTD, so that the constraints are specified in the XML reference itself. The main issue in the conversion is the transfer of semantics.

•

•

•

Although DTD is useful for validating XML, it does not deliver the semantics information. The common content between source and destination XML must be exactly the same. This also includes capitalization differences, given that XML is case sensitive. The names of elements and attributes must be unique except where to express enumerated items. To specify the enumerated items, the desired element can be written more than once (twice or more) in the XML source reference. Whether the corresponding element in the XML destination reference is written once or moredepends on an enumeration process performed by the converter. However when an element is written twice or more in the XML destination reference, the source company must agree to keep the elements as an enumeration. The same attribute name can be used multiple times in different elements but the content value (used for comparison process) must be different to indicate different location paths. For example:

Several limitations exist which may prevent the application working properly: • No entity references are allowed because it needs additional treatment to check the external references. • No DTD is needed; consequently no validation. Instead the validation is done in the XML structure check. • Element structure using OR (e.g. a|b|c) occurrence indicator as declared in the DTD is not possible to implement because XML reference must be exhaustive. • The mixed content element is not allowed because it will not be processed. It is assumed that most of XML documents that are used in the messaging scenario do not contain mixed content. Once an element has a text value, no more siblings are allowed and once an element has child element, there must not be a text value as a sibling. Therefore, the position of the element content (in the XML reference) must have no white spaces. The position in the XML message during the routine phase may include white spaces. FÄHIGKEIT

is correct but FÄHIGKEIT

…

The second number attribute is considered illegal, because in XSLT the path still refers to the first number attribute. To indicate that the second one is different from the first one, the content must be different too. For example: …

• • •

The first attribute number now refers to car/@number and the second one refers to featureList/@number. Each respectively now refers to different location path. The application requires Java 1.2 because the application and JDOM relies on Java collections framework. The path syntax follows the XPath rules because we use XSLT for the actual transformation. The default namespace definition is not allowed so as to avoid namespaces confusion used in XSLT document.

• •

is considered as mixed content and it is not processed. For XML enumerated data, only single nest structure is allowed. No attribute is allowed to be inside the enumerated elements otherwise it is not processed by the XSLT for-each instruction.

6 Related Work There have been several efforts to convert one XML structure into another. One of them was described by Hiroshi Maruyama, Kent Tamura and Naohiko Uramoto from IBM Research Laboratory in Tokyo, Japan [10]. It is called LMX (Language for Mapping XML) processor. LMX is an XML transformation package designed in Java and works by DOM tree manipulation. It describes a mapping between two sets XML of documents that are logically similar but syntactically different. The main limitation is that it can only translate data

stored as element content, not as attribute values. Nevertheless it can handle enumerated items. Another solution has been developed by AlphaWorks Community Exchange, also supported by IBM. It is called XML Translator Generator (XtransGen) [2]. The entire conversion process is performed in Java. The basic concept of this program is the XML comparison. The overview page of the downloaded package still plans to allow translation of repeated/enumerated elements and support of XSLT. The XSLT is considered for future support because XSLT allows more types of translation than XtransGen. The limitation of this tool is that it cannot translate enumerated items, however it can handle mixed content XML.

7 Conclusion The current version can convert the information items [8] in XML 1.0, which are positioned in attribute and element, in single item or enumerated items. For enumerated items, only one level nesting is allowed. The enumerated items are defined by source XML reference and also by destination. One XSLT document can only be valid for one particular relationship between source and destination XML. A further expansion of this project may include a package organization to allow it to be integrated into a web application (e.g. in a servlet), and the entity reference inclusion.

8 References 1. Alan Kotok, Making XML Work in Business, O’Reilly website “XML from inside out”, January 2, 2002 (http://www.xml.com/pub/a/2002/01/02/bizvalue.html)

2. AlphaWorks Community Exchange IBM, XML Translator Generator May 21, 1999 (http://www.alphaworks.ibm.com/tech/xmltranslatorg enerator)

3. Apache group, Apache Xalan (http://xml.apache.org/xalan-j/index.html)

4. Apache group, Apache Xerces-J (http://xml.apache.org/xerces-j/index.html)

5. Brett McLaughlin, Java and XML, 2nd edition, O’Reilly & Associates, Inc., August 2001, ISBN: 0-596-00197-5

6. Brett McLaughlin, What’s the diff? IBM developerWorks website May 2001 (http://www-106.ibm.com/developerworks/xml/library/xdiff/?dwzone=xml)

7. Brett McLaughlin, Jason Hunter, JDOM (http://www.jdom.org) 8. Don Box, Aaron Skonnard, John Lam, Essential XML, Beyond Markup, Addisson Wesley, September 2000, ISBN: 0-201-70914-7 9. Europcar team (anonym), Technical Note Europcar Online Reservation Services – XML Reservation Servlet (XRS), July 2001 10. Hiroshi Maruyama, Kent Tamura, Naohiko Uramoto, XML and Java, Developing Web Applications, Addison-Wesley, May 1999, ISBN: 0-201-48543-5 11. Khun Yee Fung, XSLT Working with XML and HTML, Addison-Wesley, December 2000, ISBN: 0-201-71103-6 12. Michael Kraus, Dan Olteanu, An XML Toolkit for Transformation and Comparison of Schemas and Instances July 4, 2001 (http://www.pms.informatik.unimuenchen.de/forschung/xml/xml-toolkit.html)

13. Norman Walsh, A Technical Introduction to XML, O’Reilly website “XML from inside out” (www.xml.com) October 3, 1998 (http://www.xml.com/pub/a/98/10/guide0.html)

14. W3C Communications Team and Bert Bos, XML in 10 Points March 27, 1999 (http://www.w3.org/XML/1999/XML-in-10-points)

15. W3C XML Working Group, W3C Recommendation on XML version 1.0 (2nd edition) (http://www.w3.org/TR/2000/REC-xml-20001006)

16. W3C XSL Working Group, W3C Recommendation on XSL Transformations (XSLT) version 1.0 (http://www.w3.org/TR/xslt)

17. W3C XSL Working Group and XML Linking Working Group, W3C Recommendation on XML Path Language (XPath) version 1.0 (http://www.w3.org/TR/xpath)

18. SourceForge project, SAX 2.0, (http://sax.sourceforge.net)

A Comparison of Video-based and Interaction-based Affect Detectors ...

pdf to xml conversion

pdf to xml conversion java

International Portfolios: A Comparison of Solution ...

A Case for XML - IJEECS

COMPARISON OF EIGENMODE BASED AND RANDOM FIELD ...

A comparison of piezoelectric-based inertial sensing ...

A Comparison of Three Agent Based Control Systems

A Comparison of Milestone-Based and Buyout Options ...

A SOA-based Solution for Resource Monitoring within ...

Comparison of Similarity Metrics for Thumbnail Based ...

A systematic comparison of phrase-based ... - Research at Google

A Comparison of Three Agent Based Control Systems

RefaX: A Refactoring Framework Based on XML

Region-Based Coding for Queries over Streamed XML ... - Springer Link

Implementation and Comparison of Solution Methods ...

pdf to xml conversion online free

Paired comparison-based subjective quality ...

Paired comparison-based subjective quality ... - Infoscience - EPFL