Making URIs published on Data Web RDF ...

Viewer
Transcript

Making URIs published on Data Web RDF dereferencable Jing Mei

Guotong Xie

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

[email protected] Shengping Liu

[email protected]

[email protected] Hanyu Li

[email protected]

ABSTRACT Nowadays, more and more URIs reside on Data Web, as published for linked open data, dereferencing URIs challenges the current Web to embrace Semantic Web. Although, quite a few practical recipes for publishing URIs have been provided to make URIs dereferencable, we believe a fundamental investigation of publishing and dereferencing URIs would contribute a forward compatibility with the RDF and OWL upper layers in the Semantic Web architecture. In this paper, we propose to make URIs published on Data Web RDF dereferencable, and we formalize such a requirement in an RDF-compatible semantics. Also, the dereferencing operation is defined in an abstract URI syntax, such that URIs, as interpreted as described resources, would be RDF dereferencable by default. Accompanied by a live demonstration, the poster demo explanation would elaborately discuss and seriously address issues on Data Web URIs, which were or have been taken for granted. Additionally, for case study, Metadata Web, a Data Web of enterprise-wide models, is explored. The URIs on Metadata Web is published as RDF dereferencable. Such an implementation of universal metadata management across the enterprise enables the metadata federation such that global query, search and analysis could be conducted on top of the Metadata Web.

1.

Yuan Ni

IBM China Research Lab Zhongguancun Software Park Beijing 100193, China

INTRODUCTION

RDF (Resource Description Framework) is intended to provide a simple way to make statements about Web resources. A typical example is, as shown below in an RDF triple, http://www.example.org/index.html has a creator whose value is John Smith as identified by a staff ID 85740. .

[email protected] Yue Pan

[email protected]

To respect the Web architecture [1], a Web resource (a web page) is identified by URI http://www.example.org/index.html. However, in the Data web context, a big challenge that is encountered is how to guarantee the above RDF triples are always retrieved when dereferencing this URI. We call it as RDF dereferencable, and its formal definition would be given in the later section. Since, web pages are information resources which could be directly dereferenced, dereferencing the above URI generally retrieves the HTML web page, rather than any RDF triple. Practice recipes from [5] and [3] also Cool URIs [7] instructed us a bit, such as using 303 redirect and content negotiation for dereferencing a URI which identifies a non-information resource. In this way, GET http://www.example.org/index.html with an Accept: application/rdf+xml header would be redirected to another URI like http://www.example.org/index.html/data, and then to get http://www.example.org/index.html/data for the RDF triples. Again, being a URI, http://www.example.org/index.html/data identifies a Web resource, and we are allowed to make statements about it. Below is an example, which is (most possibly) not an authoritative description, if such a statement made by others than the owner of the URI. "Please publish me and try to dereferencing me". Interestingly, an RDF dereferencing result of the first URI has triggered to dereferencing the second URI, which retrieves the first RDF triple. In other words, directly dereferencing the second URI does not retrieve anything about itself, and again practice recipes such as using 303 redirect and content negotiation have to be applied for retrieval of the second RDF triple. As a consequence, the so-called RDF dereferencability not only needs to be well-defined, but also needs to guarantee the retrieval of RDF triples is what delivered on Data Web. Otherwise, the failure of consuming RDF triples hurts the RDF data providers, and vice versa, the failure of providing RDF triples dismisses the RDF data consumers. Nowadays, various URIs are residing on Data Web, as the W3C SWEO Linking Open Data community project 1 proudly an1 http://esw.w3.org/topic/SweoIG/TaskForces/Community Projects/LinkingOpenData

nounced, special for using the recipes in [5, 7] which would introduce at least three URIs to describe a resource. Since, not all published URIs are RDF dereferencable, it is quite the time to do a fundamental investigation of publishing and dereferencing URIs on Data Web. Below, we propose a necessary and sufficient condition for making URIs published on Data Web RDF dereferencable. • Necessary Condition: If a URI published on Data Web is RDF dereferencable, then dereferencing this URI retrieves RDF triples with subject of this URI. • Sufficient Condition: A URI published on Data Web is RDF dereferencable, only if publishing this URI delivers RDF triples with subject of this URI. Recalling to the RDF Semantics, there was no assumption of any particular relationship between the denotation and use of a URI, and such a requirement could be added as a semantic extension [2]. To some extent, satisfiability of the above necessary and sufficient condition is likely a required relationship, where publishing a URI is the denotation and dereferencing a URI is the use. We believe, URI, being a cornerstone of the Semantic Web, needs a forward compatibility with the RDF and OWL upper layers. In this paper, we would contribute an RDF-compatible semantics for making URIs published on Data Web RDF dereferencable. Also, the dereferencing operation is defined in an abstract URI syntax, such that URIs, as interpreted as described resources, would be RDF dereferencable by default.

2.

RDF DEREFERENCABLE

By convention in the RDF Semantics [2], a set of names is referred to as a vocabulary, and a name is a URI reference or a literal. As specified in the generic URI syntax [4], a URI reference is either a URI or a relative reference. Definition 1: A des-interpretation of a vocabulary V is a simple interpretation2 I of V , extending with: (1) A set IRd ⊆ IR, described resources; (2) A mapping IDES : IRd → 2V ×V ×V , the resource description mapping, s.t., < s p o >∈ IDES(I(s)), for any s ∈ {u ∈ V |I(u) ∈ IRd }, p ∈ V and o ∈ V . A URI u ∈ V is defined as RDF dereferencable if I(u) ∈ IRd , i.e., interpreting u by a described resource. Similar to rdf-interpretations and rdfs-interpretations, every des-interpretation is also a simple interpretation. The ‘extra’ description structure does not prevent it acting in the simpler role. Given a URI published on Data Web, if it is RDF dereferencable, then dereferencing this URI should retrieve RDF triples with subject of this URI. On the contrary, if publishing this URI has delivered RDF triples with subject of this URI, then it is RDF dereferencable. By definition, we call the former as a necessary condition and the latter as a sufficient condition, for making URIs published on Data Web RDF dereferencable. Following up, we formalize resource representation by definition of the dereferencing operation. First, we recall the generic URI syntax [4] which defines a grammar that is a superset of all valid URIs, consisting of a hierarchical sequence of components referred to as the scheme, authority, 2 We direct readers to RDF Semantics [2] for definition of the simple interpretation

path, query, and fragment. As for Data Web, conventionally, only HTTP URIs are used, to avoid other URI schemes such as URNs and DOIs [5]. Below is the generic syntax of HTTP URI, and HTTP URIs are called query URIs if containing a “?” [6]. http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]] In abstract syntax, we define the set of all valid HTTP URIs by U0 . A query URI v ∈ U0 is defined in the form of u?q, where u is the non-query part of v, and q is the query part. Besides, q consists of parameters (key/value pairs) using & for separator, viz. k1 = v1 & · · · & km = vm , where ki is the parameter name and vi is the parameter value, 1 6 i 6 m. Definition 2: Let U0 be the set of URIs, G the set of RDF graphs, F the set of representation formats, S the set of byte steams and G ⊆ S. A dereferencing operation is defined by λ : U0 → S. As well, a format transformation is defined by τ : G × F → S. Taking advantage of parameters in query URIs, we propose to publish URIs for described resources with the nonquery form, so that they would be RDF dereferencable by default, while other URIs with suffix of parameters would be dereferenced in a usual way. That is, resource description is formalized by IDES and resource representation by λ, such that, given a URI u ∈ U0 , if interpreting u by a described resource I(u) ∈ IRd , then λ(u) = IDES(I(u)). Any other query URI would be dereferenced with format transformation, i.e., λ(u?k = v) = τ (IDES(I(u), v)). Besides, such a strategy would benefit the paging implementation. As noted in [5] and [8], retrieval of a huge stream of bytes challenges the bandwidth. Now, parameterized pages are configurable in URIs to retrieve a specified page like http://mdw.com/resource/Beijing?format=html&page=3.

3. REFERENCES [1] Architecture of the World Wide Web, December 2004. http://www.w3.org/TR/webarch/. [2] RDF Semantics, February 2004. http://www.w3.org/TR/rdf-mt/. [3] Best Practice Recipes for Publishing RDF Vocabularies, January 2008. http://www.w3.org/TR/swbp-vocab-pub/. [4] T. Berners-Lee, R. T. Fielding, and L. Masinter. Uniform Resource Identifier (URI): Generic Syntax, January 2005. http://gbiv.com/protocols/uri/rev2002/rfc2396bis.html. [5] C. Bizer, R. Cyganiak, and T. Heath. How to Publish Linked Data on the Web, July 2007. http://sites.wiwiss.fuberlin.de/suhl/bizer/pub/LinkedDataTutorial/. [6] R. T. Fielding, J. Gettys, J. Mogul, H. F. Nielsen, L. Masinter, P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol – HTTP/1.1, June 1999. http://www.ietf.org/rfc/rfc2616.txt. [7] L. Sauermann, R. Cyganiak, and M. Volkel. Cool URIs for the Semantic Web, August 2007. http://www.dfki.unikl.de/ sauermann/2006/11/cooluris/. [8] F.-P. Servant. Linking Enterprise Data. In WWW 2008 Workshop: Linked Data on the Web, April 2008.

Poster Demo Explanation By publishing URIs on Metadata Web (MDW, a Data Web of enterprise-wide models) as RDF dereferencable, we demonstrate an implementation of universal metadata management across the enterprise enables the metadata federation, and utilize global query, search and analysis on metadata. Existing Client

Web Browser

Web 2.0 App Bookmark, Tag, RSS feed

Data Browsers

. . URI +RDF/JSON +Search Engine +Analytic Archive

Glossary, Existing Metadata Repositorie

Crawler

WrapperFramework

WMS

WSDL,

WSRR

UML..

RAM

Search Index

Analytic Archive

Above figure shows the Metadata Web architecture, where a wrapper framework serves as the primary provider of Metadata Web URIs while Crawler and Analytic Archive are consumers of Metadata Web URIs. For preliminary implementation, there are three running prototypes on Metadata Web, namely MDW WBG (WebSphere Business Glossary), MDW WSRR (WebSphere Service Registry and Repository) and MDW RAM (Rational Asset Management). In a wrapper framework, existing metadata repositories are wrapped out, and URIs are published for enterprise-wide models and fine-grained model elements. Specially, we remark that issues which were or have been taken for granted are now prone to be elaborately discussed and seriously addressed. First is the repository URI. Although there are various RDF datasets available on Data Web, we observe that seldom repository URI has been published as RDF dereferencable, with statements about the repository itself, e.g., describing the repository capability. Moreover, a repository URI would be the entry for an RDF crawler, and dereferencing a repository URI would make the RDF crawler retrieve all other URIs which have been published from this repository host, followed by retrieval of RDF triples about those URIs. Second is the document URI. At first glance, a document has a name, and a Web document has a URI. However, it is not the story on Data Web. Actually, we are more interested in statements about documents, e.g., in which repository a document is contained, and what the document contains. Considering that, either RDFS/OWL documents used for ontologies, or WSDL/XSD documents used for applications, are all XML documents, our objective is to publish XML documents. Recalling to the Web architecture, QName, a pair of namespace and local name, is often used for naming XML elements and attributes, where the namespace itself is a URI but QNames are not URIs. Even if provided a QName mapping to URIs, for example, adding the local name to the namespace, such mapped URIs are grammar valid, but most possibly not (RDF) dereferencable. Moreover, we observe XML element and attribute URIs, even if valid and dereferencable, might be hash URIs, by appending a fragment

identifier to the document URI. Dereferencing such XML element and attribute URIs is to strip off the part after the hash, and then to retrieve the whole document content. This solution does not address a semantic relationship between a document and an element (or an attribute) which is originally defined in the document. However, real-world applications do often care about these statements. Third is the ontology URI. Companies indeed do have large and standardized vocabularies, while heterogeneous metadata registries and repositories have been developed by various vendors (such as IBM, SAP, BEA, Microsoft), as well as by customers with their roll-their-own repositories. Taking web service description as an example, there are WSDL 1.1 and WSDL 2.0 versions. Although WSDL 2.0 becomes a W3C recommendation, most enterprise applications still reside in the WSDL 1.1 era. Additionally, WSDL meta models have been developed within the enterprise, such as WSRR has its own WSDL meta model. Since, these three vocabularies, namely, mdw_wsrr, mdw_wsdl11 and mdw_wsdl20, coexist and server together for describing web services, we expect such vocabularies are described in ontologies, but their native URIs need to evolve. For example, a concrete web service interface like OpenAccount might be typed of http://schemas.xmlsoap.org/wsdl#portType in WSDL 1.1, however this URI has never been defined as an RDF class. Alternatively, http://www.w3.org/ns/wsdl-rdf#Interface in WSDL 2.0 has been defined well, but we cannot overlook meta model mismatches between this and WSDLPortType in WSRR, making (sub)components restructured. As a result, all URIs published on Metadata Web are RDF dereferencable. Using a web browser is of course a common way to consume Metadata Web URIs, and the RDF dereferencing result shows statements about described resources, in favor of human users. Software agents are also consumers who retrieve desirable RDF triples to develop Metadata Web crawlers, search engines, and analytic archive, etc. Technically, an RDF dereferencing to the repository URI retrieves all contained URIs in the repository, so as for crawlers to collect RDF data. With a collection of all published metadata URIs across the enterprise by RDF dereferencing them, search engine builds indices for the keyword search and structural query. Thanks to ontology URIs published on Metadata Web as RDF dereferencable, structural information is now available by retrieval of statements about classes and properties. Otherwise, search engine takes pains with understanding various meta models across the enterprise. Eventually, analytic archive provides functionalities, such as data aggregation and impact analysis, over collected metadata descriptions in addition to links among them. Statements about links are retrieved by RDF dereferencing to link URIs, which instruct the analytic archive what to do. For instance, link type of mdw:identicalWith indicates to perform data consolidation, i.e., aggregating RDF triples. So far, by consuming Metadata Web URIs which are published as RDF dereferencable, metadata federation, global search and query, as well as impact analysis, are moving on towards an incremental implementation across the enterprise.

Real-time RDF extraction from unstructured data streams - GitHub