3Store:
Efficient
Bulk
RDF
Storage


Stephen
Harris 
Nicholas
Gibbins
 University
of
Southampton


Presented
by
Hyun
Joon
Jung
 5/3/10


CS395T
class
presentaDon


1


Contents
 •  •  •  •  •  •  •  5/3/10


IntroducDon
 History
and
Background
 Requirement
 Related
Work
 The
Design
of
the
3store
RDF
knowledge
base
 Performance
 Conclusions
 CS395T
class
presentaDon


2


IntroducDon
 •  This
paper
 –  describes
the
3store
RDF
storage
and
query
 engine
 –  discuss
the
design
raDonale
and
opDmizaDons
 behind
it
which
enables
the
efficiency
of
RDF
 knowledge
base


5/3/10


CS395T
class
presentaDon


3


3store
 3store

 • 
 RDF
storage
and
query
engine
 • 
 developed
within
the
Advanced
Knowledge
 Technologies
project


5/3/10


CS395T
class
presentaDon


4


1.
History
and
Background


5/3/10


CS395T
class
presentaDon


5


What
is
AKT?
 •  The
Advanced
Knowledge
Technologies
 project
(AKT)
led
by
Nigel
Shadbolt
from
 Southampton
is
concerned
with
the
 management
of
the
knowledge
life
cycle


5/3/10


CS395T
class
presentaDon


6


What’s
the
problem?
 What
they
said,
 •  Many
SemanDc
Web
applicaDons
require
large
 quanDDes
of
RDF
triples
to
perform
their
 reasoning
over.

 •  Current
RDF
database
technology
only
scales
to
 hundreds
of
thousands
or
maybe
a
few
million
 triples,
and
from
our
experience
in
the
AKT
 project
we
know
that
many
interesDng
 Knowledge
Bases
can
quickly
grow
to
tens
of
 millions
of
triples
before
they
become
truly
 interesDng.
 5/3/10


CS395T
class
presentaDon


7


Towards
a
SoluDon
 •  3store
is
a
core
C
library
that
uses
MySQL
to
 store
its
raw
RDF
data
and
caches.

 •  The
library
offers
OKBC
and
RDQL
query
 interfaces,
over
HTTP
(via
an
Apache
web
 server
module),
or
directly
though
the
C
 library.



5/3/10


CS395T
class
presentaDon


8


RDF •  Resource
DescripDon
Framework
(RDF)
is
a
framework
for
 describing
and
interchanging
metadata
(data
describing
the
web
 resources).



•  RDF
provides
machine
understandable
semanDcs
for
metadata.
 This
leads,


 –  becer
precision
in
resource
discovery
than
full
text
search,

 –  assisDng
applicaDons
as
schemas
evolve,

 –  interoperability
of
metadata.



5/3/10

CS395T
class
presentaDon


9

RDFS

It
resembles
objected‐ oriented
programming


 

RDF
Schema
is
an
extension
of
Resource
DescripDon
Framework.


 

RDF
Schema
provides
a
higher
level
of
abstracDon
than
RDF.
      

specific
classes
of
resources
,
 specific
proper/es,

 and
the
rela/onships
between
these
proper/es
and
other
resources
can
be
 described.


 

RDFS
allows
specific
resources
to
be
described
as
instances
of
more
general
 classes.


 

RDFS
provides
mechanisms
where
custom
RDF
vocabulary
can
be
 developed.


 

Also,
RDFS
provides
important
semanDc
capabiliDes
that
are
used
by
 enhanced
semanDc
languages
like
DAML,
OIL
and
OWL.


5/3/10

CS395T
class
presentaDon


10

RDFS Literal

Human domain


subClassOf


range


Major age

Employer

domain


domain


range


Homepage
 Address

Company domain


range


range


hasHomepage

owns

majorsIn

RDF Schema majorsIn


Management

owns


Jim 34

5/3/10

age


RDF

OntologyTech hasHomepage


hcp://www.ontologytech.com/~ont

CS395T
class
presentaDon


11

OKBC
 •  OKBC
(Open
Knowledge
Base
ConnecDvity)
is
 a
protocol
for
accessing
knowledges
bases
 (KBs)
stored
in
Knowledge
Representa?on
 Systems
(KRSs).

 •  The
goal
of
OKBC
is
to
serve
as
an
interface
to
 many
different
KRSs.



5/3/10


CS395T
class
presentaDon


12


RDQL
 •  RDQL
(RDF
Data
Query
Language)
which
has
been
 implemented
in
a
number
of
RDF
systems
for
 extracDng
informaDon
from
RDF
graphs.
 •  An
RDF
model
is
graph,
ohen
expressed
as
a
set
of
 triples.

An
RDQL
consists
of
a
graph
pacern,
expressed
 as
a
list
of
triple
pacerns.

Each
triple
pacern
is
 comprised
of
named
variables
and
RDF
values
(URIs
 and
literals).

 •  
An
RDQL
query
can
addiDonally
have
a
set
of
 constraints
on
the
values
of
those
variables,
and
a
list
 of
the
variables
required
in
the
answer
set.
 5/3/10


CS395T
class
presentaDon


13


Example
of
RDQL
 SELECT
?family
,
?given
 WHERE

(?vcard

vcard:FN
"John
Smith")






 
 
 
(?vcard

vcard:N

?name)







 
 
 
(?name


vcard:Family

?family)







 
 
 
(?name


vcard:Given

?given)
 USING

vcard
FOR


hcp://www.w3.org/Submission/2004/SUBM‐ RDQL‐20040109/


5/3/10


CS395T
class
presentaDon


14


RDQL
and
SPARQL
 •  RDQL
predates
SPARQL
‐
in
fact,
RDQL
design
predates
the
current
 RDF
specificaDons
and
some
of
the
design
decisions
in
RDQL
are
a
 reflecDon
of
that.

 •  The
biggest
of
these
is
that
RDF
didn't
have
any
data‐typing
so
 RDQL
handles
tests
on,
say,
integers
without
checking
the
data‐type
 (if
it
looks
like
an
integer,
it
can
be
tested
as
integer).
 •  SPARQL
has
all
the
features
of
RDQL
and
more:
 –  –  –  –  – 

ability
to
add
opDonal
informaDon
to
query
results

 disjuncDon
of
graph
pacerns
 more
expression
tesDng
(date‐Dme
support,
for
example)
 named
graphs
 sorDng


•  But,
above
all,
it
is
more
Dghtly
specified
so
queries
in
one
 implementaDon
should
behave
the
same
in
all
other
 implementaDons.
 5/3/10


CS395T
class
presentaDon


15


2.
Requirements


5/3/10


CS395T
class
presentaDon


16


2.1.
Scale
 •  Scale


–  It
must
be
able



•  to
store
the
hypen.info
data.
 •  to
import
and
replace
RDF
data
in
sufficient
Dme



–  Hyphen.info


•  The
ontologies
that
make
up
the
AKT
Reference
Ontology
 and
other
RDF
data
 •  consists
of
around
5
million
RDF
triples
when
serialized
 •  200
classes
and
150
properDes


–  From
this,
the
base
scale
requirements
were
decided
 to
the
ability
to
handle
at
lease
20
million
triples
and
 5000
classes
and
properDes.
 5/3/10


CS395T
class
presentaDon


17


2.2.
Interfaces
 •  Interfaces
 –  To
insure
backwards
compaDbility
with
previous
 version,
it
supports
RDF‐based
dialect
of
OKBC
 which
used
HTTP
as
its
transport
layer.
 –  They
felt
it
to
implement
more
natural
RDF
query
 interface
based
on
RDQL
query
language
 •  provides
an
HTTP
interface
that
returns
the
results
in
 an
XML
format
 •  a
database‐style
C
API
that
queries
the
knowledge
base
 directly.
 5/3/10


CS395T
class
presentaDon


18


2.3.
InferenDal
Capability
 •  InferenDal
Capability
 aaa
=
any
uriref
 uuu
=
any
uriref
or
literal
 node
 xxx,
yyy,
and
zzz
=
any
node
 in
the
graph


5/3/10


CS395T
class
presentaDon


19


2.4.
Efficiency
 •  Efficiency
–
EvaluaDng
queries
and
AsserDng
 new
knowledge
 –  The
efficiency
of
evaluaDng
queries
 •  interacDve‐level
performance
 •  web‐based
interface,
queries
with
4
~
12
triples
pacern
 in
WHERE
clauses,
returning
a
few
hundred
results
rows
 •  response
Dme
for
the
queries
used
must
be
kept
to
the
 order
of
a
few
milliseconds


5/3/10


CS395T
class
presentaDon


20


2.4.
Efficiency
(cont’d)
 •  The
Dme
taken
to
assert
new
knowledge
 •  The
knowledge
sources
that
AKT
project
uses
are
gathered
 on
a
variety
of
schedules
ranging
from
daily
to
monthly.
 •  Maintaining
the
integrity
of
data
while
it
is
being
updated
 is
an
important
concern
 •  To
this
problem,
the
Dme
during
which
the
knowledge
 based
is
inconsistent
or
incomplete
should
be
kept
to
a
 minimum.


5/3/10


CS395T
class
presentaDon


21


3.
Related
Work


5/3/10


CS395T
class
presentaDon


22


3.1.
Jena
 –  A
java
toolkit
for
manipulaDng
RDF
models
 –  Developed
by
HP
Labs
 –  Supports
excellent
funcDonaliDes
for
RDQL
 queries
 –  Does
not
provide
OKBC
interface
 –  The
Dme
taken
to
assert
new
knowledge
is
 considerably
long
(Aher
24
hours,
It
had
not
 completed
the
import.)


5/3/10


CS395T
class
presentaDon


23


3.2.
Sesame
 –  An
architecture
for
inference
and
querying
of
RDF
 and
RDF
Schema
 –  The
Query
response
Dme
for
DBMS
backed,
 moderately
sized
knowledge
bases
is
in
excesses
 of
60ms
 –  It
is
not
fast
enough
for
the
interacDve
user
 interface


5/3/10


CS395T
class
presentaDon


24


3.3.
Redland
 –  To
be
capable
of
storage
of
large
RDF
graphs
 –  currently
has
no
graph
matching
query
facility
 –  The
RDF
parser
from
Redland,
Raptor,
which
 provides
a
C
API
for
extracDng
the
triples
from
 RDF/XML
documents
was
decided
to
use
for
this
 project


5/3/10


CS395T
class
presentaDon


25


4.
The
Design
of
the
3store
RDF
 Knowledge
Base



5/3/10


CS395T
class
presentaDon


26


4.1.
Plauorm
 –  The
sohware
is
developed
for
POSIX
compliant
 Unix
environments
in
ANSI
C
for
scalability
and
 portability
reasons
 –  The
back‐end
storage
is
provided
by
an
SQL
 engine.
MySQL
was
chosen
 •  Open
source,
and
portable
to
many
POSIX
OS
 •  Picking
a
single
back‐end
allows
Dghter
integraDon
and
 more
opDmizaDon
works
to
be
done


5/3/10


CS395T
class
presentaDon


27


4.2.
3
Layer
Model
 –  The
3
layer
opDmizaDon
model
of
HYWIBAS’s
KB.


Knowledge
Base
System
 Object
Data
Management
 System
 RelaDonal
Database
 System


5/3/10


CS395T
class
presentaDon


28


4.2.
3
Layer
Model
 •  Use
an
off‐the‐shelf
RDF
parser,
namely
the
 Free
Sohware
Raptor
toolkit
 RDF/XML
 sources


Raptor
 (RDF
 Parser)
 Parsing


5/3/10


RDF/XML
 triples


rdfsql
 library


SQL
INSERT
 statements


RDBMS


TranslaDng


CS395T
class
presentaDon


29


4.3.
Database
Structure
 The
arrangement
of
tables
in
the
SQL
database
schema
used
in
 the
current
version
of
3store
is
shown
in
this
figure.


5/3/10


CS395T
class
presentaDon


30


4.3.
Database
Structure
 •  Hashing
 –  64‐bit
hash
(the
largest
integer
format
of
 commodity
servers),
MD5
funcDon
 −10 10 –  The
probability
of
a
hash
collision
:
around
1:

 –  By
addiDon
of
a
flag
which
indicates
the
type
of
 the
object
of
a
triple,
collisions
between
resources
 hashes
and
literal
hashes
are
avoided.
 €

5/3/10


CS395T
class
presentaDon


31


4.4
Query
ExecuDon
 –  3store
RDQL
engine
transforms
an
RDQL
query
 into
an
SQL
query
over
the
underlying
RDBMS
 representaDon
of
the
RDF
data


RDQL
syntax


5/3/10


RDF
graph


CS395T
class
presentaDon


32


4.4
Query
ExecuDon
 •  Query
ExecuDon
 –  RDQL
also
allows
an
opDonal
constraints
clause.
 •  ==,
!=,
<=,
=>,
<



5/3/10


CS395T
class
presentaDon


33


4.4
Query
ExecuDon
 •  Transforming
the
RDQL
triple
expressions
into
 relaDonal
calculus

 –  Each
triple
pacern
in
the
expression
is
assigned
an
 existenDally
quanDfied
variable
 –  A
constraint
is
then
added
to
the
scope
of
the
 existenDal
funcDon
for
each
node
appearing
in
more
 than
one
triple
 –  Constraints
are
also
added
for
every
object’s
resource/ literal
constraints.
 –  A
free
tuple
variable
is
included
for
any
requested
 acributes.
 5/3/10


CS395T
class
presentaDon


34


4.4
Query
ExecuDon


5/3/10


CS395T
class
presentaDon


35


4.4.
Query
execuDon
 •  The
resource/
literal
type
of
any
free
variable
 in
the
query
can
ohen
be
determined
by
 inspecDon
 –  when
it
appears
as
the
subject
or
predicate
of
a
 triple
expression
 –  when
it
is
explicitly
bound
to
a
URI
constant
in
the
 constraints
secDon
of
the
RDQL
 –  when
it
appears
as
the
value
of
a
predicate
whose
 range
is
known.
 5/3/10


CS395T
class
presentaDon


36


4.4.
Query
ExecuDon
 •  The
problem
is
that
when
the
type
of
the
requested
 acribute
is
unknown,
as
following
example
 –  SELECT
?a
WHERE
(,
,
?a)


5/3/10


CS395T
class
presentaDon


37


4.6.
Hybrid
eager
/
lazy
producDon
 •  The
model
for
RDF/RDFS
contains
a
number
of
 entailments
which
should
be
generated
by
an
 RDF
inference
engine.
 •  Most
common
approaches
to
support
it
is
 producDon
rule
system
 –  generate
the
required
entailments
by
either
 forward
chaining
from
asserted
facts
or
backward
 chaining
from
queries
that
are
presented
to
the
 system.
 5/3/10


CS395T
class
presentaDon


38


Entailments
 •  A
entails
B
if


–  whenever
A
is
true,

then
B
must
be
true,
 and

 –  if
B
is
false,
then
A
is
false


•  Example


–  A1:
Carmen
stole
the
Mona
Lisa
this
morning
 –  Entailment?
 •  Carmen
stole
something.
(YES)
 •  Something
was
stolen
this
morning.
(YES)
 •  Carmen
believes
the
Mona
Lisa
is
valuable.
(NO)


5/3/10


CS395T
class
presentaDon


39


4.6.
Hybrid
eager
/
lazy
producDon

 •  Forward
chaining
(Sesame)
 –  It
applies
the
entailment
rules
in
the
model
theory
 exhausDvely
to
the
asserted
facts
in
order
to
 generate
RDF
closure
of
the
those
facts.
 –  Pros:
Reducing
the
processing
cost
of
evaluaDng
 queries.
(a
considerable
advantage
for
interacDve
 applicaDon)
 –  Cons:
Increasing
the
space
of
storing
RDF
closure


5/3/10


CS395T
class
presentaDon


40


4.6.
Hybrid
eager
/
lazy
producDon

 •  Backward
chaining
 –  It
evaluates
the
entailments
matched
by
a
query
 at
query
processing
Dme.
 –  Pros:
Reducing
the
cost
of
storing
the
entailments
 –  Cons:
lazy
producDon
may
incur
a
Dme
penalty
 that
makes
its
use
in
interacDve
applicaDon
 unworkable.


5/3/10


CS395T
class
presentaDon


41


4.6.
Hybrid
eager
/
lazy
producDon

 •  A
SoluDon
 –  adopt
a
hybrid
approach
 •  Some
entailment
rules
which
generate
fewer
 entailments

are
evaluated
eagerly
using
forward
 chaining
rules
when
new
facts
are
asserted.
 •  The
other
entailment
rules
which
need
a
greater
 storage
and
a
lower
evaluaDon
cost
are
evaluated
as
 required
at
query
Dme
by
a
combinaDon
of
backward
 chaining
and
query
rewriDng.


5/3/10


CS395T
class
presentaDon


42


4.6.
Hybrid
eager
/
lazy
producDon

 •  By
producing
some
of
the
entailments
lazily,
it
 is
possible
to
trade
triple
table
size
and
 asserDon
Dme
complexity
for
query
Dme
 complexity.
 •  Example:
rdfs6
 –  xxx
aaa
yyy
 –  aaa
rdfs:subPropertyOf
bbb

xxx
bbb
yyy
 –  (a,b,c)

(a,
?p,
c)
(?p
rdfs:subPropertyOf,
b)
 5/3/10


CS395T
class
presentaDon


43


4.7.
TransiDve
Entailment
 •  TransiDve
Entailment
 –  rdfs5a
express
the
transiDve
nature
of
some
 properDes
in
the
RDFS
vocabularies.
 aaa
rdfs:subPropertyOf
bbb
 bbb
rdfs:subPropertyOf
ccc
  aaa
rdfs:subPropertyOf
ccc


–  Implemented
these
transiDve
entailments
using
 an
opDmized
implementaDon
of
the
TRANSITIVE‐ CLOSURE
funcDon
 5/3/10


CS395T
class
presentaDon


44


5.
Performance
 •  On
commodity
hardware,
 –  3store
can
assert
between
100
and
300
triples/ second
depending
on
the
size
of
the
knowledge
 base.


•  Some
opDmizaDons
 –  opDonally
locking
the
tables
to
prevent
index
 rebuilding

 –  grouping
SQL
INSERT
statements


5/3/10


CS395T
class
presentaDon


45


6.
Conclusion
 •  3store,
an
RDF
engine,
which
efficiently
 supports
the
RDF
and
RDFS
entailments
over
 relaDvely
large
RDF
knowledge
base
using
 relaDonal
database
back‐end
 •  At
that
Dme,

they
claimed
that
there
was
a
 lack
of
standardized
performance
benchmarks
 for
RDF
knowledge
bases.


5/3/10


CS395T
class
presentaDon


46


QuesDon
and
Answer


5/3/10


CS395T
class
presentaDon


47


3Store: Efficient Bulk RDF Storage

May 3, 2010 - describes the 3store RDF storage and query engine. – discuss ... Company ... The biggest of these is that RDF didn't have any data‐typing so.

839KB Sizes 1 Downloads 195 Views

Recommend Documents

Efficient Monitoring Mechanisms for Cooperative Storage in Mobile Ad ...
and use limited storage capacity, data holders might behave selfishly by ... the stored data. ..... Better performance by relaxing in peer-to-peer backup. In.

Dynamic Authentication for Efficient Data Storage in HMS
other proceedings present in distributed computing operations. SAAS(Software As a Service), PAAS(Platform As a. Service), and Infrastructure As a Service are three basic services of the cloud computing for storage data, processing data and maintains

Efficient QoS for Multi-Tiered Storage Systems
favors the clients whose IOs are less costly on the back- end storage array for reasons .... have development costs for cache friendliness). In some cases, B may ..... References. [1] Nutanix complete cluster: The new virtualized datacenter build-.

Liquid Bulk Logistics.pdf
throughout our 28 years in business. Because of that, we love to highlight our employees and. what makes them great. We are excited to introduce one of our ...

Liquid Bulk Logistics.pdf
... dedication to customer service. that makes our company a great place to work and has brought us success and growth. throughout our 28 years in business.

RDF(S) introduction Francisco Javier Cervigon Ruckauer.pdf ...
RDF. Database XML RDF(S). Schema. Data. Whoops! There was a problem loading this page. RDF(S) introduction Francisco Javier Cervigon Ruckauer.pdf.

Bulk Liquid Transport.pdf
... opportunities, check out. DRIVERS WANTED or CAREERS to fill out an application and get on the road to a better career. with a top ranking bulk liquid carrier!

Bulk Email SMS Sender Manager
Page 3 of 3. Page 3 of 3. 1499533043392-along-with-mac-to-post-body-text-sen ... nning-computer-volume-email-sms-sender-manager.pdf.

bulk-mailer-brochure.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Bulk-Seller-Sheet.pdf
2012 | Toyota | Camry Hybrid | Minneapolis, MN. $2,210. Salvage. 2003 | Nissan | Murano | Austin, TX. $215. Scrap. 1993 | Honda | Civic | Oakland, CA.

Real-time RDF extraction from unstructured data streams - GitHub
May 9, 2013 - This results in a duplicate-free data stream ∆i. [k.d,(k+1)d] = {di ... The goal of this step is to find a suitable rdfs:range and rdfs:domain as well ..... resulted in a corpus, dubbed 100% of 38 time slices of 2 hours and 11.7 milli

Survey of Noise Sources in Bulk CMOS
The noise behavior of bulk CMOS devices is dominated primarily by two noise sources: thermal noise and flicker (1/f) noise. Other sources that are sometimes ...

Bulk/ Block reporting on ENIT - NSE
Aug 24, 2017 - ENIT-NEW-TRADE > Trade > Bulk/ Block Reporting ... requests for bulk/block disclosure changes through fax/e-mail shall not be accepted on ...

Handling RDF data with tools from the Hadoop ecosystem - ApacheCon
Nov 7, 2012 - This can be done with a simple MapReduce job using. N-Triples|N-Quads files ... Apache Giraph is a good solution gor graph or iterative ... Building (B+Tree) indexes with MapReduce ... get RDF datasets from the Web? • ... 20.

Scalable SPARQL Querying of Large RDF Graphs
SPARQL queries into high performance fragments that take advantage of how ...... Journal of High Performance Computing Applications, pages. 81–97, 2003.

Query-Independent Learning to Rank for RDF Entity ...
This paradigm constitutes the state-of-the-art in IR, and is widely used by .... For illustration, Figure 3 shows a subgraph from the Yago knowledge base.

Making URIs published on Data Web RDF ...
for the RDF triples. Again, being a URI, http://www.example.org/index.html/data .... been published from this repository host, followed by retrieval of. RDF triples ...

Distributed Evaluation of RDF Conjunctive Queries over ...
answer to a query or have ACID support, giving rise to “best effort” ideas. A ..... “provider” may be the company hosting a Web service. Properties are.