3Store:
Efficient
Bulk
RDF
Storage
Stephen
Harris
Nicholas
Gibbins
University
of
Southampton
Presented
by
Hyun
Joon
Jung
5/3/10
CS395T
class
presentaDon
1
Contents
• • • • • • • 5/3/10
IntroducDon
History
and
Background
Requirement
Related
Work
The
Design
of
the
3store
RDF
knowledge
base
Performance
Conclusions
CS395T
class
presentaDon
2
IntroducDon
• This
paper
– describes
the
3store
RDF
storage
and
query
engine
– discuss
the
design
raDonale
and
opDmizaDons
behind
it
which
enables
the
efficiency
of
RDF
knowledge
base
5/3/10
CS395T
class
presentaDon
3
3store
3store
•
RDF
storage
and
query
engine
•
developed
within
the
Advanced
Knowledge
Technologies
project
5/3/10
CS395T
class
presentaDon
4
1.
History
and
Background
5/3/10
CS395T
class
presentaDon
5
What
is
AKT?
• The
Advanced
Knowledge
Technologies
project
(AKT)
led
by
Nigel
Shadbolt
from
Southampton
is
concerned
with
the
management
of
the
knowledge
life
cycle
5/3/10
CS395T
class
presentaDon
6
What’s
the
problem?
What
they
said,
• Many
SemanDc
Web
applicaDons
require
large
quanDDes
of
RDF
triples
to
perform
their
reasoning
over.
• Current
RDF
database
technology
only
scales
to
hundreds
of
thousands
or
maybe
a
few
million
triples,
and
from
our
experience
in
the
AKT
project
we
know
that
many
interesDng
Knowledge
Bases
can
quickly
grow
to
tens
of
millions
of
triples
before
they
become
truly
interesDng.
5/3/10
CS395T
class
presentaDon
7
Towards
a
SoluDon
• 3store
is
a
core
C
library
that
uses
MySQL
to
store
its
raw
RDF
data
and
caches.
• The
library
offers
OKBC
and
RDQL
query
interfaces,
over
HTTP
(via
an
Apache
web
server
module),
or
directly
though
the
C
library.
5/3/10
CS395T
class
presentaDon
8
RDF • Resource
DescripDon
Framework
(RDF)
is
a
framework
for
describing
and
interchanging
metadata
(data
describing
the
web
resources).
• RDF
provides
machine
understandable
semanDcs
for
metadata.
This
leads,
– becer
precision
in
resource
discovery
than
full
text
search,
– assisDng
applicaDons
as
schemas
evolve,
– interoperability
of
metadata.
5/3/10
CS395T
class
presentaDon
9
RDFS
It
resembles
objected‐ oriented
programming
RDF
Schema
is
an
extension
of
Resource
DescripDon
Framework.
RDF
Schema
provides
a
higher
level
of
abstracDon
than
RDF.
specific
classes
of
resources
,
specific
proper/es,
and
the
rela/onships
between
these
proper/es
and
other
resources
can
be
described.
RDFS
allows
specific
resources
to
be
described
as
instances
of
more
general
classes.
RDFS
provides
mechanisms
where
custom
RDF
vocabulary
can
be
developed.
Also,
RDFS
provides
important
semanDc
capabiliDes
that
are
used
by
enhanced
semanDc
languages
like
DAML,
OIL
and
OWL.
5/3/10
CS395T
class
presentaDon
10
RDFS Literal
Human domain
subClassOf
range
Major age
Employer
domain
domain
range
Homepage
Address
Company domain
range
range
hasHomepage
owns
majorsIn
RDF Schema majorsIn
Management
owns
Jim 34
5/3/10
age
RDF
OntologyTech hasHomepage
hcp://www.ontologytech.com/~ont
CS395T
class
presentaDon
11
OKBC
• OKBC
(Open
Knowledge
Base
ConnecDvity)
is
a
protocol
for
accessing
knowledges
bases
(KBs)
stored
in
Knowledge
Representa?on
Systems
(KRSs).
• The
goal
of
OKBC
is
to
serve
as
an
interface
to
many
different
KRSs.
5/3/10
CS395T
class
presentaDon
12
RDQL
• RDQL
(RDF
Data
Query
Language)
which
has
been
implemented
in
a
number
of
RDF
systems
for
extracDng
informaDon
from
RDF
graphs.
• An
RDF
model
is
graph,
ohen
expressed
as
a
set
of
triples.
An
RDQL
consists
of
a
graph
pacern,
expressed
as
a
list
of
triple
pacerns.
Each
triple
pacern
is
comprised
of
named
variables
and
RDF
values
(URIs
and
literals).
•
An
RDQL
query
can
addiDonally
have
a
set
of
constraints
on
the
values
of
those
variables,
and
a
list
of
the
variables
required
in
the
answer
set.
5/3/10
CS395T
class
presentaDon
13
Example
of
RDQL
SELECT
?family
,
?given
WHERE
(?vcard
vcard:FN
"John
Smith")
(?vcard
vcard:N
?name)
(?name
vcard:Family
?family)
(?name
vcard:Given
?given)
USING
vcard
FOR
hcp://www.w3.org/Submission/2004/SUBM‐ RDQL‐20040109/
5/3/10
CS395T
class
presentaDon
14
RDQL
and
SPARQL
• RDQL
predates
SPARQL
‐
in
fact,
RDQL
design
predates
the
current
RDF
specificaDons
and
some
of
the
design
decisions
in
RDQL
are
a
reflecDon
of
that.
• The
biggest
of
these
is
that
RDF
didn't
have
any
data‐typing
so
RDQL
handles
tests
on,
say,
integers
without
checking
the
data‐type
(if
it
looks
like
an
integer,
it
can
be
tested
as
integer).
• SPARQL
has
all
the
features
of
RDQL
and
more:
– – – – –
ability
to
add
opDonal
informaDon
to
query
results
disjuncDon
of
graph
pacerns
more
expression
tesDng
(date‐Dme
support,
for
example)
named
graphs
sorDng
• But,
above
all,
it
is
more
Dghtly
specified
so
queries
in
one
implementaDon
should
behave
the
same
in
all
other
implementaDons.
5/3/10
CS395T
class
presentaDon
15
2.
Requirements
5/3/10
CS395T
class
presentaDon
16
2.1.
Scale
• Scale
– It
must
be
able
• to
store
the
hypen.info
data.
• to
import
and
replace
RDF
data
in
sufficient
Dme
– Hyphen.info
• The
ontologies
that
make
up
the
AKT
Reference
Ontology
and
other
RDF
data
• consists
of
around
5
million
RDF
triples
when
serialized
• 200
classes
and
150
properDes
– From
this,
the
base
scale
requirements
were
decided
to
the
ability
to
handle
at
lease
20
million
triples
and
5000
classes
and
properDes.
5/3/10
CS395T
class
presentaDon
17
2.2.
Interfaces
• Interfaces
– To
insure
backwards
compaDbility
with
previous
version,
it
supports
RDF‐based
dialect
of
OKBC
which
used
HTTP
as
its
transport
layer.
– They
felt
it
to
implement
more
natural
RDF
query
interface
based
on
RDQL
query
language
• provides
an
HTTP
interface
that
returns
the
results
in
an
XML
format
• a
database‐style
C
API
that
queries
the
knowledge
base
directly.
5/3/10
CS395T
class
presentaDon
18
2.3.
InferenDal
Capability
• InferenDal
Capability
aaa
=
any
uriref
uuu
=
any
uriref
or
literal
node
xxx,
yyy,
and
zzz
=
any
node
in
the
graph
5/3/10
CS395T
class
presentaDon
19
2.4.
Efficiency
• Efficiency
–
EvaluaDng
queries
and
AsserDng
new
knowledge
– The
efficiency
of
evaluaDng
queries
• interacDve‐level
performance
• web‐based
interface,
queries
with
4
~
12
triples
pacern
in
WHERE
clauses,
returning
a
few
hundred
results
rows
• response
Dme
for
the
queries
used
must
be
kept
to
the
order
of
a
few
milliseconds
5/3/10
CS395T
class
presentaDon
20
2.4.
Efficiency
(cont’d)
• The
Dme
taken
to
assert
new
knowledge
• The
knowledge
sources
that
AKT
project
uses
are
gathered
on
a
variety
of
schedules
ranging
from
daily
to
monthly.
• Maintaining
the
integrity
of
data
while
it
is
being
updated
is
an
important
concern
• To
this
problem,
the
Dme
during
which
the
knowledge
based
is
inconsistent
or
incomplete
should
be
kept
to
a
minimum.
5/3/10
CS395T
class
presentaDon
21
3.
Related
Work
5/3/10
CS395T
class
presentaDon
22
3.1.
Jena
– A
java
toolkit
for
manipulaDng
RDF
models
– Developed
by
HP
Labs
– Supports
excellent
funcDonaliDes
for
RDQL
queries
– Does
not
provide
OKBC
interface
– The
Dme
taken
to
assert
new
knowledge
is
considerably
long
(Aher
24
hours,
It
had
not
completed
the
import.)
5/3/10
CS395T
class
presentaDon
23
3.2.
Sesame
– An
architecture
for
inference
and
querying
of
RDF
and
RDF
Schema
– The
Query
response
Dme
for
DBMS
backed,
moderately
sized
knowledge
bases
is
in
excesses
of
60ms
– It
is
not
fast
enough
for
the
interacDve
user
interface
5/3/10
CS395T
class
presentaDon
24
3.3.
Redland
– To
be
capable
of
storage
of
large
RDF
graphs
– currently
has
no
graph
matching
query
facility
– The
RDF
parser
from
Redland,
Raptor,
which
provides
a
C
API
for
extracDng
the
triples
from
RDF/XML
documents
was
decided
to
use
for
this
project
5/3/10
CS395T
class
presentaDon
25
4.
The
Design
of
the
3store
RDF
Knowledge
Base
5/3/10
CS395T
class
presentaDon
26
4.1.
Plauorm
– The
sohware
is
developed
for
POSIX
compliant
Unix
environments
in
ANSI
C
for
scalability
and
portability
reasons
– The
back‐end
storage
is
provided
by
an
SQL
engine.
MySQL
was
chosen
• Open
source,
and
portable
to
many
POSIX
OS
• Picking
a
single
back‐end
allows
Dghter
integraDon
and
more
opDmizaDon
works
to
be
done
5/3/10
CS395T
class
presentaDon
27
4.2.
3
Layer
Model
– The
3
layer
opDmizaDon
model
of
HYWIBAS’s
KB.
Knowledge
Base
System
Object
Data
Management
System
RelaDonal
Database
System
5/3/10
CS395T
class
presentaDon
28
4.2.
3
Layer
Model
• Use
an
off‐the‐shelf
RDF
parser,
namely
the
Free
Sohware
Raptor
toolkit
RDF/XML
sources
Raptor
(RDF
Parser)
Parsing
5/3/10
RDF/XML
triples
rdfsql
library
SQL
INSERT
statements
RDBMS
TranslaDng
CS395T
class
presentaDon
29
4.3.
Database
Structure
The
arrangement
of
tables
in
the
SQL
database
schema
used
in
the
current
version
of
3store
is
shown
in
this
figure.
5/3/10
CS395T
class
presentaDon
30
4.3.
Database
Structure
• Hashing
– 64‐bit
hash
(the
largest
integer
format
of
commodity
servers),
MD5
funcDon
−10 10 – The
probability
of
a
hash
collision
:
around
1:
– By
addiDon
of
a
flag
which
indicates
the
type
of
the
object
of
a
triple,
collisions
between
resources
hashes
and
literal
hashes
are
avoided.
€
5/3/10
CS395T
class
presentaDon
31
4.4
Query
ExecuDon
– 3store
RDQL
engine
transforms
an
RDQL
query
into
an
SQL
query
over
the
underlying
RDBMS
representaDon
of
the
RDF
data
RDQL
syntax
5/3/10
RDF
graph
CS395T
class
presentaDon
32
4.4
Query
ExecuDon
• Query
ExecuDon
– RDQL
also
allows
an
opDonal
constraints
clause.
• ==,
!=,
<=,
=>,
<
5/3/10
CS395T
class
presentaDon
33
4.4
Query
ExecuDon
• Transforming
the
RDQL
triple
expressions
into
relaDonal
calculus
– Each
triple
pacern
in
the
expression
is
assigned
an
existenDally
quanDfied
variable
– A
constraint
is
then
added
to
the
scope
of
the
existenDal
funcDon
for
each
node
appearing
in
more
than
one
triple
– Constraints
are
also
added
for
every
object’s
resource/ literal
constraints.
– A
free
tuple
variable
is
included
for
any
requested
acributes.
5/3/10
CS395T
class
presentaDon
34
4.4
Query
ExecuDon
5/3/10
CS395T
class
presentaDon
35
4.4.
Query
execuDon
• The
resource/
literal
type
of
any
free
variable
in
the
query
can
ohen
be
determined
by
inspecDon
– when
it
appears
as
the
subject
or
predicate
of
a
triple
expression
– when
it
is
explicitly
bound
to
a
URI
constant
in
the
constraints
secDon
of
the
RDQL
– when
it
appears
as
the
value
of
a
predicate
whose
range
is
known.
5/3/10
CS395T
class
presentaDon
36
4.4.
Query
ExecuDon
• The
problem
is
that
when
the
type
of
the
requested
acribute
is
unknown,
as
following
example
– SELECT
?a
WHERE
(,
,
?a)
5/3/10
CS395T
class
presentaDon
37
4.6.
Hybrid
eager
/
lazy
producDon
• The
model
for
RDF/RDFS
contains
a
number
of
entailments
which
should
be
generated
by
an
RDF
inference
engine.
• Most
common
approaches
to
support
it
is
producDon
rule
system
– generate
the
required
entailments
by
either
forward
chaining
from
asserted
facts
or
backward
chaining
from
queries
that
are
presented
to
the
system.
5/3/10
CS395T
class
presentaDon
38
Entailments
• A
entails
B
if
– whenever
A
is
true,
then
B
must
be
true,
and
– if
B
is
false,
then
A
is
false
• Example
– A1:
Carmen
stole
the
Mona
Lisa
this
morning
– Entailment?
• Carmen
stole
something.
(YES)
• Something
was
stolen
this
morning.
(YES)
• Carmen
believes
the
Mona
Lisa
is
valuable.
(NO)
5/3/10
CS395T
class
presentaDon
39
4.6.
Hybrid
eager
/
lazy
producDon
• Forward
chaining
(Sesame)
– It
applies
the
entailment
rules
in
the
model
theory
exhausDvely
to
the
asserted
facts
in
order
to
generate
RDF
closure
of
the
those
facts.
– Pros:
Reducing
the
processing
cost
of
evaluaDng
queries.
(a
considerable
advantage
for
interacDve
applicaDon)
– Cons:
Increasing
the
space
of
storing
RDF
closure
5/3/10
CS395T
class
presentaDon
40
4.6.
Hybrid
eager
/
lazy
producDon
• Backward
chaining
– It
evaluates
the
entailments
matched
by
a
query
at
query
processing
Dme.
– Pros:
Reducing
the
cost
of
storing
the
entailments
– Cons:
lazy
producDon
may
incur
a
Dme
penalty
that
makes
its
use
in
interacDve
applicaDon
unworkable.
5/3/10
CS395T
class
presentaDon
41
4.6.
Hybrid
eager
/
lazy
producDon
• A
SoluDon
– adopt
a
hybrid
approach
• Some
entailment
rules
which
generate
fewer
entailments
are
evaluated
eagerly
using
forward
chaining
rules
when
new
facts
are
asserted.
• The
other
entailment
rules
which
need
a
greater
storage
and
a
lower
evaluaDon
cost
are
evaluated
as
required
at
query
Dme
by
a
combinaDon
of
backward
chaining
and
query
rewriDng.
5/3/10
CS395T
class
presentaDon
42
4.6.
Hybrid
eager
/
lazy
producDon
• By
producing
some
of
the
entailments
lazily,
it
is
possible
to
trade
triple
table
size
and
asserDon
Dme
complexity
for
query
Dme
complexity.
• Example:
rdfs6
– xxx
aaa
yyy
– aaa
rdfs:subPropertyOf
bbb
xxx
bbb
yyy
– (a,b,c)
(a,
?p,
c)
(?p
rdfs:subPropertyOf,
b)
5/3/10
CS395T
class
presentaDon
43
4.7.
TransiDve
Entailment
• TransiDve
Entailment
– rdfs5a
express
the
transiDve
nature
of
some
properDes
in
the
RDFS
vocabularies.
aaa
rdfs:subPropertyOf
bbb
bbb
rdfs:subPropertyOf
ccc
aaa
rdfs:subPropertyOf
ccc
– Implemented
these
transiDve
entailments
using
an
opDmized
implementaDon
of
the
TRANSITIVE‐ CLOSURE
funcDon
5/3/10
CS395T
class
presentaDon
44
5.
Performance
• On
commodity
hardware,
– 3store
can
assert
between
100
and
300
triples/ second
depending
on
the
size
of
the
knowledge
base.
• Some
opDmizaDons
– opDonally
locking
the
tables
to
prevent
index
rebuilding
– grouping
SQL
INSERT
statements
5/3/10
CS395T
class
presentaDon
45
6.
Conclusion
• 3store,
an
RDF
engine,
which
efficiently
supports
the
RDF
and
RDFS
entailments
over
relaDvely
large
RDF
knowledge
base
using
relaDonal
database
back‐end
• At
that
Dme,
they
claimed
that
there
was
a
lack
of
standardized
performance
benchmarks
for
RDF
knowledge
bases.
5/3/10
CS395T
class
presentaDon
46
QuesDon
and
Answer
5/3/10
CS395T
class
presentaDon
47