Materializing and Querying Learned Knowledge

Viewer
Transcript

Materializing and Querying Learned Knowledge

Volker Tresp, Yi Huang Siemens, Corporate Technology

Markus Bundschus University of Munich

Achim Rettinger Technical University of Munich

Deductive and Inductive Reasoning

In many Semantic Web (SW) domains a tremendous amount of statements (expressed as triples) might be true but only a small number of statements is known to be true or can be inferred to be true

There are regularities in the data But: cannot be captured by axioms Large people tend to have higher weight

The goal of this work Estimate the truth values of statements by exploring regularities in the SW data with machine learning Store the probabilistic triples in the SW-KB Make those triples available for querying IRMLES 2009

A Regular SPARQL Query

Query (including deductive inference): Find all actors that act in movies that are filmed in an Italian city

IRMLES 2009

A SPARQL Query with Learned Probabilities

Query (including deduction and induction) Find all actors that are likely to act in movies that are filmed in an Italian city

integrate in query learn!

sort by probability

Ronald Reagan George W. Bush … Damn … should have excluded US-presidents IRMLES 2009

Requirements

Machine learning should be “push-button” requiring a minimum of user intervention Learning time should scale well with the size of the SW The statements and their probabilities, which are predicted from machine learning, should easily be integrated into SPARQL-type querying Machine learning should be suitable for the data situation on the SW with sparse data (e.g., only a small number persons are friends) and missing information (e.g., some people don't reveal private information)

IRMLES 2009

The Key Steps User defines key entity (person) User defines population (person, that is an actor) LarKC: defines sample (subset of population) LarKC: finds all triples in which key entity is either subject or property value Calculate aggregates features The (sparse, incomplete) data matrix is generated (including deduced triples) Pruning: Columns with few ones are removed Learning by matrix completion methods Learned models makes prediction in the sample Learned model is applied to population A subset of estimated (probabilistic) triples is written into triple store Queries can be formulated IRMLES 2009

FOAF Experiment

Of

Ivey League

kn ow s

kn ow s

te nd s

Harvard at

Joe

residence te da

po st

#ofBlogs

NE-US n io g Re in

Boston

ha s

irth OfB

OnlineChat Account

ed

knows s ld o h

From subclass relations

dIn ate lo c

ws o kn

Jack

type

Mary

image

1980 ageGroup

ThirtySomething

RULE: If born between 1979 and 1989 then in ageGroup ThirtySomething IRMLES 2009

Kn thir ows a tyS ge om Gro eth up ing

hir tyS om Re sid eth en ing ce inR kn eg ow ion sJ NE oe -U S kn ow sJ ac k kn ow sM ary

ag eG rou pt

Re sid en ce Bo Re sto sid n en ce NY ho lds C on lin eC ha ha sIm tA co ag un e t

Data Matrix (FOAF)

Joe Jack Mary

…

IRMLES 2009

FOAF Experiment Statistics We selected 636 persons with a "dense" friendship information On average, a given person has 18 friends Numerical values such as date of birth or the number of blog posts were discretized The resulting data matrix, after pruning columns with few ones, has 636 persons (rows) and 491 columns 462 of the 491 columns (friendship attributes) refer to the property knows The remaining columns (general attributes) refer to general information about age, location, number of blog posts, attended school, etc. We can then answer queries such as Who would likely want to be Jack's friend; Which female persons in the north-east US, would likely want to be Jack's friends IRMLES 2009

Learning Approaches SVD based

X =UDV

T

( rr ) T ˆ X =UD V NNMF

LDA

X = AB X =AB

T

T

ai , j ≥ 0

bi , j ≥ 0

ai ,k = P(attr = i | z = k )

bi , j = P( KE = i | z = k ) IRMLES 2009

Experimental Results

NDCG-Score for different learning approaches IRMLES 2009

Persisting Probabilistic Triples

• quadruple

PersonA

foaf:knows

PersonB

0.758

_:node

• reification

rdf:subject

rdf:predicate

rdf:object

prob

rdf:type

(simplest but high memory cost)

PersonA

foaf:knows

PersonB

0.758

Statement

• blank node

PersonA

kp

_:node

foaf:knows

PersonB

prob

0.758

IRMLES 2009

Results: Who wants to be Trelena’s Friend

IRMLES 2009

Conclusion and Outlook We presented a novel generic learning approach for deriving probabilistic SW statements and demonstrated how these can be integrated into an extended SPARQL query The approach is suitable for a typical situations with sparse/missing data The learning process is to a large degree autonomous (goal!) Generalization from the sample to the population is linear in the size of the population (matrix!) Learned statements are materialized for fast querying LDA showed best performance (Bayesian averaging) Part of EU-FP7: LarKC

IRMLES 2009

Short paper: Materializing Highly Available Grids