Improving Wikipedia with DBpedia Diego Torres, Pascal Molli, Hala Skaf-‐Molli and Alicia Diaz University of La Plata, ArgenEna University of Nantes, France SemanEc Web CollaboraEve Spaces –SWCS2012 April, 17th 2012 -‐ Lyon
1
Context • SemanEc Web is growing fast. It is mainly build extracEng informaEon from Social Web. – Dbpedia extracted from Wikipedia infoboxes. – It is possible to make semanEc queries and deduce new data
• How DBPedia can help improve the Wikipedia ? • More generally, How semanEc web help to improve social web ? 2
Dbpedia Query Support Retrieve the list of people with city birth place. SELECT ?city, ?person WHERE{ ?person a Person. ?city a City . ?person birthplace ?city }
• (Paris, Pierre_Curie) • (Rosario, Lionel_Messi) • (Boston, Robin_Moore) ... • 119097 results !
3
Checking the same results in Wikipedia NavigaEon from city to person
• (Paris, Pierre_Curie) OK • (Rosario, Lionel_Messi) OK • (Boston, Robin_Moore) X ...
Not possible to navigate from Boston to Robin Moore But Robin Moore is born in Boston !!
In 49899 of 119097 is NOT possible the navigaEon from city to person. Only, 65200 of 119097 cases is possible navigate from the city to the person. 4
General Issues • There is an informaEon gap between DBpedia and Wikipedia. • If I want to repair, What are the Wikipedia convenEons I have to follow? – If I am good Wikipedian and read carefully Wikipedia convenEons1, I have to follow this: – Boston/Category:Boston/ Category:People_from_Boston/Robin_Moore
• It is possible to discover this automaEcally? 1h`p://en.wikipedia.org/wiki/Wikipedia:CategorizaEon_of_people
5
Approach • Learn from the good cases in Wikipedia. SELECT ?city, ?person WHERE{ ?person a Person. ?city a City . ?person birthplace ?city }
(Paris, Piere_Curie) } (Rosario, Lionel_Messi) } (Boston, Robin_Moore) ... }
OK OK X
• For example, for people and birthplace we can learn from the 65200 of 119097 cases.
6
Approach • Learn? – If Wikipedia a DB, what is the query in Wikipedia that best approximates the results obtained by a DBpedia query ?
• But Wikipedia is not a Database -‐> the idea: – Index the concerned fragment of Wikipedia as a Path Data Base. – Next, find the path query that best approximate the DBpedia Query. 7
Wikipedia as a graph DB with Path Queries • Considering Wikipedia as a Graph-‐DB, I want to ask Path Queries [Abiteboul97]: – Retrieve all people p from a given city c. – PQ1(c,p)=c/Category:c/Category: People_from_c/p
• We make the hypothesis that the shortest path query that maximally contains the Dbpedia results is the best expression of this semanEc relaEon in Wikipedia. Path Queries Abiteboul & Vianu, SIGMOD 97
8
Path Indexing • Index a sub graph of Wikipedia given the nodes resulEng from DBpedia Query. • Collect all the paths that links source and target with DFS algorithm. • Reduce alphabet by “wildcarding” properEes linked to source and target. • Example: – Boston /Category:Boston/Category: People_from_Boston/Robin_Moore – #from/Category:#from/Category: People_from_#from/#to
9
RSq(d,r)
Path
Path Query
(Paris, Paris/ Category: Paris / Pierre_Curie) Category:People_from_Paris/ Pierre_Curie
#from / Category:#from / Category:People_from_#from / #to
(Rosario, Lionel_Messi )
Rosario / Category: Rosario / Category:People_from_Rosario/ Lionel_Messi
#from/ Category:#from / Category:People_from_#from/ #to
Rosario / Lionel_Messi
#from/ #to
10
Path indexing results 65200
11
EvaluaEon • Run 6 queries in DBpedia and calculate the path query index running PIA (maxL=5). – Compute Found, Not Found and Errors. – Compute the proporEon of generated path with DFS up to maxL with number of Path Queries.
• Run the SCMPQ on Wikipedia – Analyze the path query in funcEon of the returned values. – Compute Precision and Recall.
• Community ValidaEon 12
EvaluaEon – Queries in DBpedia #EQ1: Cities and people born there. SELECT ?city, ?person WHERE{ ?person a Person. ?city a City. ?person birthplace ?city} #EQ2: Cities and philosophers born there. SELECT ?city, ?philosopher WHERE{ ?philosopher a Philosopher. ?city a City. ?philosopher birthplace ?city} #EQ3: Philosophers born in France SELECT France, ?philosopher WHERE{ ?philosopher a Philosopher. ?philosopher birthplace France}
#EQ4: Books and its authors. SELECT ?book, ?author WHERE{ ?book a Book. ?book author ?author } #EQ5: Works and its music composer. SELECT ?musician, ?work WHERE{ ?work a Work. ?work musicBy ?musician } #EQ6: Cities and its universities. SELECT ?city, ?university WHERE{ ?university a University. ?city a City. ?university city ?city}
13
Query Results Query
Domain
Range
Number of pairs in DBpedia
Pairs where exist a WP path
Paris Synchroniz where NOT aEon exist a WP errors path
EQ1
City
Person
119097
65200
49899
3998
EQ2
City
Philosopher 171
103
61
7
EQ3
France
Philosopher 21
21
0
0
EQ4
Book
Author
24185
20328
3689
168
EQ5
Work
Musician
1204
836
367
1
EQ6
City
University
14094
9497
4404
193
14
Query
Path
#
Q1:CiEes-‐ people
#from/ Cat:from/Cat:People from #from/ #to
34008
#from/#to
3188
Q2:CiEes-‐ philo
#from /Cat:from/Cat:People from #from/ #to
60
#from /Cat:Capitals in Europe/Cat:#from/People from #from/#to
15
Q3:philo-‐ France
#from/ Cat:#from/Cat:French people/Cat:French people by occupaEon/French philosophers/ #to
21
#from/ Cat:#from/ Cat:French people/Cat:French people by occupaEon/French sociologists/ #to
3
Q4:Book-‐ authors
#from/#to
19863
#from/ Cat:#to/ #to
119
Q5:musician-‐ work
#from/ #to
811
#from/ Cat:Tony Award winners/ Cat:Tony Award winning musicals / #to
26
Q6:ciEes-‐ universiEes
#from/#to
6031
#from/Cat:#from/ #to
15
1314
Query
Prop
Precisi Recal Cont on l rib
Rej ecte d
Q1:CiEes-‐ people
#from/ Cat:from/Cat:People from #from/ #to
0.415
12
Q2:ciEes-‐ philo
#from /Cat:from/Cat:People from/ #to -‐> Sportspeople 0.003 0.58 • People from E#dinburgh from Edinburgh • Dayton-‐Kentucky is a small community" and “this category contain opne arEcle and have li`le possibility #from/ Cat:#from/Cat:French eople/ 0.099 1 for growth" Cat:French people b y occupaEon/
Q3:Philo-‐ france
0.52
78
French philosophers/ #to Q4:Book-‐ authors
#from/#to
0.22
0.97
Q5:musici an-‐work
#from/ #to
0.027
0.97
36
0
Q6:ciEes-‐ #from/#to universiEe s
0.014
0.63
17
1 16
Conclusions and Further Work • What is the query in Wikipedia that best approximates the results obtained by a DBpedia query? – shortest path query that maximally contains the semanEc relaEon expressed by the semanEc query ?
• Preliminary evaluaEons with real data are encouraging – Precision perEnent meaningful ? – Containment is perEnent but there are other factors… 17
Future Work – ConEnue social validaEon and learn from social feedback – Reduce the alphabet with be`er use of properEes, will reduce index size – Use overlapping and containsment between query results – Improve computaEons Eme – Extends to all relaEons in DBPedia
18
Obtained Path Index #EQ1: Cities and people born there. SELECT ?city, ?person WHERE{ ?person a Person. ?city a City. ?person birthplace ?city}
207165 paths 8118 path queries
19
Obtained Path Index #EQ2: Cities and philosophers born there. SELECT ?city, ?philosopher WHERE{ ?philosopher a Philosopher. ?city a City. ?philosopher birthplace ?city}
267 paths 200 path queries
20
Obtained Path Index #EQ3: Philosophers born in France SELECT France, ?philosopher WHERE{ ?philosopher a Philosopher. ?philosopher birthplace France}
391 paths 191 path queries
21
Obtained Path Index #EQ4: Books and its authors. SELECT ?book, ?author WHERE{ ?book a Book. ?book author ?author }
5634 paths 1801 path queries
22
Obtained Path Index #EQ5: Works and its music composer. SELECT ?musician, ?work WHERE{ ?work a Work. ?work musicBy ?musician }
183 paths 41 path queries
23
Obtained Path Index #EQ6: Cities and its universities. SELECT ?city, ?university WHERE{ ?university a University. ?city a City. ?university city ?city}
26175 paths 6701 path queries
24