Similarity-Aware Indexing for Real-Time Entity Resolution Peter Christen

Ross Gaylery

Introduction Entity resolution is the task of identifying and matching records that refer to the same entities from different databases. Traditionally, this task is applied in batch-mode and on static databases, for example to nd records that relate to the same patient in different health databases, or duplicate records in bibliographic databases. Most research in entity resolution has concentrated on improving the matching quality and scalability to large databases, or reducing the manual efforts required in the entity resolution process. Many organisations are however increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include identity veri cation for online services and bene ts, national security databases, and digital libraries. The aim of real-time entity resolution is to match query records containing entities as fast as possible to one or several databases that contain records about existing entities. The approach must facilitate approximate matching; scale ef ciently to large databases that contain many million records; and generate a match score that indicates the likelihood that a matched record in the database refers to the same entity as the one of the query record.

Indexing for Real-time Entity Resolution The idea of our similarity-aware index is to calculate similarities between unique attribute values once during the build phase, so they don't need to be calculated for every query record. Different to other approximate query matching techniques, our approach allows any similarity comparison function, and any `blocking' (encoding) function, both possibly domain speci c, to be used. Record ID Surname Soundex encoding r1 smith s530 r2 miller m460 r3 peter p360 r4 myler m460 r5 smyth s530 r6 millar m460 r7 smith s530 r8 miller m460

p360 s530 SI

For each query record, the matching process returns a ranked list of potential matches and their similarities with the query record. A true match is achieved if one of the top ranked records refers to the same entity as the query record.

Experimental Evaluation We compared the similarity-aware index approach with standard blocking (basic inverted index) as traditionally used for entity resolution. Both were implemented in Python, and experiments were conducted on an idle Linux server with two quad-core 2.33 GHz 64-bit CPUs and 8 GBytes of memory. The experiments were conducted using a data set of nearly 7 million records containing surnames, postcodes and suburb (town) 1 names sourced from an Australian telephone directory from 2002. A given name attribute was added to this data set based on a list of around 80,000 unique given names and their frequencies. To evaluate scalability, we created test data sets of four different sizes containing 10%, 40%, 70% and 100% of the full data set. Both index approaches were queried with ve sets of query records containing zero to four manual modi cations per record. Build time

Average query time

Standard Blocking Sim-Aware Index

Example query record:

(1) Use encoding to get values from same block

millar miller myler

10 log seconds

!

Soundex encoding m460

Standard Blocking Sim-Aware Index

1000 log seconds

Surname miller

100

1

0.1

0.01 10 691,751

2,767,006

4,842,260

6,917,514

691,751

2,767,006

4,842,260

6,917,514

Number of records in data set

Number of records in data set

Memory usage

Accuracy for data set with 6,917,514 records

peter smith smyth

millar

miller 0.9 myler

miller

millar

myler

Query records can either refer to an entity stored in the index or to a new, unknown entity. It is assumed that query records can contain variations, errors, out-of-date and missing values. For values not stored in the index, the similarities between attribute values need to be calculated at query time.

miller

8000

0.7

(2) Get all relevant pre−calculated similarities

0.9 myler 0.8 0.8 millar

smyth

0.9

smyth

smith

0.9

Standard blocking Sim-Aware Index

100 4000 80 60 40

1000

20

0.7 400

(3) Accumulate similarities for matching record identifiers

peter smith

120

Standard Blocking Sim-Aware Index

Accuracy

m460

The index consists of three data structures: the block index BI contains encodings and their corresponding attribute values (known as `blocks' in record linkage); the similarity index SI holds the precalculated similarities within each block; and the record identi er index RI contains attribute values and their identi ers.

log MBytes

BI

David Hawkingz

0 691,751

2,767,006

4,842,260

6,917,514

Number of records in data set

0

1

2

3

4

Number of modifications per record

Figure 2: Summary of experimental results.

Conclusion and Future Work RI

millar

miller

myler

peter

smith

smyth

r6

r2

r4

r3

r1

r5

r8

r7

Figure 1: Example database and corresponding similarity-aware index.



School of Computer Science, The Australian National University, Canberra ACT 0200, Australia; [email protected] y Scoring Solutions, Veda Advantage, Melbourne VIC 3000, Australia; [email protected] y Funnelback Pty Ltd, Dickson ACT 2601, Australia; [email protected]

We presented a novel index approach for real-time entity resolution, and evaluated it experimentally on a real-world data set. The experiments showed that this approach can match query records more than two orders of magnitude faster than a basic index approach traditionally used for entity resolution. Future work includes improving the accuracy of the proposed approach, a proper analysis of its time and space complexity, improving scalability and query matching time, and conducting experiments on various other large data sets. 1

http://www.australiaondisc.com

Similarity-Aware Indexing for Real-Time Entity Resolution

ing the manual efforts required in the entity resolution process. ... names sourced from an Australian telephone directory from 2002.1. A given name attribute was ...

275KB Sizes 2 Downloads 155 Views

Recommend Documents

Towards Robust Indexing for Ranked Queries ∗
Department of Computer Science. University of Illinois at Urbana-Champaign. Urbana, IL ... Database system should be able to process the ranked queries.

active tagging for image indexing
quantized since there should be enormous (if not infinite) potential tags that are relevant to ... The concurrence similarity between ti and tj is then defined as. W. T.

Micro-Review Synthesis for Multi-Entity Summarization
Abstract Location-based social networks (LBSNs), exemplified by Foursquare, are fast ... for others to know more about various aspects of an entity (e.g., restaurant), such ... LBSNs are increasingly popular as a travel tool to get a glimpse of what

Micro-Review Synthesis for Multi-Entity Summarization
Abstract Location-based social networks (LBSNs), exemplified by Foursquare, are fast ... for others to know more about various aspects of an entity (e.g., restaurant), such ... LBSNs are increasingly popular as a travel tool to get a glimpse of what

active tagging for image indexing
many social media websites such as Flickr [1] and Youtube [2] have adopted this approach. .... For Flickr dataset, we select ten most popular tags, including.

Distributed Indexing for Semantic Search - Semantic Web
Apr 26, 2010 - 3. INDEXING RDF DATA. The index structures that need to be built for any par- ticular search ... simplicity, we will call this a horizontal index on the basis that RDF ... a way to implement a secondary sort on values by rewriting.

Entity identification for heterogeneous database ...
Internet continuously amplifies the need for semantic ..... ing procedure of an application service provider. (ASP) for the .... ю 17:4604 В Home ю 14:9700 В Bus.

Indexing Dataspaces - Semantic Scholar
and simple structural requirements, such as “a paper with title 'Birch', authored ... documents, Powerpoint presentations, emails and contacts,. RDB. Docs. XML.

STATISTICAL RESOLUTION LIMIT FOR SOURCE ...
ABSTRACT. In this paper, we derive the Multidimensional Statistical Resolution. Limit (MSRL) to resolve two closely spaced targets using a widely spaced MIMO radar. Toward this end, we perform a hypothesis test formulation using the Generalized Likel

entity retrieval - GitHub
Jun 15, 2014 - keyword unstructured/ semi-structured ranked list keyword++. (target type(s)) ..... Optimization can get trapped in a local maximum/ minimum ...

pdf-175\realtime-data-mining-self-learning-techniques-for ...
... loading more pages. Retrying... pdf-175\realtime-data-mining-self-learning-techniques ... numerical-harmonic-analysis-by-alexander-paprotny.pdf.

Distributed QoS Guarantees for Realtime Traffic in Ad Hoc Networks
... on-demand multime- dia retrieval, require quality of service (QoS) guarantees .... outside interference, the wireless channel has a high packet loss rate and the ...

ANGULAR RESOLUTION LIMIT FOR DETERMINISTIC ...
2. MODEL SETUP. Consider a linear, possibly non-uniform, array comprising M sen- sors that receives two narrowband time-varying far-field sources s1(t) and ...

Realtime HTML5 Multiplayer Games with Node.js - GitHub
○When writing your game no mental model shift ... Switching between different mental models be it java or python or a C++ .... Senior Applications Developer.

Inference Protocols for Coreference Resolution - GitHub
R. 23 other. 0.05 per. 0.85 loc. 0.10 other. 0.05 per. 0.50 loc. 0.45 other. 0.10 per .... search 3 --search_alpha 1e-4 --search_rollout oracle --passes 2 --holdout_off.

Complaint Resolution Procedures for Parents.pdf
conflict resolution). Teach children about the dangers and consequences of guns and. weapons. Support the school's policy and regulations regarding ...

Complaint Resolution Procedures for Parents.pdf
Page 1 of 2. Behavioral Expectations for Students. We Believe: That being at your best will bring you the most success; in life as. well as in school.

resolution for bombay hc.pdf
Page 1 of 2. SUPREME COURT OF INDIA. This file relates to the proposal for appointment of Shri Chetan. S. Kapadia, Advocate, as a Judge of the Bombay High Court. Supreme Court Collegium while considering the proposal for. appointment of Shri Chetan S

Lightweight, High-Resolution Monitoring for ... - Semantic Scholar
large-scale production system, thereby reducing these in- termittent ... responsive services can be investigated by quantitatively analyzing ..... out. The stack traces for locks resembled the following one: c0601655 in mutex lock slowpath c0601544 i

Complaint Resolution Procedures for Parents.pdf
conflict resolution). Teach children about the dangers and consequences of guns and. weapons. Support the school's policy and regulations regarding narcotics,. drugs (prescription and non-prescription), alcohol, mood altering. substances and tobacco.