Similarity-Aware Indexing for Real-Time Entity Resolution

Viewer
Transcript

Similarity-Aware Indexing for Real-Time Entity Resolution Peter Christen

Ross Gaylery

Introduction Entity resolution is the task of identifying and matching records that refer to the same entities from different databases. Traditionally, this task is applied in batch-mode and on static databases, for example to nd records that relate to the same patient in different health databases, or duplicate records in bibliographic databases. Most research in entity resolution has concentrated on improving the matching quality and scalability to large databases, or reducing the manual efforts required in the entity resolution process. Many organisations are however increasingly faced with the challenge of having large databases containing entities that need to be matched in real-time with a stream of query records also containing entities, such that the best matching records are retrieved. Example applications include identity veri cation for online services and bene ts, national security databases, and digital libraries. The aim of real-time entity resolution is to match query records containing entities as fast as possible to one or several databases that contain records about existing entities. The approach must facilitate approximate matching; scale ef ciently to large databases that contain many million records; and generate a match score that indicates the likelihood that a matched record in the database refers to the same entity as the one of the query record.

Indexing for Real-time Entity Resolution The idea of our similarity-aware index is to calculate similarities between unique attribute values once during the build phase, so they don't need to be calculated for every query record. Different to other approximate query matching techniques, our approach allows any similarity comparison function, and any `blocking' (encoding) function, both possibly domain speci c, to be used. Record ID Surname Soundex encoding r1 smith s530 r2 miller m460 r3 peter p360 r4 myler m460 r5 smyth s530 r6 millar m460 r7 smith s530 r8 miller m460

p360 s530 SI

For each query record, the matching process returns a ranked list of potential matches and their similarities with the query record. A true match is achieved if one of the top ranked records refers to the same entity as the query record.

Experimental Evaluation We compared the similarity-aware index approach with standard blocking (basic inverted index) as traditionally used for entity resolution. Both were implemented in Python, and experiments were conducted on an idle Linux server with two quad-core 2.33 GHz 64-bit CPUs and 8 GBytes of memory. The experiments were conducted using a data set of nearly 7 million records containing surnames, postcodes and suburb (town) 1 names sourced from an Australian telephone directory from 2002. A given name attribute was added to this data set based on a list of around 80,000 unique given names and their frequencies. To evaluate scalability, we created test data sets of four different sizes containing 10%, 40%, 70% and 100% of the full data set. Both index approaches were queried with ve sets of query records containing zero to four manual modi cations per record. Build time

Average query time

Standard Blocking Sim-Aware Index

Example query record:

(1) Use encoding to get values from same block

millar miller myler

10 log seconds

!

Soundex encoding m460

Standard Blocking Sim-Aware Index

1000 log seconds

Surname miller

100

1

0.1

0.01 10 691,751

2,767,006

4,842,260

6,917,514

691,751

2,767,006

4,842,260

6,917,514

Number of records in data set

Number of records in data set

Memory usage

Accuracy for data set with 6,917,514 records

peter smith smyth

millar

miller 0.9 myler

miller

millar

myler

Query records can either refer to an entity stored in the index or to a new, unknown entity. It is assumed that query records can contain variations, errors, out-of-date and missing values. For values not stored in the index, the similarities between attribute values need to be calculated at query time.

miller

8000

0.7

(2) Get all relevant pre−calculated similarities

0.9 myler 0.8 0.8 millar

smyth

0.9

smyth

smith

0.9

Standard blocking Sim-Aware Index

100 4000 80 60 40

1000

20

0.7 400

(3) Accumulate similarities for matching record identifiers

peter smith

120

Standard Blocking Sim-Aware Index

Accuracy

m460

The index consists of three data structures: the block index BI contains encodings and their corresponding attribute values (known as `blocks' in record linkage); the similarity index SI holds the precalculated similarities within each block; and the record identi er index RI contains attribute values and their identi ers.

log MBytes

BI

David Hawkingz

0 691,751

2,767,006

4,842,260

6,917,514

Number of records in data set

0

1

2

3

4

Number of modifications per record

Figure 2: Summary of experimental results.

Conclusion and Future Work RI

millar

miller

myler

peter

smith

smyth

r6

r2

r4

r3

r1

r5

r8

r7

Figure 1: Example database and corresponding similarity-aware index.

School of Computer Science, The Australian National University, Canberra ACT 0200, Australia; [email protected] y Scoring Solutions, Veda Advantage, Melbourne VIC 3000, Australia; [email protected] y Funnelback Pty Ltd, Dickson ACT 2601, Australia; [email protected]

We presented a novel index approach for real-time entity resolution, and evaluated it experimentally on a real-world data set. The experiments showed that this approach can match query records more than two orders of magnitude faster than a basic index approach traditionally used for entity resolution. Future work includes improving the accuracy of the proposed approach, a proper analysis of its time and space complexity, improving scalability and query matching time, and conducting experiments on various other large data sets. 1

http://www.australiaondisc.com

Towards Robust Indexing for Ranked Queries â