Department of Computer Science

Named Entities in the Digital Humanities This presentation: http://j.mp/nerdh Eetu Mäkelä (http://www.seco.tkk.fi/u/jiemakel/)

Department of Computer Science

CKCC - An example of a DH project utilizing NER

Department of Computer Science

Recogito - An example of a Named Entity reconciliation tool

Department of Computer Science

Particularities of NER in the Digital Humanities ● Humanities materials are complex: a single document may contain multiple languages, language may be old or change through time in a corpus, …

Department of Computer Science

Particularities of NER in the Digital Humanities ● Humanities scholars are: ● extremely thorough in verifying information ● used to huge amounts of manual work → NER is a part of a much larger process → Assume someone is going to manually go through your NER results → Recall much more important than precision → Important to discover named entity occurrences, but not e.g. derive entity types

Department of Computer Science

Particularities of NER in the Digital Humanities ● It is important to go beyond locating named entity surface forms to strongly identify the individuals beyond them → Coreference resolution, use of databases of identities and name variants

Department of Computer Science

Further examples of Research Questions in the Digital Humanities ● Ancient Name Dropping: ● Co-citation graph of mythical and real authorities in ancient Greek scientific texts ● Contextual reader: ● First World War primary sources ● Ancient texts (2) ● Finnish law ● Corpus of Early English Correspondence: ● How much do highly educated people use the word happiness vs those of a lower education? ● Bibliothèque nationale de France: ● Which places publish disproportionately much philosophy in French in the 18th century? Department of Computer Science

Department of Computer Science

Data sources for named entity information

Department of Computer Science

Virtual International Authority File ● http://viaf.org/viaf/98930150/ ● Joins together authority files of 45 national libraries and other institutions ● “Anyone who has ever published anything that is in any of the catalogues of the participating libraries” ● People and organizations ● 2014/02: 50 million names for 19 million entities ● 2015/05: 274 million names for 79 million entities ● Some birth/death date information

Department of Computer Science

Problems for NER ● Automatic conversions from “Lastname, Firstname” to “Firstname Lastname” does not always work due to bad data

Charles-Victor Prévost d'Arlincourt Charles Victor Prévôt ˜d'œ Arlincourt Charles Victor Prevot d' Arlincourt Arlincourt

Department of Computer Science

Problems for NER ● Different forms of encoding, typoes (Paris,) (Paris) (Paris.)

Paris A Paris [A Paris]

[Paris,] À Paris

[Paris] (Paris

Amsterdam. - et Paris Amsterdam ; et Paris Amsterdam. - et à Paris Amsterdam [Paris] (Paris. - Amsterdam A Amsterdam [i. e. Paris]. M. DCC. LXX.

Department of Computer Science

Getty Union List of Artists’ Names ● http://www.getty.edu/vow/ULANFullDisplay?find=rumi &role=&nation=&prev_page=1&subjectid=500337998 ● Names, birth/death dates, education, occupation, relationships ● 2011: 600 000 names for 200 000 people

Department of Computer Science

Consortium of European Research Libraries Thesaurus ● Place name and personal names in Europe in the period of hand press printing (1450 - c. 1830) ● http://thesaurus.cerl.org/cgi-bin/record.pl?rid=cnp0131 7268 ● 20,000 place names, 900,000 names for people ● Names, biographical dates, activities, publications

Department of Computer Science

Wikidata ● https://www.wikidata.org/wiki/Q43347 ● Structured information on 14 million Wikipedia entities

Department of Computer Science

DBpedia ● http://dbpedia.org/page/Rumi ● Structured information extracted from Wikipedia infoboxes

Department of Computer Science

Publication information sources ● ● ● ● ● ●

DNB: Deutsche Nationalbibliografie BNF: Bibliographie nationale française BNB: British National Bibliography EEBO: Early English Books Online (1475-1700) ECCO: Eighteenth Century Collectons Online OCLC WorldCat: 305 million books from OCLC member libraries

Department of Computer Science

Structured data sources for places ● Getty Thesaurus of Geographic Names - 2 million names for 1,4 million modern and historical places ● http://www.getty.edu/vow/TGNFullDisplay?find=rome &place=&nation=&prev_page=1&english=Y&subjecti d=7000874 ● GeoNames - 10 million names for 9 million places ● Pleiades - 35,000 ancient places ● National gazetteers ● Historical Gazetteer of England’s Place-Names ● PNR ● DBpedia, Wikidata ● Place names in other datasets (BNF,BNB,..) Department of Computer Science

Structured data sources for other entities ● Getty Cultural Objects Name Authority (CONA) ● Gallery and museum databases, e.g. British Museum, Finnish National Gallery, Europeana, Digital Public Library of America ● Wikidata, DBpedia ● Domain-specific vocabularies such as WW1LOD

Department of Computer Science

Named Entities in the Digital Humanities

Automatic conversions from “Lastname, Firstname” to. “Firstname Lastname” does not always work due to bad data. Problems for NER. Charles-Victor Prévost d'Arlincourt. Charles Victor Prévôt ˜d'œ. Arlincourt. Charles Victor Prevot d'.

941KB Sizes 2 Downloads 245 Views

Recommend Documents

No documents