Language technology in the service of the humanities

Viewer
Transcript

Language technology in the service of the humanities Eetu Mäkelä, D.Sc. Assistant Professor in Digital Humanities / University of Helsinki Docent (Adjunct Professor) in Computer Science / Aalto University

What’s it about: Contextual reader CORE with Thea Lindquist, University of Colorado Boulder

A Contextual Reader for WWI Primary Sources

A Contextual Reader for Finnish Law

Needs from language technology ● (Language identification) ● NER against configurable vocabularies

○ Recall more important than precision ○ Need for strong identification of found entities ○ Limited domain alleviates need for disambiguation

● ● ● ●

(Inflected form generation) (Synonym generation) Configurability! Evaluation: is interesting context highlighted, links relevant?

A Contextual Reader for Finnish WWII Magazines

Place disambiguation ● Is one of the candidates in the focus area? ● Which is the largest place?

Person disambiguation: Karl or Klaus Oesch? ● K. Oesch –> Karl ● K. Oesch after 1.10.1944 –> Klaus ● Privates Talvela, Oesch and Walden & after 8.7.1942 –> Klaus ● K. Oesch when 2nd platoon mentioned –> Klaus ● Mannerheim cross knight private Oesch –> Klaus

Needs from language technology ● (Language identification) ● NER against configurable vocabularies

○ Recall more important than precision ○ Need for strong identification of found entities ○ Need for disambiguation depends on scope and domain

● ● ● ●

(Inflected form generation) (Synonym generation) Configurability! HACK HACK HACK TWEAK Evaluation: is interesting context highlighted, links relevant?

Current projects 1: Sociohistorical Language Change with Tanja Säily, University of Helsinki

Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives

○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances

FiCa

Needs from language technology ● Support for manual perusal of results, context ● Support for grouping spelling variants ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses

Case study: newly coined words • Compare Corpus of Early English Correspondence words to:

• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary

• Needs from language technology: • Handling spelling variation • Handling OCR errors

Current projects 2: Analysing public communication in 18th century Britain and 19th century Finland with the COMHIS collective, University of Helsinki

Linguistic fingerprint, either of works or of words neighbourhood

Temporal/geographical perspective

Close reading

Robust tracking of particular discourses through term vectors

Charting conceptual distance: human nature vs benevolence

Subcorpora formulation “Human nature” → “Human nature” or “frailty” or “misanthropy” or “pravity” in 1) books between 100 and 200 pages long, that do not have the word “sermon” or “christ” in their PREFACE or TITLE and that also contain the word “philosophy”, against 2) books that DO have the word christ or sermon in their PREFACE or TITLE

Needs from language technology ● Support for fluid term vector operations in terms of subcorpora, parameters ● OCR error handling

Future project?: Analysing newspaper fiction literature genres in 19th century Finland Academy proposal with people from the Digital Humanities Hackathon 2017

Objectives • Literary scholarship: discover and analyze the world of fiction published on the pages of newspapers in Finland in the 19th century • Needs from language technology: • Robust genre dissection, summarization and comparisons ‒

fact/fiction/poetry/drama → ads, patriotic poetry, religious texts, socialist realism, experimental fiction, …

[email protected] http://j.mp/s-makela This presentation: http://j.mp/lt-hums

(the) Digital Humanities? - Sign in Accounts

The Digital-Humanities Bust - The Chronicle of Higher Education.pdf ...

Peterson s Graduate Programs in the Humanities

Christ and Krishna - Mormon Scholars in the Humanities

Named Entities in the Digital Humanities

Christ and Krishna - Mormon Scholars in the Humanities

Rethink Technology In The Age Of The Cloud ... Services