Language technology in the service of the humanities Eetu Mäkelä, D.Sc. Assistant Professor in Digital Humanities / University of Helsinki Docent (Adjunct Professor) in Computer Science / Aalto University
What’s it about: Contextual reader CORE with Thea Lindquist, University of Colorado Boulder
A Contextual Reader for WWI Primary Sources
A Contextual Reader for Finnish Law
Needs from language technology ● (Language identification) ● NER against configurable vocabularies
○ Recall more important than precision ○ Need for strong identification of found entities ○ Limited domain alleviates need for disambiguation
● ● ● ●
(Inflected form generation) (Synonym generation) Configurability! Evaluation: is interesting context highlighted, links relevant?
A Contextual Reader for Finnish WWII Magazines
Place disambiguation ● Is one of the candidates in the focus area? ● Which is the largest place?
Person disambiguation: Karl or Klaus Oesch? ● K. Oesch –> Karl ● K. Oesch after 1.10.1944 –> Klaus ● Privates Talvela, Oesch and Walden & after 8.7.1942 –> Klaus ● K. Oesch when 2nd platoon mentioned –> Klaus ● Mannerheim cross knight private Oesch –> Klaus
Needs from language technology ● (Language identification) ● NER against configurable vocabularies
○ Recall more important than precision ○ Need for strong identification of found entities ○ Need for disambiguation depends on scope and domain
● ● ● ●
(Inflected form generation) (Synonym generation) Configurability! HACK HACK HACK TWEAK Evaluation: is interesting context highlighted, links relevant?
Current projects 1: Sociohistorical Language Change with Tanja Säily, University of Helsinki
Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives
○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances
FiCa
Needs from language technology ● Support for manual perusal of results, context ● Support for grouping spelling variants ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses
Case study: newly coined words • Compare Corpus of Early English Correspondence words to:
• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary
• Needs from language technology: • Handling spelling variation • Handling OCR errors
Current projects 2: Analysing public communication in 18th century Britain and 19th century Finland with the COMHIS collective, University of Helsinki
Linguistic fingerprint, either of works or of words neighbourhood
Temporal/geographical perspective
Close reading
Robust tracking of particular discourses through term vectors
Charting conceptual distance: human nature vs benevolence
Subcorpora formulation “Human nature” → “Human nature” or “frailty” or “misanthropy” or “pravity” in 1) books between 100 and 200 pages long, that do not have the word “sermon” or “christ” in their PREFACE or TITLE and that also contain the word “philosophy”, against 2) books that DO have the word christ or sermon in their PREFACE or TITLE
Needs from language technology ● Support for fluid term vector operations in terms of subcorpora, parameters ● OCR error handling
Future project?: Analysing newspaper fiction literature genres in 19th century Finland Academy proposal with people from the Digital Humanities Hackathon 2017
Objectives • Literary scholarship: discover and analyze the world of fiction published on the pages of newspapers in Finland in the 19th century • Needs from language technology: • Robust genre dissection, summarization and comparisons ‒
fact/fiction/poetry/drama → ads, patriotic poetry, religious texts, socialist realism, experimental fiction, …
[email protected] http://j.mp/s-makela This presentation: http://j.mp/lt-hums