Text and data mining eighteenth century based on ESTC & ECCO COMHIS Collective BSECS Conference 2017, Oxford http://j.mp/comhis-bsecs
Tool testing with own laptop
http://j.mp/comhis-tools-bsecs
I Introduction
Data mining options from intellectual historian’s perspective Text mining of large corpora • • • • • •
Objective: understanding conceptual change, uses of language Sources: full-text databases (ECCO, EEBO, Finnish Newspapers etc.) Potential: Theoretically great, the future? In practice: raw data almost never openly available; if it is, tied to limited interfaces Scalability with open research data: data-driven approach Methodological perspective: Data preprocessing very important. Messy to study historical sources, intellectual input not guaranteed. Excellent for borrowing best practices from other scientific fields
Metadata as a quantitative tool • • • • • •
Objective: Quantitative study of material objects Sources: World is full of different metadata collections Potential: Greatly underestimated (even by librarians) In practice: difficulties with open access to raw data and supporting data sources, but not impossible. Scalability with open research data: fantastic Methodological perspective: Data preprocessing very important. Excellent for borrowing best practices from other scientific fields. Quality of catalogues varies.
Digital humanities research process raw data
cleaning up data (80% of work) understanding data
80% of your time for data cleanup, another 80% for algorithms, …
explo-r atory tools
results
research articles
Leverage collaboration, open science workflows to reduce individual workload
raw data
cleaning up data (80% of work)
d
explo-r atory tools
understanding data collaborate, share these, speed up research for everyone
+ reproducibility
results
research articles
The role of open science in general - Open data (ESTC, ECCO, Auxiliary data..?) - Open source methods - Open research outputs (initial results, preprints, final publications..) - Open collaboration & crowdsourcing - Transparency, Reproducibility, Reuse
Initial data
ESTC
ECCO Evolving set of analysis and processing tools APIs
Applications
Scripts
Etc...
Research questions
...
Project goals Research publications
Public tools, APIs, code
Refined data
ESTC metadata
ESTC metadata: statistical summaries and data analysis - work in progress http://github.com/rOpenGov/estc/ ~ 483,344 documents (1473-1880) ~ 336,227 documents (18th century / 70% of the data) - Title - Language (47 unique languages) - Publication year - Publication place (publisher, city, country; 1000+ cities/towns) - Author information (name, gender, life years..; 50,000+ authors) - Physical dimensions (height, width, obl, gatherings, page count..) - Etc.
Research opportunities provided by metadata alone ● ● ● ● ● ● ● ●
Quantitative frame for qualitative research Understanding knowledge production Beyond counting titles (facts about volume) Analysis of ”history”, ”philosophy”, ”religion” etc. Publishers and their networks, visualizations Publication places, cultural transfer Individual authors and comparisons Histoire Croisée
Example: proportion of female authors over time
https://github.com/rOpenGov/estc/blob/master/inst/examples/gender.md
Raw data for a single document (MARC): Uk-ES20070905112000.0031119s1787 enk|||| 00| | eng c(CU-RivES)N72335CU-RivESCU-RivESCU-RivESBible.English.Authorised.The Christian's new and complete family Bible;or, universal library of divine knowledge. Containing the whole of the sacred text of the Holy Bible, as contained in the Old and New Testaments, with the Apocrypha, at large. Illustrated with annotations and commentaries .. By the Rev. Thomas Bankes, ...London :printed for J. Cooke; and sold by the booksellers of Bath, Br[istol,] Birmingham, Canterbury, Cambridge, [and 9 other towns in England] and by all ot[her bo]oksellers in Great-Britain,[1787?][No pagination provided] ;2⁰.The Authorised version.Revelation ends on 9P2, followed by indexes, tables, and a list of subscribers ending on e3.The N.T. has a separate titlepage.With plates 'Engraved for the Revd. T. Bankes's Christian's family Bible'.'There are at least three varieties of this edition [sic]' (Darlow & Moule).Darlow & Moule,1341BibleCommentaries.1701-1800localBankes, Thomas,1744-Great BritainEnglandLondon.
Cleaning up ESTC ● ● ● ●
80 % of statistical analysis is tidying up of the data. Often neglected yet implicitly assumed by many tools. Open science approach & reusability: no need to reinvent the wheel for the same (or similar) datasets; possibilities for reuse are great Open science, transparency & collaboration: the research tools and data can be corrected and perfected on a continuous basis Quality control: open source; unit tests; automated summaries; conversion check lists; manual analyses..
Open harmonization for synonymous publication places https://github.com/rOpenGov/bibliographica/blob/master/inst/extdata/PublicationPlaceSynonymes.csv
Summary: metadata preprocessing and analysis - Very central for successful research - Open source: transparent & reproducible - Automated & customized - Collaborative
Adorned with cuts
ESTC: with cuts / adorn['e]d with / with ([^ ]+ )+cuts + plates Absolute hits Hits relative to all documents
ESTC: "with cuts", "adorn['e]d with" or "with ([^ ]+ )+cuts" no plates Absolute hits Hits relative to all documents
ESTC: Adorned with With plates Without plates
Publishers: Newbery “adorned with” Absolute Relative to all documents
Printed by (“adorned with”) Absolute Relative to all docs
Combining ESTC metadata with ECCO full texts
ECCO contains only a subset of the works in ESTC • ECCO1 titles refer to 40% of the 18th century English records in ESTC (~53% after ECCO2) • (3,279 titles in ECCO1 are not from the 18th century?) • (10,389 titles in ECCO1 are not in English)
Adorned with (ECCO & ESTC)
ECCO contains only a subset of the works in ESTC
ECCO and ESTC catalogue different things • Multi-volume works and periodicals may have a single record in ESTC • ~6000 two-volume works, ~1000 each of three and four-volume • 189 works with 10 or more issues, 38 with 20 or more
Page counts from ECCO can be used to evaluate accuracy of ESTC page counting
Metadata enrichment and cross-checking • Number of words from ECCO sometimes better measure of length than number of pages • ECCO has information on illustrations, charts and images contained in the works
Projecting the results of full text searches into ESTC metadata
OCR quality issues ● The British Library’s 19th century collection has an estimated word accuracy of 78 % ● The estimated word accuracy (word recognition rate) of Finnish newspaper collection is about 70-75 % ● These are quite low figures but realistic for OCRed historical collections → The point isn’t necessarily to correct all OCR mistakes, but to develop tools that can deal with them!
Human nature timeline count
Political economy timeline count
Different genres: “Human nature”, top-list of authors, all editions with hits included
Human nature, min. 10 hits / title, top-authors timeline
Polysemy and different genres:”Human nature”, Top titles, all printed editions included
Human nature (publishers with most hits)
From simple keywords to topic formulation “Human nature” –> “Human nature” and (“benevolence” or “pride” or “sympathy”) “Human nature” or “frailty” or “misanthropy” or “pravity”
Subcorpora formulation “Human nature” –> “Human nature” or “frailty” or “misanthropy” or “pravity” in 1) books between 100 and 200 pages long, that do not have the word “sermon” or “christ” in their PREFACE or TITLE and that also contain the word “philosophy”, against 2) books that DO have the word christ or sermon in their PREFACE or TITLE
Topic and subcorpora formulation Exploring the similarity/distance of collocations between each other allows one to: 1. Define a particular meaning for a polysemous term: “consciousness” AND (“eternal” OR “holy” OR “soul”) 2. Define different subcorpora: A) “christ” OR “sermon” OR “soul“, B) “fortune” OR ...,
consciousness
Collocations for politeness "paragraph level collocations" : { "quality" : 0.002, "young" : 0.002, "women" : 0.002, "taste" : 0.005, "society" : 0.002, "rank" : 0.004 "superior" : 0.002, "virtues" : 0.003, "sentiments" : 0.003, "youth" : 0.002, "usual" : 0.002 "visit" : 0.004, "qualities" : 0.004, "rules" : 0.002, "temper" : 0.003, "treated" : 0.004 "pride" : 0.002, "seemed" : 0.002, "received" : 0.002, "utmost" : 0.003 } "document level collocations" : { "teazed" : 0.557, "sauntering" : 0.532, "screamed" : 0.551, "volubility" : 0.555, "ungraceful" : 0.554, "urbanity" : 0.526, "regretting" : 0.535, "sprightliness" : 0.586, "vociferation" : 0.567, "prudery" : 0.597, "rusticity" : 0.596, "politest" : 0.636, "refpeaful" : 0.522, "sobbed" : 0.598, "vacity" : 0.6079955580233204, "teized" : 0.5443425076452599, "regaled" : 0.5386012715712988, "unamiable" : 0.6061320754716981, "vulgarity" : 0.5483619344773791, "unpolite" : 0.6029601029601029 }
Levels of processing • Paragraph – “Uniform-size” chunk of highly related content
Levels of processing • Paragraph – “Uniform-size” chunk of highly related content – ECCO2 does not contain paragraph segmentation! • Section – Non-uniform, but useful for constraining, e.g. PREFACE • Document part – TITLEPAGE, FRONTMATTER, BODY, BACKMATTER, TOC, INDEX • Complete Work – Not uniform size (unless so constrained)
Support for topic formulation –> exploration • Terms with similar collocations to a given term can also be explicitly queried. Here, the notion is “what is talked about similarly” to what I’m asking about • With flexible constraining and sufficient work in figuring out the right constraints, you can ask for example “what is talked about similarly in the end of the 18th century as soul was in the beginning of the 18th century in non-religious texts.”
Support for exploration • One can also directly graph conceptual distances, so e.g. discover who talks about human nature similarly/dissimilarly to David Hume • There is also functionality to compare two queries to each other
Human nature vs. dignity by paragraph
Human nature vs. self-love by paragraph (context vector analysis)
Tool testing with own laptop Queries in ECCO API1 with ESTC metadata: /estc-turin/estc-ecco-shinyapp/
Queries of ESTC metadata, ability to search various combinations of ESTC & ECCO: /estc-turin/estc-shinyapp/
Trend comparison app, mainly ECCO API2 on paragraph level: (also cumulative graph, can be used eg. for checking term overlap) /estc-turin/trend-comparison-shinyapp/
Query inspector: /estc-turin/api-query-inspector-shinyapp/
What else?
Thanks ! - COMHIS Collective: Mikko Tolonen, Eetu Mäkelä, Leo Lahti, Ville Vaara, Hege Roivainen, Antti Kanner, Jani Marjanen.. - University of Helsinki & Turku - Finnish National Library - British Library - Academy of Finland