Text and data mining eighteenth century based on ESTC & ECCO COMHIS Collective BSECS Conference 2017, Oxford http://j.mp/comhis-bsecs

Tool testing with own laptop

http://j.mp/comhis-tools-bsecs

I Introduction

Data mining options from intellectual historian’s perspective Text mining of large corpora • • • • • •

Objective: understanding conceptual change, uses of language Sources: full-text databases (ECCO, EEBO, Finnish Newspapers etc.) Potential: Theoretically great, the future? In practice: raw data almost never openly available; if it is, tied to limited interfaces Scalability with open research data: data-driven approach Methodological perspective: Data preprocessing very important. Messy to study historical sources, intellectual input not guaranteed. Excellent for borrowing best practices from other scientific fields

Metadata as a quantitative tool • • • • • •

Objective: Quantitative study of material objects Sources: World is full of different metadata collections Potential: Greatly underestimated (even by librarians) In practice: difficulties with open access to raw data and supporting data sources, but not impossible. Scalability with open research data: fantastic Methodological perspective: Data preprocessing very important. Excellent for borrowing best practices from other scientific fields. Quality of catalogues varies.

Digital humanities research process raw data

cleaning up data (80% of work) understanding data

80% of your time for data cleanup, another 80% for algorithms, …

explo-r atory tools

results

research articles

Leverage collaboration, open science workflows to reduce individual workload

raw data

cleaning up data (80% of work)

d

explo-r atory tools

understanding data collaborate, share these, speed up research for everyone

+ reproducibility

results

research articles

The role of open science in general - Open data (ESTC, ECCO, Auxiliary data..?) - Open source methods - Open research outputs (initial results, preprints, final publications..) - Open collaboration & crowdsourcing - Transparency, Reproducibility, Reuse

Initial data

ESTC

ECCO Evolving set of analysis and processing tools APIs

Applications

Scripts

Etc...

Research questions

...

Project goals Research publications

Public tools, APIs, code

Refined data

ESTC metadata

ESTC metadata: statistical summaries and data analysis - work in progress http://github.com/rOpenGov/estc/ ~ 483,344 documents (1473-1880) ~ 336,227 documents (18th century / 70% of the data) - Title - Language (47 unique languages) - Publication year - Publication place (publisher, city, country; 1000+ cities/towns) - Author information (name, gender, life years..; 50,000+ authors) - Physical dimensions (height, width, obl, gatherings, page count..) - Etc.

Research opportunities provided by metadata alone ● ● ● ● ● ● ● ●

Quantitative frame for qualitative research Understanding knowledge production Beyond counting titles (facts about volume) Analysis of ”history”, ”philosophy”, ”religion” etc. Publishers and their networks, visualizations Publication places, cultural transfer Individual authors and comparisons Histoire Croisée

Example: proportion of female authors over time

https://github.com/rOpenGov/estc/blob/master/inst/examples/gender.md

Raw data for a single document (MARC): Uk-ES20070905112000.0031119s1787 enk|||| 00| | eng c(CU-RivES)N72335CU-RivESCU-RivESCU-RivESBible.English.Authorised.The Christian's new and complete family Bible;or, universal library of divine knowledge. Containing the whole of the sacred text of the Holy Bible, as contained in the Old and New Testaments, with the Apocrypha, at large. Illustrated with annotations and commentaries .. By the Rev. Thomas Bankes, ...London :printed for J. Cooke; and sold by the booksellers of Bath, Br[istol,] Birmingham, Canterbury, Cambridge, [and 9 other towns in England] and by all ot[her bo]oksellers in Great-Britain,[1787?][No pagination provided] ;2⁰.The Authorised version.Revelation ends on 9P2, followed by indexes, tables, and a list of subscribers ending on e3.The N.T. has a separate titlepage.With plates 'Engraved for the Revd. T. Bankes's Christian's family Bible'.'There are at least three varieties of this edition [sic]' (Darlow & Moule).Darlow & Moule,1341BibleCommentaries.1701-1800localBankes, Thomas,1744-Great BritainEnglandLondon.

Cleaning up ESTC ● ● ● ●

80 % of statistical analysis is tidying up of the data. Often neglected yet implicitly assumed by many tools. Open science approach & reusability: no need to reinvent the wheel for the same (or similar) datasets; possibilities for reuse are great Open science, transparency & collaboration: the research tools and data can be corrected and perfected on a continuous basis Quality control: open source; unit tests; automated summaries; conversion check lists; manual analyses..

Open harmonization for synonymous publication places https://github.com/rOpenGov/bibliographica/blob/master/inst/extdata/PublicationPlaceSynonymes.csv

Summary: metadata preprocessing and analysis - Very central for successful research - Open source: transparent & reproducible - Automated & customized - Collaborative

Adorned with cuts

ESTC: with cuts / adorn['e]d with / with ([^ ]+ )+cuts + plates Absolute hits Hits relative to all documents

ESTC: "with cuts", "adorn['e]d with" or "with ([^ ]+ )+cuts" no plates Absolute hits Hits relative to all documents

ESTC: Adorned with With plates Without plates

Publishers: Newbery “adorned with” Absolute Relative to all documents

Printed by (“adorned with”) Absolute Relative to all docs

Combining ESTC metadata with ECCO full texts

ECCO contains only a subset of the works in ESTC • ECCO1 titles refer to 40% of the 18th century English records in ESTC (~53% after ECCO2) • (3,279 titles in ECCO1 are not from the 18th century?) • (10,389 titles in ECCO1 are not in English)

Adorned with (ECCO & ESTC)

ECCO contains only a subset of the works in ESTC

ECCO and ESTC catalogue different things • Multi-volume works and periodicals may have a single record in ESTC • ~6000 two-volume works, ~1000 each of three and four-volume • 189 works with 10 or more issues, 38 with 20 or more

Page counts from ECCO can be used to evaluate accuracy of ESTC page counting

Metadata enrichment and cross-checking • Number of words from ECCO sometimes better measure of length than number of pages • ECCO has information on illustrations, charts and images contained in the works

Projecting the results of full text searches into ESTC metadata

OCR quality issues ● The British Library’s 19th century collection has an estimated word accuracy of 78 % ● The estimated word accuracy (word recognition rate) of Finnish newspaper collection is about 70-75 % ● These are quite low figures but realistic for OCRed historical collections → The point isn’t necessarily to correct all OCR mistakes, but to develop tools that can deal with them!

Human nature timeline count

Political economy timeline count

Different genres: “Human nature”, top-list of authors, all editions with hits included

Human nature, min. 10 hits / title, top-authors timeline

Polysemy and different genres:”Human nature”, Top titles, all printed editions included

Human nature (publishers with most hits)

From simple keywords to topic formulation “Human nature” –> “Human nature” and (“benevolence” or “pride” or “sympathy”) “Human nature” or “frailty” or “misanthropy” or “pravity”

Subcorpora formulation “Human nature” –> “Human nature” or “frailty” or “misanthropy” or “pravity” in 1) books between 100 and 200 pages long, that do not have the word “sermon” or “christ” in their PREFACE or TITLE and that also contain the word “philosophy”, against 2) books that DO have the word christ or sermon in their PREFACE or TITLE

Topic and subcorpora formulation Exploring the similarity/distance of collocations between each other allows one to: 1. Define a particular meaning for a polysemous term: “consciousness” AND (“eternal” OR “holy” OR “soul”) 2. Define different subcorpora: A) “christ” OR “sermon” OR “soul“, B) “fortune” OR ...,

consciousness

Collocations for politeness "paragraph level collocations" : { "quality" : 0.002, "young" : 0.002, "women" : 0.002, "taste" : 0.005, "society" : 0.002, "rank" : 0.004 "superior" : 0.002, "virtues" : 0.003, "sentiments" : 0.003, "youth" : 0.002, "usual" : 0.002 "visit" : 0.004, "qualities" : 0.004, "rules" : 0.002, "temper" : 0.003, "treated" : 0.004 "pride" : 0.002, "seemed" : 0.002, "received" : 0.002, "utmost" : 0.003 } "document level collocations" : { "teazed" : 0.557, "sauntering" : 0.532, "screamed" : 0.551, "volubility" : 0.555, "ungraceful" : 0.554, "urbanity" : 0.526, "regretting" : 0.535, "sprightliness" : 0.586, "vociferation" : 0.567, "prudery" : 0.597, "rusticity" : 0.596, "politest" : 0.636, "refpeaful" : 0.522, "sobbed" : 0.598, "vacity" : 0.6079955580233204, "teized" : 0.5443425076452599, "regaled" : 0.5386012715712988, "unamiable" : 0.6061320754716981, "vulgarity" : 0.5483619344773791, "unpolite" : 0.6029601029601029 }

Levels of processing • Paragraph – “Uniform-size” chunk of highly related content

Levels of processing • Paragraph – “Uniform-size” chunk of highly related content – ECCO2 does not contain paragraph segmentation! • Section – Non-uniform, but useful for constraining, e.g. PREFACE • Document part – TITLEPAGE, FRONTMATTER, BODY, BACKMATTER, TOC, INDEX • Complete Work – Not uniform size (unless so constrained)

Support for topic formulation –> exploration • Terms with similar collocations to a given term can also be explicitly queried. Here, the notion is “what is talked about similarly” to what I’m asking about • With flexible constraining and sufficient work in figuring out the right constraints, you can ask for example “what is talked about similarly in the end of the 18th century as soul was in the beginning of the 18th century in non-religious texts.”

Support for exploration • One can also directly graph conceptual distances, so e.g. discover who talks about human nature similarly/dissimilarly to David Hume • There is also functionality to compare two queries to each other

Human nature vs. dignity by paragraph

Human nature vs. self-love by paragraph (context vector analysis)

Tool testing with own laptop Queries in ECCO API1 with ESTC metadata: /estc-turin/estc-ecco-shinyapp/

Queries of ESTC metadata, ability to search various combinations of ESTC & ECCO: /estc-turin/estc-shinyapp/

Trend comparison app, mainly ECCO API2 on paragraph level: (also cumulative graph, can be used eg. for checking term overlap) /estc-turin/trend-comparison-shinyapp/

Query inspector: /estc-turin/api-query-inspector-shinyapp/

What else?

Thanks ! - COMHIS Collective: Mikko Tolonen, Eetu Mäkelä, Leo Lahti, Ville Vaara, Hege Roivainen, Antti Kanner, Jani Marjanen.. - University of Helsinki & Turku - Finnish National Library - British Library - Academy of Finland

Text and data mining eighteenth century based on ...

COMHIS Collective. BSECS Conference ... Initial data. Evolving set of analysis and processing tools ... statistical summaries and data analysis - work in progress.

9MB Sizes 3 Downloads 206 Views

Recommend Documents

Heteronormativity-In-Eighteenth-Century-Literature-And-Culture.pdf ...
Heteronormativity-In-Eighteenth-Century-Literature-And-Culture.pdf. Heteronormativity-In-Eighteenth-Century-Literature-And-Culture.pdf. Open. Extract.

pdf-149\heteronormativity-in-eighteenth-century-literature-and ...
... the apps below to open or edit this item. pdf-149\heteronormativity-in-eighteenth-century-literature-and-culture-by-ana-de-freitas-boe-abby-coykendall.pdf.

akwamu and otublohum: an eighteenth- century akan ...
towards them and would no more attack and raid them. ..... Similarly the Twi name is Amo, are KvG 10-67; WIC 141-50; Romer, 1760; Bierm, and Amu the GB ...

pdf-1240\humans-and-other-animals-in-eighteenth-century-british ...
There was a problem loading more pages. pdf-1240\humans-and-other-animals-in-eighteenth-centu ... re-representation-hybridity-ethics-from-routledge.pdf.

Framing-Majismo-Art-And-Royal-Identity-In-Eighteenth-Century ...
PDF eBooks or in other format, are offered in a heap on the internet. Framing Majismo: Art And Royal Identity In Eighteenth-Century Spain. Framing Majismo: Art ...

Milton, David Hume and the Eighteenth-Century Conception of ...
Milton, David Hume and the Eighteenth-Century Conception of Natural Law.pdf. Milton, David Hume and the Eighteenth-Century Conception of Natural Law.pdf.

lak15_poster on text mining eP.pdf
People. Things. Different ifferent. Meaning. Skills. Use. Engineering Engineering. Enjoy. Study. Civil. Subject Subject. Challenge Challenge. Feel. Design. World.

Research and Realization of Text Mining Algorithm on ...
Internet are HTML document or XML document. The document pretreatment .... Verkamo, A. I. “Fast discovery of association rules.” Advance in knowledge ...

Handbook of Research on Text and Web Mining ...
is such an analytical technique, which reveals various dimensions of data and their ... sional data cube as a suitable data structure to capture multi-dimensional ...

Review on Data Warehouse, Data Mining and OLAP Technology - IJRIT
An OLAP is market-oriented which is used for data analysis by knowledge ..... The data warehouse environment supports the entire decision. Database. Source.

Review on Data Warehouse, Data Mining and OLAP Technology - IJRIT
used for transactions and query processing by clerks, clients. An OLAP is market-oriented which is used for data analysis by knowledge employees, including ...

Distributive Justice before the Eighteenth Century
Raphael, Concepts of Justice (Oxford: Oxford University Press, 2001), pp. ..... ([London]: In the Savoy: printed by J. Nutt, assignee of Edw. Sayer Esq; for J.

Multi-Task Text Segmentation and Alignment Based on ...
Nov 11, 2006 - a novel domain-independent unsupervised method for multi- ... tation task, our goal is to find the best solution to maximize. I( ˆT; ˆS) = ∑. ˆt∈ ˆ.

Retrieving Video Segments Based on Combined Text, Speech and Image ...
content-based indexing, archiving, retrieval and on- ... encountered in multimedia archiving and indexing ... problems due to the continuous nature of the data.

thesis on data mining pdf
Page 1. Whoops! There was a problem loading more pages. thesis on data mining pdf. thesis on data mining pdf. Open. Extract. Open with. Sign In. Main menu.

4th Workshop on Data Mining for Medicine and Healthcare
ing this data could potentially give us insight in many healthcare problems. Among many difficult tasks, capturing global trends of diseases are the ones that intrigue many healthcare practitioners [9, 11]. Being able to confidently predict future tr

Validating Text Mining Results on Protein-Protein ...
a few big known protein complexes that have clearly defined interactions ... comparison to random pairs, while in the other three species only slightly ... ing results from gene expression data has been proposed. Since .... Term Database.