Imagining a CIF-based XAFS data exchange standard James Hester, Bragg Institute
Outline • More detail on a CIF solution • What needs to be decided
Some points from Madrid talk • Once the community have adopted a standard, it will be very hard to move away from it – so take the time to get it right • Flexibility is important – things change • Complexity can only be moved around, not eliminated: CIF maintains a simple syntax by moving most complexity into textual “Dictionaries”. • Framework, not format • CIF syntax is adequate for expressing scientific data
Fun facts on the Crystallographic Information Framework (CIF) • Until 2005, CIF stood for “Crystallographic Information File” • Canonical information on CIF contained in International Tables Vol G (594 pages, also accessible at http://it.iucr.org/g) • This year is the 20th anniversary of CIF, with the original publication in 1991 (Hall, Allen and Brown (1991) Acta Cryst. A47, 655-685) • Invented before HTML or XML or the World Wide Web
Constructing meaning
• “Interpret” – the hard part
How CIF constructs meaning • Diagram 4
CIF syntax • A CIF file is pure ASCII text • All items are white-space separated (i.e. free format) • A CIF file is divided into data blocks, which start with the “data_name” token • A CIF dataname is followed by the value taken by that dataname (tag-value format) • Tabular data is represented by “loops”, which are laid out like tables in publications – first a header, then columns of data. Columns and rows can appear in any order • Loops and tag-value pairs can appear in any order within data blocks, which can themselves appear in any order • Su's are expressed according to IUCr recommendations
# An Example CIF-format file for XAFS # data_v2o5_nanotube _xafs_absorber.atom V _xafs_absorber.edge K _xafs_source.identification 'KEK-PF BL20B' _xafs_source.location 'Tsukuba, Japan' _xafs_collection.date_time_initiated '2008-05-26T15:35:33' loop_ _xafs_detectors.label _xafs_detectors.position _xafs_detectors.type _xafs_detectors.special_details monitor monitor ionisation . fl-detector detector fluorescence '36-element PAD detector' io-detector detector ionisation . foil foil ionisation . loop_ _xafs_ionisation_detector.label _xafs_ionisation_detector.gas_pressure _xafs_ionisation_detector.length _xafs_ionisation_detector.amplifier _xafs_ionisation_detector.amplifier_gain monitor 1 10 'Keithley' 10 io-detector 1 20 'Keithley' 10 foil 1 5 'Keithley' 11 loop_ _xafs_reduced.energy _xafs_reduced.absorbance 5248.52108 0.813(2) 5258.29435 0.798(2) 5268.26606 0.781(2) 5278.27878 0.764(2) 5288.28697 0.748(2) 5298.19834 0.731(19)
CIF dictionaries • A collection of definitions also in CIF-like format • Tags drawn from restricted vocabulary of around 50 possible tags • About half of these tags are for human consumption only (e.g. item_description) • Almost all the rest are for validation • Three variants: DDL1/2/m
Comparing the DDLs • DDL1: the first DDL – Basic relational database descriptors – Simplest, 27 tags in total • DDL2: development driven by macromolecular database (PDB) – Detailed data types – Excellent relational database match – 60 tags • DDLm (draft): brings DDL1 and 2 together, addresses deficiencies – Excellent relational database descriptors – Vectors and matrices – Algorithms for describing relationships between datanames – Dictionaries can be assembled out of reusable chunks
The DDL Dictionary “datamodel” A dictionary language will refer to a datamodel, which must be compatible with the syntactical datamodel. For all CIF DDLs, this datamodel is isomorphic to a relational database. In particular, note the following equivalences: Category = table description ● Category key = table key ●
Not implied by the grammar!
Datafile loop = a filled-in table in the database ● A loop row = a table record ● Loop headers = table columns ● Unlooped datanames = values taken by columns in a single-row table ●
Definitions • Human-readable material, including examples and descriptions • Relational database ready: keys are definable • Tags for validation, e.g. mandatory or not?
save_XAFS_DETECTORS _category.description ; Data items in the XAFS_DETECTORS category record details about the layout and type of detectors used in an XAFS experiment. Further details about particular aspects of the detectors used are recorded in the separate categories XAFS_DETECTORS_IONISATION and XAFS_DETECTORS_FLUORESCENCE. ; _category.id xafs_detectors _category.mandatory_code no _category_key.name '_xafs_detectors.label' loop_ _category_examples.detail _category_examples.case ; EXAMPLE 1: A simple threeionisation counter setup for absorption measurements ; ; loop_ _xafs_detectors.label _xafs_detectors.position _xafs_detectors.type monitor monitor ionisation . detector detector ionisation . foil foil ionisation . ; ; EXAMPLE 2: A fluorescence detector as well as 3 ionisation chambers ; ; loop_ _xafs_detectors.label _xafs_detectors.position _xafs_detectors.type _xafs_detectors.special_details monitor monitor ionisation . fldetector detector fluorescence '36element Ge PAD detector' iodetector detector ionisation .
An item definition • Enumerated values allow validation • The relevant category is identified • Correct data value construction is specified
save__xafs_detectors.type _item_description.description ; The type of detector used for detecting photons ; _item.name '_xafs_detectors.type' _item.category_id 'xafs_detectors' _item.mandatory_code yes _item_type.code string loop_ _item_enumeration.value _item_enumeration.detail ionisation 'An ionisation chamber' fluorescence 'A pixelated fluorescence detector' Lytle 'A Lytle detector' save_
Another item definition • Units can be specified
save__xafs_reduced.energy _item_description.description ; The energy at which a single measurement of absorbance was taken, after all beamlinedependent corrections have been applied. ; _item.name '_xafs_reduced.energy' _item.category_id 'xafs_reduced' _item.mandatory_code yes _item_type.code float _item_units.code electron_volts save_
Issues • Each data tag can only appear once in a data block, so if 'energy' appears in multiple tables, it must have multiple names • Data items in the same category must always be tabulated together. For example, k and (k) • Only 2D tables are possible. • No complex data structures (vectors, matrices)
Management • As the field develops, dictionaries will need updating. How is this going to be managed? • Where are the canonical copies kept? • IUCr: – permanent committee (COMCIFS) – Dictionary management groups report to COMCIFS – COMCIFS monitor developments in relevant IUCr commissions (computing, nomenclature) – IUCr maintains web-accessible register of dictionaries, the dictionaries themselves and CIF documentation
A protected standard • CIF is trademarked for protective purposes (cf 'Linux') and the standards themselves are copyrighted by the IUCr • Statement of policy http://ww1.iucr.org/ipr.html • All software claiming to read or write a CIF-format file must actually be able to do so • IUCr is interested in promoting the standard for use in structural science
CIF services from COMCIFS • Verification of syntax (datafile and dictionary) • Check that XAFS ontology matches IUCrmaintained ontologies (if required) • Advice on dictionary construction
Decisions, decisions • Agreement on definitions • Agreement on purpose(s) – For databases – Data transfer – Publication supplementary material • Agreement on minimum information for each purpose • Datafile format – Text/binary, simple/complex, established/new • Dictionary format, if any • Dictionary language – Which set of tags? DDL1/2/m/custom
• Management – Custodian – Update mechanism