OXFORD COMMON FILESYSTEM LAYOUT

OXFORD COMMON FILESYSTEM LAYOUT

Version 0.1 Alpha Storing and describing files are central to the functionality of institutional repositories. Unlike catalogues, where an electronic record exists to point to a physical object, an institutional repository itself contains and manages the electronic objects as well as the cataloguing data. There are currently no agreed upon practices, however, for the low-level filesystem structures that institutional repository systems adopt to store these objects on disk. Some systems delegate this responsibility to third-party libraries, treating the storage layer as a 'black box' [e.g., Modeshape]. Others implement their own software-specific filesystem hierarchy.[EPrints, Dspace?] Yet others take no particular view on this, leaving the filesystem hierarchy up to the individual institutions to implement according to local practices. In each of these implementations, there is no common approach to storing both file data and metadata on the disk. This can have significant implications on the longterm viability of the data, especially in systems that are built as "fire and forget" -that is, static collections that 'just work' until they do not. This document will propose a common approach to filesystem layout for institutional repositories, providing recommendations for how IR systems should structure and store files on disk. It is developed under the name "Oxford Common Filesystem Layout" (OCFL) because the impetus for this effort grew out of discussions held at the Fedora / Samvera Camp held at the University of Oxford, September 2017. It generally follows the model of naming an effort after the place where it originated (see: Dublin Core, Portland Common Data Model). The goals for this effort include: 1. Better support of decoupled microservices. A common filesystem layout will provide an expected platform on which many services can act on the IR filesystem, independent of any single 'managerial' system. Services involved in delivery (e.g., a IIIF-compatible image server) can use the underlying filesystem directly, without needing to go through an intermediate retrieval system. Preservation and auditing systems can operate on the underlying data directly. 2. Data migration and "rebuildability." A common approach to filesystems, and a mandate to provide the ability to 'rebuild' an institutional repository from the

1

filesystem, can help obviate challenges in migrating from one system to another. (It could be argued that a large part of the time, and effort, involved in migrating systems is designing and building the process of translating from one filesystem structure to another.) Put another way, there should be no absolute requirement on a particular piece of software to use or make sense of the objects on the filesystem. A user (or software implementer) should be able to understand the repository with just the files (and possibly the OCFL spec for convenience.) 3. Common object versioning model. Digital object versioning has been implemented in several different ways, each with different impacts on storage capacity and performance. A common approach to this will provide an expected filesystem layout that follows best practices, but perhaps even more importantly it can document the decision process and discussions around these tradeoffs. 4. Storage systems best-practices and recommendations. Discussion around filesystem layout should not dig too deeply into specific implementations; however, there is a need for high-level discussions at the intersection of implementation and layout. For example, if object versioning relies on symbolic links, how does this translate to cloud-based systems like Amazon S3? What are the recommendations for storage systems that can span multiple devices or protocols? Are there any filesystem design practices that can have a negative impact when implemented "at scale"? 5. Backup. By storing the institutional repository data as 'plain' filesystem objects, and making a requirement on 'rebuildability' from the filesystem, the system presents a single interface to repository duplication. Backing up the entire repository is simply backing up the filesystem, without the need to rely on external processes for database exports. (Of course, database exports can still form part of a backup strategy, but an "apocalyptic scenario" disaster recovery strategy does not rely on having these files). 6. Validation. Part of the efforts around OCFL should be the creation of a filesystem layout validation tool which can flag both errors and warnings against a given filesystem and its conformance to the OCFL specification. Some preliminary challenges and considerations might include: 1. The 'rug' problem. If many systems are expected to operate on a unified set of files, how do we prevent any one system from "having the rug pulled out from under it" -- that is, operations that take place that change the underlying data in ways that it was not expecting. 2. The 'common data model' problem. Institutions implement metadata in many different ways that are not immediately amenable to a standardised storage

2

approach. At a minimum, what is required to create a standardised filesystem layout while recognising that the exact contents of the objects being stored can vary in structure. 3. The border between layout and specific technologies. While many systems implement filesystem layout in the standard 'file-and-folder' paradigm, others might use specific technologies, like Apache Cassandra or HDFS, that require different approaches in their implementation. 4. The 'here or there' problem. Some systems use both local and remote storage to store the underlying data. For example, small files might be stored locally, while large files might be stored in a cheaper cloud storage location and simply 'pointed' to locally. How do we resolve the problem of having files that are there, but not? 5. The consistency problem. Depending on the nature of the files, the systems being used to provide them, and the location to which they are being written, it may not be possible to provide a synchronous guarantee that data that is provided has been written. Should a guarantee of 'eventual consistency' be enough, or should there be an absolute requirement on atomic and synchronous write operations within the IR?

Previous art The Unix Filesystem Hierarchy Standard1 describes a common filesystem hierarchy expected of compatible systems. It allows both users and software to try and predict likely locations of operating system components, such as '/lib' for libraries, '/include' for headers, or '/etc' for configuration files. The OCFL aims to describe a filesystem hierarchy for the same reasons: To promote a common approach for both user and systems to understand and work with the file systems underlying their institutional repositories. There are a family of specifications oriented around filesystem layout developed as part of the California Digital Library efforts2. This includes PairTree, ReDD (Reverse Directory Deltas), and D-Flat, and several other associated specifications. These specifications may be a useful starting point for our discussions. The MOAB system at Stanford University Libraries3 builds on the work at California Digital Libraries for object versioning. They considered implementation details such as choosing forward or reverse versioning and their relative complexities.

Notes

3

While the OCFL is intended to serve as the underlying storage layout, it makes no assumptions on the delivery mechanisms that sit above it. Implementers should be free to implement services that promote faster access to the underlying objects, such as relational databases, triple-stores, or key-value stores. For write operations, consistency with the OCFL store can follow the 'eventual consistency' model, while caching layers might provide a faster synchronous storage layer.

Document versions 0.1 Alpha Initial proposal; Authors: Andrew Hankinson

1. https://wiki.linuxfoundation.org/lsb/fhs 2. https://confluence.ucop.edu/display/Curation/Microservices 3. http://journal.code4lib.org/articles/8482

4

oxford common filesystem layout -

EPrints, Dspace?] Yet others take no particular view on this, leaving the filesystem hierarchy up to the individual institutions to implement according to local practices. In each of these implementations, there is no common approach to storing both file data and metadata on the disk. This can have significant implications on ...

74KB Sizes 8 Downloads 174 Views

Recommend Documents

Linux Filesystem Hierarchy
Linux operating system according to those of the FSSTND v2.3 final (January 29, 2004) and also its actual ...... provider knows, for example, where the executable for sed is to be found on a Linux machine and can use that ...... That is, suspend acco

Filesystem Plugin.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Filesystem ...

Guia Filesystem 2011 Resuelta.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

O'Reilly - Using The HTML5 Filesystem API.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. O'Reilly - Using ...Missing:

zfs filesystem in solaris 10 pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. zfs filesystem in ...

Master layout
Kaohsiung Medical University Chung-Ho Memorial Hospital, Kaohsiung .... System software (SAS; SAS Institute Inc., Cary, ... handwriting in medical records.

Layout 2 - Public Voices
2. 96. Introduction. U.S. Latino populations are an area of study that requires further research in the field of public ..... tion/twps0075/twps0075.html. Davis, M.

Layout 2 -
like to choose a greater degree of equality than in other settings (Lerner 1974). ... Need. While the equality principle requires that every member of the society gets an ... mechanical equality to take account of individual circumstances (73).

Layout 2 - Public Voices
that, in addition to mediums such as cinema and television, cartoonists have also taken an “assault ... representation of all available cartoons, but rather a select sample accessible through the Internet and print format. ... The cartoons collecte

Layout 2
tary School, and Nanakuli High and Intermediate School total- .... Healthcare. Medicaid reimbursements - ... tative. Representative Awana, a graduate of St. An-.

Layout 2 - jpmsp
Hosting Mississippi's largest university, Starkville is a highly educated .... 2006. http://www.cops.usdoj.gov/Default.asp?Item=36 [Accessed August 5, 2008].

patteron layout
These file systems also used to have an issue with storing ... tributors to NFS in Linux and they will tell you the same: ... keep a disk-based index architecture.

Layout 2 -
Rawlsian theory is not supported in many experimental studies that were carried out .... Internet self-administered surveys have great benefits in terms of ...... Some unfinished business in public administration. ... By D. A. Bell and A. de-Shalit.

jain layout
build and maintain. Ad hoc ... In this article we present an algorithm for routing in wireless ad hoc networks using information about geographical location of the nodes. We ... ing system (GPS) or terrestrial positioning system, for example.

Layout 2
Frederick Douglas, a former slave, believed that it was the general consensus of the day by both blacks and whites that slavery was immoral (Pratkanis ..... traumatic slave syndrome (PTSS) wreaks havoc in African Americans and is especially en- durin

Layout 2 - jpmsp
The Journal of Public Management and Social Policy, begins its seventeenth volume by examining various issues that not only impact people today, but have ...

Layout 2
John Jay College of Criminal Justice ... should not be reserved for college freshmen. .... mation technology to enrich the public administration curriculum. She is ...

conceptual layout -
MHD. Top = 154.59'. Inv. In = 150.43' (12" RCP - SE). Inv. In = 150.48' (12" RCP - NE). C/L Chamber Near Inv. Out = ±147.42'. (Inv. Out Full Of Silt / Debris). MHD.

Layout 3
among the districts through best practices and cases studies. .... disclosure of personal information to third parties, including other .... These are in line with ..... ter at the private host primarily deals with service problems, but ... any e-mai