The HDF Group
Working with HDF Files Aleksandar Jelenak The HDF Group
February 4, 2015
1
www.hdfgroup.org
Acknowledgements
Many thanks to my HDF colleagues for generously providing material included in this presentaGon.
February 4, 2015
2
www.hdfgroup.org
Outline • Brief HDF introducGon • HDF5 data model • Possible new developments • HDF5 and data wrangling • HDF5 as a foundaGon for sustainable soLware and data • Storing and accessing HDF5 data from Python-‐ based tools
February 4, 2015
3
www.hdfgroup.org
Brief History of HDF 1987
At NCSA (University of Illinois), a task force formed to create an architecture-‐independent format and library: AEHOO (All Encompassing Hierarchical Object Oriented format) Became HDF
Early NASA adopted HDF for Earth Observing System project 1990’s 1996 DOE’s ASC (Advanced SimulaGon and CompuGng) Project began collaboraGng with the HDF group (NCSA) to create “Big HDF” (Increase in compuGng power of DOE systems at LLNL, LANL and Sandia NaGonal labs, required bigger, more complex data files). “Big HDF” became HDF5. 1998 HDF5 was released with support from DOE Labs, NASA, NCSA 2006 The HDF Group spun off from University of Illinois as non-‐profit corporaGon
February 4, 2015
4
www.hdfgroup.org
The HDF Group • Established in 1988 • 18 years at University of Illinois’ NaGonal Center for SupercompuGng ApplicaGons • 8 years as independent non-‐profit company, “The HDF Group”
• HDF4 is the first HDF
• Originally called HDF; last major release was version 4
• HDF5 benefits from lessons learned with HDF4 • Changes to file format, soLware, and data model • HDF5 and HDF4 are different
• The HDF Group owns HDF4 and HDF5 • HDF4 & HDF5 formats, libraries, and tools are open source and freely available with BSD-‐style license February 4, 2015
5
www.hdfgroup.org
The HDF Group
The HDF Group Mission
To ensure long-‐term accessibility of HDF data through sustainable development and support of HDF technologies.
February 4, 2015
6
www.hdfgroup.org
What is HDF5? • A versa5le data model that can represent very complex data objects and a wide variety of metadata. • A completely portable file format with no limit on the number or size of data objects stored. • An open source so?ware library that runs on a wide range of computaGonal plajorms, from cell phones to massively parallel systems, and implements a high-‐level API with C, C++, Fortran 90, and Java interfaces. • A rich set of integrated performance features that allow for access Gme and storage space opGmizaGons. • Tools and applica5ons for managing, manipulaGng, viewing, and analyzing the data in the collecGon.
February 4, 2015
7
www.hdfgroup.org
HDF5 has characterisGcs of…
February 4, 2015
8
www.hdfgroup.org
HDF5 Technology Plajorm • HDF5 Abstract Data Model • Defines the “building blocks” for data organizaGon and specificaGon • Files, Groups, Links, Datasets, Amributes, Datatypes, Dataspaces
• HDF5 SoLware • Tools • Language Interfaces • HDF5 Library
• HDF5 Binary File Format • Bit-‐level organizaGon of HDF5 file • Defined by HDF5 File Format SpecificaGon February 4, 2015
9
www.hdfgroup.org
HDF5 DATA MODEL
February 4, 2015
10
www.hdfgroup.org
HDF5 Data Model • Groups – structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype opGons • Flexible, efficient storage and I/O
• Amributes – for metadata • Links – connect objects • Have names
February 4, 2015
11
www.hdfgroup.org
Structures to organize objects “Groups” “/” (root) 3-‐D array
“/TestData” lat | lon | temp -‐-‐-‐-‐|-‐-‐-‐-‐-‐|-‐-‐-‐-‐-‐ 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6
palette
Table
Raster image Raster image
“Datasets” February 4, 2015
2-‐D array
12
www.hdfgroup.org
HDF5 Links (Paths) • The path to an object defines it • Objects can be shared: /A/k and /B/m are the same temp
“/”
A k
B m
C
temp
= Group = Dataset
February 4, 2015
13
www.hdfgroup.org
HDF5 Dataset Metadata
Data
Dataspace
Rank Dimensions 3
Datatype Integer
Properties Chunked
Dim_1 = 4 Dim_2 = 5 Dim_3 = 7
(optional) Attributes Time = 32.4
Mul$-‐dimensional array of iden$cally typed data elements
Pressure = 987
Compressed
HDF5 datasets organize and contain “raw data values”. February 4, 2015
14
www.hdfgroup.org
HDF5 Dataspace Dim_3 = 7
HDF5 Dataspace Rank
Dimensions
3
Specifica$ons for array dimensions
Mul$-‐dimensional array of iden$cally typed data elements
• HDF5 datasets organize and contain “raw data values”.
• HDF5 dataspaces describe the logical layout of the data elements February 4, 2015
15
www.hdfgroup.org
HDF5 Dataspaces Describe the logical layout of the elements in an HDF5 dataset • NULL -‐ no elements • Scalar -‐ single element • Simple array (most common) -‐ MulGple elements organized in a recGlinear array: Rank = number of dimensions Dimension size = number of elements in each dimension Maximum number of elements in each dimension can be fixed or unlimited
February 4, 2015
16
www.hdfgroup.org
HDF5 Dataspaces Two roles: Dataspace contains spaGal informaGon (logical layout) about a dataset stored in a file • Rank and dimensions • Permanent part of dataset definiGon
Rank = 2 Dimensions = 4x6
ParGal I/0: Dataspaces describe applicaGons’ data buffers and data elements parGcipaGng in I/O Rank = 1 Dimension = 10 February 4, 2015
17
www.hdfgroup.org
HDF5 Datatypes • Describe individual data elements in an HDF5 dataset • Wide range of datatypes supported • Integer (signed and unsigned, 32 and 64-‐bit, etc.) • • • • • •
Float Variable-‐length sequence types (e.g., strings) Compound (similar to C structs) User-‐defined (e.g., 13-‐bit integer) Nested types Premy much any type!
February 4, 2015
18
www.hdfgroup.org
Example: Compound Datatype 3
5
V
Compound Datatype:
int16 char
int32
V
V
V V V V V V
2x3x2 array of float32
Dataspace: Rank = 2, Dimensions = 5 x 3 February 4, 2015
19
www.hdfgroup.org
HDF5 Property Lists • Property lists allow you to configure or control the behavior of the library. • They provide fine grain control when creaGng or accessing objects. For example how datasets are stored, performance tuning… • There are default values associated with property lists.
February 4, 2015
20
www.hdfgroup.org
Dataset Storage ProperGes Contiguous (default)
Chunked
Chunked & Compressed
February 4, 2015
Improves storage efficiency, transmission speed
21
www.hdfgroup.org
HDF5 Amributes • Typically contain user metadata • Have a name and a value • Amributes “decorate” HDF5 objects
• Value is described by a datatype and a dataspace. Analogous to a dataset, but do not support parGal IO operaGons; nor can they be compressed or extended.
February 4, 2015
22
www.hdfgroup.org
HDF5 Library
Tools
HDF5 SoLware Layers …
API High Level APIs
Python, Java, R, etc.
tools
VisualizaGon
Language HDF5 Data Model Objects Interfaces Groups, Datasets, Amributes, … C, Fortran, C++ Internals
Memory Mgmt
Virtual File Layer
Posix I/O
Datatype Conversion
Filters
Split Files
Tunable ProperGes Chunk Size, I/O Driver, …
Chunked Storage
Version and so on… CompaGbility
Custom
MPI I/O
Storage
I/O Drivers HDF5 File Format
February 4, 2015
File
Split Files
File on Parallel Filesystem 23
www.hdfgroup.org
Why use HDF5? • Challenging data: • ApplicaGon data that pushes the limits of what can be addressed by tradiGonal database systems, XML documents, or in-‐house data formats. • SoLware soluGons: • For very large datasets, very fast access requirements, or very complex datasets. • ApplicaGons can be wrimen in different programming languages. • Enabling long-‐term preservaGon of data. February 4, 2015
24
www.hdfgroup.org
Who uses HDF5? • Examples of HDF5 user communiGes • • • • • • • • • •
Astrophysics Plasma physics ParGcle physics NASA Earth Science Enterprise Dept. of Energy Labs NaGonal Oceanographic and Atmospheric AdministraGon (NOAA) SupercompuGng centers in US, Europe and Asia Financial insGtuGons Manufacturing industries Many others
• For a more detailed list, visit • hmp://www.hdfgroup.org/HDF5/users5.html February 4, 2015
25
www.hdfgroup.org
NEW DEVELOPMENTS
February 4, 2015
26
www.hdfgroup.org
Virtual Object Layer • The VFL is implemented below the HDF5 abstract model:
• deals with blocks of bytes in the storage container • does not recognize HDF5 objects nor abstract operaGons on those objects.
• The VOL would be layered right below the API to capture the HDF5 model:
February 4, 2015
27
www.hdfgroup.org
HDF5 REST API
February 4, 2015
28
www.hdfgroup.org
HDF Server • REST-‐based service for HDF5 data • Reference ImplementaGon for REST API • Developed in Python using Tornado Framework • Supports Read/Write operaGons • Clients can be Python/C/Fortran or Web Page
February 4, 2015
29
www.hdfgroup.org
HDF Server Architecture
February 4, 2015
30
www.hdfgroup.org
HDF Compass • “Simple” HDF5 Viewer applicaGon • Cross plajorm (Windows/Mac/Linux) • NaGve look and feel • Can display extremely large HDF5 files • View HDF5 files and OPeNDAP resources • Plugin model enables different file formats/remote resources to be supported • Community-‐based development model
February 4, 2015
31
www.hdfgroup.org
Compass Architecture
February 4, 2015
32
www.hdfgroup.org
Codename “HEXAD” • HDF5 Excel Add-‐in • Let’s you do the usual things including: • Display content (file structure, detailed object info) • Create/read/write datasets • Create/read/update amributes
• Plenty of ideas for bells an whistles, e.g., HDF5 image & PyTables support • Send in* your Must Have/Nice To Have feature list! • Stay tuned for the beta program
*
[email protected]
February 4, 2015
33
www.hdfgroup.org
How HDF5 Makes Data Wrangling Easier • Portability • Limle/big endian, row/column-‐major ordering • Read and write files on just about any compuGng plajorm • Per-‐dataset filters • Compression, scaling, shuffling, third-‐party • Self-‐descripGon • Amributes allow descripGon of the data (metadata) • Object store (when using groups) • IntrospecGon API • Many open-‐source and commercial tools understand HDF5 • Commimed datatypes: like your special datatype? Save it. February 4, 2015
34
www.hdfgroup.org
HDF5 AS A FOUNDATION FOR SUSTAINABLE SOFTWARE AND DATA February 4, 2015
35
www.hdfgroup.org
Crossing the Chasm
Additional Software Tools Training & Support
Core Product
Adoption Processes
System Integration
?
Conventions (Standards)
February 4, 2015
36
www.hdfgroup.org
Whole Product Partners IMPACT
February 4, 2015
37
www.hdfgroup.org
Tools
February 4, 2015
38
www.hdfgroup.org
Community
February 4, 2015
39
www.hdfgroup.org
Recent applicaGon
Trillion Par5cle Simula5on on NERSC’s hopper system
• •
•
Vector ParGcle-‐In-‐Cell plasma physics simulaGon with 100,000 nodes on hopper “Using their tools, the researchers wrote each 32 TB file to disk in about 20 minutes, at a sustained rate of 27 gigabytes per second (GB/s). By applying an enhanced version of the FastQuery tool, the team indexed this massive dataset in about 10 minutes, then queried the dataset in three seconds for interes$ng features to visualize.” hmp://1.usa.gov/Le0JF8 February 4, 2015
40
www.hdfgroup.org
DEMONSTRATION
February 4, 2015
41
www.hdfgroup.org