The  HDF  Group  

Working  with  HDF  Files     Aleksandar  Jelenak   The  HDF  Group  

February  4,  2015  

1  

www.hdfgroup.org  

Acknowledgements  

  Many  thanks  to  my  HDF  colleagues  for  generously   providing  material  included  in  this  presentaGon.          

February  4,  2015  

2  

www.hdfgroup.org  

Outline   •  Brief  HDF  introducGon   •  HDF5  data  model   •  Possible  new  developments   •  HDF5  and  data  wrangling   •  HDF5  as  a  foundaGon  for  sustainable  soLware  and   data   •  Storing  and  accessing  HDF5  data  from  Python-­‐ based  tools  

February  4,  2015  

3  

www.hdfgroup.org  

Brief  History  of  HDF   1987

 At  NCSA  (University  of  Illinois),  a  task  force  formed  to  create  an    architecture-­‐independent    format  and  library:    AEHOO    (All  Encompassing  Hierarchical  Object  Oriented  format)      Became  HDF                        

  Early    NASA  adopted  HDF  for  Earth  Observing  System  project    1990’s     1996    DOE’s  ASC  (Advanced  SimulaGon  and  CompuGng)  Project  began    collaboraGng  with  the  HDF  group  (NCSA)  to  create  “Big  HDF”          (Increase  in  compuGng  power  of  DOE  systems  at  LLNL,  LANL  and    Sandia  NaGonal  labs,  required  bigger,  more  complex  data  files).          “Big  HDF”  became  HDF5.       1998    HDF5  was  released  with  support  from  DOE  Labs,  NASA,  NCSA     2006    The  HDF  Group  spun  off  from  University  of  Illinois  as  non-­‐profit    corporaGon  

February  4,  2015  

4  

   

www.hdfgroup.org  

The  HDF  Group   •  Established  in  1988   •  18  years  at  University  of  Illinois’  NaGonal  Center  for   SupercompuGng  ApplicaGons   •  8  years  as  independent  non-­‐profit  company,  “The  HDF   Group”  

•  HDF4  is  the  first  HDF  

•  Originally  called  HDF;  last  major  release  was  version  4  

•  HDF5  benefits  from  lessons  learned  with  HDF4   •  Changes  to  file  format,  soLware,  and  data  model   •  HDF5  and  HDF4  are  different  

•  The  HDF  Group  owns  HDF4  and  HDF5   •  HDF4  &  HDF5  formats,  libraries,  and  tools  are  open   source  and  freely  available  with  BSD-­‐style  license   February  4,  2015  

5  

www.hdfgroup.org  

The  HDF  Group  

The  HDF  Group  Mission    

 

To  ensure  long-­‐term  accessibility  of   HDF  data  through  sustainable   development  and  support  of  HDF   technologies.  

February  4,  2015  

6  

www.hdfgroup.org  

What  is  HDF5?   •  A  versa5le  data  model  that  can  represent  very  complex  data   objects  and  a  wide  variety  of  metadata.   •  A  completely  portable  file  format  with  no  limit  on  the  number   or  size  of  data  objects  stored.   •  An  open  source  so?ware  library  that  runs  on  a  wide  range  of   computaGonal  plajorms,  from  cell  phones  to  massively  parallel   systems,  and  implements  a  high-­‐level  API  with  C,  C++,  Fortran   90,  and  Java  interfaces.   •  A  rich  set  of  integrated  performance  features  that  allow  for   access  Gme  and  storage  space  opGmizaGons.   •  Tools  and  applica5ons  for  managing,  manipulaGng,  viewing,   and  analyzing  the  data  in  the  collecGon.  

February  4,  2015  

7  

www.hdfgroup.org  

HDF5  has  characterisGcs  of…  

February  4,  2015  

8  

www.hdfgroup.org  

HDF5  Technology  Plajorm   •  HDF5  Abstract  Data  Model   •  Defines  the  “building  blocks”  for  data  organizaGon  and   specificaGon   •  Files,  Groups,  Links,  Datasets,  Amributes,  Datatypes,  Dataspaces  

•  HDF5  SoLware   •  Tools     •  Language  Interfaces   •  HDF5  Library  

•  HDF5  Binary  File  Format   •  Bit-­‐level  organizaGon  of  HDF5  file   •  Defined  by  HDF5  File  Format  SpecificaGon   February  4,  2015  

9  

www.hdfgroup.org  

HDF5  DATA  MODEL  

February  4,  2015  

10  

www.hdfgroup.org  

HDF5  Data  Model   •  Groups  –  structure  among  objects   •  Datasets  –  where  the  primary  data  goes   •  Data  arrays   •  Rich  set  of  datatype  opGons   •  Flexible,  efficient  storage  and  I/O  

•  Amributes  –  for  metadata   •  Links  –  connect  objects   •  Have  names  

February  4,  2015  

11  

www.hdfgroup.org  

Structures  to  organize  objects   “Groups” “/” (root)   3-­‐D  array  

“/TestData”   lat  |  lon  |  temp   -­‐-­‐-­‐-­‐|-­‐-­‐-­‐-­‐-­‐|-­‐-­‐-­‐-­‐-­‐    12  |    23  |    3.1    15  |    24  |    4.2    17  |    21  |    3.6  

palette  

Table  

Raster  image   Raster  image  

“Datasets” February  4,  2015  

2-­‐D  array  

12  

www.hdfgroup.org  

HDF5  Links  (Paths)   •  The path to an object defines it •  Objects can be shared: /A/k and /B/m are the same temp

“/”

A k

B m

C

temp

=  Group =  Dataset

February  4,  2015  

13  

www.hdfgroup.org  

HDF5  Dataset   Metadata

Data

Dataspace

Rank Dimensions 3

Datatype Integer

Properties Chunked

Dim_1 = 4 Dim_2 = 5 Dim_3 = 7

(optional) Attributes Time = 32.4

Mul$-­‐dimensional  array  of     iden$cally  typed  data  elements  

Pressure = 987

Compressed

 

HDF5  datasets  organize  and  contain  “raw  data  values”.   February  4,  2015  

14  

www.hdfgroup.org  

HDF5  Dataspace   Dim_3  =  7  

HDF5  Dataspace   Rank  

Dimensions  

3  

Specifica$ons  for  array   dimensions  

Mul$-­‐dimensional  array  of     iden$cally  typed  data  elements  

•   HDF5  datasets  organize  and  contain  “raw  data  values”.    

•   HDF5  dataspaces  describe  the  logical  layout  of  the  data  elements   February  4,  2015  

15  

www.hdfgroup.org  

HDF5  Dataspaces   Describe  the  logical  layout  of  the  elements  in  an  HDF5  dataset     •  NULL   -­‐  no  elements     •  Scalar     -­‐  single  element   •  Simple  array  (most  common)            -­‐  MulGple  elements  organized  in  a  recGlinear  array:                        Rank  =  number  of  dimensions            Dimension  size  =  number  of  elements  in  each  dimension            Maximum  number  of  elements  in  each  dimension  can    be  fixed  or  unlimited  

February  4,  2015  

16  

www.hdfgroup.org  

HDF5  Dataspaces   Two  roles:   Dataspace  contains  spaGal  informaGon  (logical  layout)   about  a  dataset        stored  in  a  file   • Rank  and  dimensions   • Permanent  part  of  dataset     definiGon  

Rank  =  2   Dimensions  =  4x6  

ParGal  I/0:  Dataspaces  describe  applicaGons’  data   buffers  and  data  elements  parGcipaGng  in  I/O     Rank  =  1     Dimension  =  10     February  4,  2015  

17  

www.hdfgroup.org  

HDF5  Datatypes   •  Describe  individual  data  elements  in  an  HDF5  dataset   •  Wide  range  of  datatypes  supported   •  Integer  (signed  and  unsigned,  32  and  64-­‐bit,  etc.)   •  •  •  •  •  • 

Float   Variable-­‐length  sequence  types  (e.g.,  strings)   Compound  (similar  to  C  structs)   User-­‐defined    (e.g.,  13-­‐bit  integer)   Nested  types   Premy  much  any  type!  

February  4,  2015  

18  

www.hdfgroup.org  

Example:  Compound  Datatype   3

5

V  

Compound   Datatype:

int16   char  

int32  

V  

V  

V    V    V   V    V    V  

2x3x2  array  of  float32  

 

Dataspace:          Rank  =  2,  Dimensions  =  5  x  3   February  4,  2015  

19  

www.hdfgroup.org  

HDF5  Property  Lists   •  Property  lists  allow  you  to  configure  or  control  the   behavior  of  the  library.   •  They  provide  fine  grain  control  when  creaGng  or   accessing  objects.  For  example  how  datasets  are   stored,  performance  tuning…   •  There  are  default  values  associated  with  property  lists.  

February  4,  2015  

20  

www.hdfgroup.org  

Dataset  Storage  ProperGes   Contiguous (default)

Chunked

Chunked & Compressed

February  4,  2015  

Improves storage efficiency, transmission speed

21  

www.hdfgroup.org  

HDF5  Amributes   •  Typically  contain  user  metadata   •  Have  a  name  and  a  value   •  Amributes  “decorate”  HDF5  objects  

•   Value  is  described  by  a  datatype  and  a  dataspace.   Analogous  to  a  dataset,  but  do  not  support  parGal  IO           operaGons;  nor  can  they  be  compressed  or  extended.    

February  4,  2015  

22  

www.hdfgroup.org  

HDF5  Library  

Tools  

HDF5  SoLware  Layers     …  

 

    API   High  Level   APIs  

 

 Python,   Java,   R,  etc.  

   tools  

VisualizaGon  

Language   HDF5  Data  Model  Objects   Interfaces   Groups,  Datasets,  Amributes,  …   C,  Fortran,  C++   Internals  

Memory   Mgmt  

Virtual  File   Layer  

Posix  I/O  

Datatype   Conversion  

Filters  

Split   Files  

Tunable  ProperGes   Chunk  Size,  I/O  Driver,  …    

Chunked   Storage  

Version   and  so  on…   CompaGbility  

Custom  

MPI  I/O  

Storage  

I/O  Drivers   HDF5  File   Format  

February  4,  2015  

File  

Split     Files  

File  on   Parallel   Filesystem   23  

www.hdfgroup.org  

Why  use  HDF5?   •  Challenging  data:   •  ApplicaGon  data  that  pushes  the  limits  of  what  can   be  addressed  by  tradiGonal  database  systems,  XML   documents,  or  in-­‐house  data  formats.   •  SoLware  soluGons:   •  For  very  large  datasets,  very  fast  access   requirements,  or  very  complex  datasets.   •  ApplicaGons  can  be  wrimen  in  different   programming  languages.   •  Enabling  long-­‐term  preservaGon  of  data.   February  4,  2015  

24  

www.hdfgroup.org  

Who  uses  HDF5?   •  Examples  of  HDF5  user  communiGes   •  •  •  •  •  •  •  •  •  • 

Astrophysics   Plasma  physics   ParGcle  physics   NASA  Earth  Science  Enterprise   Dept.  of  Energy  Labs   NaGonal  Oceanographic  and  Atmospheric  AdministraGon  (NOAA)   SupercompuGng  centers  in  US,  Europe  and  Asia   Financial  insGtuGons   Manufacturing  industries   Many  others  

•  For  a  more  detailed  list,  visit   •  hmp://www.hdfgroup.org/HDF5/users5.html   February  4,  2015  

25  

www.hdfgroup.org  

NEW  DEVELOPMENTS  

February  4,  2015  

26  

www.hdfgroup.org  

Virtual  Object  Layer   •  The  VFL  is  implemented  below  the  HDF5  abstract  model:  

•  deals  with  blocks  of  bytes  in  the  storage  container   •  does  not  recognize  HDF5  objects  nor  abstract  operaGons  on  those  objects.  

•  The  VOL  would  be  layered  right  below  the  API  to  capture  the  HDF5  model:  

February  4,  2015  

27  

www.hdfgroup.org  

HDF5  REST  API  

February  4,  2015  

28  

www.hdfgroup.org  

HDF  Server   •  REST-­‐based  service  for  HDF5  data   •  Reference  ImplementaGon  for  REST  API   •  Developed  in  Python  using  Tornado  Framework   •  Supports  Read/Write  operaGons   •  Clients  can  be  Python/C/Fortran  or  Web  Page  

February  4,  2015  

29  

www.hdfgroup.org  

HDF  Server  Architecture  

February  4,  2015  

30  

www.hdfgroup.org  

HDF  Compass   •  “Simple”  HDF5  Viewer  applicaGon   •  Cross  plajorm  (Windows/Mac/Linux)   •  NaGve  look  and  feel   •  Can  display  extremely  large  HDF5  files   •  View  HDF5  files  and  OPeNDAP  resources   •  Plugin  model  enables  different  file  formats/remote   resources  to  be  supported   •  Community-­‐based  development  model  

February  4,  2015  

31  

www.hdfgroup.org  

Compass  Architecture  

February  4,  2015  

32  

www.hdfgroup.org  

Codename  “HEXAD”   •  HDF5  Excel  Add-­‐in   •  Let’s  you  do  the  usual  things  including:   •  Display  content  (file  structure,  detailed  object  info)   •  Create/read/write  datasets   •  Create/read/update  amributes  

•  Plenty  of  ideas  for  bells  an  whistles,  e.g.,  HDF5   image  &  PyTables  support   •  Send  in*  your  Must  Have/Nice  To  Have  feature  list!   •  Stay  tuned  for  the  beta  program  

* [email protected]

February  4,  2015  

33  

www.hdfgroup.org  

How  HDF5  Makes  Data  Wrangling  Easier   •  Portability   •  Limle/big  endian,  row/column-­‐major  ordering   •  Read  and  write  files  on  just  about  any  compuGng   plajorm   •  Per-­‐dataset  filters   •  Compression,  scaling,  shuffling,  third-­‐party   •  Self-­‐descripGon   •  Amributes  allow  descripGon  of  the  data  (metadata)   •  Object  store  (when  using  groups)   •  IntrospecGon  API   •  Many  open-­‐source  and  commercial  tools  understand  HDF5   •  Commimed  datatypes:  like  your  special  datatype?  Save  it.   February  4,  2015  

34  

www.hdfgroup.org  

HDF5  AS  A  FOUNDATION  FOR   SUSTAINABLE  SOFTWARE  AND  DATA   February  4,  2015  

35  

www.hdfgroup.org  

Crossing  the  Chasm  

Additional Software Tools Training & Support

Core   Product  

Adoption Processes

System Integration

?

Conventions (Standards)

February  4,  2015  

36  

www.hdfgroup.org  

Whole  Product  Partners   IMPACT  

February  4,  2015  

37  

www.hdfgroup.org  

Tools  

February  4,  2015  

38  

www.hdfgroup.org  

Community  

February  4,  2015  

39  

www.hdfgroup.org  

Recent  applicaGon  

Trillion  Par5cle  Simula5on  on   NERSC’s  hopper  system  

•  • 

• 

Vector  ParGcle-­‐In-­‐Cell  plasma  physics  simulaGon  with  100,000  nodes  on  hopper   “Using  their  tools,  the  researchers  wrote  each  32  TB  file  to  disk  in  about  20  minutes,  at  a  sustained   rate  of  27  gigabytes  per  second  (GB/s).  By  applying  an  enhanced  version  of  the  FastQuery  tool,  the   team  indexed  this  massive  dataset  in  about  10  minutes,  then  queried  the  dataset  in  three  seconds  for   interes$ng  features  to  visualize.”   hmp://1.usa.gov/Le0JF8   February  4,  2015  

40  

www.hdfgroup.org  

DEMONSTRATION  

February  4,  2015  

41  

www.hdfgroup.org  

DWDC-Working-with-HDF5-Files.pdf

Early NASA adopted HDF for Earth Observing System project. 1990'. s. 1996 DOE's ASC (Advanced SimulaGon and CompuGng) Project began collaboraGng ...

11MB Sizes 5 Downloads 164 Views

Recommend Documents

No documents