A Framework for Access Methods for Versioned Data B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas
Outline • Motivation • Introducing versions through the examples • Versions and version ranges • Data pages – Page splitting and consolidation – Efficiency guarantee
• Index pages • Conclusions
Motivation • Historical archives need to be retained – Medical, banking, …
• Different historical versions created along different branches must be reconstructed – Software libraries, design, …
• Access methods for versioned data have been proposed
Motivation • We present a framework for constructing and understanding versioned access methods • Central point: the study of version splitting of units of data storage (disk pages) • Main goal: to make the stabbing query efficient: “Find all data alive at this version”
The two-version example • Records format: version v1 Set of versions for which record k2 does not change
version v2
Redundancy What k2 use does Couldifwe not for a startchange and end large number of versions labels? <{v1,v2},k2,d2> versions? <{v1,v2},k3,d3>
The three-version example branch b2
v3
v1
v2
branch b1
Key space k3 k2 k1
v1
version v2
version v3
branch b1 d3
d1
version v1
d’1
v2
d2
now
Key space k3 k2 time k1
d2
branch b2 d’2
d3 d1
v1
v3
now
time
The three-version example Key space k3 k2 k1
branch b1 d3
d1
v1
d’1
v2
d2
now
Key space k3 k2 k time 1
d2
branch b2 d’2
d3 d1
v1
<{v1, v3},k1,d1> <{v1, v2},k2,d2> <{v1, v2, v3},k3,d3>
v3
now
time
What if there k3 iskeep never When We might is the updated branch b1? branching, start andinthe weend cannot Should expresswe version on a keep unique each end updating the end version for a set of branch version for k3 as new versions versions appear on b1?
Versions • The initial version set: V = {v1} • New versions are obtained by updating, inserting or deleting records from old versions of V • V can be represented by a tree: the version tree • There is a partial order on the nodes of the version tree – Ancestors: anc(v)={a ∈ V/ a < v} – Descendents: desc(v) = {d ∈ V/ d > v}
Version Ranges • Records correspond to sets of versions over which they do not change • Such a set forms a subtree called the version range. We have: – One start version: the root of the subtree – A set of end versions: the leaves of the subtree (one on each branch)
• The main objection: – To have to update end versions for every new version for which the record does not change
• The solution: – To take apart end versions from the version range
Version Ranges
v1
v3
v4
v5 v2
Considerthat Assume now a record athat new RRversion isinserted updated v5 at version appears v1. Suppose but v4.RWe is that not could Rtouched remains say v3 unchanged is by an v5 end version at v2 and for R v 3.
Version Ranges
v1
v3
v4
v5 v2
v6
Later, Now We choose version any number the v3 is end no ofversion longer an for end R descendent to version be a “stop for in VR sign” R. R could remains along be a unchanged branch. created. The If these atend {v1versions version , v2, v3, vdo 54,} does not change not belong R, thetoVR the version range for automatically expands R
Version Ranges
v1
v3
v4
v5 v2
v6
Later, any number of descendent in VR could be created. If this versions do not change R, the VR expands automatically
Version Ranges •
The version range vr = (start(vr), end(vr)), where: – start(vr) is an individual version – end(vr) is the minimal set of versions ev with the property: v ∈ vr iif 1. start(vr) ≤ v 2. ∀ ev ∈ end(vr) ¬(ev ≤ v)
The three-version example revisited <{v1, v3},k1,d1> <{v1, v2},k2,d2> <{v1, v2, v3},k3,d3>
<(v1, {v2}),k1,d1> <(v2, { }),k1,d’1> <(v1, {v3}),k2,d2> <(v3, { }),k2,d’2> <(v1, { }),k3,d3>
Data pages • Data pages (P) delimit one version range (vr) and one key range (kr) • We define KVR(P) = (kr(P),vr(P)) – A data page with KVR(P) = (kr,vr) stores all records such that: 1. k ∈ kr and 2. vr ∩ vr’ ≠ ∅
Compact record representation • To store records in data pages we use the compact record representation <(v1, {v2}),k1,d1> <(v2, { }),k1,d’1> <(v1, {v3}),k2,d2> <(v3, { }),k2,d’2> <(v1, { }),k3,d3> Deletion events do not cause lose of content, they are stated by means of compact null records
Looking for the efficiency • To make the stabbing query efficient, a substantial percentage of the records in an accessed page must be alive for a version v • The splitting page policy – When a page P gets full, a version splitting of P must be done (here current version vn is used) – A new page P’ is allocated with VR(P’) = (vn,∅) – Records from P can be moved or copied to P’
Page splitting policy • Records created by vn which are not null are moved from P to P’ • Records whose version range lie in VR (P) ∩ VR(P’) which are not null are copied to P’
Page splitting policy • Some kind of key splits are allowed in our framework (similar to B-tree page splits) – After a version split if the new page has more than a certain threshold value Tk (we call version-andkey split) – When a full page has version range (current_version, ∅) (we call restricted-key split)
• Pure key splits cannot guarantee a minimun number of records alive for a given version
Consolidation • Delete operations may damage the stabbing query efficiency • When the number of records alive in P at vn fall below a threshold Tc, a consolidation process is triggered • A sparse page and a proper sibling are current-version split, and the results are combined in one page • Transactions with a large number of deletions may generate ghost pages
Efficiency guarantee • We start with a page D at version v1 having n alive records • Our framework guarantees a minimum number of records in a data page D in answering a stabbing query (v ∈ VR(D)) under different scenarios
Efficiency guarantee Assertions 1. No deletes and only version splits: at least n 2. No deletes and only current-version or version-and-key or restricted-key splits: at least min(n,Tk/2) 3. Any kind of transactions and version splits, version-and-key splits, restricted-key splits and node consolidation: at least min(Tc,n)
Index pages • Index pages + data pages form a DAG • Index pages also correspond to key-version ranges • Index page entries contain for every child C: • Index page splits and consolidations follow the same policy as for data pages • Additional details about properties and treatment of index pages can be seen in the paper
Conclusions • Version data are not trivial to deal with • Our framework – contributes to understand the implications of managing and retrieving version data – gives clear cues to represent in a compact and robust way this kind of data – supports realistic assumptions on transactions
A Framework for Access Methods for Versioned Data B. Salzberg, L. Jiang, D. Lomet, M. Barrena, J. Shan & E. Kanoulas