Background Currently, there are some known drawbacks in Greenplum DB: 1. The cluster has to restart while adding more segments because the hash algorithm mapping tuple to segment is not monotonic: we have to change tables’ distribution_policy and re-distribute all the data(will lock the table in Access Exclusive Mode) and then change tables’ distribution_policy back. Very huge cost. 2. Data is stored on each segment. Although we add support for replicated table, we still can make data distribution more flexible, like data is distributed on some user-specific segments. 3. Gang size is always 1 or equal to N(the number of segments). This may lead to two problems: a. We might spawn too many processes in the system which might cause workload problem b. We are not able to spawn more processes for computation to improve the performance
Goals We want to make data distribution more flexible in Greenplum DB, so that: 1. Elastic Expanding: Database is available while expanding GPDB cluster. This will benefit Greenplum on Cloud. 2. Elastic Storage: Data in a table does not need to be distributed to each segment 3. Dynamic Gang Size: a. High Concurrency: we do not need to spawn so many processes when data is only distributed on several segments and the system’s workload is hight b. High Performance: we can get computing performance gain by spawn more processes in a gang to do computation 4. Cloud support: it will become much easier and better to run Greenplum on Cloud Platform with above elastic capability
Important Features and the Implementation Region: New Data Distribution We introduce a concept “region” to make data distributed more flexible and show that we could achieve the above goals by implementing region correctly and efficiently.
A region is a set of segments. Users can `create/drop` regions to manage it. Users can create table and specify a region the table binds, so that data inserted into the table will only be stored in the segments within the region.
Manage Region DDL -- this statement create a region whose name is r1, with 4 segments: seg1, seg2, seg3, seg9. create region r1 with segments='1-3,9'; -- this statement drop a region whose name is r1. -- If there are any tables binded to r1 region, this statement will fail. drop region r1; There is a problem of the design of region: should region be cluster-level or database level? Currently, we prefer to implement region in cluster-level. And we do not allow `alter region` statement because this statement is so heavy that we have to change each table that binded to the region. This statement might hangs for a long time and what if it crashes just before finishing?
Bind Table to A Region -- This statement create table and binds it to region r1(1-3,9). -- So that data of t1 is distributed on r1's segment by hashing column c. create table t1 (c int) with region=r1 distributed by (c); -- Insert data into t1, the four tuples (1),(2),(3),(4) will be sent to -- seg1, seg2, seg3, seg9, and hash-distributed on these 4 segments. insert into t1 values (1), (2), (3), (4); -- This statement changes t1's binding region to r2(suppose r2's segments are seg1,seg2,seg,3,seg9, seg10, we add a segment based on r1). -- This will cause data movement. T1's data will re-distributed on the new region. -- We should implement the process of movement very efficiently and table t1 is always available (at least for read) during the process. -- We will discuss an efficient implementation later. alter table t1 set region to r2;
-- DBA -- DBA select select
should easily know a table's binding region. should easily know what tables are in a region. region_oid from gp_distribution_policy where rel_oid=table_t_oid; rel_oid from gp_distribution_policy where region_oid=r_oid
Internal Design and Implementation The statement that causes data movement is more complex, so let’s discuss more on that. And if t1 is distribute randomly, it is quite easy. We focus on hash distributed table’s data movement. alter table t1 set region to r2; -- At the beginning, t1 is on r1 1. r2 is proper super set of r1. For example, r1 is (seg0, seg1, seg2), and r2 is (seg0, seg1, seg2, seg3). In this case, we should make sure data movement only happen from seg0-2 to seg3, that is our hash-distribution algorithm is monotonic. And we should try our best to reduce the data that needs to move. We could implement the process as the following plan: Insert o n t (slice2 on r2) -> Redistribute Motion r1 to r2 Hash Key: public.t.c1 -> Delete on t (slice1 on r1) -> Seq Scan on t (slice1 on r1)
The key point that needs to pay attention to is the insert node and delete node is executed on different segments. 2. r2 is proper subset of r1. We can still do it as the above plan if we make delete node and insert node happen on different segments. 3. We can still do it as the above plan if we make delete node and insert node happen on different segments. And we should make the process in a transaction. We should lock the table in Exclusive Mode at the beginning of the transaction. Other transactions will not see region changes before we commit due to MVCC. Let’s see an example of query between two regions.
create region r1 with segments='6,7,8'; create region r2 with segments='8,9,10,11'; create table t1 (c int) distributed by (c) on region r1; create table t2 (c int) distributed by (c) on region r2; -- then insert some data into t1 and t2 -- ######################################################## select * from t1; -- this should create a gang with 3 processes on seg6, seg7, seg8 select * from t2; -- this should create a gang with 4 processes on seg8, seg9, seg10, seg11 select * from t1 join t2 on t1.c = t2.c; -- attention! In current gpdb, there's no motion here. However, with region now, t1 -- and t2 are on different regions, we have to motion data. -- this will create two gangs: -- * gang1: table scan on t1, and a redistributed motion to gang2 -- * gang2: table scan on t2, get data from gang1's via redistributed motion, then do the join By studying this example, we have some conclusions: 1. the processes in a gang is related to the region 2. motion can transfer data between regions, that is the send and receive can be on different regions
Expanding Cluster Online With region we can expand greenplum cluster online. The working flow is like: 1. get the new segments ready(binary, directories, ssh-keys...) 2. add them into the cluster a. initdb on the new segments b. make QD connect to the segments and keep their states c. update QD's segments config 3. create a new whole region which contains all the segments 4. Alter some tables’ region to move data now or later (The database is available even if we skip this step)
Underlying Storage Management However, current hash algorithm is very simple: hash tuple to a value, then do module. The biggest drawback of this is that after adding one segment into the cluster, we have re-arrange all the data among all the segments. (If we double the segments, we can only move part of data in old segments to the new one, single direction flow of data. The cost is double). So to solve this problem we have to design a new hash algorithm. We could use consistent hash with virtual nodes or we can introduce bucket concept. Comparing to consistent hash, bucket can make data balance distribution easier, and the data distribution logic is clearer so that we might use it to optimize query in the future. Let's talk about bucket method here. Each tuple firstly is mapped to a hash value just using current method. Then we can use another simple map to map this hash value to bucket id. And for each table, we maintain its bucket_id to segment id map. With the bucket layer, when we add one segment into the cluster, since we have not updated the `bucket to segment` mapping, the db can work correctly. Further, with the bucket layer, we do not need to re-arrange all the data. We could just move some buckets from several segments to the new one. This leads to much less data transformation. There is a problem of implementation: how to find out all the tuples in a bucket efficiently?
Catalog changes: 1. Add a catalog pg_region to maintain region information 2. Add some columns to gp_distribution_policy to maintain table-region relation information and bucket_segment mapping information
Challenges Some foreseeable challenges for us: 1. The gang management redesign including gang creation and gang reuse to support vary-size gang. 2. Involve segments as queries needed: eg, operation like INSERT/UPDATE on a table only involve partial segments in the region and two phase commit commands are not dispatched to all segments. 3. What does “begin” do after introducing dynamic gang size 4. Generate correct parallelled plan 5. How to maintain primary writer gang? 6. Meta data copy(catalog…) during online expanding 7. ORCA support
8. Modify gpdb tools if needs: gpexpand|pgdump… 9. Enhance upcoming upgrader tool to support downward-compatible so that existing cluster/database can continue to work without major changes.
Use Case With region, there could be many good use cases, e.g. 1. Very large cluster Region will remove the overhead when more and more segments are added to a large cluster so that users can continue to add more nodes/segments to setup very large cluster and support their increasing data size and workload 2. TP+AP Hybrid scenario Users can use one segment to host some TP tables while the rest of segments can serve AP tables. Given this, we might optimize TP workload very efficiently and users can still real-time access both TP and AP table in one big database. 3. Multi-tenant scenario For multiple tenants, DBA can create separate regions for different tenants so that some specific small tenants are only allowed to store data on some segments/nodes, while some big tenants can store data on entire cluster. This can create a good vertical resource isolation so that in most cases the tenants only run their workload on limited segments/nodes. 4. Enhanced High availability We could enhance cluster HA to new level. For now, when both primary and mirror nodes are down, the cluster will be out of service. With region, We might continue to serve DML statements only if the related table regions are available.