Adaptive refinement recovery after fault simulation Linda Stals
1
February 11, 2016
1 Mathematical Sciences Institute, Australian National University, Canberra ACT 0200, Australia.
[email protected] Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
1 / 45
Resilience
With the advent of high performance machines containing an increasingly larger number of processors and system components, the chances of a fault occurring becomes more likely. Achieving resilience is expensive since it inevitably requires redundancy, and thus more system resources and additional energy. Alternatives that can exploit specific features of the algorithms may offer substantial savings.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
2 / 45
Multilevel adaptive grids
We study parallel algorithms with automatic adaptive mesh refinement. Thus the grid structure itself requires nontrivial distributed and dynamic data structures that store a hierarchy of locally refined finite element meshes. In case of a fault not only the state of the solution within the iterative process is lost in a subdomain but also the information about the adaptively refined mesh structures themselves. The recovery process must take into account both the intra-grid data dependencies as well as the inter-grid dependencies.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
3 / 45
Parallel implementation 6 4 1
5 3
2 6 4 1
3
Processor 1 Linda Stals
4
5 1
5 3
2 Processor 2
Adaptive refinement recovery after fault simulation
February 11, 2016
4 / 45
Inter-grid connections Observe that the algebraic connections include the inter-grid connections defined by the interpolation and restriction operators. 1
6
4
7
2
8 Level 2
1
Level 1 4
2
5 Level 0
1
2
3
Figure: The ghost nodes are used to complete the intra-grid and inter-grid connections. The full nodes are drawn as dark circles while the ghost nodes are drawn as open circles.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
5 / 45
Inter-grid connections The inter-grid connections add an extra degree of complexity to the data dependencies. For example, when a node is moved to another processor, the neighbour node list must be updated on the current grid level as well as on the previous grid level and the next grid level. The use of ghost nodes as a communication buffer or as a way of storing updates from neighbouring processors is standard practise. However, we have extended their application. For example, during refinement the communication pattern has to be updated when new nodes are added and we will show that by exploiting the relationship between the ghost nodes and full nodes the communication pattern may be updated independently across the processors. Unfortunately, it will turn out that this is not true during the fault recovery process. The load balancing routine breaks many of the assumptions or rules that are relied upon during the standard refinement procedure. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
6 / 45
Triangle subdivision
There are two ways to subdivide a triangle; bisection or quadrisection. We will use the bisection method as it lends more easily to adaptive refinement and the general ideas extend more readily to tetrahedral grids. The grids are refined by using the newest node bisection method. In this method the triangles are bisected along the edges that sit opposite the newest nodes. It can be shown (Mitchell references a proof by Sewell) that if the angles in the initial triangulation are bounded away from 0 and π then the angles in the refined grid will be bounded away from 0 and π. Indeed there are only a finite number of similar shapes that arise.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
7 / 45
Newest node bisection
B
B
B
B
Figure: Initial Triangulation. The outlined circles represent the newest nodes.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
8 / 45
Parallel Implementation
The parallel implementation of this method is equivalent to the steps outlined above except that it must be extended to handle the addition of new nodes. Note that the grid sitting on processor p is given by m m m m m m Mm p = Mp {Fp , Gp , Ep , Cp , Qp },
where Fpm is the set of full nodes, Gpm is the set of ghost nodes, Epm is the set of edges, Cpm is the set of algebraic connections and Qm p is the neighbour node tables associated with processor p.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
9 / 45
New nodes
If both Ni and Nj are ghost nodes we do not know if the processor contains enough of the grid to complete all of the connections. Consequently Nd is assigned as a ghost node. If Ni and Nj belong to the set of full nodes for processor p then the new node must be added to the full node table. If Ni ∈ Fp and Nj ∈ Fq where p 6= q. Then Nd can be added as a full node to either p or q. The midpoint, Nd , is then added to the processor that contains the smallest number of full nodes according to population table. If processors p and q have the same number of full nodes according to population table then the global I.D. is used.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
10 / 45
Parallel refinement The program loops through the edges table and bisects the triangles along the base edges. We can bisect the triangles independently across the processors. 6 6 B 1
2 B
3
B 7
B
4 Processor 1
7
2 B
5
8
B 5
B 9
10
Processor 2
Figure: The base edges, marked by a B, show which triangles need to be bisected. The full nodes dark circles, ghost nodes are open circles. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
11 / 45
Trim Thus, after each level of refinement a trim routine is called to prune any ghost nodes no longer connected to a full node and to remove any full nodes from Qm+1 that are no longer connected to a ghost node. Care Fp must be taken to check both the intra- and inter-grid connections.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
12 / 45
Interface base edges B3 B4
B9 I1
B7
I2 B5
B6 a)
B10
B11 B1
B9 B12 B10
I3
I15
I14
I16 I8
I2
I8
I2
B6
B5
B6
b)
B11 B12
B5
c)
Figure: Example triangulation with interface-base edges I* and base edges B*. a) The base edge B7, should be refined before the interface-base edge I1. b) Result of bisecting base edge B7. Note that the interface-base edge I1 has been updated to a base edge B1. c) The edge B1 is now bisected to give the final grid.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
13 / 45
Split edges To determine the edges that must be bisected we use a recursive routine that follows the interface-base edges down the refinement levels until it reaches a base edge.
I3 I1 I2
I4 B1
Figure: Follow the interface-edges down the coarse triangles until a base edge. is found
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
14 / 45
Parallel implementation
I2
I2
I3
I1 B1 Processor 1
Processor 2
Figure: The edges B1, I3 and I2 in Processor 2 need to be bisected before edges I2 and I1 in Processor 1.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
15 / 45
Load balancing
After refinement, particularly after adaptive refinement, we may find that the number of nodes per processor is poorly balanced. Consequently the code uses a load balancing routine to redistribute the load. The load balancing routine works solely on the node-edge table, it does not take the shape of the finite elements into account. If a full node is moved from one processor to another on the finest grid level, then any corresponding node on any of the coarser grids will also be moved. In other words, the load balancing routine ensures that the full nodes on a given processor will be nested. The same is not necessarily true for the ghost nodes.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
16 / 45
Load balancing
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
17 / 45
Grid reconstruction
We next develop techniques to reconstruct the grid levels after a fault has occured. We simulate a fault by removing all of the grid levels from a given processor, except the coarsest level. In all of the approaches we discuss here we assume we that the coarsest grid can be recovered (e.g. be read in again from file).
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
18 / 45
Load balancing The biggest challenge is taking into account the fact that the grids are being reconstructed after load balancing. Level 2 7
8
Level 1 1
5
2
6
Level 0 1
2
4
3
Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
19 / 45
Neighbouring grids
Unlike the original refinement routine communication is needed to recover the grids. The purpose of the communication calls is to build cm+1 and M cm+1 . two grids; M G F Neighbouring, healthy, processors send copies of their ghost nodes to cm+1 . be stored in M F Neighbouring, healthy, processors send copies of their full nodes to be cm+1 . stored in M G Neighbouring processors also send the edges and connections joined cm+1 . to the full nodes to be stored in M G
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
20 / 45
Ghost vs full nodes
In our fault recovery routine the process of bisecting a triangle is exactly the same as described previously with only one exception. Rather than relying on a set of rules to determine which processor contains a full node copy we use the information from the neighbouring processors given in cm+1 and M cm+1 . M G F
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
21 / 45
Missing information The edge between Node 1 and Node 2 in Level 0 is not stored in the processor so the refinement routine will not know that the edge needs to cm+1 be bisected when building Level 1. The extra information stored in M G addresses that issue. Level 2 7 8 Level 1 1
5
2
6
Level 0 1
2
4
3
Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
22 / 45
Adjust full Level 2 7
8
Level 1 1
5
2
6
Level 0 1
2
4
3
Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor
cF is used to correct the full neighbour node table Information stored in M in processor p. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
23 / 45
Adaptive refinement
Figure: The nodes in the healthy processors can be used to guide the refinement in the faulty processor. The square represents a node that exists in a healthy processor. If that node sits within a triangle in the faulty processor, that triangle must be bisected until the node is added to the faulty processor.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
24 / 45
Results
With non-adaptive refinement we are able to reconstruct the original grid. Once the system of equations have been built, the computations can continue as usual and the results have the same accuracy that would have been achieved if no fault had occurred. With adaptive refinement we do not fully reconstruct the grid. We do however reconstruct enough so that the computations may continue, albeit with the final result having a reduced degree of accuracy.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
25 / 45
Non-adaptive 1D The model problem used in this case is uxx = −4π sin(2πx) on the line 0 ≤ x ≤ 1. The grid was divided amongst three processors and six levels of non-adaptive refinement were carried out. 1D multilevel grid reconstruction 4
Level
3 2 1 0 0.25
0.3
0.35
0.4
0.45
0.5 x
0.55
0.6
0.65
0.7
0.75
Figure: Example multilevel 1D grid recovered after fault simulation. The nodes joined by a line are the full nodes, while the nodes enclosed in a circle are the ghost nodes
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
26 / 45
Non-adaptive 1D
1D model problem - fault 1.5e+00
solution
1.0e+00 5.0e-01 0.0e+00 -5.0e-01 -1.0e+00
Proc. 0
-1.5e+00 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
6.0e-05
1 error
4.0e-05 2.0e-05 0.0e+00 -2.0e-05 -4.0e-05
Proc. 0
-6.0e-05 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
1
Figure: Solution and error of 1D model problem with fault simulation.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
27 / 45
Non-adaptive 1D
Model problem uxx = −4π sin(2πx) and increased the grid size by using eighteen levels of refinement giving 1048577 nodes on the finest level. Runs were carried out on 1 to 32 processors. In all cases the original grid is recovered in the (simulated) faulty processor. To ensure that the original grid is indeed being recovered we built and solved the system of equations and then calculated and checked the error norms of the solution. The expected convergence rate was observed even when a fault occurred.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
28 / 45
Non-adaptive 2D
Figure: Example 2D grid divided over three processors (before fault simulation). Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
29 / 45
Non-adaptive 2D
Figure: Example 2D grid recovered after fault simulation. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
30 / 45
Non-adaptive 2D
To test the recovery process in two dimensions a model problem is ∆u = sin(πx) sin(πy ) on the square domain 0 ≤ x, y ≤ 1. Refinement routine was used to construct 10 levels of grids, with the finest grid consisting of 1050625 nodes. Runs were carried out on 1 to 32 processors. The error norm was again checked to ensure that the solution is the same irrespective of whether a fault occurred.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
31 / 45
Adaptive 1D
With adaptive refinement the original grid is not fully reconstructed. This is as expected since the recovery procedure only ensures that the communication pattern is recovered.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
32 / 45
Adaptive 1D Laplace’s equation with Dirichlet boundary conditions where the exact 2 solution is given by exp −50(0.5 − x) . 1D adaptive model problem 1.0e+00
solution
8.0e-01 6.0e-01 4.0e-01 2.0e-01
Proc. 0
0.0e+00 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
0.0e+00
1 error
-5.0e-04 -1.0e-03 -1.5e-03 -2.0e-03
Proc. 0 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
1
Figure: Solution and error of 1D model problem on adaptive grid without fault simulation.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
33 / 45
Adaptive 1D
1D adaptive model problem - fault 1.0e+00
solution
8.0e-01 6.0e-01 4.0e-01 2.0e-01 Proc. 0
0.0e+00 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
2.0e-01
1 error
1.5e-01 1.0e-01 5.0e-02 Proc. 0
0.0e+00 0
0.2
Proc. 1 0.4
Proc. 2 0.6
0.8
1
Figure: Solution and error of 1D model problem on adaptive grid with fault simulation.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
34 / 45
Adaptive 1D
1D adaptive grid 4
Level
3 2 1 0 0.25
0.3
0.35
0.4
0.45
0.5 x
0.55
0.6
0.65
0.7
0.75
Figure: Example multilevel 1D grid after two levels of uniform refinement followed by four levels of adaptive refinement.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
35 / 45
Adaptive 1D Grid that is recovered after a simulated fault. Clearly the original grid is not fully recovered. 1D adaptive grid reconstruction 4
Level
3 2 1 0 0.25
0.3
0.35
0.4
0.45
0.5 x
0.55
0.6
0.65
0.7
0.75
Figure: Example multilevel 1D adaptive grid recovered after fault simulation. The nodes joined by a line are the full nodes, while the nodes enclosed in a circle are the ghost nodes
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
36 / 45
Adaptive 2D L shaped domain, Poisson equation with zero Dirichlet boundaries.
Figure: Example 2D adaptively refine grid distributed over three processors. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
37 / 45
Adaptive 2D
Figure: A close-up of the part of the grid stored in Processor 0. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
38 / 45
Adaptive 2D
Figure: The part of the grid stored in Processor 0 recovered after fault simulation. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
39 / 45
Adaptive 2D
Figure: Example multilevel 1D grid recovered after fault simulation. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
40 / 45
Adaptive 2D
Ran a test problem consisting of 2 levels of uniform refinement followed by thirteen levels of adaptive refinement, resulting in a fine grid with 176081 nodes. The tests were carried out on 1 to 32 processors. After each fault recovery the solver was called to ensure the grid had been recovered correctly.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
41 / 45
Recover missing data
We present some initial attempts to recover the missing data. Fault recovery routine reestablishes communication pattern. So can call adaptive refinement routine again to fill in region in interior of faulty domain. Currently apply the refinement routine to the whole domain, but it should be possible to modify the refinement routine to only work on the faulty processor.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
42 / 45
Error at different fault scenarios 1 0.1 0.01 0.001
error
0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 100
max - no fault L2 - no fault max - fault L2 - fault max - recover L2 - recover O(h) O(h2) 1000
10000
100000
number of nodes
Figure: The maximum and discrete l2 norm of the error for the 1D model problem.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
43 / 45
MG convergence MG Convergence 100
1
residucal norm
0.01
0.0001
1e-06 no fault fault refine coarse
1e-08
0
5
10
15 iteration
20
25
30
Figure: The discrete l2 norm of the residual for the 1D model problem. Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
44 / 45
Conclusion
We presented an algorithm based fault recovery routine for adaptively refined multigrids. The algorithm does not fully recover the initial grid rather, it recovers enough of the data structures to ensure that the communication pattern has been reestablished. Once the data structures are again consistent the computations may continue, but potentially with reduced accuracy. Applying additional refinement routines can recover lost information.
Linda Stals
Adaptive refinement recovery after fault simulation
February 11, 2016
45 / 45