Adaptive refinement recovery after fault simulation

Viewer
Transcript

Adaptive refinement recovery after fault simulation Linda Stals

1

February 11, 2016

1 Mathematical Sciences Institute, Australian National University, Canberra ACT 0200, Australia. [email protected] Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

1 / 45

Resilience

With the advent of high performance machines containing an increasingly larger number of processors and system components, the chances of a fault occurring becomes more likely. Achieving resilience is expensive since it inevitably requires redundancy, and thus more system resources and additional energy. Alternatives that can exploit specific features of the algorithms may offer substantial savings.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

2 / 45

Multilevel adaptive grids

We study parallel algorithms with automatic adaptive mesh refinement. Thus the grid structure itself requires nontrivial distributed and dynamic data structures that store a hierarchy of locally refined finite element meshes. In case of a fault not only the state of the solution within the iterative process is lost in a subdomain but also the information about the adaptively refined mesh structures themselves. The recovery process must take into account both the intra-grid data dependencies as well as the inter-grid dependencies.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

3 / 45

Parallel implementation 6 4 1

5 3

2 6 4 1

3

Processor 1 Linda Stals

4

5 1

5 3

2 Processor 2

Adaptive refinement recovery after fault simulation

February 11, 2016

4 / 45

Inter-grid connections Observe that the algebraic connections include the inter-grid connections defined by the interpolation and restriction operators. 1

6

4

7

2

8 Level 2

1

Level 1 4

2

5 Level 0

1

2

3

Figure: The ghost nodes are used to complete the intra-grid and inter-grid connections. The full nodes are drawn as dark circles while the ghost nodes are drawn as open circles.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

5 / 45

Inter-grid connections The inter-grid connections add an extra degree of complexity to the data dependencies. For example, when a node is moved to another processor, the neighbour node list must be updated on the current grid level as well as on the previous grid level and the next grid level. The use of ghost nodes as a communication buffer or as a way of storing updates from neighbouring processors is standard practise. However, we have extended their application. For example, during refinement the communication pattern has to be updated when new nodes are added and we will show that by exploiting the relationship between the ghost nodes and full nodes the communication pattern may be updated independently across the processors. Unfortunately, it will turn out that this is not true during the fault recovery process. The load balancing routine breaks many of the assumptions or rules that are relied upon during the standard refinement procedure. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

6 / 45

Triangle subdivision

There are two ways to subdivide a triangle; bisection or quadrisection. We will use the bisection method as it lends more easily to adaptive refinement and the general ideas extend more readily to tetrahedral grids. The grids are refined by using the newest node bisection method. In this method the triangles are bisected along the edges that sit opposite the newest nodes. It can be shown (Mitchell references a proof by Sewell) that if the angles in the initial triangulation are bounded away from 0 and π then the angles in the refined grid will be bounded away from 0 and π. Indeed there are only a finite number of similar shapes that arise.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

7 / 45

Newest node bisection

B

B

B

B

Figure: Initial Triangulation. The outlined circles represent the newest nodes.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

8 / 45

Parallel Implementation

The parallel implementation of this method is equivalent to the steps outlined above except that it must be extended to handle the addition of new nodes. Note that the grid sitting on processor p is given by m m m m m m Mm p = Mp {Fp , Gp , Ep , Cp , Qp },

where Fpm is the set of full nodes, Gpm is the set of ghost nodes, Epm is the set of edges, Cpm is the set of algebraic connections and Qm p is the neighbour node tables associated with processor p.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

9 / 45

New nodes

If both Ni and Nj are ghost nodes we do not know if the processor contains enough of the grid to complete all of the connections. Consequently Nd is assigned as a ghost node. If Ni and Nj belong to the set of full nodes for processor p then the new node must be added to the full node table. If Ni ∈ Fp and Nj ∈ Fq where p 6= q. Then Nd can be added as a full node to either p or q. The midpoint, Nd , is then added to the processor that contains the smallest number of full nodes according to population table. If processors p and q have the same number of full nodes according to population table then the global I.D. is used.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

10 / 45

Parallel refinement The program loops through the edges table and bisects the triangles along the base edges. We can bisect the triangles independently across the processors. 6 6 B 1

2 B

3

B 7

B

4 Processor 1

7

2 B

5

8

B 5

B 9

10

Processor 2

Figure: The base edges, marked by a B, show which triangles need to be bisected. The full nodes dark circles, ghost nodes are open circles. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

11 / 45

Trim Thus, after each level of refinement a trim routine is called to prune any ghost nodes no longer connected to a full node and to remove any full nodes from Qm+1 that are no longer connected to a ghost node. Care Fp must be taken to check both the intra- and inter-grid connections.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

12 / 45

Interface base edges B3 B4

B9 I1

B7

I2 B5

B6 a)

B10

B11 B1

B9 B12 B10

I3

I15

I14

I16 I8

I2

I8

I2

B6

B5

B6

b)

B11 B12

B5

c)

Figure: Example triangulation with interface-base edges I* and base edges B*. a) The base edge B7, should be refined before the interface-base edge I1. b) Result of bisecting base edge B7. Note that the interface-base edge I1 has been updated to a base edge B1. c) The edge B1 is now bisected to give the final grid.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

13 / 45

Split edges To determine the edges that must be bisected we use a recursive routine that follows the interface-base edges down the refinement levels until it reaches a base edge.

I3 I1 I2

I4 B1

Figure: Follow the interface-edges down the coarse triangles until a base edge. is found

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

14 / 45

Parallel implementation

I2

I2

I3

I1 B1 Processor 1

Processor 2

Figure: The edges B1, I3 and I2 in Processor 2 need to be bisected before edges I2 and I1 in Processor 1.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

15 / 45

Load balancing

After refinement, particularly after adaptive refinement, we may find that the number of nodes per processor is poorly balanced. Consequently the code uses a load balancing routine to redistribute the load. The load balancing routine works solely on the node-edge table, it does not take the shape of the finite elements into account. If a full node is moved from one processor to another on the finest grid level, then any corresponding node on any of the coarser grids will also be moved. In other words, the load balancing routine ensures that the full nodes on a given processor will be nested. The same is not necessarily true for the ghost nodes.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

16 / 45

Load balancing

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

17 / 45

Grid reconstruction

We next develop techniques to reconstruct the grid levels after a fault has occured. We simulate a fault by removing all of the grid levels from a given processor, except the coarsest level. In all of the approaches we discuss here we assume we that the coarsest grid can be recovered (e.g. be read in again from file).

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

18 / 45

Load balancing The biggest challenge is taking into account the fact that the grids are being reconstructed after load balancing. Level 2 7

8

Level 1 1

5

2

6

Level 0 1

2

4

3

Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

19 / 45

Neighbouring grids

Unlike the original refinement routine communication is needed to recover the grids. The purpose of the communication calls is to build cm+1 and M cm+1 . two grids; M G F Neighbouring, healthy, processors send copies of their ghost nodes to cm+1 . be stored in M F Neighbouring, healthy, processors send copies of their full nodes to be cm+1 . stored in M G Neighbouring processors also send the edges and connections joined cm+1 . to the full nodes to be stored in M G

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

20 / 45

Ghost vs full nodes

In our fault recovery routine the process of bisecting a triangle is exactly the same as described previously with only one exception. Rather than relying on a set of rules to determine which processor contains a full node copy we use the information from the neighbouring processors given in cm+1 and M cm+1 . M G F

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

21 / 45

Missing information The edge between Node 1 and Node 2 in Level 0 is not stored in the processor so the refinement routine will not know that the edge needs to cm+1 be bisected when building Level 1. The extra information stored in M G addresses that issue. Level 2 7 8 Level 1 1

5

2

6

Level 0 1

2

4

3

Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

22 / 45

Adjust full Level 2 7

8

Level 1 1

5

2

6

Level 0 1

2

4

3

Figure: Example grid refinement. The filled blue circles represent full nodes, the open blue circles are ghost nodes and the green squares represent parts of the grid that will not be stored in the processor

cF is used to correct the full neighbour node table Information stored in M in processor p. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

23 / 45

Adaptive refinement

Figure: The nodes in the healthy processors can be used to guide the refinement in the faulty processor. The square represents a node that exists in a healthy processor. If that node sits within a triangle in the faulty processor, that triangle must be bisected until the node is added to the faulty processor.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

24 / 45

Results

With non-adaptive refinement we are able to reconstruct the original grid. Once the system of equations have been built, the computations can continue as usual and the results have the same accuracy that would have been achieved if no fault had occurred. With adaptive refinement we do not fully reconstruct the grid. We do however reconstruct enough so that the computations may continue, albeit with the final result having a reduced degree of accuracy.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

25 / 45

Non-adaptive 1D The model problem used in this case is uxx = −4π sin(2πx) on the line 0 ≤ x ≤ 1. The grid was divided amongst three processors and six levels of non-adaptive refinement were carried out. 1D multilevel grid reconstruction 4

Level

3 2 1 0 0.25

0.3

0.35

0.4

0.45

0.5 x

0.55

0.6

0.65

0.7

0.75

Figure: Example multilevel 1D grid recovered after fault simulation. The nodes joined by a line are the full nodes, while the nodes enclosed in a circle are the ghost nodes

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

26 / 45

Non-adaptive 1D

1D model problem - fault 1.5e+00

solution

1.0e+00 5.0e-01 0.0e+00 -5.0e-01 -1.0e+00

Proc. 0

-1.5e+00 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

6.0e-05

1 error

4.0e-05 2.0e-05 0.0e+00 -2.0e-05 -4.0e-05

Proc. 0

-6.0e-05 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

1

Figure: Solution and error of 1D model problem with fault simulation.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

27 / 45

Non-adaptive 1D

Model problem uxx = −4π sin(2πx) and increased the grid size by using eighteen levels of refinement giving 1048577 nodes on the finest level. Runs were carried out on 1 to 32 processors. In all cases the original grid is recovered in the (simulated) faulty processor. To ensure that the original grid is indeed being recovered we built and solved the system of equations and then calculated and checked the error norms of the solution. The expected convergence rate was observed even when a fault occurred.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

28 / 45

Non-adaptive 2D

Figure: Example 2D grid divided over three processors (before fault simulation). Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

29 / 45

Non-adaptive 2D

Figure: Example 2D grid recovered after fault simulation. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

30 / 45

Non-adaptive 2D

To test the recovery process in two dimensions a model problem is ∆u = sin(πx) sin(πy ) on the square domain 0 ≤ x, y ≤ 1. Refinement routine was used to construct 10 levels of grids, with the finest grid consisting of 1050625 nodes. Runs were carried out on 1 to 32 processors. The error norm was again checked to ensure that the solution is the same irrespective of whether a fault occurred.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

31 / 45

Adaptive 1D

With adaptive refinement the original grid is not fully reconstructed. This is as expected since the recovery procedure only ensures that the communication pattern is recovered.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

32 / 45

Adaptive 1D Laplace’s equation with Dirichlet boundary conditions where the exact 2 solution is given by exp −50(0.5 − x) . 1D adaptive model problem 1.0e+00

solution

8.0e-01 6.0e-01 4.0e-01 2.0e-01

Proc. 0

0.0e+00 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

0.0e+00

1 error

-5.0e-04 -1.0e-03 -1.5e-03 -2.0e-03

Proc. 0 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

1

Figure: Solution and error of 1D model problem on adaptive grid without fault simulation.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

33 / 45

Adaptive 1D

1D adaptive model problem - fault 1.0e+00

solution

8.0e-01 6.0e-01 4.0e-01 2.0e-01 Proc. 0

0.0e+00 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

2.0e-01

1 error

1.5e-01 1.0e-01 5.0e-02 Proc. 0

0.0e+00 0

0.2

Proc. 1 0.4

Proc. 2 0.6

0.8

1

Figure: Solution and error of 1D model problem on adaptive grid with fault simulation.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

34 / 45

Adaptive 1D

1D adaptive grid 4

Level

3 2 1 0 0.25

0.3

0.35

0.4

0.45

0.5 x

0.55

0.6

0.65

0.7

0.75

Figure: Example multilevel 1D grid after two levels of uniform refinement followed by four levels of adaptive refinement.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

35 / 45

Adaptive 1D Grid that is recovered after a simulated fault. Clearly the original grid is not fully recovered. 1D adaptive grid reconstruction 4

Level

3 2 1 0 0.25

0.3

0.35

0.4

0.45

0.5 x

0.55

0.6

0.65

0.7

0.75

Figure: Example multilevel 1D adaptive grid recovered after fault simulation. The nodes joined by a line are the full nodes, while the nodes enclosed in a circle are the ghost nodes

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

36 / 45

Adaptive 2D L shaped domain, Poisson equation with zero Dirichlet boundaries.

Figure: Example 2D adaptively refine grid distributed over three processors. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

37 / 45

Adaptive 2D

Figure: A close-up of the part of the grid stored in Processor 0. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

38 / 45

Adaptive 2D

Figure: The part of the grid stored in Processor 0 recovered after fault simulation. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

39 / 45

Adaptive 2D

Figure: Example multilevel 1D grid recovered after fault simulation. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

40 / 45

Adaptive 2D

Ran a test problem consisting of 2 levels of uniform refinement followed by thirteen levels of adaptive refinement, resulting in a fine grid with 176081 nodes. The tests were carried out on 1 to 32 processors. After each fault recovery the solver was called to ensure the grid had been recovered correctly.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

41 / 45

Recover missing data

We present some initial attempts to recover the missing data. Fault recovery routine reestablishes communication pattern. So can call adaptive refinement routine again to fill in region in interior of faulty domain. Currently apply the refinement routine to the whole domain, but it should be possible to modify the refinement routine to only work on the faulty processor.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

42 / 45

Error at different fault scenarios 1 0.1 0.01 0.001

error

0.0001 1e-05 1e-06 1e-07 1e-08 1e-09 100

max - no fault L2 - no fault max - fault L2 - fault max - recover L2 - recover O(h) O(h2) 1000

10000

100000

number of nodes

Figure: The maximum and discrete l2 norm of the error for the 1D model problem.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

43 / 45

MG convergence MG Convergence 100

1

residucal norm

0.01

0.0001

1e-06 no fault fault refine coarse

1e-08

0

5

10

15 iteration

20

25

30

Figure: The discrete l2 norm of the residual for the 1D model problem. Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

44 / 45

Conclusion

We presented an algorithm based fault recovery routine for adaptively refined multigrids. The algorithm does not fully recover the initial grid rather, it recovers enough of the data structures to ensure that the communication pattern has been reestablished. Once the data structures are again consistent the computations may continue, but potentially with reduced accuracy. Applying additional refinement routines can recover lost information.

Linda Stals

Adaptive refinement recovery after fault simulation

February 11, 2016

45 / 45

Adaptive refinement recovery after fault simulation

The recovery process must take into account both the intra-grid data dependencies as well as the inter-grid dependencies. Linda Stals. Adaptive refinement ...

Download PDF

904KB Sizes 0 Downloads 120 Views

Report

Adaptive refinement recovery after fault simulation

Recommend Documents