Computer-Aided Civil and Infrastructure Engineering 17 (2002) 423–438

Integration of General Sparse Matrix and Parallel Computing Technologies for Large-Scale Structural Analysis Shang-Hsien Hsieh,∗ Yuan-Sen Yang & Po-Yao Hsu Department of Civil Engineering, National Taiwan University, Taipei 10617, Taiwan, Republic of China

Abstract: Both general sparse matrix and parallel computing technologies are integrated in this study as a finite element solution of large-scale structural problems in a PC cluster environment. The general sparse matrix technique is first employed to reduce execution time and storage requirements for solving the simultaneous equilibrium equations in finite element analysis. To further reduce the time required for large-scale structural analyses, two parallel processing approaches for sharing computational workloads among collaborating processors are then investigated. One approach adopts a publicly available parallel equation solver, called SPOOLES, to directly solve the sparse finite element equations, while the other employs a parallel substructure method for the finite element solution. This work focuses more on integrating the general sparse matrix technique and the parallel substructure method for large-scale finite element solutions. Additionally, numerical studies have been conducted on several large-scale structural analyses using a PC cluster to investigate the effectiveness of the general sparse matrix and parallel computing technologies in reducing time and storage requirements in large-scale finite element structural analyses. 1 INTRODUCTION Large-scale structural analyses typically require considerable execution time and memory requirements. Two approaches can be employed to increase the scale and complexity of engineering problems that can be solved numerically. One approach is to develop better solution strategies, which are more computationally efficient and ∗

To whom correspondence should be addressed. E-mail: [email protected].

require less memory storage, allowing larger scale engineering problems to be solved with limited computing resources. The alternative approach is to fully exploit advanced computer hardware technology and utilize a more powerful computer system (with higher speed and larger memory capacity) to analyze larger engineering problems. Obviously, these two approaches are not mutually exclusive and can be applied simultaneously to solve even larger engineering problems. The approach of adopting better solution strategies has generally focused on the strategy of solving the equilibrium equation in matrix form (e.g., [K] {u} = { f }), which is essential in both linear and nonlinear, static and implicit dynamic finite element analysis. The direct method (namely, factorizing the matrix [K] into the form [L][L]T or [L][D][L]T ) is usually preferable (over the iterative method) for its robustness and reliability on ill-conditioned problems, which may occur in nonlinear structural analyses. When solving a linear system using the direct method, it is often computationally expensive (in terms of time and storage requirements) to factorize the stiffness matrix [K] in the equilibrium equations, especially when analyzing a large-scale structural problem. To reduce the time and storage requirements for factorizing a large system matrix [K], the SKyline Matrix (SKM) technique is widely adopted by structural analysis packages. The SKM approach stores and computes the matrix items within the skyline of the matrix. Nevertheless, a considerable number of zero items are stored within the skyline. Therefore, to make more efficient use of computational resources, the general sparse matrix (GSM) technique was developed (George and Liu, 1981; Duff et al., 1986). This technique stores only the nonzero matrix items that are required in matrix factorization and

 C 2002 Computer-Aided Civil and Infrastructure Engineering. Published by Blackwell Publishing, 350 Main Street, Malden, MA 02148, USA, and 108 Cowley Road, Oxford OX4 1JF, UK.

424

Hsieh, Yang & Hsu

will be explained in more detail in Section 3.1. The GSM technique is studied and employed herein for solving equilibrium equations in large-scale structural engineering problems. The storage requirements and computational efficiency of using the GSM approach to store and factorize the matrix [K] are also compared with the use of the SKM approach in a series of numerical experiments. The approach of using more powerful computers focuses on exploiting computational parallelism. Extensive research (Noor, 1987; Farhat and Roux, 1994; Adeli and Soegiarso, 1999; Adeli and Kumar, 1999; Farhat et al., 2001) has examined the area of parallel finite element structural analyses for improving computational speed-up and efficiency on various parallel computers and for applications in structural analysis, design, and optimization. Recent progress on parallel finite element analysis has recently been reviewed by Adeli (2000). Two parallel computing approaches are investigated herein, and both of them are integrated with the GSM technique for large-scale structural analyses. The first approach is named the GSM-based parallel equation solver method in this work. This approach employs a publicly available parallel equation solver, called SPOOLES (Ashcraft et al., 1999), to directly solve the sparse finite element equations in parallel. The second approach is named the GSM-based parallel substructure method in this work. It employs a parallel direct substructure method (Smith et al., 1996) that first performs substructure condensation in parallel, then sequentially solves the condensed set of interface system equations. This work focuses on the GSM-based parallel substructure method and investigates the GSM-based parallel equation solver method mainly for the purpose of comparison. Most previous studies on the parallel substructure method focus on the SKM technique. This work applies the GSM technique to both the sequential and parallel substructure methods. Issues related to the parallel efficiency of the present GSM-based parallel substructure method are thoroughly investigated. Furthermore, the efficiency of the GSM-based parallel substructure method is compared with that of the GSM-based parallel equation solver method using a number of large-scale structural problems on a PC cluster comprising up to four personal computers.

2 SOFTWARE AND HARDWARE ENVIRONMENTS An object-oriented finite element package, called FE2000 (Yang, 2000), is used as a testbed herein to implement and experiment with the GSM and parallel computing technologies for large-scale structural analyses.

The design of FE2000 mostly follows that of FE++, an object-oriented finite element framework developed by Lu (1994). But FE2000 simplifies the object framework of FE++ with the consideration of efficiency improvement. For example, FE++ uses an abstract Node class with hierarchical derived classes, while FE2000 uses a single concrete feNode class with no derived classes to handle node-related data and operations. FE2000 also simplifies the multi-level hierarchies of the Element classes in FE++ and instead uses a single-level hierarchy of the feElement classes to handle element properties and associated operations. The derived classes available in FE2000 presently include feB8Element (8-noded brick solid element), feB20Element (20-noded brick solid element), feTrussElement (3D truss element), and feBCElement (3D beam-column element). Similar to the Assemblage class in FE++, FE2000 also implements the feAssemblage class to provide accessory interfaces to the global system of finite element analysis, including pointers to the feNode and feElement objects, and to assemble global stiffness matrices and force vectors. To facilitate the implementation of both sequential and parallel substructure analyses, FE2000 adopts the idea of the SuperElement class from an object-oriented parallel finite element program, called PFE++ (Mukunda et al., 1998) and derives an feSuperElement class from its feElement class to handle substructure-related data and operations. Additionally, an object-oriented MPI (Message Passing Interface Forum, 1994) based class library, called PPI++ (Hsieh and Sotelino, 1997), is used by FE2000 to handle all message-passing tasks among processors. Originally FE2000 used only the SKM approach for matrix computations and storage in both sequential and parallel analyses. The GSM approach has been implemented in FE2000 as an alternative for matrix operations and storage in both sequential and parallel analyses. For solving the equilibrium equations in both sequential and parallel analyses using the GSM approach, either SPARSPAK (George and Liu, 1981) or SPOOLES (Ashcraft et al., 1999) can be used in FE2000. A Pentium II-350 based PC with 384 Mb of memory is used herein for sequential analyses. Meanwhile, a PC cluster comprising four Pentium II-350 based PCs (two with 256 Mb of memory and two with 128 Mb of memory), running the Linux RedHat operating system, is employed for parallel finite element analysis. The speed of the network system used in the PC cluster is 100 Mbps. The MPICH message-passing library (Gropp et al., 1996), which follows the MPI standard, is used for passing messages among concurrent processors. Additionally, all of the computer programs are compiled using the GNU g++ compiler version 2.91.66 (Stallman, 1998).

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

3 GENERAL SPARSE MATRIX TECHNIQUE The GSM technique has been proposed for more than two decades and it has been demonstrated that its performance usually exceeds the SKM technique. However, much of the existing research on parallel finite element structural analysis employs the SKM technique. The present work employs the GSM technique to solve equilibrium equations encountered in large-scale structural engineering problems. This section briefly introduces the GSM technique, and then the SKM and GSM approaches are numerically compared. 3.1 The GSM approach The GSM technique stores only the nonzero items that can be arbitrarily distributed in a coefficient matrix of a linear system, [K]{u} = { f }. However, the distribution pattern of the nonzero items in matrix [K] differs from that in the factorized matrix [L]. Some zero items in matrix [K] become nonzero (also known as “fill-ins”) during factorization. Therefore, the number of nonzero items in matrix [L] always exceeds that in matrix [K]. To factorize matrix [K] correctly, sufficient memory should be allocated based on the storage requirements of matrix [L], not just matrix [K]. Symbolic factorization is required in the GSM approach for predicting the distribution of nonzero items in matrix [L] after matrix [K] has been factorized. A simple way to perform the symbolic factorization is to explicitly simulate the process of matrix factorization (e.g., Sarma and Adeli, 1996). Due to the availability of fast algorithms (e.g., SPARSPAK uses an implicit representation of a quotient elimination graph), symbolic factorization often does not cost much execution time especially for large-scale problems if numerical pivoting is not required (Kumar et al., 1994). The SKM approach does not require such factorization because both matrix [L] and matrix [K] have the same skyline. Similar to the SKM approach, the GSM approach also requires general matrix re-ordering to reduce computational and storage requirements. The matrix ordering must be performed before the symbolic factorization because the distribution of the nonzero items in matrix [L] depends on the ordering of the matrix. The GSM ordering attempts to reduce the nonzero items in matrix [L]. Generally, the fewer nonzero items, the less time and storage are required for factorizing matrix [K]. Several algorithms, such as the Minimum-Degree algorithm (George and Liu, 1981), the Nested-Dissection algorithm (George, 1973), and their modifications (Liu, 1985; Bui and Jones, 1993) or hybrid schemes (Liu, 1989; Ashcraft and Liu, 1998) are generally used for matrix ordering in the GSM approach.

425

Active research efforts have recently tried to improve the GSM technique. For example, Karypis and Kumar (1998) studied enhanced ordering algorithms to further reduce the number of fill-ins; Ashcraft et al. (1998) investigated pivoting techniques for better numerical stability; Ashcraft et al. (1999), Joshi et al. (1999), and Amestoy et al. (2000) implemented parallel sparse equation solvers for exploiting parallel computers. Furthermore, the GSM technique has also been applied to iterative solvers because some of the sparse iterative methods can be seen as incomplete matrix factorizations with iterations (Saad, 1996; Ashcraft et al., 1999). 3.2 Comparative numerical studies of the GSM and SKM approaches Ashcraft et al. (1987) numerically compared the GSM and SKM approaches on a Cray X-MP vector supercomputer using eleven structural engineering examples from the Harwell-Boeing sparse matrix collection. The comparison showed that the GSM approach is significantly superior to the SKM approach in both computational operations and storage requirements. After using both the variable-band (namely, skyline) and sparse solvers for several structural analyses on the CONVEX C220 and CRAY-2 vector computers, Poole et al. (1992) demonstrated that the skyline solver is the most effective method for exploiting vector computers but has greater operational and memory requirements than the sparse solver. Yang et al. (1998) analyzed several finite element models with different characteristics on a Pentium II-233 PC and numerically compared the GSM and SKM approaches. The analysis showed that the GSM technique could help to significantly reduce time and memory storage in finite element analyses, especially when the finite element model is large-scale and irregularly shaped. Both the GSM and SKM approaches are implemented in FE2000 and employed herein to solve the equilibrium equations in several structural analyses using finite element models with different characteristics. The time and storage requirements for matrix factorization are measured and compared between the GSM and SKM approaches. In the numerical studies, the measured factorization time excludes time for matrix ordering and symbolic factorization in order for a direct comparison on matrix factorization time between the two approaches. Nevertheless, the time costs for the matrix ordering and symbolic factorization in the GSM approach are reported. Although these costs cannot simply be neglected in many cases, they are often not significant for large-scale problems requiring long matrix factorization time and become even less significant in nonlinear structural analyses, where they are usually performed just once at the very beginning of the analysis (though

426

Hsieh, Yang & Hsu

the stiffness matrix [K] may be updated and factorized several times). The measured storage requirement refers to the memory used to store the entire matrix, including the storage of nonzero items and information relating to their distribution. A sparse matrix package called SPARSPAK (George and Liu, 1981) is used herein for matrix factorization. The SPARSPAK package provides both the SKM and GSM computations for solving a linear system. The GENRCM subprogram of the SPARSPAK package, which implements the Reversed Cuthill-McKee ordering method (Liu, 1976), is used for the SKM approach. Meanwhile, the ‘orderViaBestOfNDandMS’ subprogram of the SPOOLES package (Ashcraft et al., 1999), which chooses one of the Nested-Dissection or Multi-Section (MS) ordering results (Ashcraft and Liu, 1998), is used for the GSM approach. All numerical tests were performed on a Pentium II-350 personal computer (with 384 Mb of memory) running the Linux operating system. Table 1 summarizes the results of the numerical comparative studies. RT and R M denote the ratios of the time and storage requirements of the GSM approach over those of the SKM approach. In all of the examples studied herein, the GSM approach is superior to the SKM approach in both time and storage requirements for matrix factorization in the finite element analyses. Below are more detailed descriptions of the comparative tests and discussions of the differences in performance between the GSM and SKM approaches. Two finite element models of frame structures (15STORY-1 and 15STORY-2, see Figure 1) are used to test the influence of refining a finite element model on the GSM and SKM approaches. Both the 15STORY-1 and 15STORY-2 models have the same geometric shape. The 15STORY-1 model uses only one beam-column element for its beams and columns, while the 15STORY-2 model

Fig. 1. The 15STORY-1 model.

uses four beam-column elements. As listed in Table 1, the GSM approach achieves about 40% time saving and 46% memory saving compared to the SKM approach when applied to the 15STORY-1 model. Because the number of degrees of freedom of the 15STORY-2 model is nine times greater than that of the 15STORY-1 model, the time and storage requirements increase rapidly with the SKM approach (by about 56 and 20 times, respectively). However, the corresponding increase in the time and storage requirements is considerably smaller for the GSM approach. Therefore, the time and storage requirements of the GSM approach are just about 2% and 6%, respectively, of those of the SKM approach for the 15STORY-2 model. These results indicate that the GSM approach has considerable advantages in dealing with refined finite element models of framed structures. The H1-12, H5-12, and O4-12 models (as shown in Figures 2–4) are used here to investigate how the

Table 1 Comparisons between the GSM and SKM approaches for time and storage requirements Factorization time (sec)

Finite element models 15STORY-1 15STORY-2 H1-12 H5-12 O4-12 M12BD M12BD-2 ∗ The

GSM 5.4 (0.34)* 9.2 (1.86) 212.6 (1.57) 24.0 (1.47) 70.3 (1.49) 2.4 (1.42) 174.7 (15.6)

Memory requirements (megabytes)

SKM

RT (GSM/SKM) (%)

GSM

SKM

R M (GSM/SKM) (%)

9.4 527.1 304.2 206.4 230.6 5.6 —

57.4% 1.7% 69.9% 11.6% 30.5% 42.9% —

8.4 18.5 76.4 35.7 51.6 9.5 161.7

15.7 318.3 162.9 135.5 142.4 17.8 405.7

53.5% 5.8% 46.9% 26.3% 36.2% 53.4% 39.9%

number in parenthesis indicates the time required for matrix ordering and symbolic factorization (in seconds).

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

Fig. 2. The H1-12 model.

existence of appendages and/or holes in a finite element model influences the GSM and SKM approaches. The three models are designed to have the same number of stories, nodes, and degrees of freedom, but their geometric shapes differ significantly. The H1-12 model has a regular mesh, while the H5-12 model has a mesh with many appendages, and the O4-12 model has a mesh with holes. As listed in Table 1, the GSM approach spends about 30% less time and 53% less memory than the SKM approach on the H1-12 model. On the H5-12 and O4-12 models, the time and storage savings from using the GSM approach can reach around 88% and 74%, re-

427

spectively. These results indicate that the GSM approach is superior to the SKM approach in taking advantage of the appendages and/or holes present in finite element models. Additionally, two finite element meshes (as shown in Figures 5 and 6) comprising 20-noded brick (or solid) elements are tested herein. The M12BD-2 model is a refined mesh from the M12BD model. As presented in Table 1, the GSM approach spends around 57% (32% if the time for matrix ordering and symbolic factorization is included) less time and 46% less memory than the SKM approach when applied to the M12BD model. For the M12BD-2 model, the time consumption of the SKM approach was not measured because at least 405 Mb of memory are required for in-core analysis, exceeding the core memory of the computer used, and memory swapping slowed the analysis considerably. However, the GSM approach still performed well in this case, requiring 60% less memory than the SKM approach. The above results demonstrate that the GSM approach also performs better than the SKM approach in finite element meshes with higher order elements (such as the 20-noded solid elements used in this case). The numerical comparative studies herein have confirmed that the GSM approach performs significantly better than the SKM approach in terms of time and storage requirements, particularly for large-scale structural analyses on finite element meshes with appendages and holes.

Fig. 3. The H5-12 model.

428

Hsieh, Yang & Hsu

Fig. 4. The O4-12 model.

4 PARALLEL FINITE ELEMENT ANALYSIS As mentioned earlier, two parallel finite element approaches are investigated in this work. The GSMbased parallel substructure method employs the parallel processing and substructuring techniques, while the GSM-based parallel equation solver method em-

ploys the parallel processing technique in the solution of the equilibrium equation with SPOOLES library (no substructuring). This section first describes these two approaches and their implementations in FE2000. Then discussions on the similarities and differences between these two approaches are provided.

Fig. 5. The M12BD model (Hsieh and Abel, 1995).

Fig. 6. The M12BD-2 model.

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

4.1 GSM-based parallel substructure method The parallel substructure method is one of the most widespread approaches for parallel finite element computations (Noor, 1987; Adeli and Kamal, 1993). Based upon the direct method, the parallel substructure method is suitable for both linear and nonlinear structural analysis, owing to its robustness when applied to ill-conditioned problems. Furthermore, the substructure method is generally used for finite element analysis with local nonlinearities and adaptive mesh refinement. By using the substructure method, different substructures can be studied simultaneously by different design teams. The parallel substructure method partitions the structure into a number of nonoverlapping substructures and then assigns each substructure to a separate processor, as indicated in Figures 7a–b. Each processor then forms substructure equilibrium equations       uI KI I KI E fI = (1) [K]k {u}k = KEI KEE k u E k fE k in which [K] denotes the stiffness matrix; {u} represents the displacement vector; { f } is the external force vector; subscript k denotes the kth substructure; and subscripts I and E represent the internal and interface degrees of freedom, respectively. Static condensation of substructure interior degrees of freedom (as shown in Figure 7c) is then performed independently and concurrently within each substructure without interprocess communication among processors, as expressed by the following two equations: ¯ k = [KEE ]k − [KEI ]k [KI I ]−1 [ K] k [KI E ]

(2)

{ ¯f E }k = { f E }k − [KEI ]k [KI I ]−1 k { f I }k

(3)

The condensed system of each substructure is then assembled to form a set of global equations for the unknowns along the substructure interfaces (as presented

429

in Figure 7d) and the global system of equations is solved as follows: NP    ¯ EE ]g = ¯ EE ]k [P]kT [K (4) [P]k [ K k=1 Np 



¯fe = ([P]k ¯f E k) g

(5)

k=1

¯ EE ]−1 ¯ {u E }g = [ K g { f E }g

(6)

where subscript g denotes the global system, and the matrix [P]k represents the permutation matrix for converting the degrees of freedom of the substructure k to the global system. Finally, the internal degrees of freedom of each substructure are solved concurrently within each processor as follows (as shown in Figure 7e): {u E }k = [P]kT {u E }g {u I }k = [KI I ]−1 k ({ f I }k − [KI E ]k {u E }k)

(7) (8)

Figure 8 shows the flowchart of the GSM-based parallel substructure method implemented in FE2000. This work does not solve the condensed set of equations associated with the interface degrees of freedom in parallel, but instead performs sequential matrix factorization (using SPARSPAK) on a single processor. This method is employed mainly because the present approach is still the fastest in FE2000, although some efforts have been made to parallelize this part of the computations using a parallel equation solver, such as SPOOLES. The experiments conducted by Yang (2000) have shown that the sequential solutions of the condensed interface system using SPARSPAK are faster than the parallel solutions using SPOOLES in most of the finite element examples studied. This is probably because the size of the

Fig. 7. Sketch procedures in the GSM-based parallel substructure method: (a) finite element mesh of a structure; (b) mesh partitioning; (c) substructure condensation in parallel; (d) sequential assembly and solution of interface system; (e) solution of substructure internal degrees of freedom in parallel.

430

Hsieh, Yang & Hsu

Fig. 8. Flowchart for the GSM-based parallel substructure method.

condensed interface system has become too small and the condensed interface coefficient matrix has become too dense for SPOOLES to achieve good parallel efficiency. In this work, the parallel substructure method employs only up to four substructures and the number of the interface degrees of freedom is generally a very small portion of the total degrees of freedom (often below 5%). However, because the solution of the interface system may occupy a significant portion of the entire finite element computations (for example, 50% in terms of time for the O4-12 model), an efficient parallel implementation for solving the interface system is still being sought. The GSM approach has already been demonstrated to be superior to the SKM approach in both time and storage requirements. Consequently, the parallel substructure method herein employs only the GSM approach. To facilitate the sparse implementation of the substructure method, a slightly extended version of the SPARSPAK library is employed. The modified decomposition algorithm developed by Han and Abel (1984) is used for the static condensation of the substructure interior degrees of freedom. Due to the similarity between the solution of the simultaneous equations and substructure condensation, the routines in SPARSPAK can be easily modified to support substructure condensation. The

static condensation of [K] and { f } can be considered a part of matrix factorization and forward substitution, respectively, while solving the internal degrees of freedom can be considered a part of backward substitution. The static condensation subprograms are thus easy to build by modifying the looping indices in the SPARSPAK library (Yang, 2000). Substructure matrix ordering is important for enhancing the efficiency and lowering the storage requirements of the substructure method. Because substructure interior degrees of freedom must be ordered before interface ones, substructure matrix ordering is a constrained optimization problem. Liu (1989) has proposed a constraint minimum degree (CMD) ordering method, which allows users to assign a set of degrees of freedom to be ordered posterior to the others and is one of the important parts of the multi-section ordering method. The CMD is thus suitable for substructure matrix ordering in the substructure method. Besides the CMD method, various alternative ordering methods can also be employed to order all of the internal and interface degrees of freedom together, and then move the interface degrees of freedom posterior to the internal degrees of freedom based on this ordering. The second approach with the “BestOfNDandMS” ordering method is used herein for substructure matrix ordering because it

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

outperforms the first approach in the numerical experiments conducted by Yang (2000). The BestOfNDandMS ordering method chooses the better result from the nested-dissection (ND) (George, 1973) and multi-section (MS) methods (Ashcraft and Liu, 1998) and is available from SPOOLES (Ashcraft et al., 1999). Mesh partitions of a finite element model also significantly influence the efficiency of parallel finite element computations, including parallel substructure analyses. Most mesh partitioning algorithms adopt the following two criteria to optimize their partitions for the parallel analyses: (1) balance of workload among processors (often in terms of the number of elements), and (2) minimization of the overall size of the interface degrees of freedom (or nodes) among substructures, to reduce interprocess communications (Hsieh et al., 1997). Numerous partitioning packages have been made publicly, including JOSTLE (Walshaw, 2000), METIS (Karypis and Kumar, 1998), and TOP/DOMDEC (Farhat et al., 1995). The recursive spectral two-way (RST) method (Hsieh et al., 1995) is employed herein for mesh partitioning. Based on the recursive spectral bisection method (Simon, 1991), the RST method is effective in reducing the total number of interface nodes (Hsieh et al., 1995). The RST method also partitions a finite element mesh into an arbitrary number of submeshes, each with almost the same number of elements. Furthermore, the present RST implementation uses the fast multilevel approach proposed by Barnard and Simon (1993). 4.2 GSM-based parallel equation solver method In most finite element structural analyses, solving the equilibrium equations is typically the most computationally intensive part of the analysis. Successful parallelization of this part of the computations thus contributes significantly to shortening analysis time. Additionally, incorporating parallel equation solvers in a finite element program is probably the most direct way of permitting parallel finite element computations (Jones, 2000). Extensive research has been conducted on solving sparse linear systems in parallel (Heath et al., 1991; Gupta et al., 1997; Amestoy et al., 2000; Gupta, 2001). Several parallel equation solver packages that use the GSM technique have been made publicly available to facilitate parallel finite element analyses. For example, PSPASES (Joshi et al., 1999) provides the direct method for solving a symmetric positive definite linear system. MUMPS (Amestoy et al., 2000) can be used to solve symmetric or nonsymmetric linear systems using direct methods. SPOOLES (Ashcraft et al., 1999) supports solution of symmetric, nonsymmetric, or Hermitian linear equations using direct or iterative methods. PETSc (Balay et al., 2000) can be used to solve symmetric, nonsymmet-

431

ric, or Hermitian linear equations using iterative methods. These approaches allow researchers or engineers to easily and efficiently solve the large-scale linear systems encountered in various scientific or engineering problems. Comparisons on the performance of some parallel equation solvers have recently been conducted by Gupta (2000). The SPOOLES parallel sparse linear equation solver has been incorporated into FE2000 for parallel solutions of the equilibrium equations herein. One of the major capabilities of SPOOLES is to solve sparse linear equations with the form [K]{u} = { f } either sequentially or in multi-threaded or message-passing parallel. The matrix [K] can be either real or complex, and either symmetric, Hermitian, or nonsymmetric. Pivoting (for numerical stability) and approximation techniques (for iterative methods) are optional in SPOOLES. Additionally, SPOOLES provides a collection of class objects that can be used independently to perform some of the subtasks involved in the solution of [K]{u} = { f }, for example, reordering the matrix [K] by various methods supported in SPOOLES. Figure 9 presents the flowchart for the GSM-based parallel equation solver method in FE2000. The stiffness matrix [K] (symmetric) is formed and assembled sequentially on one of the processors, while the equilibrium equations are solved by SPOOLES in parallel. Because this approach is a straightforward one that can easily turn a sequential finite element program into a parallel one by simply replacing the sequential equation solver with a parallel one (in this case, SPOOLES), it is adopted in this work for comparison with the GSM-based parallel substructure method. The BestOfNDandMS ordering method is employed herein. Additionally, the pivoting option is deactivated to prevent it from influencing parallel performance and permit fair comparison with the GSM-based parallel substructure method that does not perform pivoting. Notably, SPOOLES allows each processor to assemble different parts of the stiffness matrix and load vector, allowing the computations for forming the element stiffness matrix and assembling them into the global stiffness [K] to be further parallelized. Hsu (2000) has explored this possibility and has shown that better parallel speedups than those reported in Section 5 of this work can be achieved. Some numerical results from Hsu (2000) will be quoted in Section 5 to illustrate the benefit of performing stiffness forming and assembling in parallel. 4.3 Parallel substructure method versus the parallel equation solver method Previous investigations have outlined how the substructure method closely resembles the procedures for solving a sparse linear system (Smith et al., 1996). Simi-

432

Hsieh, Yang & Hsu

Fig. 9. Flowchart for the GSM-based parallel equation solver method.

larly, the parallel substructure method closely resembles the parallel equation solver method. This section discusses the similarities and differences between these two approaches. The first step in solving a sparse linear system is to reorder the degrees of freedom. Some of the ordering methods, such as the nested-dissection and multi-section methods, employ the same concept of “dissection” to obtain a set of degrees of freedom (called the separator) that separates the remaining degrees of freedom into two or more groups, so that any two degrees of freedom in different groups, say i and j, are disconnected (or [K]i j = 0). The degrees of freedom belonging to the separator are ordered after those belonging to the remaining two groups. Different methods then apply different approaches to order the remaining degrees of freedom. Because the degrees of freedom within the two groups are disconnected, they can be factorized concurrently and/or independently. The substructure method and the idea of dissection are very similar. The interface degrees of freedom in the substructure method can be considered to be a separator, which separates the remaining degrees of freedom into several groups. Each group of degrees of freedom represents the internal degrees of freedom of a substructure. The factorization of the degrees of freedom of each group resembles the static condensation in the substructure method. From this perspective, the parallel substructure method closely resembles the parallel equation solver method: finding a separator (mesh partitioning), and assigning each group of degrees of freedom

to each processor for factorization (static condensation), then factoring the degrees of freedom associated with the separator (solving the interface system). The parallel substructure method and the parallel equation solver method differ mainly in that the former approaches the solution from a physical viewpoint, while the latter approaches it from an equation viewpoint. In the parallel substructure method, the separator (interface degrees of freedom) depends on the partitioning of the physical domain (or mesh), that is, how elements are distributed among substructures. Meanwhile, in the parallel equation solver method, the matrix ordering method decides the separator. The parallel substructure method normally requires less communication among processors than the parallel equation solver method. In the parallel substructure method, once the elements are assigned to processors, the computations needed to form and assemble the element stiffness matrices and recover the element forces for each substructure can be performed independently and/or concurrently without interprocess communications. Only the condensed matrix of each substructure needs to be exchanged among processors for solving interface degrees of freedom. However, in the parallel equation solver method, it may be necessary to exchange the entries in the global stiffness matrix [K] and the entries in the factored matrix [L] among processors when solving the equilibrium equations. The parallel equation solver method and the parallel substructure method use different matrix ordering

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

schemes herein. The parallel equation solver method simply orders the entire system equations by the BestOfNDandMS method, while the parallel substructure method first orders all degrees of freedom of each substructure by the BestOfNDandMS method, then moves the interface degrees of freedom posterior to the

S=

age requirements are measured for both sequential and parallel finite element analyses. Two factors are used herein to evaluate the effectiveness of parallel finite element analyses: the speed-up factor for computational efficiency and the drop-off factor for storage saving. The speed-up factor, S, is defined as

elapsed time of the “fastest” sequential analysis on a single processor elapsed time of the parallel analysis on N p processors

internal degrees of freedom. The numerical computations involved in these two methods differ because the computations depend heavily on the matrix ordering. Experiments performed in Section 5 of this work demonstrate that the numerical computations required by the parallel substructure method usually exceed those required by the parallel equation solver method. Additionally, different matrix computation packages are employed by the two methods herein. The GSM-

D = 100% −

5 NUMERICAL STUDIES ON PARALLEL FINITE ELEMENT ANALYSIS Several finite element examples are used herein to investigate the effectiveness of the GSM-based parallel equation solver method and the GSM-based parallel substructure method. These examples include the O112, O4-12, B20P6416, and M12BD-2 models, as shown in Figures 10, 4, 11, and 6, respectively. As already discussed, SPOOLES is employed for solving the entire set of system equilibrium equations in the GSMbased parallel equation solver method, while extended SPARSPAK is used for substructure matrix condensation and interface solution in the GSM-based parallel substructure method. The total elapsed time and stor-

(9)

where Np is the number of processors used. Three different implementations of sequential analyses are investigated and the one with the fastest execution time is used for calculating the speed-up factors in Equation (9). The first two implementations use SPARSPAK and SPOOLES (parallel version with Np = 1), respectively, for directly solving the entire system equations. The third implementation is the sequential version of the substructure method. The storage drop-off factor, D, is defined as

maximum of any processor’s storage requirement in parallel analysis storage requirement in sequential analysis

based parallel substructure method employs the extended SPARSPAK library (see Section 4.1) for static condensation of each substructure (in parallel) and solution of the interface degrees of freedom (in sequential), while the GSM-based parallel equation solver method employs SPOOLES for solving the equilibrium equations for the entire system in parallel.

433

(10)

The sequential implementation with the lowest storage requirements in each of the examples tested herein is used for calculating the storage drop-off factors in Equation (10). The storage drop-off factor and the speedup factor provide a good indication of parallel analysis in terms of time and storage requirements, particularly when the available computer hardware lacks sufficient core memory to accommodate the sequential analysis. Table 2 lists the elapsed times of three different GSM-based sequential finite element analyses. Because sequential SPARSPAK (column A in Table 2) and SPOOLES (column B in Table 2) use the same matrix ordering method (namely, orderViaBestOfMSandND, as described earlier), they require the same number of nonzero entries in the factorized matrix [L] and the same number of numerical operations for solving the linear equations. However, the SPOOLES implementation runs faster than the SPARSPAK implementation in all test examples. Consequently, the results of the SPOOLES implementation are used for calculating the speed-up factor of Equation (9) (see Table 3). The time measured for the GSM-based sequential substructure method includes time spent in mesh partitioning using the RST method (although mesh partitioning

434

Hsieh, Yang & Hsu

Fig. 10. The O1-12 model.

costs only about one second in the largest examples). There is often inherent computational overhead associated with the substructure algorithm (compared with the direct solution method). As shown in Table 2, the overheads associated with the O1-12 and O4-12 framed structures are clearly quite significant. Based on our investigations, the inherent computational overhead of the substructure method comes mainly from the extra nu-

Fig. 11. The B20P6416 model.

merical computations induced by the constrained substructure matrix ordering (see Section 4.1), which can differ significantly from the matrix ordering in the corresponding sequential analysis. Subsequent parallel analyses demonstrate that these high overheads significantly influence the parallel efficiency of the parallel substructure method. However, this kind of overhead does not occur for the B20P6416 and M12BD-2 models. This is probably because the substructure method in these two cases happens to obtain relatively better matrix ordering results than the direct solution method. Table 3 lists the elapsed times of the parallel finite element analyses using four processors. Only the speedup value for the GSM-based parallel substructure analysis of the O4-12 model is below one. In the test examples of the O1-12, B20P6416, and M12BD-2 models, the GSM-based parallel substructure method achieves better speed-ups than the GSM-based parallel equation solver method using SPOOLES, while in the O4-12, the trend is opposite. Generally, no guarantee exists that parallel finite element analyses will cause speed-up factors of greater than one. Finite element models that are not large enough to exploit numerous processors, the imbalance of workload distributions among processors, and excessive interprocess communications (when compared to the computational workload) often cause inefficient parallel analyses. Because of the inherent overheads associated with the substructure algorithm itself (especially in the framed structures), the parallel analysis of the O4-12 model using the GSM-based parallel substructure method does not achieve speed-ups of greater than one when Np = 4. However, for the M12BD-2

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

435

Table 2 Elapsed time (in seconds) of GSM-based sequential finite element analyses

FE model

SPARSPAK (A)

SPOOLES (Np = 1) (B)

Substructure method (Ns = 4) (C)

SPOOLES vs. SPARSPAK (B)/(A)

Overhead of substructure method (C)/(A) – 100%

O1-12 O4-12 B20P6416 M12BD-2

71.2 76.1 142.9 219.7

69.6 46.6 121.7 209.8

99.6 220.4 134.8 228.5

0.98 0.61 0.94 0.95

40% 190% −6% −4%

∗N p

denotes the number of processors, and Ns denotes the number of substructures.

Table 3 Elapsed time (in seconds) of GSM-based parallel finite element analyses (Np = 4)

FE model

Parallel substructure method (A)

Parallel equation solver (B)

Speed-up of (A)

Speed-up of (B)

O1-12 O4-12 B20P6416 M12BD-2

43.3 134.8 49.6 68.2

70.1 29.9 78.2 115.5

1.61 0.35 2.45 3.08

0.99 1.56 1.56 1.82

∗N p

denotes the number of processors.

model, the GSM-based parallel substructure method obtains a speed-up of 3.08 (for Np = 4). If the elapsed time of the parallel substructure analyses (Table 3) is compared with that elapsed in the GSM-based sequential substructure analyses (Table 2), the speed-ups (when Np = 4) of the parallel substructure method in comparison to its sequential counterpart for the four examples can be calculated as 2.30, 1.64, 2.72, and 3.35, respectively. These results indicate that the parallelization of the GSM-based substructure method is actually significantly more effective than indicated by the speed-ups listed in Table 3. In addition to the problem of inherent computational overhead, load imbalance and sequential solution of the interface system are also bottlenecks for achieving high parallel speed-ups. For example, the four

substructures of the O4-12 model have almost the same number of elements (2922 elements in each) and nodes (1231, 1232, 1223, and 1299 nodes, respectively). But they have very different number of interface nodes (180, 188, 144, and 302 nodes, respectively) and therefore, require different condensation time (27.3, 30.9, 13.5, and 43.5 sec, respectively). Moreover, the sequential solution of the interface system costs 66.8 sec (even though there are only 405 interface nodes), which is almost the same as the factorization time in the sequential direct solution of the entire system (see Table 1). Furthermore, if the computations for forming the element stiffness matrix and assembling them into the global stiffness [K] are performed in parallel in the GSM-based parallel equation solver method, the speed-up values (when Np = 4) for the O1-12, O4-12, and M12BD-2 models become 1.07, 1.56, and 2.05, respectively (Hsu, 2000). Tables 4 and 5 list the storage requirements of GSMbased sequential and parallel finite element analyses, respectively. These requirements are measured using the monitoring tool named “top,” provided by the Linux operating system. In the GSM-based parallel finite element analyses, the storage requirement for each processor is measured, and the maximal one is reported here. Table 4 reveals that SPOOLES generally requires more storage than SPARSPAK. This is probably because SPARSPAK uses a column-based data structure for storing matrices [K] and [L], while SPOOLES uses a submatrix-based data structure, which often requires significantly more

Table 4 The storage requirements of GSM-based sequential finite element analyses (Mb)

FE model

SPARSPAK (A)

SPOOLES (Np = 1) (B)

Substructure method (Ns = 4) (C)

SPOOLES vs. SPARSPAK (B)/(A)

Overhead of substructure method (C)/(A) – 100%

O1-12 O4-12 B20P6416 M12BD-2

123 62 129 201

248 96 254 448

119 107 134 218

2.02 1.55 1.97 2.23

−3% 73% 4% 8%

∗N p

denotes the number of processors, and Ns denotes the number of substructures.

436

Hsieh, Yang & Hsu

Table 5 The storage requirements of GSM-based parallel finite element analyses (Np = 4)

FE model

Parallel substructure method (A)

Parallel equation solver (B)

Storage drop-off of (A)

Storage drop-off of (B)

O1-12 O4-12 B20P6416 M12BD-2

63 71 51 73

150 50 162 267

0.49 −0.15 0.60 0.64

−0.22 0.19 −0.26 −0.33

∗N p

denotes the number of processors.

memory storage (Ashcraft, 1999, personal communication). For the O4-12 model, the sequential substructure method requires significantly more storage for substructure matrices than the sequential SPARSPAK implementation (73%, as shown in Table 4), while the differences are rather small (less than 10%) for the other three models. This is mainly because the O4-12 model has significant imbalance of substructure storage among the four processors, while the other three models all have good balance of substructure storage among processors. Notably, the sequential finite element analysis on the M12BD-2 model using SPOOLES requires 448 Mb of storage, exceeding the system core memory. Therefore, if the computer system has a memory capacity exceeding 448 Mb, the sequential analysis time may be further reduced. Table 5 reveals that the storage drop-off factors of the parallel substructure analyses are generally better than those of the parallel equation solver method.

6 CONCLUSIONS This study has integrated the general sparse matrix and parallel computing techniques in an object-oriented finite element program for large-scale structural analyses. Several finite element models with different characteristics have been analyzed using a PC cluster comprising up to four processors to investigate the effectiveness of the general sparse matrix and parallel computing technologies implemented. The general sparse matrix technique has been shown to significantly reduce time and storage requirements in large-scale structural analyses. As the scale and complexity of engineering problems increase with time, the general sparse matrix technique is expected to become increasingly important in large-scale structural analyses. Two parallel computing methods have been carefully studied herein, with particular emphasis on the GSM-based parallel substructure method. Although no

guarantee exists that parallel processing can always accelerate analysis (especially when compared with the “fastest” sequential analysis), it is demonstrated herein that, for large-scale structural analyses, parallel computing techniques can frequently shorten the time required for analysis and/or reduce the storage requirement of a single computer. The GSM-based parallel substructure method can often reduce the storage requirements of a single computer. The GSM-based parallel substructure method also possesses good inherent parallelism in its algorithm, but this is often counteracted by the inherent computational overheads associated with the algorithm itself. Consequently, when compared to the “fastest” sequential method, it may be difficult for the GSM-based parallel substructure method to achieve good speedup, particularly for cases where memory requirements are not a concern, and more research is still needed to enhance the parallel efficiency of the method. Notably, however, the “fastest” algorithm is usually hard to find and hardware limitations on the storage requirements are often an issue for large-scale structural analyses. In addition, load balancing among processors has significant influence on the efficiency of the GSM-based parallel substructure method. The RST mesh partitioning method employed herein is quite effective in reducing the total number of interface nodes and always produces substructures with almost the same number of elements. However, a balanced number of elements among substructures does not necessarily lead to balance of substructure workloads in the GSM-based parallel substructure method. For example, the partitions of the O4-12 model in this work have the same number of elements but very different computational workloads. Research on better mesh partitioning algorithms is still needed for the GSM-based parallel substructure method. Meanwhile, the GSM-based parallel equation solver method has become a straightforward one to exploit parallel computing technologies. Satisfactory speed-up results have been obtained herein for this approach, particularly given that the current implementation only parallelizes the solution of system equations of the entire analysis. Moreover, Hsu (2000) has shown that the parallel efficiency of this approach can be further enhanced if computations on forming the element stiffness matrix and assembling them into the global stiffness are also performed in parallel. Additionally, it is found that the SPOOLES library used herein requires significantly larger memory space for computations than the SPARSPAK library but executes faster. For the future work, other parallel solvers, such as MUMPS (Amestoy et al., 2000) and PSPASES (Joshi et al., 1999), should be investigated for the GSM-based parallel equation solver method.

Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis

Based on the experiments herein, we cannot conclude that either the GSM-based parallel substructure or the GSM-based parallel equation solver method is better for parallel finite element analysis. Generally, the parallel equation solver method can be more easily implemented in a finite element program because its analytical processes are almost identical to the sequential method. Several publicly available parallel equation solvers also exist. However, the parallel equation solver method generally requires more communication among processors than the parallel substructure method (see Section 4.3). Therefore, the efficiency of the parallel equation solver method depends more on the performance of the communication network of the parallel computing environment. Although only up to four processors are employed in this work for parallel computations, it should be noted that the two GSM-based parallel solution methods presented in this work are not limited to a particular number of processors. However, if more processors are used but the problem size is fixed (i.e., not scaled up accordingly), the speed-up may increase until the number of processors has reached an “optimal” value, but the efficiency usually decreases as the number of processors increases. For the GSM-based parallel substructure method, using more processors (or substructures) often leads to possibly more computational overheads and a larger interface system. For the GSM-based parallel equation solver method, using more processors usually means more interprocessor communications. Future research is still needed to investigate the ability of the two GSM-based parallel solution methods, especially the GSM-based parallel substructure method, to maintain a constant efficiency when the number of processors and the problem size increase simultaneously.

ACKNOWLEDGMENTS The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under contract nos. NSC87-2211-E-002034 and NSC 88-2211-E-002-018. Dr. Cleve Ashcraft of the Boeing Shared Services Group is also appreciated for providing the SPOOLES library and helpful comments about the library. REFERENCES Adeli, H. (2000), High-performance computing for large-scale analysis, optimization, and control, Journal of Aerospace Engineering, ASCE, 13(1), 1–10. Adeli, H. & Kamal, O. (1993), Parallel Processing in Structural Engineering, Elsevier Applied Science, London.

437

Adeli, H. & Kumar, S. (1999), Distributed Computer-Aided Engineering for Analysis, Design, and Visualization, CRC Press, Boca Raton, FL. Adeli, H. & Soegiarso, R. (1999), High-Performance Computing in Structural Engineering, CRC Press, Boca Raton, FL. Amestoy, P. R., Duff, I. S. & L’Excellent, J. Y. (2000), Multifrontal parallel distributed symmetric and unsymmetric solvers, Computer Methods in Applied Mechanics and Engineering, 184(2–4), 501–20. Ashcraft, C. & Liu, J. W. H. (1998), Robust ordering of sparse matrices using multisection, SIAM Journal on Matrix Analysis and Applications, 19(3), 816–32. Ashcraft, C., Grimes, R. G. & Lewis, J. G. (1998), Accurate symmetric indefinite linear equation solvers, SIAM Journal on Matrix Analysis and Applications, 20(2), 513–61. Ashcraft, C., Pierce, D., Wah, D. K. & Wu, J. (1999), The Reference Manual for SPOOLES, Release 2.2: An Object Oriented Software Library for Solving Sparse Linear Systems of Equations, Boeing Shared Services Group, USA. Ashcraft, C. C., Grimes, R. G., Peyton, B. W. & Simon, H. D. (1987), Progress in sparse matrix methods for large linear systems on vector supercomputers, The International Journal of Supercomputer Applications, 1(4), 10–30. Balay, S., Gropp, W., McInnes, L. C. & Smith, B. (2000), PETSc 2.0 Users Manual, ANL-95/11—Revision 2.0.29, Argonne National Laboratory, USA. Barnard, S. T. & Simon H. (1993), A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems, in R. F. Sincovec et al. (eds.), Parallel Processing for Scientific Computing, SIAM, 711–18. Bui, T. & Jones, C. (1993), A heuristic for reducing fill-in in sparse matrix factorization, Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, 445–52. Duff, I. S., Erisman, A. M. & Reid, J. K. (1986), Direct Methods for Sparse Matrices, Oxford University Press, New York. Farhat, C. & Roux, F. X. (1994), Implicit parallel processing in structural mechanics, Computational Mechanics, 2, 1–124. Farhat, C., Lanteri, S. & Simon, H. (1995), TOP/DOMDEC—A software tool for mesh partitioning and parallel processing, Computing Systems in Engineering, 6(1), 13–26. Farhat, C., Lesoinne, M., LeTallec, P., Pierson, K. & Rixen, D. (2001), FETI-DP: a dual-primal unified FETI method— Part I: a faster alternative to the two-level FETI method, International Journal for Numerical Methods in Engineering, 50, 1523–44. George, A. (1973), Nested dissection of a regular finite element mesh, SIAM Journal on Numerical Analysis, 10, 345–63. George, A. & Liu, J. W. H. (1981), Computer Solution of Large Sparse Positive Definite Systems, Prentice-Hall, Englewood Cliffs, NJ. Gropp, W., Lusk, E., Doss, N. & Skjellum, A. (1996), Highperformance, portable implementation of the MPI message passing interface standard, Parallel Computing, 22(6), 789– 828. Gupta, A. (2000), An experimental comparison of some direct sparse solver packages, IBM Research Report, RC 21862 (98393), Computer Science/Mathematics, IBM Research Division, USA. Gupta, A. (2001), Recent advances in direct methods for solving un-symmetric sparse systems of linear equations, IBM Research Report, RC 22039 (98933), Computer Science/Mathematics, IBM Research Division, USA. Gupta, A., Karypis, G. & Kumar, V. (1997), Highly scalable parallel algorithms for sparse matrix factorization, IEEE

438

Hsieh, Yang & Hsu

Transactions on Parallel and Distributed System, 8(5), 502– 20. Han, T. Y. & Abel, J. F. (1984), Substructure condensation using modified decomposition, International Journal for Numerical Methods in Engineering, 20, 1959–64. Heath, M. T., Ng, E. & Peyton, B. W. (1991), Parallel algorithms for sparse linear systems, SIAM Review, 33(3), 420– 60. Hsieh, S. H. & Abel, J. F. (1995), Comparison of two finite element approaches for analysis of rotating bladed-disk assemblies, Journal of Sound and Vibration, 182(1), 91– 107. Hsieh, S. H., Paulino, G. H. & Abel, J. F. (1995), Recursive spectral algorithms for automatic domain partitioning in parallel finite element analysis, Computer Methods in Applied Mechanics and Engineering, 121, 137–62. Hsieh, S. H. & Sotelino, E. D. (1997), A message passing class library in C++ for portable parallel programming, Engineering with Computers, 13, 20–34. Hsieh, S. H., Paulino, G. H. & Abel, J. F. (1997), Evaluation of automatic domain partitioning algorithms for parallel finite element analysis, International Journal for Numerical Methods in Engineering, 40(6), 1025–51. Hsu, P. Y. (2000), Finite element structural analysis using parallel sparse solver approach, Master’s Thesis, Department of Civil Engineering, National Taiwan University, Taipei, Taiwan, ROC. (in Chinese) Jones, M. T. (2000), Unstructured mesh computations on networks of workstations, Computer-Aided Civil and Infrastructure Engineering, 15, 196–208. Joshi, M., Karypis, G. & Kumar, V. (1999), PSPASES: Scalable Parallel Direct Solver Library for Sparse Symmetric Positive Definite Linear Systems User’s Manual (version 1.0.3), Department of Computer Science, University of Minnesota, MN, USA. Karypis, G. & Kumar, V. (1998), METIS A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0, Technical report, Department of Computer Science/Army HPC Research Center, MN. Kumar, V., Grama, A., Gupta, A. & Karypis, G. (1994), Introduction to Parallel Computing Design and Analysis of Algorithms, Benjamin Cummings, Redwood City, CA. Liu, J. W. H. (1976), Comparative analysis of the CuthillMcKee and the reverse Cuthill-McKee ordering algorithms for sparse matrices, SIAM Journal on Numerical Analysis, 13(2), 198–213. Liu, J. W. H. (1985), Modification of the minimum-degree algo-

rithm by multiple elimination, ACM Transactions on Mathematical Software, 11(2), 141–53. Liu, J. W. H. (1989), On the minimum degree ordering with constraints, SIAM Journal on Scientific Computing, 10, 1136– 45. Lu, J. (1994), FE++: An Object-Oriented Application Framework for Finite Element Programming, Proceedings of the 2nd Annual Object-Oriented Numeric Conference, Sunriver, Oregon, April 24–27, 438–47. Message Passing Interface Forum (1994), MPI: a messagepassing interface standard, International Journal of Supercomputer Applications, 8(3/4), 159–416. Mukunda, G. R., Sotelino, E. D. & Hsieh, S. H. (1998), Distributed finite element computations using object-oriented techniques, Engineering with Computers, 14(1), 59–72. Noor, A. K. (ed.) (1987), Parallel Computations and Their Impact on Mechanics, The American Society of Mechanical Engineers, New York. Poole, E. L., Knight, N. F., Jr. & Davis, D. D., Jr. (1992), Highperformance equation solvers and their impact on finite element analysis, International Journal for Numerical Methods in Engineering, 33, 855–68. Saad, Y. (1996), Iterative Methods for Sparse Linear Systems, PWS Publishing Company, Boston. Sarma, K. C. & Adeli, H. (1996), Sparse matrix algorithm for minimum weight design of large structures, Engineering Optimization, 27(1), 65–85. Simon. H. D. (1991), Partitioning of unstructured problems for parallel processing, Computing Systems in Engineering, 2(2/3), 135–48. Smith, B., Bjorstad, P. & Gropp, W. (1996), Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations, Cambridge University Press, New York. Stallman, R. M. (1998), The C preprocessor, online document, available from http://www.gnu.org/. Walshaw, C. (2000), Parallel Jostle Library Interface: Version 1.2.1, School of Computing & Mathematical Sciences, University of Greenwich, London, UK. Yang, Y. S., Hsieh, S. H., Chou, K. W. & Tsai, I. C. (1998), Large-scale structural analysis using general sparse matrix technique, Proceedings of the Eighth KKNN Seminar on Civil Engineering, Singapore, November 30–December 1, 260–65. Yang, Y. S. (2000), Parallel computing for nonlinear dynamic finite element structural analysis with general sparse matrix technology, Ph.D. Dissertation, Department of Civil Engineering, National Taiwan University, Taipei, Taiwan, Republic of China.

Integration of General Sparse Matrix and Parallel ...

time and storage requirements in large-scale finite element structural analyses. ...... a substructure. The factorization of the degrees of free- dom of each group ..... Stallman, R. M. (1998), The C preprocessor, online document, available from ...

613KB Sizes 0 Downloads 251 Views

Recommend Documents

Parallel Sparse Matrix Vector Multiplication using ...
Parallel Sparse Matrix Vector Multiplication (PSpMV) is a compute intensive kernel used in iterative solvers like Conjugate Gradient, GMRES and Lanzcos.

On Constrained Sparse Matrix Factorization
given. Finally conclusion is provided in Section 5. 2. Constrained sparse matrix factorization. 2.1. A general framework. Suppose given the data matrix X=(x1, …

On Constrained Sparse Matrix Factorization
Institute of Automation, CAS. Beijing ... can provide a platform for discussion of the impacts of different .... The contribution of CSMF is to provide a platform for.

Improving the Performance of the Sparse Matrix Vector ...
Currently, Graphics Processing Units (GPUs) offer massive ... 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010).

Parallel time integration with multigrid
In the case that f is a linear function of u(t), the solution to (1) is defined via ... scalability for cases where the “coarse-in-time” grid is still too large to be treated ...

Sparse Non-negative Matrix Language Modeling - Research at Google
same speech recognition accuracy on voice search and short message ..... a second training stage that adapts the model to in-domain tran- scribed data. 5.

Pruning Sparse Non-negative Matrix N-gram ... - Research at Google
a mutual information criterion yields the best known pruned model on the One ... classes which can then be used to build a class-based n-gram model, as first ..... [3] H. Schwenk, “Continuous space language models,” Computer. Speech and ...

Sparse Non-negative Matrix Language Modeling - Research at Google
test data: 1.98% (out of domain, as it turns out). ○ OOV words mapped to , also part of ... Computationally cheap: ○ O(counting relative frequencies).

Sparse Non-negative Matrix Language Modeling - ESAT - K.U.Leuven
Do not scale well to large data => slow to train/evaluate. ○ Maximum .... ~10x faster (machine hours) than specialized RNN LM implementation. ○ easily ...

Sparse Non-negative Matrix Language Modeling - Research at Google
Table 4 that the best RNNME model outperforms the best SNM model by 13% on the check set. The out- of-domain test set shows that due to its compactness,.

Sparse Additive Matrix Factorization for Robust PCA ...
a low-rank one by imposing sparsity on its singular values, and its robust variant further ...... is very efficient: it takes less than 0.05 sec on a laptop to segment a 192 × 144 grey ... gave good separation, while 'LE'-SAMF failed in several fram

SPARSE NON-NEGATIVE MATRIX LANGUAGE ... - Research at Google
Email: {ciprianchelba,noam}@google.com. ABSTRACT. The paper ..... postal code (ZIP) or designated marketing area (DMA) geo- tags, and used along the .... The computational advantages of SNM over both ME and. RNN-LM estimation are ...

Pruning Sparse Non-negative Matrix N-gram ... - Research at Google
Pruning Sparse Non-negative Matrix N-gram Language Models. Joris Pelemans1 ... very large amounts of data as gracefully as n-gram LMs do. In this work we ...

Sparse Non-negative Matrix Language Modeling - Semantic Scholar
Gradient descent training for large, distributed models gets expensive. ○ Goal: build computationally efficient model that can mix arbitrary features (a la MaxEnt).

PARALLEL AND DISTRIBUTED TRAINING OF ...
architectures from data distributed over a network, is still a challenging open problem. In fact, we are aware of only a few very recent works dealing with distributed algorithms for nonconvex optimization, see, e.g., [9, 10]. For this rea- son, up t

Robust Detection and Identification of Sparse ...
are alternations of DNA of a genome that results in the cell having a less or more than .... can be regarded as continuous when coverage of the genome is sufficiently high, for .... and is compared with the performances of LRS and CBS.

Adaptive Fusion and Sparse Estimation of Multi-sensor ...
two GV sites of the TRMM , Houston, TX. (HSTN) and ... Data. Motivations. Mathematics of Rainfall Images. (GSM) Estimation and Fusion Sparse .... L2 Recovery.

Mixtures of Sparse Autoregressive Networks
Given training examples x. (n) we would ... be learned very efficiently from relatively few training examples. .... ≈-86.34 SpARN (4×50 comp, auto) -87.40. DARN.

NONNEGATIVE MATRIX FACTORIZATION AND SPATIAL ...
ABSTRACT. We address the problem of blind audio source separation in the under-determined and convolutive case. The contribution of each source to the mixture channels in the time-frequency domain is modeled by a zero-mean Gaussian random vector with