In this study, we examine the accurate matrix multiplication in floating-point arithmetic. We demonstrate the error-free transformations of matrix multiplication using high performance basic linear algebra subprograms. These transformations can be applied to accurately compute the product of two matrices using floating-point entries. A key technique for this calculation is error-free splitting in floating-point matrices. In this study, we improve upon our previous method by a posteriori validation using floating-point exception. In the method, we utilize the presence of overflow in a positive manner for detecting whether rounding error occurs. If overflow occurs, the result contains some exceptional values such as
and NaN, that is, the method fails by necessity. Otherwise, we can confirm that no rounding error occurs in the process. Therefore, reducing the possibility of overflow is important. The numerical results suggest that the proposed algorithm provides more accurate results compared with the original algorithm. Moreover, for the product of *n* × *n* matrices, when
, the new algorithm reduces the computing time for error-free transformation by an average of 20 % and up to 30 % compared with the original algorithm. Furthermore, the new algorithm can be used when matrix multiplication is performed using divide-and-conquer methods. Copyright © 2016 John Wiley & Sons, Ltd.

This paper presents a class of limited memory preconditioners (LMP) for solving linear systems of equations with symmetric indefinite matrices and multiple right-hand sides. These preconditioners based on limited memory quasi-Newton formulas require a small number *k* of linearly independent vectors and may be used to improve an existing first-level preconditioner. The contributions of the paper are threefold. First, we derive a formula to characterize the spectrum of the preconditioned operator. A spectral analysis of the preconditioned matrix shows that the eigenvalues are all real and that the LMP class is able to cluster at least *k* eigenvalues at 1. Secondly, we show that the eigenvalues of the preconditioned matrix enjoy interlacing properties with respect to the eigenvalues of the original matrix provided that the *k* linearly independent vectors have been prior projected onto the invariant subspaces associated with the eigenvalues of the original matrix in the open right and left half-plane, respectively. Third, we focus on theoretical properties of the Ritz-LMP variant, where Ritz information is used to determine the *k* vectors. Finally, we illustrate the numerical behaviour of the Ritz limited memory preconditioners on realistic applications in structural mechanics that require the solution of sequences of large-scale symmetric saddle-point systems. Numerical experiments show the relevance of the proposed preconditioner leading to a significant decrease in terms of computational operations when solving such sequences of linear systems. A saving of up to 43% in terms of computational effort is obtained on one of these applications. Copyright © 2016 John Wiley & Sons, Ltd.

Let ** n** = (

An algorithm to generate samples with approximate first-order, second-order, third-order, and fourth-order moments is presented by extending the Cholesky matrix decomposition to a Cholesky tensor decomposition of an arbitrary order. The tensor decomposition of the first-order, second-order, third-order, and fourth-order objective moments generates a non-linear system of equations. The algorithm solves these equations by numerical methods. The results show that the optimization algorithm delivers samples with an approximate residual error of less than 10^{16} between the components of the objective and the sample moments. The algorithm is extended for a *n*-th-order approximate tensor moment version, and simulations of non-normal samples replicated from distributions with asymmetries and heavy tails are presented. An application for sensitivity analysis of portfolio risk assessment with Value-at-Risk (*VaR*) is provided. A comparison with previous methods available in the literature suggests that the methodology proposed reduces the error of the objective moments in the generated samples. Copyright © 2016 John Wiley & Sons, Ltd.

We consider the parallel factorization of sparse finite element matrices on distributed memory machines. Our method is based on a nested dissection approach combined with a cyclic re-distribution of the interface Schur complements. We present a detailed definition of the parallel method, and the well-posedness and the complexity of the algorithm are analyzed. A lean and transparent functional interface to existing finite element software is defined, and the performance is demonstrated for several representative examples. Copyright © 2016 John Wiley & Sons, Ltd.

This paper deals with the problem of recovering an unknown low-rank matrix from a sampling of its entries. For its solution, we consider a nonconvex approach based on the minimization of a nonconvex functional that is the sum of a convex fidelity term and a nonconvex, nonsmooth relaxation of the rank function. We show that by a suitable choice of this nonconvex penalty, it is possible, under mild assumptions, to use also in this matrix setting the iterative forward–backward splitting method. Specifically, we propose the use of certain parameter dependent nonconvex penalties that with a good choice of the parameter value allow us to solve in the backward step a convex minimization problem, and we exploit this result to prove the convergence of the iterative forward–backward splitting algorithm. Based on the theoretical results, we develop for the solution of the matrix completion problem the efficient iterative improved matrix completion forward–backward algorithm, which exhibits lower computing times and improved recovery performance when compared with the best state-of-the-art algorithms for matrix completion. Copyright © 2016 John Wiley & Sons, Ltd.

We consider symmetrized Karush–Kuhn–Tucker systems arising in the solution of convex quadratic programming problems in standard form by Interior Point methods. Their coefficient matrices usually have 3 × 3 block structure, and under suitable conditions on both the quadratic programming problem and the solution, they are nonsingular in the limit. We present new spectral estimates for these matrices: the new bounds are established for the unpreconditioned matrices and for the matrices preconditioned by symmetric positive definite augmented preconditioners. Some of the obtained results complete the analysis recently given by Greif, Moulding, and Orban in [*SIAM J. Optim., 24 (2014), pp. 49-83*]. The sharpness of the new estimates is illustrated by numerical experiments. Copyright © 2016 John Wiley & Sons, Ltd.

Large-scale scientific computing models are needed for the simulation of wave propagation especially for multiple frequency and high-frequency models in complex heterogeneous media. Multigrid methods provide efficient iterative solvers for many large sign-definite systems of equations resulting from physical models. Time-harmonic wave propagation models lead to sign-indefinite systems with eigenvalues in the left half of the complex plane. Thus standard multigrid approaches applied in conjunction with a low-order finite difference or finite element method are not sufficient. In this work, we describe a high-order finite element method model for multiple (low to high) frequency time-harmonic acoustic wave propagation on general curved, non-convex, and non-smooth domains with heterogeneous media using a multigrid approximation of the shifted Laplacian operator as a preconditioner. We implement the model using an efficient geometric multigrid approach with parallel grid transfer operator calculations to simulate the model using the BiCGStab iterative solver. We demonstrate the efficiency and parallel performance of the computational model with multiple low (5 wavelength) to high-frequency (100 wavelength) input incident waves. Copyright © 2016 John Wiley & Sons, Ltd.

The incompressible. Stokes equations are a widely used model of viscous or tightly confined flow in which convection effects are negligible. In order to strongly enforce the conservation of mass at the element scale, special discretization techniques must be employed. In this paper, we consider a discontinuous Galerkin approximation in which the velocity field is *H*(div,Ω)-conforming and divergence-free, based on the Brezzi, Douglas, and Marini finite-element space, with complementary space (*P*_{0}) for the pressure. Because of the saddle-point structure and the nature of the resulting variational formulation, the linear systems can be difficult to solve. Therefore, specialized preconditioning strategies are required in order to efficiently solve these systems. We compare the effectiveness of two families of preconditioners for saddle-point systems when applied to the resulting matrix problem. Specifically, we consider block-factorization techniques, in which the velocity block is preconditioned using geometric multigrid, as well as fully coupled monolithic multigrid methods. We present parameter study data and a serial timing comparison, and we show that a monolithic multigrid preconditioner using Braess–Sarazin style relaxation provides the fastest time to solution for the test problem considered. Copyright © 2016 John Wiley & Sons, Ltd.

This paper proposes a new, low-communication algorithm for solving PDEs on massively parallel computers. The range decomposition (RD) algorithm exposes coarse-grain parallelism by applying nested iteration and adaptive mesh refinement locally before performing a global communication step. Just a few such steps are observed to be sufficient to obtain accuracy within a small multiple of discretization error. The target applications are petascale and exascale machines, where hierarchical parallelism is required and traditional parallel numerical PDE communication patterns are costly because of message latency. The RD algorithm uses a partition of unity to equally distribute the error, and thus, the work. The computational advantages of this approach are that the decomposed problems can be solved in parallel without any communication until the partitioned solutions are summed. This offers potential advantages in the paradigm of expensive communication but very cheap computation. This paper introduces the method and explains the details of the communication step. Two performance models are developed, showing that the latency cost associated with a traditional parallel implementation of nested iteration is proportional to *l**o**g*(*P*)^{2}, whereas the RD method reduces the communication latency to *l**o**g*(*P*), while maintaining similar bandwidth costs. Numerical results for two problems, Laplace and advection diffusion, demonstrate the enhanced performance, and a heuristic argument explains why the method converges quickly. Copyright © 2016 John Wiley & Sons, Ltd.

The following paper discusses the application of a multigrid-in-time scheme to Least Squares Shadowing (LSS), a novel sensitivity analysis method for chaotic dynamical systems. While traditional sensitivity analysis methods break down for chaotic dynamical systems, LSS is able to compute accurate gradients. Multigrid is used because LSS requires solving a very large Karush–Kuhn–Tucker system constructed from the solution of the dynamical system over the entire time interval of interest. Several different multigrid-in-time schemes are examined, and a number of factors were found to heavily influence the convergence rate of multigrid-in-time for LSS. These include the iterative method used for the smoother, how the coarse grid system is formed and how the least squares objective function at the center of LSS is weighted. Copyright © 2014 John Wiley & Sons, Ltd.

No abstract is available for this article.

]]>Several iterative methods for maximal correlation problems (MCPs) have been proposed in the literature. This paper deals with the convergence of these iterations and contains three contributions. Firstly, a unified and concise proof of the monotone convergence of these iterative methods is presented. Secondly, a starting point strategy is analysed. Thirdly, some error estimates are presented to test the quality of a computed solution. Both theoretical results and numerical tests suggest that combining with this starting point strategy these methods converge rapidly and are more likely converging to a global maximizer of MCP. Copyright © 2016 John Wiley & Sons, Ltd.

We consider the nonlinear eigenvalue problem *M*(*λ*)*x* = 0, where *M*(*λ*) is a large parameter-dependent matrix. In several applications, *M*(*λ*) has a structure where the higher-order terms of its Taylor expansion have a particular low-rank structure. We propose a new Arnoldi-based algorithm that can exploit this structure. More precisely, the proposed algorithm is equivalent to Arnoldi's method applied to an operator whose reciprocal eigenvalues are solutions to the nonlinear eigenvalue problem. The iterates in the algorithm are functions represented in a particular structured vector-valued polynomial basis similar to the construction in the infinite Arnoldi method [Jarlebring, Michiels, and Meerbergen, Numer. Math., 122 (2012), pp. 169–195]. In this paper, the low-rank structure is exploited by applying an additional operator and by using a more compact representation of the functions. This reduces the computational cost associated with orthogonalization, as well as the required memory resources. The structure exploitation also provides a natural way in carrying out implicit restarting and locking without the need to impose structure in every restart. The efficiency and properties of the algorithm are illustrated with two large-scale problems. Copyright © 2016 John Wiley & Sons, Ltd.

Some modulus-based matrix splitting iteration methods for a class of implicit complementarity problem are presented, and their convergence analysis is given. Numerical experiments confirm the theoretical analysis and show that the proposed methods are efficient. Copyright © 2016 John Wiley & Sons, Ltd.

The triangular truncation operator is a linear transformation that maps a given matrix to its strictly lower triangular part. The operator norm (with respect to the matrix spectral norm) of the triangular truncation is known to have logarithmic dependence on the dimension, and such dependence is usually illustrated by a specific Toeplitz matrix. However, the precise value of this operator norm as well as on which matrices can it be attained is still unclear. In this article, we describe a simple way of constructing matrices whose strictly lower triangular part has logarithmically larger spectral norm. The construction also leads to a sharp estimate that is very close to the actual operator norm of the triangular truncation. This research is directly motivated by our studies on the convergence theory of the Kaczmarz type method (or equivalently, the Gauß-Seidel type method), the corresponding application of which is also included. Copyright © 2016 John Wiley & Sons, Ltd.

In this paper, two accelerated divide-and-conquer (ADC) algorithms are proposed for the symmetric tridiagonal eigenvalue problem, which cost *O*(*N*^{2}*r*) flops in the worst case, where *N* is the dimension of the matrix and *r* is a modest number depending on the distribution of eigenvalues. Both of these algorithms use hierarchically semiseparable (HSS) matrices to approximate some intermediate eigenvector matrices, which are Cauchy-like matrices and are off-diagonally low-rank. The difference of these two versions lies in using different HSS construction algorithms, one (denoted by ADC1) uses a structured low-rank approximation method and the other (ADC2) uses a randomized HSS construction algorithm. For the ADC2 algorithm, a method is proposed to estimate the off-diagonal rank. Numerous experiments have been carried out to show their stability and efficiency. These algorithms are implemented in parallel in a shared memory environment, and some parallel implementation details are included. Comparing the ADCs with highly optimized multithreaded libraries such as Intel MKL, we find that ADCs could be more than six times faster for some large matrices with few deflations. Copyright © 2016 John Wiley & Sons, Ltd.

Estimating the number of eigenvalues located in a given interval of a large sparse Hermitian matrix is an important problem in certain applications, and it is a prerequisite of eigensolvers based on a divide-and-conquer paradigm. Often, an exact count is not necessary, and methods based on stochastic estimates can be utilized to yield rough approximations. This paper examines a number of techniques tailored to this specific task. It reviews standard approaches and explores new ones based on polynomial and rational approximation filtering combined with a stochastic procedure. We also discuss how the latter method is particularly well-suited for the FEAST eigensolver. Copyright © 2016 John Wiley & Sons, Ltd.

We present an analysis for minimizing the condition number of nonsingular parameter-dependent 2 × 2 block-structured saddle-point matrices with a maximally rank-deficient (1,1) block. The matrices arise from an augmented Lagrangian approach. Using quasidirect sums, we show that a decomposition akin to simultaneous diagonalization leads to an optimization based on the extremal nonzero eigenvalues and singular values of the associated block matrices. Bounds on the condition number of the parameter-dependent matrix are obtained, and we demonstrate their tightness on some numerical examples. Copyright © 2016 John Wiley & Sons, Ltd.

This paper introduces a robust preconditioner for general sparse matrices based on low-rank approximations of the Schur complement in a Domain Decomposition framework. In this ‘Schur Low Rank’ preconditioning approach, the coefficient matrix is first decoupled by a graph partitioner, and then a low-rank correction is exploited to compute an approximate inverse of the Schur complement associated with the interface unknowns. The method avoids explicit formation of the Schur complement. We show the feasibility of this strategy for a model problem and conduct a detailed spectral analysis for the relation between the low-rank correction and the quality of the preconditioner. We first introduce the SLR preconditioner for symmetric positive definite matrices and symmetric indefinite matrices if the interface matrices are symmetric positive definite. Extensions to general symmetric indefinite matrices as well as to nonsymmetric matrices are also discussed. Numerical experiments on general matrices illustrate the robustness and efficiency of the proposed approach. Copyright © 2016 John Wiley & Sons, Ltd.

Methods for the polynomial eigenvalue problem sometimes need to be followed by an iterative refinement process to improve the accuracy of the computed solutions. This can be accomplished by means of a Newton iteration tailored to matrix polynomials. The computational cost of this step is usually higher than the cost of computing the initial approximations, due to the need of solving multiple linear systems of equations with a bordered coefficient matrix. An effective parallelization is thus important, and we propose different approaches for the message-passing scenario. Some schemes use a subcommunicator strategy in order to improve the scalability whenever direct linear solvers are used. We show performance results for the various alternatives implemented in the context of SLEPc, the Scalable Library for Eigenvalue Problem Computations. Copyright © 2016 John Wiley & Sons, Ltd.

In this paper, we consider the convergence rate of a smoothed aggregation algebraic multigrid method, which uses a simple polynomial (1 − *t*)^{ν} or an optimal Chebyshev-like polynomial to construct the smoother and prolongation operator. The result is purely algebraic, whereas a required main weak approximation property of the tentative interpolation operator is verified for a spectral element agglomeration version of the method. More specifically, we prove that, for partial differential equations (PDEs), the two-grid method converges uniformly without any regularity assumptions. Moreover, the convergence rate improves uniformly when the degree of the polynomials used for the smoother and the prolongation increases. Such a result, as is well-known, would imply uniform convergence of the multilevel W-cycle version of the algorithm. Numerical results, for both PDE and non-PDE (graph Laplacian) problems are presented to illustrate the theoretical findings. Published 2016. This article is a U.S. Government work and is in the public domain in the USA.