\documentclass[11pt]{amsart}
\usepackage{geometry}                % See geometry.pdf to learn the layout options. There are lots.
\geometry{letterpaper}                   % ... or a4paper or a5paper or ... 
%\geometry{landscape}                % Activate for for rotated page geometry
%\usepackage[parfill]{parskip}    % Activate to begin paragraphs with an empty line rather than an indent
\usepackage{subfig}
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
\usepackage{amsmath}
% load package with ``framed'' and ``numbered'' option.
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{multirow}
\usepackage[framed,numbered,autolinebreaks,useliterate]{mcode}
\DeclareGraphicsRule{.tif}{png}{.png}{`convert #1 `dirname #1`/`basename #1 .tif`.png}

\title{Improvements to Hessenberg Reduction}
\author{Shankar, Yang, Hao}
%\date{}                                           % Activate to display a given date or no date

\begin{document}
\maketitle
\section{abstract}
The problem of matrix reduction is critical to effectively solving linear algebra problems like the Standard Eigen Value problem. There are real world applications on precisely these kinds of problems like structural mechanics, computational fluid dynamics, web engine rank search etc. A triangular matrix is probably the best form to solve these problems and while the computational effort to transform a dense matrix into a triangular one is intense, it is possible to generate an alternate form called the Hessenberg matrix with considerably less effort. A Hessenberg matrix is a triangular matrix with a few more sub diagonals than a triangular matrix. By identifying transformation techniques to convert a general matrix into the Hessenberg form, called Hessenberg reduction, it is seen that a computationally efficient method to perform operations like Eigen value problem arise when the first step is converting a general matrix into Hessenberg form. 
\section{Introduction}
Real world problems that require mathematical modeling are huge sparse matrices, most of which are non-symmetric. Be it fluid computations, weather forecasting, earthquake modeling, structural integrity analysis and so on, most of it deals with obtaining the eigen values of real life huge matrices that are non-symmetric in nature. This would mean dealing with complex numbers and we know that complex number computations are twice as slow as real number computations.
Therefore, it makes sense to reduce complex number computations by delaying them as much as possible in the computational chain. Traditional solutions to the eigen value problems and linear solves rely on a triangular matrix transformation from which it is quite easy to extract the necessary results. However, it can be seen that it is possible to obtain near triangular matrix forms without using complex numbers.

Inferring from the above discussion, we can use similarity transforms to reduce a matrix to its triangular form in order to find the eigen values of the system. However, by Abel's theorem, it is clear that there exists no algorithm which can compute the mentioned transformation in a finite number of steps. But it is possible to bring a matrix close to its triangular form in a finite number of steps. This matrix form is called Hessenberg matrix and this reduction is the first step towards simplifying an eigen value problem by bringing it as close to a triangular matrix as possible through similarity transforms and is called Hessenberg Reduction.

An n x n matrix A is called upper Hessenberg if 
\begin{equation} a_{ij}=0 \text{ whenever }i>j+1\text{ ie. } 
   \begin{bmatrix} 
      * & * & * & * & * \\
      * & * & * & * & * \\
         & * & * & * & * \\
         &    & * & * & * \\
         &    &    & * & *
   \end{bmatrix}
 \end{equation}
 If the matrix is symmetric (hermitian for a complex matrix), then this reduction results in a tridiagonal form mentioned below.
 
    \begin{center}$\begin{bmatrix} 
      * & * &    &   &     \\
      * & * & * &   &     \\
         & * & * & * &    \\
         &    & * & * & * \\
         &    &    & * & *
   \end{bmatrix}$\end{center}
\subsection{Hessenberg Reduction}
For the step 1 in the Hessenberg reduction process, we use orthogonal transformations (reflections/rotations) to zero out the necessary elements in each column. Using similarity transforms, if we have an orthogonal transformation vector $\hat P$, applying similarity transform $\hat PA\hat P$ does not work because the zeroes obtained during the first multiplication are removed due to the second orthogonal transform. However, $\hat P^TA\hat P$ does work because of the implied inverse in the transpose of an orthogonal matrix. The next step is to remove the extra sub diagonals of a Hessenberg matrix to make it triangular. However the focus of this project is Step 1, namely the reduction of a general matrix to its Hessenberg form.
\subsection{Literature Review}
Various implementation notes from LAPACK were reviewed to find good implementations of the Hessenberg reduction through Householder Reflectors, Givens Rotators and Block Householder Reflectors.
The column Householder reflector and the Givens rotation algorithms were picked from "Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note \#208" [1]. The block Hessenberg Reduction method was picked from [2]. Background information was obtained from "Fundamentals of Matrix Computations" by David Watkins.
\section{Algorithm}
The algorithm to reduce a General matrix to upper Hessenberg form is simple. In the simplest implementation, assuming a matrix $mxn$, each column is taken and the lower n-i elements are zeroed out, where i=2....m for each column.

Assume a starting matrix 

$A=$
$\begin{bmatrix} 
      a_{11} & \dots & \dots & \dots & \dots \\
      \vdots & * & * & * & * \\
      \vdots & * & * & * & * \\
      \vdots & * & * & * & * \\
      \vdots & * & * & * & *
   \end{bmatrix}$

Let $\hat Q_1$ be a reflector matrix such that $\hat Q_1b=[-\tau_1, 0, \dots,0]^T, (\vert \tau _1 \vert = \Vert b \Vert_2)$   
That is 

$Q_1 =  \begin{bmatrix}
		1 & 0^T \\
		0 & \hat Q_1
	    \end{bmatrix}$

Now performing the transformation $A_{1/2}=Q_1A=\left[ \begin{array}{c|cccc} a_{11} & \multicolumn{4}{c} {c^T} \\
						\hline -\tau _1 & \multicolumn{4}{c}{\multirow{4}{1in}{$\hat Q_1\hat A$}}\\
						0 & \multicolumn{4}{c} {} \\
						\vdots & \multicolumn{4}{c} {} \\
						0 & \multicolumn{4}{c} {}

\end{array} \right]$

it is seen that there are zeros in the required rows in the first column. 

Now applying $A_1=A_{1/2}Q_1=\left[ \begin{array}{c|cccc} a_{11} & \multicolumn{4}{c} {c^T\hat Q_1} \\
						\hline -\tau _1 & \multicolumn{4}{c}{\multirow{4}{1in}{$\hat Q_1\hat A\hat Q_1$}}\\
						0 & \multicolumn{4}{c} {} \\
						\vdots & \multicolumn{4}{c} {} \\
						0 & \multicolumn{4}{c} {}

\end{array} \right]
=\left[ \begin{array}{c|cccc} a_{11} & * & \dots & \dots & * \\
						\hline -\tau _1 & \multicolumn{4}{c}{\multirow{4}{1in}{$\hat A_1$}}\\
						0 & \multicolumn{4}{c} {} \\
						\vdots & \multicolumn{4}{c} {} \\
						0 & \multicolumn{4}{c} {}

\end{array} \right]$

Because of the structure of $Q_1$, the zeros in the first column are preserved. The second step is to now create reflector $\hat Q_2$ for $A_1$ instead of $A$ and so on. Note that we can replace reflector $\hat Q$ with a rotator $\hat R$ instead. The objective is to zero out elements in each column to reduce it to Hessenberg form without losing the Eigen values in them. This is possible through orthogonal transforms like reflections and rotations via similarity transforms.
\subsection{Householder Reflections}
Shown in Algorithm \ref{alg:house} is the algorithm for the householder reflector vector generation. It takes a matrix A and returns the householder vector v operating on its first column.
\algsetup{indent=2em}
\begin{algorithm}
  \caption{$house(n)$}\label{alg:house}
  \begin{algorithmic}[1]
    \REQUIRE A matrix A.
    \ENSURE The householder reflector vector $\hat V$.
\medskip
       \STATE $N \leftarrow Size(A)$
       \STATE Generate $e1 \leftarrow [1, zeros(nrows-1, ncols)]$
       \STATE $v \leftarrow sign(A(1))*\Vert A \Vert_2*e1+A$
       \STATE $v \leftarrow v/\Vert V \Vert_2$
  \end{algorithmic}
\end{algorithm}
\subsection{Givens Rotations}
Shown in Algorithm \ref{alg:givens} is the algorithm for the Givens rotator matrix generator. It takes a matrix A and row to be zeroed j and row affected i and returns a Givens rotator matrix which zeros out the element in the 1st column of the $j^{th}$ row, modifying only the $i^{th}$ row and leaving other rows intact.
\begin{algorithm}
  \caption{$givens(X, j, i)$}\label{alg:givens}
  \begin{algorithmic}[1]
    \REQUIRE A matrix A, row j to be zeroed, row i affected by the rotation
    \ENSURE The givens rotator matrix R that zeros out  element $(j, 1)$ of A and modifies only row $(i,:)$.
\medskip
       \STATE $G \leftarrow I_x$
       \STATE $x_i \leftarrow x(i,1)$
       \STATE $x_j \leftarrow x(j,1)$
       \STATE $r \leftarrow \sqrt{x_i^2+x_j^2}$
       \STATE $\cos \theta \leftarrow x_i/r$
       \STATE $\sin \theta \leftarrow x_j/r$
       \STATE $G(i,i) \leftarrow G(j,j) \leftarrow \cos \theta$
       \STATE $G(i,j) \leftarrow -\sin \theta$
       \STATE $G(j,i) \leftarrow \sin \theta$
  \end{algorithmic}
\end{algorithm}
\subsection{Column Householder Reduction}
Shown in Algorithm \ref{alg:hessen} is the algorithm to reduce a General matrix to upper Hessenberg form using Column Householder reduction.  It takes a matrix A and returns matrix in upper Hessenberg form.
\begin{algorithm}
  \caption{$hessen(A)$}\label{alg:hessen}
  \begin{algorithmic}[1]
    \REQUIRE A matrix A
    \ENSURE The upper Hessenberg form R of matrix A.
\medskip
       \FOR{$j=1$ to $n-2$}
       \STATE $x \leftarrow A_{j+1:n,j}$
       \STATE $v_j \leftarrow sign(x_1)\Vert x \Vert_2 e1+x$
       \STATE $v_j \leftarrow v_j/ \Vert v_j \Vert_2$
       \STATE $A_{j+1:n, j:n} \leftarrow A_{j+1:n, j:n}-2v_j(v_j^*A_{j+1:n, j:n})$
       \STATE $A_{1:n, j+1:n} \leftarrow A_{1:n, j+1:n}-2(A_{j+1:n, j:n}v_j)v_j^*$
       \ENDFOR
  \end{algorithmic}
\end{algorithm}
\subsection{Givens Hessenberg Reduction}
Shown in Algorithm \ref{alg:hessen1} is the algorithm to reduce a General matrix to upper Hessenberg form using Givens rotations.  It takes a matrix A and returns matrix in upper Hessenberg form.
\begin{algorithm}
  \caption{$hessen1(A)$}\label{alg:hessen1}
  \begin{algorithmic}[1]
    \REQUIRE A matrix A
    \ENSURE The upper Hessenberg form R of matrix A.
\medskip
       \STATE $G \leftarrow Id_n$
       \FOR{$j=1,2$ to $n-2$}
       \FOR{$i=n,n-1$ to $j+2$}
       \STATE Build the local $g(i-1,i)$ such that $A(i,j)=0$
       \STATE Accumulate $G \leftarrow g(i-1,i)*G$
       \STATE Update $A \leftarrow g^T(i-1,i)*A$
       \STATE Update $A \leftarrow A* g(i,i-1)$
       \ENDFOR
       \ENDFOR
  \end{algorithmic}
\end{algorithm}
\subsection{Block Hessenberg Reduction}
Shown in Algorithm \ref{alg:BlockHess} is the algorithm to reduce a General matrix to upper Hessenberg form using Block Householder reflectors. It takes a matrix A and returns matrix in upper Hessenberg form.
The algorithm works by splitting the matrix into a set of columns(panels) and performing column householder reductions on each of them. Each panel's householder vectors are accumulated into a matrix and a deferred updating takes place using Level 3 operations between the panels.
\section{Code}
\subsection{Householder Reflections}
Here's the code for the house() function that just gives the householder reflector vectors given a matrix.
\lstinputlisting[label=lst:house, caption={\mcode{house.m}}]{../house.m}
\subsection{Givens Rotations}
Here's the code for the givens() function that generates the rotator matrix given the input matrix and row whose first column is to be zeroed out.
\lstinputlisting[label=lst:givens, caption={\mcode{givens.m}}]{../givens.m}
\subsection{Column Hessenberg Reduction}
Here's code for Hessenberg Reduction using Column Householder
\lstinputlisting[label=lst:hessen, caption={\mcode{hessen.m}}]{../hessen.m}
\subsection{Givens Hessenberg Reduction}
Here's code for Hessenberg Reduction using a Givens rotator.
\lstinputlisting[label=lst:hessen1, caption={\mcode{hessen1.m}}]{../hessen1.m}
\subsection{Block Hessenberg Reduction}
Here's the code for Block Hessenberg Reduction
\lstinputlisting[label=lst:BlockHess, caption={\mcode{BlockHess.m}}]{../BlockHess.m}
\subsection{Testing}
Here's the code for complete test function
\lstinputlisting[label=lst:hr_test, caption={\mcode{hr_test.m}}]{../hr_test.m}
\section{Results}
This section lists the results obtained from the experiments done so far on various reduction methods. The test code first compares the norm of the reduced matrix with that of the built-in hess command, then compares the norm of the eigen values obtained through each of these methods to the eigen values of the original matrix. A timing is also performed to benchmark each of these functions to each other and to the built-in hess command.
\subsection{Hessenberg Reduction using Householder Reflectors}
The output of Hessenberg reduction using Householder Reflectors is shown below.
\begin{figure}[htb]
  \centering
  \subfloat{\label{fig:hessen_1}\includegraphics[width=0.3\textwidth]{../matplot_hessen_1.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen_2}\includegraphics[width=0.3\textwidth]{../matplot_hessen_2.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen_3}\includegraphics[width=0.3\textwidth]{../matplot_hessen_3.png}}
  
  \subfloat{\label{fig:hessen_4}\includegraphics[width=0.3\textwidth]{../matplot_hessen_4.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen_5}\includegraphics[width=0.3\textwidth]{../matplot_hessen_5.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen_6}\includegraphics[width=0.3\textwidth]{../matplot_hessen_6.png}}
  \caption{Matrix Plots of Hessenberg reduction using Householder reflectors for each iteration N=8}
  \label{fig:matplot_hessen}
\end{figure}
\begin{figure}[htb]
\centering
\includegraphics[width=0.3\textwidth]{../matplot_house.png}
\caption{Matrix Plot results of a Hessenberg reduced matrix using householder reflectors for N=64}
\label{fig:mplot_house}
\end{figure}
Relative Norm with default hess(): 6.743127020622670e-12
Relative Norm of eigen values compared to original matrix's eigenvalues: 2.179818782631296e-13
\subsection{Hessenberg Reduction using  Givens Rotation}
The output of Hessenberg reduction using Householder Reflectors is shown below.
\begin{figure}[htb]
  \centering
  \subfloat{\label{fig:hessen1_1}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_1.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen1_2}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_2.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen1_3}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_3.png}}
  
  \subfloat{\label{fig:hessen1_4}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_4.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen1_5}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_5.png}}
  ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc. (or a blank line to force the subfig onto a new line)
  \subfloat{\label{fig:hessen1_6}\includegraphics[width=0.3\textwidth]{../matplot_hessen1_6.png}}
  \caption{Matrix Plots of Hessenberg reduction using Householder reflectors for each iteration N=8}
  \label{fig:matplot_hessen}
\end{figure}
\begin{figure}[htb]
\centering
\includegraphics[width=0.3\textwidth]{../matplot_givens.png}
\caption{Matrix Plot results of a Hessenberg reduced matrix using givens rotators}
\label{fig:mplot_givens}
\end{figure}
Relative Norm with default hess(): 2.195456923554638e+02
Relative Norm of eigen values compared to original matrix's eigenvalues: 3.991088526709233e-13
Note that the relative norm with hess() is not right for Givens. This is acceptable because by observing the matrix visually, it is seen that the values are identical but several elements of the rotated matrix have inverted signs when compared to the reflected matrices. Since the eigen values do match the original, this is not a concern as there is no requirement that the reduced matrix should be identical through all methods.
\subsection{Timing}
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{../results.png}
\caption{A log-log plot of timing of various Hessenberg Reduction functions (column householder, givens rotation and default hess())}
\label{fig:mplot_results}
\end{figure}
\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth]{../PastedGraphic-1.pdf}
\caption{A log-log plot of timing of various Hessenberg Reduction functions (column householder, givens rotation and default hess())}
\label{fig:mplot_results1}
\end{figure}
This is the graphic of a log-log plot obtained while benchmarking each Hessenberg reduction variant varying the size of the input matrix.

\section{Future Work}
LAPACK suggests improvements to the Hessenberg reduction through several blocking methods, one with fast givens rotations and GPU variants of these. It would be quite interesting to benchmark the GPU variant with the CPU variant for various sizes of N (matrix size) to determine the break-in point at which the GPU starts to outperform the CPU. Lastly, parallelized algorithms for Hessenberg reduction would help to understand hardware implementations of the algorithm for possible implementation in an ASIC/FPGA.

\section{References}

1. Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited LAPACK Working Note \#208" 
2. "Fundamentals of Matrix Computations" by David Watkins.

\section{Appendix}
\subsection{Matrix Plot code}
The following is the matrix plot code by Nathan Childress from matlabcentral. This was used to plot the matrices of various Hessenberg reduction forms.
\lstinputlisting[label=lst:BlockHess, caption={\mcode{matrixplot.m}}]{../matrixplot.m}
\subsection{Performance Improvement Criteria}
To improve the performance of numerical linear algebra computations on a computing environment, a few factors need to be taken into account. Knowing there factors would help mathematicians and engineers get a better appreciation of proper algorithmic design and using techniques to improve performance of computing manyfold.
Traditional computing architectures had a CPU (central processing unit) whose role was to fetch instructions and data sequentially and execute them one by one. This slowly evolved into SIMD (Single Instruction Multiple Data). In other words, vectorized computations on the processor. Modern computing environments have multiple cores in the CPU and also use the GPU (Graphics Processing Unit) to perform massively parallel floating point computations.

All computing environments have a fast, small intermediate storage space called Cache whose only purpose is to fetch and store a chunk of data from primary/secondary memory. Since the cache is an extremely fast component, the fetch itself could be slow, but once fetched any data within the chunk can be accessed extremely fast. Cache locality implies that keeping data close together in memory would help the cache bring the entire block in one go from the primary memory. However, since the size of the cache is very limited, a complete matrix / vector may not be able to fit into the cache. The general rule is to improve cache locality as much as possible by storing consecutive elements as close to each other as possible.

Another factor to consider is the memory bandwidth. Higher the memory bandwidth, faster the flow of data into and out of the CPU and faster the computations. Since there is a finite limit to this bandwidth, this barrier is usually broken by using faster memories, faster bus etc.

Optimizing the number of calculations to be performed on a given set of data also improves performance if pipeline bubbles can be avoided by grouping instructions together, predictive branching in pipelines etc. The size of data also matters and bigger the data to be fed to the processor, the longer it takes, adding to the cache flushing every time a new chunk of data is brought in. And of course, in desktops, the pre-emptive multi tasking operating systems steal cpu cycles as well in-between computations.

In the case of a GPU, the algorithms can be parallelized to utilize the massively parallel computing elements it offers. A GPU has a large initial latency overhead but this is compensated for when the computation starts if the problem is sufficiently large.

The general rules when it comes to cache locality is that Vector-Vector (Level 1) operations are slower than Vector-Matrix (Level 2) operations which are slower than Matrix-Matrix (Level 3) operations. So blocking (grouping vectors to matrices) whenever possible increases the performance of the system.

\end{document}