Over the past 40 years, the computing industry has benefited from persistent growth largely attributed to improvements in processor performance. Many significant advances have been made in areas such as processor architecture, on-chip and off-chip memory, as well as data transfer interfaces. The ability to transfer data through high-speed interfaces, and access memory within a few tens of cycles has contributed to the overall improvement of computing platforms, but it is the throughput of the processing architectures, when presented a set of instructions, which plays the most significant role in the overall performance of the system. Traditionally, application code has been programmed and compiled with the underlying architecture and memory system in mind. Masking memory latencies, predicting cache misses, and leveraging specific vector instructions became the norm for squeezing performance from a given system. In the days before wide-scale parallelism, where sequential processors dominated, much reliance existed on the hardware vendors to continue to scale performance as transistor densities improved. The ability to obtain a speedup of legacy software by simply waiting 18 months for newer faster hardware to emerge became the ‘free lunch’ of the computing industry, but also its Achilles heel. The lack of preparation from the software community to adopt and define parallel software coding practices and standards has seen the development of compilers and supporting languages for parallelism lagging behind.
Today, great advances have been made; however the tenet of ‘design first, figure out how to program later’ still lingers in the corridors of Silicon Valley. The focus of this study is however not on making a contribution to compilers or software development, nor on determining an efficient generic parallel processing architecture for all classes of computing. Instead, this study adopts a different design approach, where a class of computing is first selected and analyzed, before determining a suitable hardware structure which can be tailored to the class being considered. The class of computing under investigation in this work is Spectral Methods, which by its very nature, has its own processing and data communication requirements. The purpose of this study is to investigate the processing and data handling requirements of the Spectral Methods class, and to design a suitable framework to support this class. The approach is different from past traditions – the hardware framework is based on software requirements, and in a sense is designed for the processing required, rather that the other way around. The underlying computation in Spectral Methods is the Fourier transform, which in turn can be analyzed to reveal the underpinning of complex arithmetic. Further consideration of the class shows that other operations possible in Fourier space are also reliant on complex arithmetic.
The goal of achieving a high-throughput system through parallelism with minimal data dependencies steered the study to identify a parallel technique to compute the Fourier transform, while making it possible to use the underlying hardware to support additional Fourier space operations. Investigations revealed that parallel scalable algorithms to compute the Fourier transform do exist, and with suitable modifications could be made to perform both forward and reverse Fourier transforms with error correction abilities for finite wordlength arithmetic.
To perform the Fourier transform this study implemented a recursive technique which allowed the updating of the output on a sample-by-sample basis. Techniques such as the Fast Fourier Transform (FFT) by contrast, operate in a block-data fashion where a set of samples are first required before the computation can be performed. The ability to compute a Fourier update sample-by-sample opens up new opportunity for increased time-frequency resolution, while still maintaining the ability to compute either a forward or reverse transform on blocks of data if required. The sample-by-sample updating requires the recursion of results which in turn introduced the negative side-effect of arithmetic error accumulation for finite-bit arithmetic.
To bound the error growth, this study further develops an error computation and correction technique, making it possible to determine the error at any time instant for use in correcting the respective Fourier output. The results showed that it is possible to error correct and limit the error in any output for finite-bit arithmetic, and achieve bounded square error results in the order of 10−6 for 36-bit word lengths using 26-bit precision (root mean square error of the order 10−3 after ≈ 106 iterations). Furthermore, performance results indicate that giga-operations per second with throughputs exceeding 29GBPS are possible for a DFT of length 64 operating at 82MHz on Virtex 5 FPGA hardware. This design included automatic error computation and correction to ensure bounded errors, and provides suitable competition for FFT implementations on multi-core CPU’s and multi-threaded GPU’s.
On reviewing the study objectives, hypotheses, and research questions, it can be concluded that it is possible to obtain scalable performance using a many processing element framework for the fundamental computations of the Spectral Method computing class. This was shown in both the software and hardware simulations and tests, and was made possible through the adoption and implementation of algorithms which support concurrency. It also became clear throughout the study that checks and balances are required, as the algorithm selected due to its advantages of sample-by-sample updating and ease of implementation carried an error cost which needed to be rectified. Jointly considering the research questions, a number of operations were identified for the class (Fourier transform being the dominant one), and through careful evaluation and selection of the algorithm, it was shown that operations such as the Fourier transform can be executed with scalable linear performance using many architecturally simple processing elements. This study has been successful in the objectives stated, and has showed that the design paradigm of ‘assess first, design later ’ in fact can be beneficial for at least one class of computing.