# Design and Implementation of AGU based FFT Pipeline Architecture

G. Prasanna Kumar<sup>1</sup>, Maturi Sarath Chandra<sup>2</sup>, K Shiva Prasanna<sup>3</sup>, M Mahesh<sup>4</sup> {prasanna4600@gmail.com<sup>1</sup>, sarathchandra\_ece@mvsrec.edu.in<sup>2</sup>, shivaprasanna441@gmail.com<sup>3</sup>, maheshm@sreenidhi.edu.in<sup>4</sup>}

Associate professor, ECE department, Malla Reddy Engineering College (Autonomous), Hyderabad, Telangana<sup>1</sup>, Assistant Professor, ECE department, Maturi Venkata Subba Rao Engineering College, Hyderabad, Telangana<sup>2</sup>, Assistant professor, ECE department, Teegala Krishna Reddy engineering College, Telangana<sup>3</sup>, Assistant professor, ECE department, Sreenidhi institute of Science and technology, Telangana<sup>4</sup>

**Abstract.** Present it is most needful task to get various applications with parallel computations by using a Fast Fourier Transform (FFT) and the derived outputs should be in regular format. This can be achieved by using an advanced technique called Multipath delay commutator (MDC) Pipelining FFT processor and this processor will be capable to perform the computation of a different data streams at a time. In this paper the design and implementation of AGU based Pipelined FFT architecture is done. The proposed instructions calculate a butterfly within two cycles. The proposed architecture employs a Data Processing Unit (DPU) supporting the instructions and an FFT Address Generation Unit (FAGU) automatically calculating the butterfly input and output data addresses. The proposed DPU has a smaller area than commercial DSP chips. Moreover, the number of FFT computation cycles is reduced by the proposed FAGU. The design of FFT architecture will consists of real data paths. The FFT is mapped to pipelined architecture with different FFT sizes, different level of parallelism and various radixes. The most attractive feature of the pipelined FFT architecture is it consists of bit reversal operation so it requires little number of registers and better throughput.

**Keywords:** Fast Fourier transforms (FFT), parallel computation, pipelining, real data path, bit reversal, data stream and twiddle factor.

#### 1 Introduction

FFT (Fast Fourier Transform) is utilized precisely as its name infers - to create indicative test designs ready to recognize fault types or issue areas. Its will likely recognize a greatest number of issues utilizing a negligible number of indicative test designs. Various FFT approaches have been proposed [1]. A large portion of these techniques depend on adjusted traditional FFT. At the point when the quantity of conceivable numerous issue competitors is enormous, it gets infeasible to unequivocally attempt all the potential mixes of various faults utilizing conventional methodologies.

Thus, existing FFT(Fast Fourier Transform) methods are basically utilized for single issue analysis. Another test to existing analytic test design age strategies is demonstrating unrecognized capacity. Issue analysis is a methodology to foresee the areas of imperfections in bombing chips. The data acquired through this strategy can be utilized to control the damaging Physical Failure Analysis (PFA) interaction to arrive at the genuine imperfection areas and discover the reasons for the deformities this is a basic technique for IC (Integrated Chip) assembling organizations to increase the yield of new cycle innovation. One significant advance of deficiency finding is to produce a reasonable arrangement of examples that can be applied to the chips through Automatic Test Equipment (ATE) [3-4] with the end goal that appropriate conclusion data of the chips can be gathered and broke down.

To improve the determination effectiveness of the volume conclusion strategies, a few example age methods have been proposed to create extra examples with high recognize capacity. In this paper this will call these extra examples as fourier transform while the first transformation. As a rule, a decent finding design age technique ought to accomplish three destinations: 1) high recognize capacity to recognize most nonequivalent deficiencies, 2) little analysis design volume, and 3) short conclusion design age time.

The essential thought of this methodology is to adjust the information circuit model by adding some analysis purposed logic into the first circuit with the end goal that the issue of recognizing two deficiency applicants is changed to the issue of distinguishing a solitary stuck to fault, consequently this doesn't requires uncommon ATPG instruments. In this paper it additionally adds this methodology.

In a Circuit Under Test (CUT), when all the conceivable info blends are tried, more force will be naturally burned-through because of the flipping of the data sources. Along these lines, there is a requirement for decrease in the quantity of test designs that will be utilized on the CUT. There are numerous strategies that have conceived to decrease power scattering in the CUT. Vector filtering elemets out a portion of the successions that don't distinguish the issues in the circuit.

# 2 Multipath Delay Commutator (MDC) FFT

FFT calculations are for the most part worked in stages where in each stage information peruse and compose activities are performed by getting to memory . The memory based structures are ordered into single memory engineering and double memory design. In single memory engineering, the handling component is associated with a single memory unit of at any rate N words by a bidirectional transport as demonstrated. Information trades are occurred between the processor (Proc) and memory at each stage utilizing this bidirectional information transport. double memory engineering, where two recollections of size N each are associated with the preparing component with two separate bidirectional information transports. Information contributions from one memory are gone through the preparing component to another memory and the other way around till the change is finished.

Information rate may not be equivalent to the FFT processor recurrence, so the information data sources should go through three stages: input buffering, calculation and yield reordering. Information to be changed is first put away in the information cradle till the N tests

are gathered. At that point this memory is utilized as the computational memory, which is gotten to by the processor. Simultaneously another memory block turns into the info support to store another arrangement of information. The processor sets aside some effort to finish the calculations and stores the halfway information in computational memory. When the change is finished, the computational memory fills in as the yield cushion.

Memory based designs for processing a N point radix-r FFT require  $((N/r) \log r N)$  memory gets to where r words are perused from and kept in touch with memory at each entrance. The clock recurrence ought to be logr (N/r) times the recurrence of information inputs in light of the fact that just one preparing component is dealing with the calculations.

#### 3 Radix-2 Pipelined FFT Architecture

Pipelined FFT models are quick and high throughput structures with parallelism and pipelining. Despite the fact that the equipment intricacy is high and less adaptable contrasted with different models, they offer high throughput and energy productive executions. Here we present some usually utilized pipelined structures, for example, Multi-way Delay Commutator (MDC) and Single-way Delay Feedback (SDF).

In this sort of designs, input succession is first isolated into different equal information streams by commutator and afterward, butterfly activity followed by fidget factor duplication is performed with legitimate deferrals to every information stream. In radix-2 MDC (R2MDC), input information stream is isolated into two equal ways as demonstrated in fig. 1 for N equivalent to 16. Absolutely, (log2N - 1) complex multipliers, log2N butterfly units and (3N/2 - 2) defer cradles are required.

All the butterfly units and multipliers can be used at 100% with legitimate information buffering. Fig. 6 shows the radix-4 MDC (R4MDC) engineering, in which four equal streams are prepared immediately. An aggregate of  $(3\log 4N - 1)$  complex multipliers, log 4N radix-4 butterfly units and (5N/2 - 4) expressions of memory is required.



#### Fig. 1. Idea of the proposed method



Fig. 2. Proposed 16-point radix-2 FFT architecture with outputs in natural order

In single way defer input models, a solitary information stream goes through multiplier in each stage. The defer units are more productively used by dividing a similar stockpiling among the sources of info and yields of the butterfly. Higher Radix MDC structures are not favored on the grounds that of their gigantic prerequisite of planning cushions though the higher radix SDF structures are favored on the grounds that the quantity of complex multipliers is decreased, equipment usage is upgraded furthermore, needs less memory. A cautious execution of higher radix preparing components is required on the grounds that the intricacy of adders may increment if not carried out by a few fell radix-2 butterfly units.

Among these structures, memory based designs and Pipelined structures are broadly embraced. Table 2 shows the examination between pipelined SDF design and memory based engineering for radix-r N point FFT execution. Both the structures have practically similar capacity necessities where in memory based design, principle memory is divided into r banks for concurrent access and in pipelined SDF, memory is appropriated into log2N banks. Force utilization can be diminished in pipelined SDF design with productive executions of consecutive cradles due to its successive access while in memory based engineering, to accomplish a contention free memory access, arbitrary tending to is important. Memory based design necessities to drive its clock logr (N/r) times the processor recurrence to accomplish a similar exhibition as pipelined SDF design.

Thus, pipelined designs are favored when execution and power are the primary sources than the intricacy. Then again memory based designs are acceptable decision where intricacy is of primary concern.

## 4 AGU Based Fft Pipelined

Any single FFT calculation itself isn't the best appropriate for all sorts of equipment stages. In this way, for better enhancement, a best reasonable calculation ought to be chosen for given equipment. Low power utilization, less territory, consistency in design and high execution are the primary worries of FFT enhancements. In the writing, diverse FFT calculations are received in various FFT structures for accomplishing required objectives in explicit applications. Here, we present the latest things in FFT engineering enhancement furthermore, best in class.

|         | 1st stage                                                                                        | 2nd stage      | 3rd stage     | -4th stage |               |
|---------|--------------------------------------------------------------------------------------------------|----------------|---------------|------------|---------------|
| ×(0)9   | , <sup>3</sup> , <sup>1</sup>                                                                    |                |               |            | 4 10 ×(0)     |
| x(4)q   | 130,                                                                                             | +3 5 W         |               |            | \$ 10, X(8)   |
| x(1) 0  | 1/32,                                                                                            |                | V. a at       | X,®        | 411,Xr(4)     |
| x(5)0   | 11///30,                                                                                         | -36, W         | 1.8 × 8.8. "  |            | 4 11,X,(4)    |
| ×(2)0   | $\left  \left  \right  \right  \right  \left  \left  \left  \frac{1}{3} \right  \right  \right $ |                | 11.0 30-      |            | 5 12 × (2)    |
| ~(6)0   | $(XX)/f_{3}$                                                                                     | 3 (T) WO       | 12 39 1       | 0 12       | 5 12, X, (10) |
| ~(2)0   |                                                                                                  |                |               | ΞX         | 5(13) X (2)   |
| ~(5)~   |                                                                                                  | 4 B W°         | (B) 4 10 W    | 2 (12      | 513 X (10)    |
| ×(/)9   |                                                                                                  |                | 1.9 .50       | 13         | \$ 14.X(1)    |
| X(8)6   |                                                                                                  |                | 1 9 5 13 14   | 0 14       | \$ 14 X.(9)   |
| ×(12)   | ///////////////////////////////////////                                                          |                | 2 10 / 6 12   |            | . £15 X(1)    |
| ×(9) 8  |                                                                                                  |                |               | 0 14       | 6 (15 X (9)   |
| x(13)   |                                                                                                  |                |               | (15)       | 7 16 7 (5)    |
| ×(10)   | S/// 11183 .                                                                                     | $W^2$          |               | 0 16       |               |
| ×(14) d | $\frac{1}{8}$                                                                                    |                |               | - V 10     | 7 (13)        |
| x(11)d  | 5/ \ <sup>3</sup> <sup>4</sup> <sup>4</sup>                                                      | w <sup>3</sup> | **** \*** W   | 4 40 >>    | X,(5)         |
| x(15)d  | $\overset{{}_{4}}{\longrightarrow} \overset{{}_{6}}{\longrightarrow}$                            | →o (12)        | 4 4 8 →8 14 → |            | -6 0 X (13)   |

Fig. 3. Regular flow graph corresponding to Pipelined FFT

In this figure, genuine and nonexistent parts have been isolated Twiddle factors are moved to resulting stages depending on the situation, with the end goal that the changed stream chart contains one segment of BFs followed by one segment of Wk-blocks, and every segment in the stream diagram contains a sum of N genuine and fanciful examples. First and last stages needn't bother with Wk-blocks. Fig. 3 shows the standard design acquired after this change.



Fig. 4. Proposed two-parallel pipelined VLSI architecture for a 16-point DIF RFFT

Any single FFT calculation itself isn't the best appropriate for all sorts of equipment stages. Along these lines, for better streamlining, a best reasonable calculation ought to be chosen for given equipment. Low power utilization, less territory, consistency in construction and high execution are the primary sources of FFT advancements. In the writing, distinctive FFT calculations are embraced in various FFT models for accomplishing required objectives in explicit applications.



Fig. 5: Agu Based Fft Pipelined Architecture

The above figure (5) shows the block diagram of AGU based FFT pipelined architecture. The proposed FAGU consists of several counters, registers, etc. The yield locations of the FAGU are stacked into the balance address registers in the AGU. At that point the stacked addresses are utilized for counter balance locations of base locations. For instance, if a base location register is R0 and a counterbalance register is N0, the determined location which is the yield address of the AGU.

By utilizing set up FFT calculations and shared memory cradles, the memory utilization in memory based models can be incredibly diminished. A blended radix calculation is utilized with set up calculation system for struggle free memory access. The design comprises of just one butterfly unit prepared to do playing out a radix-4 activity or two radix-2 tasks. With this engineering, the region, computational clock recurrence and force utilizations are diminished. As of late, in the design introduced, the equivalent set up calculation system is utilized and an altered radix-2 calculation to stay away from excess tasks is received for genuine esteemed signs.

Scientists have proposed various changes to pattern designs for enhancing the handling components and postponement cradles to accomplish low force and high throughput. In memory based models, double port SRAMs are utilized for expanding the speed and decreasing the force utilization. To accomplish high throughput and decrease the increase intricacies in pipelined designs, FFT calculations with higher radices are received in light of radix-2k. With this methodology the routineness of construction is kept up without expanded intricacy.

# **5** Results

The below figure (6) shows the comparison of accuracy for FFT and AGU based FFT. The accuracy of AGU based FFT is very high.



COMPARISON OF ACCURACY

The below figure (7) shows the comparison of delay for FFT and AGU based FFT. The delay of AGU based FFT is reduced very effectively.



### COMPARISON OF DELAY

Fig. 7. Comparison of Delay

## **6** Conclusion

Hence in this paper the design and implementation of AGU based FFT pipelined architecture was implemented. This brief has presented a normal i/o order radix-2 multipath delay commutator fft pipelined architecture for real-valued fft for the computation of the fast Fourier transform (FFT) of real signals and inverse FFT of Hermitian-symmetric signals using only real datapaths whose outputs are generated in the natural order. The bit reversal circuit present in prior designs is eliminated by integrating two FFT processors and the registers, which are present in the architecture, are reused for bit reversal. The real FFT structure is transformed by transferring twiddle factors to subsequent stages, such that each stage in the proposed flow graph contains one column of butterfly units and one column of twiddle factor blocks, and each column of the flow graph contains only *N* samples. These attributes make the proposed FFT processor superior in sense of hardware complexity and performance. Moreover, the proposed architecture provides throughput higher than the prior architectures.

#### References

- Z. Wang, X. Liu, B. He, and F. Yu, "A combined SDC-SDF architecture for normal I/O pipelined radix-2 FFT," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 5, pp. 973–977, May 2015.
- [2] A. X. Glittas and G. Lakshminarayanan, "Pipelined FFT architectures for real-time signal processing and wireless communication applications," in Proc. 18th Int. Symp. VLSI Design Test, Jul. 2014, pp. 1–2.
- [3] P. P. Boopal, M. Garrido, and O. Gustafsson, "A reconfigurable FFT architecture for variablelength and multi-streaming OFDM standards," in Proc. IEEE ISCAS, May 2013, pp. 2066– 2070.
- [4] K.-J. Yang, S.-H. Tsai, and G. C. H. Chuang, "MDC FFT/IFFT processor with variable length for MIMO-OFDM systems," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 4, pp. 720–731, Apr. 2013.
- [5] M. Ayinala, M. Brown, and K. K. Parhi, "Pipelined parallel FFT architectures via folding transformation," IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 6, pp. 1068– 1081, Jun. 2012.
- [6] S.-N. Tang, C.-H. Liao, and T.-Y. Chang, "An area- and energyefficient multimode FFT processor for WPAN/WLAN/WMAN systems," IEEE J. Solid-State Circuits, vol. 47, no. 6, pp. 1419–1435, Jul. 2012
- [7] M. Garrido, J. Grajal, and O. Gustafsson, "Optimum circuits for bit reversal," IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58, no. 10, pp. 657–661, Oct. 2011.
- [8] Y. Voronenko and M. Püschel, "Algebraic signal processing theory: Cooley-Tukey type algorithms for real DFTs," IEEE Trans. Signal Process., vol. 57, no. 1, pp. 205–222, Jan. 2009.
- [9] .Y. Chen, Y.-W. Lin, Y.-C. Tsao, and C.-Y. Lee, "A 2.4-Gsample/s DVFS FFT processor for MIMO OFDM communication systems," IEEE J. Solid-State Circuits, vol. 43, no. 5, pp. 1260–1273, May 2008.
- [10] Y.-N. Chang, "An efficient VLSI architecture for normal I/O order pipeline FFT design," IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 12, pp. 1234–1238, Dec. 2008.
- [11] Y.-W. Lin and C.-Y. Lee, "Design of an FFT/IFFT processor for MIMO OFDM systems," IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 4, pp. 807–815, Apr. 2007.
- [12] H. F. Chi and Z. H. Lai, "A cost-effective memory-based real-valued FFT and Hermitian symmetric IFFT processor for DMT-based wire-line transmission systems," in Proc. IEEE Int. Symp. Circuits Syst., May 2005, vol. 6, pp. 6006–6009.
- [13] Y. W. Lin, H. Y. Liu, and C. Y. Lee, "A 1-GS/s FFT/IFFT processor for UWB applications," IEEE J. Solid-State Circuits, vol. 40, no. 8, pp. 1726–1735, Aug. 2005.
- [14] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time Signal Processing., 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1998.
- [15] S. He and M. Torkelson, "Design and implementation of a 1024-point pipeline FFT processor," in Proc. IEEE Custom Integr. Circuits Conf., Santa Clara, CA, USA, May 1998, pp. 131–134.
- [16] S. He and M. Torkelson, "A new approach to pipeline FFT processor," in Proc. 10th Int. Parallel Process. Symp., 1996, pp. 766–770