## An Efficient Discrete Wavelet Transform Architecture with Low Power and Multiplier-Less Structure for Pervasive Biomedical Image Processing Application

Maram Anantha Guptha<sup>1\*</sup>, Surampudi Srinivasa Rao<sup>2</sup>, Ravindrakumar Selvaraj<sup>3</sup>

<sup>1</sup>Research Scholar, ECE Department, Sri Satya Sai University of Technology and Medical Science, Madhya Pradesh, India
<sup>2</sup>Principal, Malla Reddy College of Engineering and Technology, Telangana, India 500014.
<sup>3</sup>Associate Professor, Biomedical Engineering, Sri Shakthi Institute of Engineering and Technology, Coimbatore, Tamilnadu, India 641062.

## Abstract

INTRODUCTION: Over the past several years analysis of image has moved from larger system to pervasive portable devices. For example, in pervasive biomedical systems like PACS-Picture achieving and Communication system, computing is the main element. Image processing application for biomedical diagnosis needs efficient and fast algorithms and architecture for their functionality. Future pervasive systems designed for biomedical application should provide computational efficiency and portability. The discrete wavelet transform (DWT) designed in on-chip been used in several applications like data, audio signal processing and machine learning.

OBJECTIVES: The conventional convolution based scheme is easy to implement but occupies more memory, power and delay. The conventional lifting based architecture has multiplier blocks which increase the critical delay. Designing the wavelet transform without multiplier is a effective task especially for the 2-D image analysis. Without multiplier Daubechies wavelet implementation in forward and inverse transforms may find efficient. The objective of the work is on obtaining low power and less delay architecture.

METHODS: The proposed lifting scheme for two dimensional architecture reduces critical path through multiplier less and provides low power, area and high throughput.

The proposed multiplier is delay efficient.

RESULTS: The architecture is Multiplier less in the predict and update stage and the implementation carried out in FPGA by the use of Quartus II 9.1 and it is found that there is reduction in consumption of power at approximately 56%. There is reduction in delay due to multiplier less architecture.

CONCLUSION: multiplier less architecture provides less delay and low power. The power observed is in milliwatts and suitable for high speed application due to low critical path delay.

Keywords: CMOS, power Efficient, Multiplier-Less, DWT Architecture, FPGA, Lifting based.

Received on 18 September 2021, accepted on 27 December 2022, published on 10 January 2023

Copyright © 2023 Maram Anantha Guptha *et al.*, licensed to EAI. This is an open access article distributed under the terms of the <u>Creative Commons Attribution license</u>, which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited.

doi: 10.4108/eetpht.v9i1.3176

#### 1. Introduction

In recent years the analysis of image has moved from larger system to portable devices. For example, in biomedical systems like PACS-Picture achieving and Communication system, computing is the main element. The speed and density of electronics components have exponentially increased to meet these standards. The implementation of various methods for analysismg the biomedical signal and image has taken a strong growth in recent years. Among them, DWT is playing a crucial part in image and signal processing. Compared to other transforms the Wavelet methods are flexible in design and can be implemented easily in Programmable arrays. The convolution and lifting based architecture have their own advantages and disadvantages. The Multipliers are the basic building blocks and occupy more area and consume more power. They also lead to longer latency.



<sup>\*</sup>Corresponding author. Email: ananthaguptha402@gmail.com

This limits the density and computing power of integrated circuits.

Also there may not be necessity or requirement for the grid integration all the time, so there has to be a controller for controlling the wind energy.

Source to operate the system effectively either in grid connected mode or in islanded as per the requirement (Yong zheng Zhang et al., 2013).

Nearby planetary group has been incorporated at the mark of regular coupling of the framework as an option Saleh sustainable power source (Mohammad Marhabaetal., 2018; A. Barkia et at., 2016; Raúl Sarrias-Mena et al., 2014). The framework is likewise coordinated with other RL load, enlistment engine load which can likewise be all the while worked addressing equal burdens activity. The framework is additionally made to work in relationship with the fixed speed wind energy frameworks. There might be unusual force quality issues inside the network association and to guarantee the appropriate activity FACTS gadget is likewise coordinated at the mark of regular coupling.

This paper includes the plan of variable speed wind turbine with a multimode control methodology (E. Muljadi et al., 2007; R. Takahashi et al., 2010; X. Zhao et al., 2010; J.L. Rodriguez-An et al., 2008; T. Lund et al., 2007) and breaking down the activity of the proposed framework with various experiments to guarantee the effective activity of the suggested framework under ordinary and blamed conditions. The paper is organized as follows: Section I: presents prologue to the proposed framework, Section II: manages the square chart clarification of proposed framework with the equal units and their numerical displaying. Segment III: manages the conversation of reproduction work and results. Section IV: manages the ends drawn from the proposed work.

#### 2. Literature Survey

In literature, several authors have presented the algorithms and implementation of wavelet for image application processing (Chaitali, 1995). These architectures were used for 1 dimensSional and 2 D analysis (Lewis et al, 1991). Generic architecture were famous during the past which used biorthogonal wavelet transform based systems. These systems were implemented using generic structures and scalable architectures (Shahidmasud and John, 2001). The word length are parameterized in this design. The multiplier less implementation can be extended to a multiplier included systolic array architecture (Nayak S, 2005). The array architecture uses a single clock cycle for all filter co-efficient. But the architecture is advantage due to its higher utilization of less registers for implementation. Patrick Longa et al (2007) coined a LUT based integrated decimation unit using distributed structures. The throughput achieved is nominal and partitioning was reduced. Using poly-phase FIR filter structures Jose and Thomas (2008) designed 1-D DWT architecture. The

design utilized the features of digital processing elements. The convolution based architecture are advantages in easy implementation but more number of elements especially multipliers while the lifting based scheme reduces the multiplier and pipelining stages Wei Zhang et al (2012), Mohanty et al(2011) . Several architectures using different multipliers are presented in literature (Senthilkumar, 2019, 2018) most architecture needs memory to save intermediate results. Some architectures for 2D transform used multilevel analysis using scalable structures (Yusong and Ching, 2013), folded architecture and digital serial architecture (Keshab and shri 2009). These architectures were suitable for one dimensional and two dimensional analyses. To speed up the wavelet transform operation Chengjun et al (2009) presented clock reduced pipeline architecture to perform the DWT. A similar type of 2-D discrete wavelet transform architecture was proposed by Zhang C et al (2012).

Designing the wavelet transform without multiplier is a effective task especially for the 2-D Daubechies Wavelet Transform for image analysis. Without multiplier Daubechies wavelet implementation in forward and inverse transforms may find efficient. Various filters were used like 9/7 and 5/3 filters and shifts of bits are reduced adder counts (Pramod Kumar Mehar et al (2015). The pipeline reduces the power delay.

# 2.1 Problem Statement and Objective

The objective is to design a DWT architecture which is efficient in power or area or computation and multiplier free. Since in large scale integration power is directly proportional with area, speed and supply voltage, the optimization can be carried out only in one parameter. In several methods literature as discussed were implemented. These methods are effective but certain issues can be improved like the critical path delay or new device approach when compared to CMOS. Especially the critical delay is an important parameter occurring due to the multipliers which should be addressed. The processing elements consisting of adders multipliers etc in the signal processing block leads to critical delay. The major delay is contributed by the multiplier unit. This work addresses the issues due to the problems in multiplier. The paper presents the detail investigation on the power reduction and delay due to the presence of the multiplier and a multiplier less architecture using CMOS is presented. The main aim was to design a multiplier less lifting based DWT architecture using CMOS circuit in FPGA.

## 3. Background Methodology

In previous section the survey of different methodologies adopted in literature for the design of DWT architecture is presented.



# 3.1. Discrete Wavelet Transform (DWT)

Unlike the Fast Fourier Transform (FFT), extensively exploited to analyze stationary quantities, the DWT is for non-stationary signal analysis (Chakrabarti, Viswanath, 1996). The analyzed signal is band pass filtered using DWT decomposition frequency bands (C.Chakrabarti et al, 1993). The time information of the signal was preserved using wavelet signal. The filtering process carried out by the transform. Using various levels of decomposition the bands associated with the signals are determined (P. P. Vaidyanathan, 1987). Moreover, the transform analysis differs for sinusoidal signal, superposition of sinusoidal signals and the analysis of a signal concatenation. The conventional methodologies used software packages and hardware like DSP processor, integrated circuits, application specific integrated circuits and FPGAs. The simplicity and low computational burdens helped the researchers and engineers to develop hardware for DWT architectures (Oliver, Malumbres, 2008),(Das et al, 2010).

#### 3.1.1. Lifting scheme

The lifting scheme is an alternative for Convolution based architecture when high throughput is required (Darji, A et al, 2001),( Mohanty, B.K et al, 2012). The methods to construct the wavelet in 1D or 2D are prediction and updation. The lifting scheme divides the polyphase matrices into triangular matrices using the Euclidean algorithm. The filter implementation is divided into banded matrices multiplication. (M. Vetterli, 1991). The pipelined processor is implemented using the triangular matrices cascade with scaling matrix. The integer compution, symmetrical and in-placed computation are advantages of lifting based architecture (C.-H et al, 2013).

## 3.1.2. Architecture

The Lifting based DWT architecture with two predict and two update stages is shown in Figure 1.During the initial split into odd and even , on even data is applied to predict step. Next the even samples are updated with the help of newly calculated odd samples the necessary properties are maintained. All samples are transformed by applying the repeated steps on the samples of the input signal. While designing the processor core for the same, the controller should choose propoer lifting coefficient for each clock cycle. The clocks should drive the registers in the each half of the structure through the even and odd clock. For example the pixels in an image processing application will get split into even and odd since they arrive serially with a speed of one pixel per clock cycle. The DWT architecture will give low and high pass coefficient at the output for each pair of pixel values. The corresponding lifting coefficients for filter coefficients were given out. The update and predict block have multipliers so the execution time of DSP algorithm can be minimized using high-speed multiplier(Tung T. H, et al, 2010).



Figure 1. Architecture of Lifting Scheme.

#### 3.1.3. Architecture for 2D Analysis

For the image processing architecture the structure of the DWT should have a 1D row processor (RP) and column processor (CP) of the image matrix. The internal memory like a SRAM stores the intermediate 1D row processed coefficients. The Z-scanning method commonly known as dual line scan based structure will reduce the latency. This scanning is independent of size. In addition the number of registers used will be less when compared to conventional line-based architectures (Huang et al, 2000). Hardware architectures can be optimized to accommodate different scaling factors.

By this the area, power and latency will be reduced. By including pipelined critical path due to multiplier will reduce (Dillen et al 2003). But minor increase in latency will happen.

## 3.1.4. Z-scanning method

The 2-D DWT architecture is implemented in a direct mode if we use direct scan methodology. In the operation of direct scan method, initially a row wise 1-D DWT followed by column 1-D DWT are operated and results stored in memory N2. By this method of performing one level 2D DWT, a complete external memory of size 2N2 is necessary, this may result in increased consumption of energy (Acharyya, A., et al, 2009). For storing the intermediate coefficients the line based method uses internal line buffers. The line buffer size and the input frame size are dependent on each other. The inputs are divided into individual block and supplied to parallel architecture in case of block-based approach. It is advantageous in regarding its internal cache but addressing makes it inefficient to use in streaming applications.

During the scanning process, the pixels in the first row of the input frames are read first and following that the next two pixels are read during the next clock cycle and this



process is repeated for every row in the input frames. Here RP is performing row transform of adjacent rows and starts the processing of 1D column processing that are boundary treated. N2/2 clock cycle is utilized for read operation for a frame size of N  $\times$  N in the scanning process. Z-scanning generally reduces the memory usage for transpose. For handling the row processing and boundary treatments at the frame boundary the registers D1, D2, D3 and D4 are employed with accurate values. Here, for boundary treatment the initialization of registers are done with zeros. Figure 2 illustrates the z scanning process.



Figure 2. Z scanning method.

#### 4. DWT Architecture with Multiplier

A less multiplier based pipeline architecture in DWT for lifting method is illustrated in fig.3. The critical path to the adder is reduced by the use of shift and adds multipliers. The DWT is used in variety of applications such as signal processing, machine learning, coding of signal, compressing data, hiding data, data interpretation, geophysics, motion tracking, meteorology, etc., They are used in situations in which scalability and acceptable degradations are necessary.



Figure 3. DWT with Multiplier Architecture.

Similar to discrete cosine transform, the DWT architecture in good in its compression ratio, they do not have any blocking artefacts, excellent localisation in frequency domain and time domain, inbuilt scaling and greater flexibility. The implementation is done in CMOS technology.

## 5. Proposed Method: DWT without Multiplier

The data flow of 9/7 lifting steps in our proposed two input/two output 1D row processing element. In the first lifting stage even and odd input are derived as output by the predict module gets at its first clock cycle. The previous even input and the present even input are added together (s0i, s0i+1). Next, with the shift and add arithmetic the multiplication is performed. Then, for calculating the first predict coefficient (d1i) the result from multiplication is added together with odd input samples (d0i) during the fourth clock cycle. The first update value is estimated from past and present predict and input during the fifth clock cycle. In the data flow graph only the adders are used for computation and thus there exists a reduction in critical path to one adder delay. Pipelining increases the speed of operation. The RP and CP stage consists of 2D/N delay registers and they are utilised for row and column processors. In our proposed method the first lifting stage is constructed with the use of seven adders/subtractors and four shifters. Similarly ten adders and eight shifters are used for second lifting stage. The predict and update stages of the 1D row and column processing elements in the design of four stage pipelined system are fully pipelined with the pipeline registers and are denoted in vertical dashed lines. The input of RP needs five delay registers, two for retaining even inputs and the other three for predict module. For compensating the three and two clock cycle delay that occurs in the pipeline stage of predict 1 and update 1 modules of the predict1 stage a/e of RP and CP are delayed by six clock cycles. Likewise a delay of six and seven clock cycles are introduced in update 1 output and predict 2 accordingly. To do column processing the outputs from row predict 2 and update 2 process are connected.



Figure 4(a). Schematic block diagram of Four stage pipelined architecture of 1D row processor.





Figure 4(b). Schematic block diagram of four stage pipelined architecture of 1D column processor.



Figure 4(c). Scaling1 of Four stage pipelined architecture of 1D row and column processing elements.



Figure 4(d). Scaling2 of Four stage pipelined architecture of 1D row and column processing elements.

For each clock cycle in our proposed architecture the pixels are acquired and filter coefficients are calculated. The output from RP is send to TU. The TU do transpose operation on the incoming coefficients and gives the output to the input of CP. The CP is modelled to have double the throughput. Due to the use of the Z-scanning method our proposed method uses only five registers whereas in conventional line-based scanning method 2D DWT modelling needs transpose buffer of size 1.5N. For holding the previous predict and update values

intermediate stage are used. The architecture is illustrated in Figure 4 (a) to 4(d).

Multiplier less architecture is shown in figure. 5 and 6.



Figure 5. Architecture of Predict 1 and update 1stage without multiplier.



Figure 6. Architecture of Predict 2 and update 2 stage without multiplier.

## 5.1. Transposing unit

The operation of the overall 2D DWT is performed by TU switching mechanism that can feed the two alternate 1D row coefficients of CP. It uses five registers with two  $2 \times 1$  multiplexers hence they are considered to be effective. Due to the use of Z-scanning in our proposed method, it is power and area efficient and they are not dependent on the size of the image. For column process synchronized with row process in the Z-scanning in consecutive cycles. The method is a line-based architecture and CP want to hold some time for entire row.

## 5.1.2. D Flip Flop

Generally the given data is accumulated as a group of bits, expressed in numbers and codes in digital circuits. Hence it is simple to pick the data in parallel lines and save it in successive flip flops ordered linearly. Registers are constructed by connecting D flip flops, since registers



are the general multi-bit data services that can store multiple bits of data. The same clock input is given to every flip flops which are connected to the separate data input. When a positive edge triggered clock signal is given, the flip flop stores the data gathered from their appropriate D input.

#### 5.1.3. Data Transfer

D flip-flops are interconnected with each other in cascade connection with same clock signal to construct shift registers, this helps in transferring data, and it is hugely applied in the data transfer applications. Once the clock pulse is given the data is shifted or transferred. For storing the data temporarily shift registers are employed and they are widely applied in serial to parallel and parallel to serial data conversion applications. Their applications are also in pulse extenders, delay circuits etc.

#### 6. Simulation and Results

The RTL view of the DWT architecture of the existing lifting based architecture with and without multiplier are shown in figure 7, and 9. The power analysis output of the lifting based DWT without multiplier is represented in figure 8.



Figure 7. RTL view for Lifting 9/7 filters with multiplier.

| PowerPlay Power Analyzer Status        | Successful - Sat Mar 18 11:40:36 2017          |
|----------------------------------------|------------------------------------------------|
| Quartus II Version                     | 9.1 Build 222 10/21/2009 SJ Web Edition        |
| Revision Name                          | liftmain                                       |
| Top-level Entity Name                  | liftmain                                       |
| Family                                 | Cyclone II                                     |
| Device                                 | EP2C20F484C7                                   |
| Power Models                           | Final                                          |
| Total Thermal Power Dissipation        | 255.13 mW                                      |
| Core Dynamic Thermal Power Dissipation | 13.19 mW                                       |
| Core Static Thermal Power Dissipation  | 47.69 mW                                       |
| I/O Thermal Power Dissipation          | 194.24 mW                                      |
| Power Estimation Confidence            | Low: user provided insufficient toggle rate da |
|                                        |                                                |

## Figure 8. Power analysis for Lifting 9/7 filters without multiplier.



Figure 9. RTL view for Lifting 9/7 filters without multiplier.

The table 1 and 2 shows the power analysis of the various blocks of the lifting based architecture for the different FPGA families. For the analysis and comparison LUT, power dissipation and delay are taken into account. The analysis shows the area optimization and delay reduction in the proposed method.

| FPGA Family | Parameters                             | Predict1 | Update1 | Predict2 | Update2 |
|-------------|----------------------------------------|----------|---------|----------|---------|
|             | LUT<br>Utilization(%)                  | 2        | <1      | <1       | <1      |
| Cyclone II  | Total Thermal Power<br>Dissipation(mW) | 255.82   | 250.27  | 252.57   | 252.57  |
|             | Delay<br>(nS)                          | 14.286   | 6.192   | 6.123    | 6.123   |
|             | LUT<br>Utilization(%)                  | 2        | <1      | <1       | <1      |
| Cyclone III | Total Thermal Power<br>Dissipation(mW) | 167.22   | 161.45  | 161.12   | 161.12  |
|             | Delay<br>(nS)                          | 10.783   | 3.782   | 3.949    | 3.949   |
| Stratix II  | LUT                                    | 2        | <1      | <1       | <1      |

Table 1. LUT Utilization in %( P1), Total Thermal Power Dissipation in Mw (P2) and Delay in nS (P3) Parameter Analysis of different modules in DWT without multiplier.



|             | Utilization(%)                         |        |        |        |        |
|-------------|----------------------------------------|--------|--------|--------|--------|
|             | Total Thermal Power                    |        |        | 497.72 | 497.72 |
|             | Dissipation(mw)                        |        |        |        |        |
|             | Delay<br>(nS)                          | 7.688  | 4.107  | 4.538  | 4.538  |
|             | LUT<br>Utilization(%)                  | <1     | <1     | <1     | <1     |
| Stratix III | Total Thermal Power<br>Dissipation(mW) | 511.74 | 498.89 | 499.47 | 499.47 |
|             | Delay<br>(nS)                          | 5.991  | 3.285  | 3.754  | 3.754  |

Table 2. LUT Utilization in %(P1), Total Thermal Power Dissipation in mW(P2) and Delay in nS(P3) Parameter Analysis for lifter with Multiplier and without Multiplier.

| FPGA Family | Parameters                           | Lifter with<br>Multiplier | Lifter without<br>Multiplier |
|-------------|--------------------------------------|---------------------------|------------------------------|
|             | LUT<br>Utilization(%)                | 43                        | 2                            |
| Cyclone II  | Total Thermal Power Dissipation (mW) | 120.51                    | 255.82                       |
|             | Delay<br>(nS)                        | 69.383                    | 14.286                       |
|             | LUT<br>Utilization(%)                | 40                        | 2                            |
| Cyclone III | Total Thermal Power Dissipation (mW) | 104.31                    | 167.22                       |
|             | Delay<br>(nS)                        | 55.065                    | 10.783                       |
|             | LUT<br>Utilization(%)                | 48                        | 2                            |
| Stratix II  | Total Thermal Power Dissipation (mW) | 359.13                    | 503.67                       |
|             | Delay<br>(nS)                        | 45.398                    | 7.688                        |
|             | LUT<br>Utilization(%)                | 16                        | <1                           |
| Stratix III | Total Thermal Power Dissipation (mW) | 441.67                    | 511.74                       |
|             | Delay<br>(nS)                        | 35.068                    | 5.991                        |

## Conclusion

The area, power and critical path are decreased by the use of new multiplier less predict and update model for lifting based DWT architecture. The traditional method had shortcomings in the use of multipliers and this is overcome by the use of adders and shift registers in place of multipliers in our proposed architecture. Our less multiplier architecture of our proposed design is found to be effective. The pipelined architecture can be used in the proposed design as a future enhancement, this will increase the speed of the overall device. The power consumed is reduced by 56%. The DWT architecture is implemented for image processing areas in future perspective of the proposed methodology.

#### References

- A. Acharyya, K. Maharatna, B. Al-Hashimi, S. Gunn, "Memory reduction methodology for distributed-arithmeticbased DWT/IDWT exploiting data symmetry", IEEE Trans. Circuits Syst. II Express Briefs, Vol. 56, no. 4, pp. 285–289, 2009.
- [2] C. Chakrabarti, M. Vishwanath, R.M. Owens, "Architectures for wavelet transforms", in Proc. VLSI Signal Processing Workshop, 1993.
- [3] C.Chakrabarti, M. Viswanath, "Efficient realizations of discrete and continuous wavelet transforms From single chip implementations to mapping on SIMD array computers, IEEE Trans. Signal Process", Vol. 43, No. 3, pp.759–771, 1995.
- [4] C. Chakrabarti, M. Viswanath, "Architectures for wavelet transforms a survey", J. VLSI Signal Process, Vol. 14, pp. 171–192, 1996.
- [5] Z. Chengjun, W. Chunyan, A.M. Omair, "Pipeline VLSI Architecture for High-Speed Computation of the 1-D Discrete Wavelet Transform", In IEEE Transactions on Circuits and Systems, Vol. 57, no. 10, pp. 2729 – 2740, 2010.
- [6] A. Darji, S. Merchant, A. Chandorkar, "Efficient pipelined VLSI architectre with dual scanning method for 2-D lifting-



based discrete wavelet transform", Proc. Int. Symp. Integrated Circuits (ISIC), pp. 329–331, 2011.

- [7] A. Das, A. Hazra, S. Banerjee, "An efficient architecture for 3-D discrete wavelet transforms", IEEE Trans. Circuits Syst. Video Technol, Vol. 20, no. 2, pp. 286–296, 2010.
- [8] G. Dillen, B. Georis, J. Legat, O. Cantineau, "Combined linebased architecture for the 5-3 and 9-7 wavelet transform of JPEG 2000", IEEE Trans. Circuits Syst. Video Technol, Vol. 13, no. 9, pp. 944–950, 2003.
- [9] C.H. Hsia, J.S. Chiang, J.M Guo, "Memory-efficient hardware architecture of 2-D dual-mode lifting-based discrete wavelet transform", IEEE Trans. Circuits Syst. Video Technol., Vol. 23 no. 4, 671–683, 2013.
- [10] Q. Huang, R. Zhou, Z. Hong, "Low memory and low complexity VLSI implementation of JPEG 2000 codec", IEEE Trans. Consum. Electron, Vol. 50, pp. 638–646, 2004.
- [11] C. Jose, L. Thomas, "Hardware Implementation of 1D Wavelet Transform on an FPGA for Infrasound Signal Classification", In IEEE Transactions on Nuclear science, Vol. 55, no. 1, pp. 9 – 13, 2008.
- [12] A. Lewis, G. Knowels, "VLSI architecture for 2-D Daubechies wavelet transform without multipliers", Electron. Lett, Vol. 27, no. 2, pp. 171–173, 1991.
- [13] J.C. Limqueco, M.A. Bayoumi, "A VLSI architecture for separable 2-D discrete wavelet transforms", J. VLSI Signal Process, Vol. 18, pp. 125–140, 1998.
- [14] M. Vetterli, "Wavelets and filter banks for discrete time signal processing," in Wavelets and their Applications (R. Coifman et al. Eds.), Place Jones and Barlett, 1991.
- [15] B.K. Mohanty, A. Mahajan, P.K. Meher, "Area- and powerefficient architecture for high-throughput implementation of lifting 2-D DWT", IEEE Trans. Circuits Syst. II, Vol. 59, no. 7, pp. 434–438, 2012.
- [16] B.K. Mohanty, P.K. Meher, "Memory efficient modular VLSI architecture for high-throughput and low-latency implementation of multilevel lifting 2-D DWT", IEEE Trans. Signal Process, Vol. 59, no. 4, pp. 2072–2084, 2011.
- [17] S.Nayak, "Bit-level systolic implementation of 1-D and 2-D discrete Wavelet Transform", In IEEE proceedings circuits Devices Systems, Vol. 152, no. 1, pp. 25-32, 2005.
- [18] J. Oliver, M.P. Malumbres, "Analysis and VLSI architecture for 1-D and 2-D discrete wavelet transform", IEEE Trans. Circuits Syst. Video Technol, Vol. 18, no. 2, pp. 237–248, 2008.
- [19] P. P. Vaidyanathan, "Quadrature mirror filter banks, M-band extensions and perfect-reconstruction techniques," IEEE ASSP Mag. 420, Jul. 1987.
- [20] K. Parhi, T. Nishitani, "VLSI architectures for discrete wavelet transforms", IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Vol. 1, no. 2, pp. 191–202, 1993.
- [21] L. Patrick, M. Ali, B. Miodrag, "A Flexible Design of Filter bank Architectures for Discrete Wavelet Transforms", In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 07), pp. 1441 – 1444, 2007.
- [22] K M. Pramod, K M. Basant, M N S. Swamy, "Low-area and Low Power Reconfigurable Architecture for Convolution – Based 1-D DWT using 9/7 and 5/3 Filters", In 28th International Conference on VLSI Design and 14th International conference on Embedded Systems, IEEE computer Society, pp. 327 – 332, 2015.
- [23] V.M. Senthilkumar, S. Ravindrakumar, D. Nithya, N.V. Kousik, "A vedic mathematics based processor core for discrete wavelet transform using FinFET and CNTFET technology for biomedical signal processing", Microprocess. Microsyst, Vol. **71**, 102875, 2019.

- [24] V.M. Senthilkumar, S. Ravindrakumar, "A low power and area efficient FinFET based approximate multiplier in 32 nm technology", in Springer-International Conference on Soft Computing and Signal Processing, 2018.
- [25] M. Shahid, M, V. John, "Design of Silicon IP Cores for Biorthogonal Wavelet Transforms", In Journal of VLSI signal processing for signal, Image and video technology, Vol. 29 no. 3, pp. 179 – 196, 2001.
- [26] G. Shi, W. Liu, L. Zhang F. Li, "An efficient folded architecture for lifting-based discrete wavelet transform", IEEE Trans. Circuits Syst. II, Vol. 56, no. 4, pp. 290–294, 2009.
- [27] T. H. Tung, S. Magnus, L.E. Per, "A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit", In IEEE Transactions on Circuits and Systems-I, Vol. 57, no. 12, pp. 3073 – 3081, 2010.
- [28] Z. Wei, J. Zhe, G. Zhiyu, L. Yanyan, "An efficient VLSI architecture for Lifting –Based Discrete Wavelet Transform", In IEEE Transactions on Circuits and Systems–II, Vol. 59 no. 3, pp. 158 – 162, 2012.
- [29] H.Yusong, C. J. Ching, "A Memory-Efficient High-Throughput Architecture for Lifting-Based Multi-Level 2-D DWT", In IEEE Transactions on Signal Processing, Vol. 61, no. 20, pp. 4975-4987, 2013.
- [30] C. Zhang, C. Wang, MO. Ahmad, "A Pipeline VLSI architecture for the fast computation of the 2-D Discrete Wavelet Transform", In IEEE Transactions on Circuits and Systems, Vol. 59, no. 8, pp. 1775 – 1785, 2012.

