999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

2022-08-24 12:57:48RajkumarSarmaCherryBhargavaandKetanKotecha
Computers Materials&Continua 2022年7期

Rajkumar Sarma, Cherry Bhargavaand Ketan Kotecha

1Department of Electrical & Electronics Engineering, Faculty of Engineering & Technology, Jain(Deemed-to-be-University), Ramanagar, 562112, Karnataka, India

2Symbiosis Institute of Technology, Symbiosis International (Deemed University), Lavale, Pune, 412115, India

3Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Lavale, Pune, 412115,India

Abstract: In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations.A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera.In the domain of digital signal processing, the use of normalization architec-ture is very vast.The main objective of using normalization is to perform com-parison and shift operations.In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc.The proposed normalization algorithm is further used in designing an 8×8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture.Since the SFMAC can accept an 8-bit signific and and a 3-bit exponent, the input to the said architecture can be somewhere between -(7.96872)10 to + (7.96872)10.The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively).To reduce the power consumption of the proposed normalization architecture,techniques such as“block enabling”and“clock gating”are used rigorously.According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

Keywords: Data normalization; cadence virtuoso; signed-floating-point MAC; evolutionary optimized algorithm; block enabling; clock gating

1 Introduction to Multiply & Accumulate (MAC) Architecture

In digital signal processing, the MAC operation is considered a significant and critical operation.The Digital Signal Processing (DSP) algorithms execute many mathematical calculations repeatedly and rapidly on various data sets.DSP algorithms can be effectively executed by the majority of operating systems and general-purpose microprocessors.Unfortunately, DSP algorithms have energy efficiency issues while operating with portable devices such as Personal Digital Assistants (PDAs)and mobile phones.Considering delay and power optimization, the exponential growth of portable electronics has imposed a major challenge to Very Large-Scale Integration (VLSI) design engineers.A MAC unit is a vital component of any digital system, such as various FFT algorithms, convolution etc.The actual MAC block is not just limited to the fixed-point number system.For audio and image processing applications, floating-point MAC architecture is much needed.MAC’s simple operation is to multiply two variables (XiandYi) and add the product to the last cycle’s output.Therefore, the MAC architecture includes the key operational blocks of a multiplier, adder, and register/accumulator[1-14].The multiplier multiplies the two input operands; the adder attaches the multiplier’s output to the previous cycle’s result, and the register or accumulator preserves the final addition output.Fig.1 shows the generalized block diagram of N×N bit MAC.

The popularity of portable devices and the requirement to limit the power consumption(and therefore heat dissipation) in heavily-dense VLSI chips have resulted in rapid advances in low-power design over the past few years.Mobile applications necessitating low-power dissipation and high throughput,let us say notebook Personal Computers (PCs), mobile communication devices, and PDAs, are the driving forces behind these innovations.In most cases, low power consumption requirements need to be met along with equally challenging targets of high chip density and high speed.Therefore,the low-power IC design surfaced as a beneficial and fast-developing area of Complementary Met al Oxide Semiconductor (CMOS) circuit design.Usually, the restricted battery life places very stringent demands on the portable system’s overall power requirements.New types of rechargeable batteries,say“Nickel-Met al Hydride (NiMH)”is being produced with better energy storage capacity than the traditional“Nickel-Cadmium (NiCd)”batteries.Still, there is no prospect of a significant increase in energy capacity in the foreseeable future.The energy density (the energy stored/unit weight) provided by new advancements in technologies (such as NiMH) is approximately 30 Watt-hour/pound, which is quite lesser considering the growing applications of portable systems.Scaling down the energy dissipation of Integrated Circuits (ICs) by improving functionality is, therefore, a significant task in developing portable devices.

In high-performance digital systems, such as microprocessors-microcontrollers, DSPs, etc., the need for low-power circuit development is also becoming a significant concern.Targeting higher chip density and higher processing speed contributes to developing a high-clock rate in very complex circuits.If the chip’s clock speed rises, then the chip’s energy dissipation, thereby increasing the temperature linearly.As the dissipated heat has to be efficiently removed to maintain the chip’s temperature at an optimum level, the packaging cost, cooling, and heat extraction become important aspects.A few elite microchips structured in the mid-1990s (such as Intel Pentium, Digital Equipment Corporation (DEC) Alpha, PowerPC) which operates in a frequency ranging from 100-300 MHz, and the total average power is ranging from 20-50 W.VLSI’s reliability is one more critical factor to look after for the design engineers, as it emphases to the demand for energy-efficient design.There is a near connection between electronic circuit maximumpower-dissipation and reliability concerns like electro-migrationand and system degradation caused by the carriers.Additionally, the thermal stress caused by chip heat dissipation is also a significant issue to look after in terms of reliability.As a consequence,increasing power consumption is also critical for improving performance.The procedures used in digital systems to achieve low-power consumption vary from device to device, technology to technology or algorithm to algorithm level.The standard system features (say threshold voltage),device dimension and interconnection properties are essential factors in reducing power consumption.Circuit level approaches such as a careful selection of circuit design logic family, decrement in the total number of voltage transitions, and clocking approaches can be used to minimize transistor level energy dissipation.Measures at the architecture level include intelligent power management of different system components, pipeline and concurrent usage, and bus layout design.

In recent years, different researchers have done several works [2-3,5-21].Reference [22] proposes a high throughput MAC architecture that promises the optimized area in 2007.To maximize speed,it employs 4:2 compressor circuits.Reference [23] in 2012 suggests a novel multiplier architecture.Reference [12] proposes a novel architecture based on a transformed“Wallace tree multiplier”in 2013.The architecture is 64-bit compatible.Reference[24]uses an updated Braun Multiplier to create a MAC unit in 2013.NCSim and RTL Compiler are used in the implementation.In the year2014,reference[9]proposes a“low-power Baugh-Wooley multiplier-based MAC”unit.A pipelined-based architecture has been proposed in this work.Reference [25] explains a split MAC architecture in 2009.To increase the speed of operation, even more, a strategy to compact the“partial product using interleaved adders”and a“modified hybrid partial product reduction tree (PPRT)”scheme is proposed.A double carry-save addition algorithm is proposed in [26], where its prototype is also verified on a six-input Look-up Table (LUT) based Field Programmable Gate Array (FPGA).In 2016, an“embedded logic full adder (PRO-FA)”was presented in [14], which offers better improvements on the basic design constraint.In 2019, a“low-complexity asynchronous pipelined adder”that guarantees significant energy saving & latency is proposed [27].At the same time, a Pro-LA architecture is proposed in [28]that targets error-tolerant applications.Reference [29] proposes an optimizing approach for“gripper mechanism”using appropriate bi-algorithms in a separate approach.An optimization technique for a“dragonfly-inspired compliant joint”is proposed in [30], whereas reference [31] proposes an optimization technique for a“linear compliant mechanism of nanoindentation tester”.

As shown in Fig.1, the multiplier block collects and multiplies two n-bit inputs and results in 2N-bit output, further processed to the register/accumulator unit.The register cum accumulator temporarily stores the data and sends the data to the adder as an input.The adder sums up the register unit output together with the accumulated value resulting from the previous cycle.Thus,the MAC unit’s overall output is taken from the accumulator register output.Hence, the MAC architecture consists of an“N-bit multiplier”,“2N bit register”,“(2N+1) bit adder”, and two“(2N+1)-bit accumulators/registers”(one for storing the output value and the other for reading the previous output).As shown in Fig.1, the conventional MAC architecture is capable of performing MAC operation on the unsigned fixed-point numbers only.At the same time, today’s digital systems demand floating-point signed operation.In the case of floating-point arithmetic, the conventional adder/subtractor or multiplier algorithms cannot be applied directly because of the presence of the decimal point in the inputs.Therefore, to standardize the floating-point inputs, normalization operations are essential.Normalization means standardization where the decimal point location of the mantissa part is fixed & the exponent value is varied in a particular range based on the shifting of the decimal point.This paper proposes a multiplexer-based normalization architecture that can execute MAC operations on signed floating-point inputs.A unique input data format is created that accepts 9-bit binary data and 4-bit exponential input to perform the same.As a result, the new input data format is 13 bits (it also includes the MSB bits reserved as the sign bit for the mantissa and the exponent).Exponent-Comparator-Circuit (ECC) and Exponent-Shifter-Circuit (ESC) are the two main algorithms in the proposed normalization architecture.

This manuscript is divided into six subsections: Section 2 explains the Exponent-Comparator-Circuit(ECC)&its operation.Section3 describes the Exponent-Shifter-Circuit(ESC)&its operation.Section 4 describes the proposed SFMAC architecture using ECC & ESC architectures.Section 5 explains the comparison of the proposed SFMAC with the existing one.At last, the conclusions and future work are explained in Section 6.

2 ECC Block

The product of the input exponents and the previous cycle’s output exponent are used as inputs to the ECC (Exponent-Comparator-Circuit).The most important thing to remember here is that difference between two ECC block’s input is calculated as arithmetic difference, if both of the ECC block’s input terms have the same sign.On the other hand, if both inputs have separate signs, the difference between the two is equal to the arithmetic sum of the two inputs.Fig.2 shows the flowchart of the ECC block.

Figure 2: ECC flowchart

Multiplexers are used in the architecture to compare the inputs.The ECC operation generates a 5-bit output used to execute binary shifts (as shown in Fig.3).The MUX-based architecture of the ECC block is shown in Fig.3.The Multiplexer based design of the ECC block is as follows:

Figure 3: MUX based ECC architecture

i) The ECC’s inputs are expressed in 2’s complement form depending on the input sign bits.

ii) The operation of the ECC is further segregated based on the sign bits of the inputs as follows:

a.If both the sign bits are different, then add the inputs of the ECC to produce a 4-bit output(i.e., discard the carry bit) but introduce the 5th bit as‘1’if the product of the exponents of the inputs is negative, but the previous exponent is positive.Make the 5th bit as‘0’in the other circumstances.

b.If both the sign bits of the inputs to the ECC are the same, then find out the input which is higher among the two and find the difference between the inputs as per the following procedure:

?To find the higher number, compare both the numbers bit by bit, i.e., start comparing MSB to LSB, as shown in Fig.4.

Figure 4: MUX based ECC with same sign bit

?For finding the difference, use the 2’s complement approach.The difference produces a 4-bit output (i.e., discard the borrow bit) but introduces the 5th bit as‘0’if the product of the exponents of the inputs is higher than the previous cycle exponent.Make the 5th bit as‘1’in the other circumstances.

?In this architecture, multiplexers are used to compare the inputs.

iii) This method yields a 5-bit output that is utilized to do binary shifts in the ESC block.

3 ESC Block

The ESC (Exponent-Shifter-Circuit) block is in charge of shifting the smaller number by an amount of the difference between the exponents of the product of the 8-bit inputs and the previous cycle MAC output (preceding output).The ECC block’s 5-bit output, a 16-bit product of the inputs,and the previous cycle’s 16-bit output (preceding output) are the ESC block’s inputs.The multiplexer-based design of the ESC block is shown in Fig.5.The following is the step-by-step procedure:

Figure 5: MUX based ESC architecture

1.Based on the ECC result, the smallest number is identified (5-bits).If the MSB of the ECC block output is 1, the product of the inputs is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.If the MSB of the ECC block output is 0, the preceding output is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.

2.The MSB of the ECC block output also identifies the input to the ESC block, which does not need shifting.If the MSB of the ECC block output is 1, the previous output is retained (not shifted).If,on the other hand, the MSB of the ECC block output is 0, the product of the inputs is passed in its entirety (not shifted).

4 SFMAC Architecture

To represent positive and negative numbers, the architecture employs sign-magnitude and 2’s complement representations.Signed magnitude form is used to describe SFMAC input-output, but these inputs are converted to 2’s complement form for the internal calculations.The proposed MAC architecture’s final output (MAC output) has 17 bits, including one sign bit.

The SFMAC’s inputs are two 8-bit binary numbers formatted as shown in Fig.6.Each SFMAC input is 13 bits long, with two bits set aside for the number’s and exponent’s sign bits.Depending on whether the number is positive or negative, the sign bit might be 0 or 1.The remaining eleven bits are utilized to indicate an 8-bit binary representation and a 3-bit binary exponent.One important thing to remember is that the 3rd bit of the exponent in binary representation is set to 0 by default since 2-bit binary takes 3 bits to be represented in 2’s complement form.

Figure 6: Input format representation of SFMAC

As a result, the exponent term in this architecture will vary from‘-4’to‘+3’.The input numbers will range from -(0.11111111)2×2+3to +(0.11111111)2×2+3& hence the new SFMAC architecture’s inputs range from -(7.96872)10to +(7.96872)10.Furthermore, the SFMAC architecture’s inputs can only be entered in fractions.For instance, the numbers (001)2& (010)2should be entered as(0.00100000)2×2+3& (0.0100000)2×2+3respectively as the inputs to the SFMAC.Similarly, (101)2& (10)2should be represented as (0.10100000)2×2+3& (0.10000000)2×2+2respectively to process it through the SFMAC.The 8-bit multiplier, 16-bit register, 16-bit adder, 2:1/4:1 multiplexer of various sizes, and Exponential Adder are the main building blocks of the SFMAC architecture (other than the Exponent Comparator Circuit (ECC) and Exponent Shifter Circuit (ESC) explained earlier).SFMAC’s overall architecture is depicted in Fig.7.

Figure 7: SFMAC architecture using ECC & ESC blocks

CMOS technologies are used to develop and execute the overall SFMAC architecture.A thorough study is carried out using the Cadence Virtuoso.To limit the power consumption, the architecture employs a “clock gating scheme” and a pipeline mechanism.The clock pulse pipeline system is ensured by triggering successive blocks after a predetermined period.

The SFMAC architecture is implemented in 90 and 130 nm CMOS technology (GPDK and TSMC, respectively).Tab.1 compares the influence of the SFMAC architecture in various CMOS technologies for a particular input vector.Cadence Spectre Tool is used to measure the power usage of the implemented designs.The average power (PAverage) is calculated over a simulation time (Tsim) of 40 ns and at a clock frequency (fclk) of 83.33 MHz, while the static power is evaluated for a 2 V supply voltage (VDD).Since the transistor sizing is greater in 130 nm technology, the average power (PAverage)consumption in 130 nm (TSMC) is higher than 90 nm (GPDK) as it affects the load capacitanceCload.In the same way, device geometry affects static power consumption.As a result, a circuit with a larger device dimension can consume more static power.If αTis the activity factor, then CMOS dynamic power is calculated as Eq.(1):

Tab.2 shows a comparison of the proposed SFMAC architecture and existing MAC architectures in terms of power consumption.Since most of the available architectures in the literature use an HDL-based approach, comparing the proposed SFMAC architecture to those already present in the literature is difficult.On the other hand, the proposed architecture is implemented in a Cadence

Virtuoso 90 or 130 nm technologies.Furthermore, almost all of the architectures described in the literature do not support signed operations & floating-point designs.

Table 1: Performance of SFMAC at 90 and 130 nm CMOS technologies (GPDK and TSMC respectively)

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator),most of the architectures in the literature do not use any clocking signals.Asynchronous circuits do not have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2, are dedicated to implementing fixedpoint Multiply-Accumulate (unsigned) operation.

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator), the majority of the architectures in the literature do not use any clocking signals.Asynchronous circuits don’t have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2 are dedicated for implementing fixed-point Multiply-Accumulate (unsigned) operation.

Tab.2 reveals that the architectures in [12,33,34] consume considerably higher static and average power (in mW) than the proposed SFMAC architecture.The architectures in [35,36] are examined for 16-bit operations at 1 V and 8-bit operations at 1.8 V in 90 and 180 nm technologies.Even though the existing work described in [35,36] requires less power than the proposed SFMAC (the existing circuit’s performance analysis is done with a supply voltage less than 2 V, while the SFMAC uses a supply voltage of 2 V), these two existing implementations can only execute MAC operations on unsigned fixed-point numbers.As a result, the MAC architectures in [35,36] have a restricted scope.Although the architecture defined in [37] is implemented in 180 nm technology with a 1.8 V supply voltage for 16-MAC operation, it consumes substantially more power than the SFMAC architecture.The implementation of the architecture listed in [38] is for 1-bit unsigned fixed-point MAC operation in 32 nm CMOS & CNTFET technology, so a comparison with an 8-bit SFMAC is meaningless.Despite the fact that the architecture described in [32] is the only existing MAC architecture capable of performing on signed floating-point operations, a comparative study with the proposed SFMAC reveals that SFMAC’s efficiency in terms of power consumption is much better.

Table 2: Proposed SFMAC vs. already reported architectures

Table 2: Continued

5 Conclusion

A novel approach for performing normalization is explained in this paper.The proposed normalization operation is categorized into Exponential Comparator Circuit (ECC) & Exponential Shifter Circuit (ESC).The ECC block performs a comparison between the exponents; at the same time, ESC is responsible for shifting the smaller number by the amount of difference between the exponents of the inputs.Further, a signed floating-point MAC architecture is also proposed using the novel normalization architecture.For design & implementation, the Cadence Spectre tool is used at CMOS 90 nm and TSMC 130 nm technologies.The results have proved that the proposed SFMAC architecture has used the least power than its recent counterpart & therefore, has applicability in lowpower DSP architectures.

Funding Statement:This work was supported by Research Support Fund (RSF) of Symbiosis International (Deemed University), Pune, India

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

主站蜘蛛池模板: 91精品啪在线观看国产91| 91精品日韩人妻无码久久| 波多野结衣AV无码久久一区| 日本a∨在线观看| 国产天天射| 一级看片免费视频| 青青操视频在线| 日日碰狠狠添天天爽| 精品超清无码视频在线观看| 国产日韩精品欧美一区灰| 色婷婷在线播放| 亚洲国产成人精品一二区| 国产国拍精品视频免费看| 九色在线视频导航91| 久久精品无码中文字幕| 国产一二三区在线| 亚洲国产欧美目韩成人综合| 无码一区中文字幕| 欧美不卡视频一区发布| 国产精品视频导航| 国产成人精品高清不卡在线| 97视频精品全国免费观看| 久久亚洲天堂| 这里只有精品在线播放| 亚洲第一在线播放| www.亚洲一区| 国产精品熟女亚洲AV麻豆| 中文字幕色在线| 免费人成在线观看视频色| 欧美一级夜夜爽| 色综合热无码热国产| 国产制服丝袜91在线| 国产玖玖玖精品视频| 中文字幕调教一区二区视频| 亚洲精品在线91| 国产av色站网站| 欧美成人免费| 国产青青草视频| 成年看免费观看视频拍拍| 97狠狠操| 天天色天天综合网| 国产成人禁片在线观看| 一本色道久久88| 欧美人与牲动交a欧美精品| 呦视频在线一区二区三区| 黄色片中文字幕| 四虎成人在线视频| 免费人成视频在线观看网站| 国产美女91视频| 国产日韩欧美一区二区三区在线| 呦女亚洲一区精品| 久热这里只有精品6| 日本一区中文字幕最新在线| 中文字幕乱码二三区免费| 高清欧美性猛交XXXX黑人猛交| 久久青草热| 国产91九色在线播放| 91网红精品在线观看| 99r在线精品视频在线播放| 国产一在线观看| 欧洲在线免费视频| 免费精品一区二区h| 欧美狠狠干| 国产91无毒不卡在线观看| 性激烈欧美三级在线播放| AV老司机AV天堂| 精品国产成人国产在线| 国产午夜无码专区喷水| 午夜一级做a爰片久久毛片| 国产精品视频白浆免费视频| 亚洲欧美综合在线观看| 国产麻豆精品手机在线观看| 九九九九热精品视频| 怡春院欧美一区二区三区免费| 久久久久久高潮白浆| 中文字幕1区2区| 日韩一二三区视频精品| 国产av一码二码三码无码| 久久天天躁狠狠躁夜夜躁| 伊人久久久久久久| 人人妻人人澡人人爽欧美一区| 午夜性刺激在线观看免费|