999精品在线视频,手机成人午夜在线视频,久久不卡国产精品无码,中日无码在线观看,成人av手机在线观看,日韩精品亚洲一区中文字幕,亚洲av无码人妻,四虎国产在线观看 ?

An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

2022-08-24 12:57:48RajkumarSarmaCherryBhargavaandKetanKotecha
Computers Materials&Continua 2022年7期

Rajkumar Sarma, Cherry Bhargavaand Ketan Kotecha

1Department of Electrical & Electronics Engineering, Faculty of Engineering & Technology, Jain(Deemed-to-be-University), Ramanagar, 562112, Karnataka, India

2Symbiosis Institute of Technology, Symbiosis International (Deemed University), Lavale, Pune, 412115, India

3Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International (Deemed University), Lavale, Pune, 412115,India

Abstract: In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations.A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera.In the domain of digital signal processing, the use of normalization architec-ture is very vast.The main objective of using normalization is to perform com-parison and shift operations.In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc.The proposed normalization algorithm is further used in designing an 8×8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture.Since the SFMAC can accept an 8-bit signific and and a 3-bit exponent, the input to the said architecture can be somewhere between -(7.96872)10 to + (7.96872)10.The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively).To reduce the power consumption of the proposed normalization architecture,techniques such as“block enabling”and“clock gating”are used rigorously.According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

Keywords: Data normalization; cadence virtuoso; signed-floating-point MAC; evolutionary optimized algorithm; block enabling; clock gating

1 Introduction to Multiply & Accumulate (MAC) Architecture

In digital signal processing, the MAC operation is considered a significant and critical operation.The Digital Signal Processing (DSP) algorithms execute many mathematical calculations repeatedly and rapidly on various data sets.DSP algorithms can be effectively executed by the majority of operating systems and general-purpose microprocessors.Unfortunately, DSP algorithms have energy efficiency issues while operating with portable devices such as Personal Digital Assistants (PDAs)and mobile phones.Considering delay and power optimization, the exponential growth of portable electronics has imposed a major challenge to Very Large-Scale Integration (VLSI) design engineers.A MAC unit is a vital component of any digital system, such as various FFT algorithms, convolution etc.The actual MAC block is not just limited to the fixed-point number system.For audio and image processing applications, floating-point MAC architecture is much needed.MAC’s simple operation is to multiply two variables (XiandYi) and add the product to the last cycle’s output.Therefore, the MAC architecture includes the key operational blocks of a multiplier, adder, and register/accumulator[1-14].The multiplier multiplies the two input operands; the adder attaches the multiplier’s output to the previous cycle’s result, and the register or accumulator preserves the final addition output.Fig.1 shows the generalized block diagram of N×N bit MAC.

The popularity of portable devices and the requirement to limit the power consumption(and therefore heat dissipation) in heavily-dense VLSI chips have resulted in rapid advances in low-power design over the past few years.Mobile applications necessitating low-power dissipation and high throughput,let us say notebook Personal Computers (PCs), mobile communication devices, and PDAs, are the driving forces behind these innovations.In most cases, low power consumption requirements need to be met along with equally challenging targets of high chip density and high speed.Therefore,the low-power IC design surfaced as a beneficial and fast-developing area of Complementary Met al Oxide Semiconductor (CMOS) circuit design.Usually, the restricted battery life places very stringent demands on the portable system’s overall power requirements.New types of rechargeable batteries,say“Nickel-Met al Hydride (NiMH)”is being produced with better energy storage capacity than the traditional“Nickel-Cadmium (NiCd)”batteries.Still, there is no prospect of a significant increase in energy capacity in the foreseeable future.The energy density (the energy stored/unit weight) provided by new advancements in technologies (such as NiMH) is approximately 30 Watt-hour/pound, which is quite lesser considering the growing applications of portable systems.Scaling down the energy dissipation of Integrated Circuits (ICs) by improving functionality is, therefore, a significant task in developing portable devices.

In high-performance digital systems, such as microprocessors-microcontrollers, DSPs, etc., the need for low-power circuit development is also becoming a significant concern.Targeting higher chip density and higher processing speed contributes to developing a high-clock rate in very complex circuits.If the chip’s clock speed rises, then the chip’s energy dissipation, thereby increasing the temperature linearly.As the dissipated heat has to be efficiently removed to maintain the chip’s temperature at an optimum level, the packaging cost, cooling, and heat extraction become important aspects.A few elite microchips structured in the mid-1990s (such as Intel Pentium, Digital Equipment Corporation (DEC) Alpha, PowerPC) which operates in a frequency ranging from 100-300 MHz, and the total average power is ranging from 20-50 W.VLSI’s reliability is one more critical factor to look after for the design engineers, as it emphases to the demand for energy-efficient design.There is a near connection between electronic circuit maximumpower-dissipation and reliability concerns like electro-migrationand and system degradation caused by the carriers.Additionally, the thermal stress caused by chip heat dissipation is also a significant issue to look after in terms of reliability.As a consequence,increasing power consumption is also critical for improving performance.The procedures used in digital systems to achieve low-power consumption vary from device to device, technology to technology or algorithm to algorithm level.The standard system features (say threshold voltage),device dimension and interconnection properties are essential factors in reducing power consumption.Circuit level approaches such as a careful selection of circuit design logic family, decrement in the total number of voltage transitions, and clocking approaches can be used to minimize transistor level energy dissipation.Measures at the architecture level include intelligent power management of different system components, pipeline and concurrent usage, and bus layout design.

In recent years, different researchers have done several works [2-3,5-21].Reference [22] proposes a high throughput MAC architecture that promises the optimized area in 2007.To maximize speed,it employs 4:2 compressor circuits.Reference [23] in 2012 suggests a novel multiplier architecture.Reference [12] proposes a novel architecture based on a transformed“Wallace tree multiplier”in 2013.The architecture is 64-bit compatible.Reference[24]uses an updated Braun Multiplier to create a MAC unit in 2013.NCSim and RTL Compiler are used in the implementation.In the year2014,reference[9]proposes a“low-power Baugh-Wooley multiplier-based MAC”unit.A pipelined-based architecture has been proposed in this work.Reference [25] explains a split MAC architecture in 2009.To increase the speed of operation, even more, a strategy to compact the“partial product using interleaved adders”and a“modified hybrid partial product reduction tree (PPRT)”scheme is proposed.A double carry-save addition algorithm is proposed in [26], where its prototype is also verified on a six-input Look-up Table (LUT) based Field Programmable Gate Array (FPGA).In 2016, an“embedded logic full adder (PRO-FA)”was presented in [14], which offers better improvements on the basic design constraint.In 2019, a“low-complexity asynchronous pipelined adder”that guarantees significant energy saving & latency is proposed [27].At the same time, a Pro-LA architecture is proposed in [28]that targets error-tolerant applications.Reference [29] proposes an optimizing approach for“gripper mechanism”using appropriate bi-algorithms in a separate approach.An optimization technique for a“dragonfly-inspired compliant joint”is proposed in [30], whereas reference [31] proposes an optimization technique for a“linear compliant mechanism of nanoindentation tester”.

As shown in Fig.1, the multiplier block collects and multiplies two n-bit inputs and results in 2N-bit output, further processed to the register/accumulator unit.The register cum accumulator temporarily stores the data and sends the data to the adder as an input.The adder sums up the register unit output together with the accumulated value resulting from the previous cycle.Thus,the MAC unit’s overall output is taken from the accumulator register output.Hence, the MAC architecture consists of an“N-bit multiplier”,“2N bit register”,“(2N+1) bit adder”, and two“(2N+1)-bit accumulators/registers”(one for storing the output value and the other for reading the previous output).As shown in Fig.1, the conventional MAC architecture is capable of performing MAC operation on the unsigned fixed-point numbers only.At the same time, today’s digital systems demand floating-point signed operation.In the case of floating-point arithmetic, the conventional adder/subtractor or multiplier algorithms cannot be applied directly because of the presence of the decimal point in the inputs.Therefore, to standardize the floating-point inputs, normalization operations are essential.Normalization means standardization where the decimal point location of the mantissa part is fixed & the exponent value is varied in a particular range based on the shifting of the decimal point.This paper proposes a multiplexer-based normalization architecture that can execute MAC operations on signed floating-point inputs.A unique input data format is created that accepts 9-bit binary data and 4-bit exponential input to perform the same.As a result, the new input data format is 13 bits (it also includes the MSB bits reserved as the sign bit for the mantissa and the exponent).Exponent-Comparator-Circuit (ECC) and Exponent-Shifter-Circuit (ESC) are the two main algorithms in the proposed normalization architecture.

This manuscript is divided into six subsections: Section 2 explains the Exponent-Comparator-Circuit(ECC)&its operation.Section3 describes the Exponent-Shifter-Circuit(ESC)&its operation.Section 4 describes the proposed SFMAC architecture using ECC & ESC architectures.Section 5 explains the comparison of the proposed SFMAC with the existing one.At last, the conclusions and future work are explained in Section 6.

2 ECC Block

The product of the input exponents and the previous cycle’s output exponent are used as inputs to the ECC (Exponent-Comparator-Circuit).The most important thing to remember here is that difference between two ECC block’s input is calculated as arithmetic difference, if both of the ECC block’s input terms have the same sign.On the other hand, if both inputs have separate signs, the difference between the two is equal to the arithmetic sum of the two inputs.Fig.2 shows the flowchart of the ECC block.

Figure 2: ECC flowchart

Multiplexers are used in the architecture to compare the inputs.The ECC operation generates a 5-bit output used to execute binary shifts (as shown in Fig.3).The MUX-based architecture of the ECC block is shown in Fig.3.The Multiplexer based design of the ECC block is as follows:

Figure 3: MUX based ECC architecture

i) The ECC’s inputs are expressed in 2’s complement form depending on the input sign bits.

ii) The operation of the ECC is further segregated based on the sign bits of the inputs as follows:

a.If both the sign bits are different, then add the inputs of the ECC to produce a 4-bit output(i.e., discard the carry bit) but introduce the 5th bit as‘1’if the product of the exponents of the inputs is negative, but the previous exponent is positive.Make the 5th bit as‘0’in the other circumstances.

b.If both the sign bits of the inputs to the ECC are the same, then find out the input which is higher among the two and find the difference between the inputs as per the following procedure:

?To find the higher number, compare both the numbers bit by bit, i.e., start comparing MSB to LSB, as shown in Fig.4.

Figure 4: MUX based ECC with same sign bit

?For finding the difference, use the 2’s complement approach.The difference produces a 4-bit output (i.e., discard the borrow bit) but introduces the 5th bit as‘0’if the product of the exponents of the inputs is higher than the previous cycle exponent.Make the 5th bit as‘1’in the other circumstances.

?In this architecture, multiplexers are used to compare the inputs.

iii) This method yields a 5-bit output that is utilized to do binary shifts in the ESC block.

3 ESC Block

The ESC (Exponent-Shifter-Circuit) block is in charge of shifting the smaller number by an amount of the difference between the exponents of the product of the 8-bit inputs and the previous cycle MAC output (preceding output).The ECC block’s 5-bit output, a 16-bit product of the inputs,and the previous cycle’s 16-bit output (preceding output) are the ESC block’s inputs.The multiplexer-based design of the ESC block is shown in Fig.5.The following is the step-by-step procedure:

Figure 5: MUX based ESC architecture

1.Based on the ECC result, the smallest number is identified (5-bits).If the MSB of the ECC block output is 1, the product of the inputs is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.If the MSB of the ECC block output is 0, the preceding output is moved to the right by the corresponding decimal value of the ECC block output’s remaining 4-bit binary.

2.The MSB of the ECC block output also identifies the input to the ESC block, which does not need shifting.If the MSB of the ECC block output is 1, the previous output is retained (not shifted).If,on the other hand, the MSB of the ECC block output is 0, the product of the inputs is passed in its entirety (not shifted).

4 SFMAC Architecture

To represent positive and negative numbers, the architecture employs sign-magnitude and 2’s complement representations.Signed magnitude form is used to describe SFMAC input-output, but these inputs are converted to 2’s complement form for the internal calculations.The proposed MAC architecture’s final output (MAC output) has 17 bits, including one sign bit.

The SFMAC’s inputs are two 8-bit binary numbers formatted as shown in Fig.6.Each SFMAC input is 13 bits long, with two bits set aside for the number’s and exponent’s sign bits.Depending on whether the number is positive or negative, the sign bit might be 0 or 1.The remaining eleven bits are utilized to indicate an 8-bit binary representation and a 3-bit binary exponent.One important thing to remember is that the 3rd bit of the exponent in binary representation is set to 0 by default since 2-bit binary takes 3 bits to be represented in 2’s complement form.

Figure 6: Input format representation of SFMAC

As a result, the exponent term in this architecture will vary from‘-4’to‘+3’.The input numbers will range from -(0.11111111)2×2+3to +(0.11111111)2×2+3& hence the new SFMAC architecture’s inputs range from -(7.96872)10to +(7.96872)10.Furthermore, the SFMAC architecture’s inputs can only be entered in fractions.For instance, the numbers (001)2& (010)2should be entered as(0.00100000)2×2+3& (0.0100000)2×2+3respectively as the inputs to the SFMAC.Similarly, (101)2& (10)2should be represented as (0.10100000)2×2+3& (0.10000000)2×2+2respectively to process it through the SFMAC.The 8-bit multiplier, 16-bit register, 16-bit adder, 2:1/4:1 multiplexer of various sizes, and Exponential Adder are the main building blocks of the SFMAC architecture (other than the Exponent Comparator Circuit (ECC) and Exponent Shifter Circuit (ESC) explained earlier).SFMAC’s overall architecture is depicted in Fig.7.

Figure 7: SFMAC architecture using ECC & ESC blocks

CMOS technologies are used to develop and execute the overall SFMAC architecture.A thorough study is carried out using the Cadence Virtuoso.To limit the power consumption, the architecture employs a “clock gating scheme” and a pipeline mechanism.The clock pulse pipeline system is ensured by triggering successive blocks after a predetermined period.

The SFMAC architecture is implemented in 90 and 130 nm CMOS technology (GPDK and TSMC, respectively).Tab.1 compares the influence of the SFMAC architecture in various CMOS technologies for a particular input vector.Cadence Spectre Tool is used to measure the power usage of the implemented designs.The average power (PAverage) is calculated over a simulation time (Tsim) of 40 ns and at a clock frequency (fclk) of 83.33 MHz, while the static power is evaluated for a 2 V supply voltage (VDD).Since the transistor sizing is greater in 130 nm technology, the average power (PAverage)consumption in 130 nm (TSMC) is higher than 90 nm (GPDK) as it affects the load capacitanceCload.In the same way, device geometry affects static power consumption.As a result, a circuit with a larger device dimension can consume more static power.If αTis the activity factor, then CMOS dynamic power is calculated as Eq.(1):

Tab.2 shows a comparison of the proposed SFMAC architecture and existing MAC architectures in terms of power consumption.Since most of the available architectures in the literature use an HDL-based approach, comparing the proposed SFMAC architecture to those already present in the literature is difficult.On the other hand, the proposed architecture is implemented in a Cadence

Virtuoso 90 or 130 nm technologies.Furthermore, almost all of the architectures described in the literature do not support signed operations & floating-point designs.

Table 1: Performance of SFMAC at 90 and 130 nm CMOS technologies (GPDK and TSMC respectively)

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator),most of the architectures in the literature do not use any clocking signals.Asynchronous circuits do not have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2, are dedicated to implementing fixedpoint Multiply-Accumulate (unsigned) operation.

Although there are architectures that use clock signals just for data accumulation (in the register or accumulator), the majority of the architectures in the literature do not use any clocking signals.Asynchronous circuits don’t have real-time applicability.As a result, the architecture’s functional applicability must be further investigated.The architecture shown in [32] is designed for floating-point operation (signed), whereas most of the reported architectures, as discussed in Tab.2 are dedicated for implementing fixed-point Multiply-Accumulate (unsigned) operation.

Tab.2 reveals that the architectures in [12,33,34] consume considerably higher static and average power (in mW) than the proposed SFMAC architecture.The architectures in [35,36] are examined for 16-bit operations at 1 V and 8-bit operations at 1.8 V in 90 and 180 nm technologies.Even though the existing work described in [35,36] requires less power than the proposed SFMAC (the existing circuit’s performance analysis is done with a supply voltage less than 2 V, while the SFMAC uses a supply voltage of 2 V), these two existing implementations can only execute MAC operations on unsigned fixed-point numbers.As a result, the MAC architectures in [35,36] have a restricted scope.Although the architecture defined in [37] is implemented in 180 nm technology with a 1.8 V supply voltage for 16-MAC operation, it consumes substantially more power than the SFMAC architecture.The implementation of the architecture listed in [38] is for 1-bit unsigned fixed-point MAC operation in 32 nm CMOS & CNTFET technology, so a comparison with an 8-bit SFMAC is meaningless.Despite the fact that the architecture described in [32] is the only existing MAC architecture capable of performing on signed floating-point operations, a comparative study with the proposed SFMAC reveals that SFMAC’s efficiency in terms of power consumption is much better.

Table 2: Proposed SFMAC vs. already reported architectures

Table 2: Continued

5 Conclusion

A novel approach for performing normalization is explained in this paper.The proposed normalization operation is categorized into Exponential Comparator Circuit (ECC) & Exponential Shifter Circuit (ESC).The ECC block performs a comparison between the exponents; at the same time, ESC is responsible for shifting the smaller number by the amount of difference between the exponents of the inputs.Further, a signed floating-point MAC architecture is also proposed using the novel normalization architecture.For design & implementation, the Cadence Spectre tool is used at CMOS 90 nm and TSMC 130 nm technologies.The results have proved that the proposed SFMAC architecture has used the least power than its recent counterpart & therefore, has applicability in lowpower DSP architectures.

Funding Statement:This work was supported by Research Support Fund (RSF) of Symbiosis International (Deemed University), Pune, India

Conflicts of Interest:The authors declare that they have no conflicts of interest to report regarding the present study.

主站蜘蛛池模板: 亚洲视频免| 四虎永久在线精品影院| 欧美高清三区| 欧美精品啪啪一区二区三区| 在线观看亚洲成人| 免费无码又爽又黄又刺激网站| 日韩成人在线一区二区| 国产 在线视频无码| 日韩人妻无码制服丝袜视频| 亚洲欧美日韩精品专区| 视频一本大道香蕉久在线播放 | 女人18一级毛片免费观看| 精品精品国产高清A毛片| 亚洲欧美成人网| 欧亚日韩Av| 狠狠亚洲婷婷综合色香| 91免费国产高清观看| 97亚洲色综久久精品| 国产激爽大片高清在线观看| 国产成人久视频免费| 欧美一级一级做性视频| 五月激情综合网| 三上悠亚一区二区| 精品三级网站| 亚洲人成网址| 亚洲无线国产观看| 日韩成人午夜| 日韩黄色在线| 免费国产一级 片内射老| 精品一区二区三区四区五区| 无码aaa视频| 亚洲综合久久成人AV| 亚洲无线视频| 国产精品一老牛影视频| 国产成人精品免费av| 亚洲无码91视频| 国产精品白浆无码流出在线看| 国产精品第一区| 十八禁美女裸体网站| 亚洲成人网在线播放| 日本三区视频| 91精品啪在线观看国产| 国产成人精品2021欧美日韩| 日本亚洲欧美在线| 高h视频在线| 天堂成人av| 欧美另类图片视频无弹跳第一页| 91精品免费高清在线| 久久久久久久97| 国产在线97| 国产成人精品一区二区| 亚洲精品久综合蜜| 99久久国产综合精品2020| 欧美三级视频网站| 国产一区二区网站| 久久精品嫩草研究院| 久久综合伊人 六十路| 成人综合久久综合| 97国产精品视频自在拍| 试看120秒男女啪啪免费| 色悠久久久久久久综合网伊人| 国产视频a| 日韩欧美成人高清在线观看| 中日无码在线观看| 国产成人综合亚洲欧洲色就色| 国产视频一区二区在线观看 | 国产SUV精品一区二区| 精品免费在线视频| 人妻丝袜无码视频| 亚洲一区二区三区麻豆| 亚洲AV无码乱码在线观看裸奔| 成人免费午间影院在线观看| 免费一级毛片在线播放傲雪网| 熟女日韩精品2区| 四虎精品国产AV二区| 天堂成人在线视频| 第一区免费在线观看| 欧美日韩动态图| 国产极品嫩模在线观看91| 亚洲高清中文字幕在线看不卡| 亚洲久悠悠色悠在线播放| 亚洲综合经典在线一区二区|