Camera-Ready Paper Submission GUIDE

Viewer
Transcript

Energy Efficient In-Memory Machine Learning for Data Intensive Image-Processing by Non-volatile Domain-Wall Memory Hao Yu1*, Yuhao Wang1, Shuai Chen1, Wei Fei1, Chuliang Weng2, Junfeng Zhao2 and Zhulin Wei2 1

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 2 Huawei Shannon Laboratory, China *Correspondent author: [email protected]; Tel: +65-67904509, Fax: +65-6793 3318

Abstract - Image processing in conventional logic-memory I/O-integrated systems will incur significant communication congestion at memory I/Os for excessive big image data at exa-scale. This paper explores an in-memory machine learning on neural network architecture by utilizing the newly introduced domain-wall nanowire, called DW-NN. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire. Domain-wall nanowire based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 92x when compared to conventional image processing system.

I Introduction One exciting feature of future big-data storage system is to find implicit pattern of data and excavate valued behavior behind by big-data analytics such as image feature extraction during image search. Instead of performing the image search by calculating pixel similarity, image search by machine learning is a similar process as human brains. For example, each image feature extraction is performed to obtain the characteristics first, and then is matched by key words. As such, the image search becomes a traditional string matching problem to solve. However, to handle big image data at exa-scale, there is memory-wall that has long memory access latency as well as limited memory bandwidth. For the example of the image search in one big-data storage system, there may be billions of images. In order to perform feature extraction for one of images, it will lead to significant congestion at I/Os when migrating data between memory and processor. Note that in-memory computing system [1,2,3,4] is promising as one future big-data solution to relieve the memory-wall issue. For example, domain specific accelerators can be developed within memory for big-data processing such that the data will be pre-processed before they are readout with the minimum number of data migrations. In this paper, the big image data processing algorithm by machine learning is examined within the in-memory computing system architecture. Among numerous machine learning algorithms [5,6,7], neural-network based algorithm has shown low complexity with genetic adaptability. In particular, the extreme learning machine (ELM) [7,8] has one input layer, one hidden layer and one output layer, and hence it has tuning-free feature without expensive iterative training process, which makes it suitable for the low-cost hardware implementation. As such, the in-memory hardware

Fig. 1. The diagram of domain-wall nanowire device.

accelerators of ELM is studied here for the big image data processing. The proposed in-memory ELM computing system is examined by the nano-scale non-volatile memory devices [9,10,11,12,13]. Domain-wall nanowire or racetrack memory [12,13], is one newly introduced spintronic NVM device that has not only potential for high density and high performance memory storage, but also feasible in-memory computing capability. In this paper, we show the feasibility of mapping the ELM to a full domain-wall nanowire based in-memory neural network computing system, called DW-NN. Compared to the scenario that ELM is executed in CMOS based general purpose processor, the proposed DW-NN improves the system throughput by 11.6x and energy efficiency by 92x. The rest of paper is organized as following. Section II reviews the fundamental of domain-wall memory and domain-wall memory based in-memory computing architecture. Section III demonstrates algorithm and implementation of in-memory machine learning by domain-wall nanowire devices. The experiments are carried out in Section IV with conclusions in Section V.

II. In-memory Architecture on Domain-wall Devices A. Domain-wall Nanowire Devices Domain-wall nanowire, also known as racetrack memory [12,13], is the third-generation of spin-based NVM. As shown in Fig. 1, multiple bits of information are stored in a single ferromagnetic nanowire. Separated by domain-walls, each bit is represented by the magnetization direction. The domain-walls can be moved left or right by applying a current through the shift port at the two ends of the nanowire. The domain width of each bit remains unchanged, thus the stored information is preserved. All bits are shifted with a tape-like operation similarly as in a shift register. In a domain-wall nanowire device, a strongly magnetized ferromagnetic layer is placed along the ferromagnetic nanowire at a desired position and separated by an insulator

layer. Such a sandwich-like structure forms a magnetic-tunnel-junction (MTJ) through which the stored information can be accessed. The formed MTJ exhibits different states depending on the alignment of the fixed layer and free layer. This is called the giant magnetoresistance (GMR) effect. MTJ shows different resistance with parallel or anti-parallel alignments. By detecting the resistance of MTJ, the stored bit can be read out. Moreover, write operation is achieved by altering the free layer magnetization with an injected current. The electron spin will force the free layer to be parallel or anti-parallel to fixed layer based on the current direction [14]. Note the write and read operation can only occur in the MTJ. Therefore, the bit to be operated needs to be shifted and aligned with the fixed layer, while the shift direction and velocity is controlled by the current direction and amplitude [15]. B. Domain-wall Memory based In-Memory Computing Conventionally, all the data is maintained within memory that is separated from the processor but connected with I/Os. Therefore, during the execution, all data needs to be migrated to processor and written back afterwards. In the data-oriented applications, however, this will incur significant I/O congestions and hence greatly degrade the overall performance. In addition, significant standby power will be consumed in order to hold the large volume of data. To overcome the above two issues, the in-memory non-volatile computing architecture is introduced. The overall architecture of domain-wall memory based in-memory computing platform is illustrated in Fig. 2(a). In particular, domain specific in-memory accelerators are integrated locally together with the stored data in distributed manner such that frequent involved operations can be performed without much communication with external processor. In addition, the distributed local accelerators can also provide great thread-level parallelism thus the throughput can be improved. Fig. 2(b) shows how the in-memory distributed Map-Reduce [16] data processing is performed locally between one data array and local logic pair. Firstly, the external processor will issue commands to specific pair to perform in-memory logic computing. The commands will be received and interpreted by a controller in accelerator. Secondly, the controller will request related data to the data array with a read operation. As a result, the neural network based processing in ELM, mainly including weighted sum and sigmoid function, can be performed in a Map-Reduce fashion. Lastly, the results are written back to the data array. In this platform, the domain-wall nanowire is intensively utilized towards ultra-low power big-data processing in both memory and logic, with significantly reduced leakage and operating power. What is more, domain-wall nanowire based adder, multiplier and look-up table for sigmoid function are adopted within non-volatile memory that further improves energy efficiency with small I/O overheads.

III. Domain-wall Nanowire based Extreme Learning Machine In-memory ELM neural network offers two major advantages. Firstly, all domain-wall based arithmetic

External processor memory Data/address/command IO

Local data/logic pair

Data array

In-memory logic

Local data path for in-memory logic

(a)

Domain wall data array  0.5 0.2 0.6  2    Step 1: tasks issued  0.8 0 0.3  1  0.7 0.1 0.4 3 by external

controllers

Dispatch atom task

M …...

Step 2: fetch data

processor

[0.5,0.2,0.6][2 ,1,3] [-0.8,0,0.3] [2,1,3] [-0.7,0.1,-0.4] [2,1,3]

Data segment 0x00 stored

N

0x10

Step 6: Write back result

Dispatch atom task

(key, value) pairs DWMUL

(1, 1) (2, -1.6) (2,0)

DWADDER

DWMUL

(1, 0.2) (2, 0.9) (3, -1.4)

DWADDER

(2, -0.7)

DWLUT

DWADDER

(3, -2.5)

DWLUT

DWMUL

(1, 1.8) (3, 0.1) (3, -1.2)

Step 3: multiplication (Map process)

(1,3)

DWLUT [0.95, 0.33, 0.08]

Step 4: sum Step 5: Sigmoid (Reduce process) function

Domain wall nanowire based In-memory logic (b) Fig. 2. (a) The overview of the in-memory computing architecture; (b) detailed domain-wall nanowire based machine learning platform in Map-Reduce fashion.

operations in ELM can be integrated into the memory, and these operations are performed directly on operands stored in non-volatile domain-wall memory. That is significantly different from the conventional memory-logic architecture where data need to be transferred from the memory to data path and then written back to the memory after being processed. Secondly, all the computing-intensive operations in ELM are implemented by domain-wall nanowire devices, which can also be used as storage units. This provides integration compatibility between data path and memory used in ELM, as well as the ability to reuse peripheral circuits like decoders and sense amplifiers. A. Extreme Learning Machine We first review the basic of the neural-network based ELM algorithm. Among numerous machine learning algorithm [3, 4], support vector machine (SVM) and neural network (NN) are widely discussed. However, both two algorithms have major challenging issues in terms of slow learning speed,

Input vector Hidden vector

Output vector

iwij

owij



p0 p1

SHF1

OW

IW

b0

b1

…...

Load A

q1

1 1  e x

RD

SHF2

WR1



M2

Load B

M4 EN

A B

Cout

 b1h

DW Cin

DW

SUM

Cout

qm

I

1 1  e x

Fig. 3. Computation flow of extreme learning machine (ELM).

trivial human intervene (parameter tuning) and poor computational scalability [5]. Extreme Learning Machine (ELM) was initially proposed [7] for the single-hidden-layer feed-forward neural networks (SLFNs). Compared with traditional neural networks, ELM eliminates the need of parameter tuning in the training stage and hence reduces the training time significantly. The output function of ELM is formulated as (only one output node is considered)

SHF2

VDD

EN



WR2

Output

VDD

p1n

WR2

…...

q0

1 1  e x



p2



SHF1

RD

WR1

M3 I

M1 A

A

B

B

Cin

A B A

Cin

Cin EN

EN

B Cin

DW DW DW DW

DW

Cout

Fig. 4. Domain-wall nanowire based full adder with SUM operation by DW-XOR logic and CARRY operation by resistor comparator.

the computation flow for ELM-SR, where input vector obtained from input image is multiplied by input weight , (1) matrix. The result is then added with bias vector b to generate input of sigmoid function. Lastly sigmoid function outputs are where is the output weight vector multiplied with output weight matrix to produce final results. storing the output weights between the hidden layer and In the following, we will demonstrate how to map the fundamental addition, multiplication, and sigmoid function to output node. is the domain-wall nanowires. hidden layer output matrix given input vector X and performs the transformation of input vector into L-dimensional feature B. Weighted Sum by Domain-wall Adder and Multiplier space. The training process of ELM aims to obtain output weight vector β and to enhance the generalization ELM The GMR-effect can be interpreted as the bit-wise XOR minimizes the training error as well as the norm of output operation of the magnetization direction of two thin magnetic weight layers, where the output is denoted by high or low resistance. In a GMR-based MTJ structure, however, the XOR-logic will Minimize: ||Hβ-T|| and ||β|| (2) fail as there is only one operand as variable since the magnetization in fixed layer is constant. Nevertheless, this where T is the target vector in the training process and β can problem can be overcome by the unique domain-wall shift-operation in the domain-wall nanowire device, which be solved by minimal norm least square method as follows enables the possibility of DWL-based XOR logic for , (3) computing. A bitwise-XOR logic implemented by two domain-wall nanowires is shown in Fig. 4. The bitwise-XOR logic is where is the Moore-Penrose generalized inverse of matrix performed by constructing a new read-only-port, where two H. free layers and one insulator layer are stacked. The two free The application of ELM for image processing in this paper layers are in the size of one magnetization domain and are is an ELM based image super-resolution (SR) algorithm [8], from two respective nanowires. Thus, the two operands, which learns the image features of a specific category of denoted as the magnetization direction in free layer, can both images and improves low-resolution figures by applying be variables with values assigned through the MTJ of the learned knowledge. Note that ELM-SR is commonly used as according nanowire. As such, it can be shifted to the operating pre-processing stage to improve image quality before applying port such that the XOR-logic is performed. other image algorithms. It involves intensive matrix operation, For example, the A xor B can be executed in the following such as matrix addition, matrix multiplication as well as steps exponentiation on each element of a matrix. Fig. 3 illustrates

Sigmoid function by DWM-LUT

x

1/(1+e-x)

Parallel output by distributing bits into separate nanowires 2nd bit

Word-line decoder

1st bit

x

nanowire

nanowire

nanowire

nanowire

nanowire

nanowire

nanowire

nanowire

nanowire

Bit-line decoder

8th bit

BL

nanowire

nanowire

nanowire

BLB Column mux & sense amplifiers

~1/(1+e-x) (a)

TABLE I Domain-wall Operations and Logic Performance Operation read write shift Logic 8-bit full adder 8-bit multiplier 8-bit sigmoid (LUT)

Platform

Ideal logistic curve

Approximate by large LUT

0 (b) Fig. 5. (a) Sigmoid function implemented by domain wall nanowire based look-up table (DW-LUT); (b) DW-LUT size effect on the precision of the sigmoid function. The larger the LUT, the smoother and more precise the curve is.



The operands A and B are loaded into two nanowires by enabling WL1 and WL2 respectively.  A and B are shifted from their access-ports to the read-only ports by enabling SHF1 and SHF2 respectively;  By enabling RD, the bit-wise XOR result can be obtained through the GMR-effect. By deploying two DW-XOR logic units, the SUM operation of full adder can be achieved in by domain-wall nanowire devices with low power consumption. In addition, to realize a full adder, the CARRY operation is also needed. Spintronics based CARRY operation is proposed in [17], where a pre-charge sensing amplifier (PCSA) is used for resistance comparison. The CARRY logic by PCSA and two branches of domain-wall nanowires is shown in Fig. 4. The three operands for CARRY operation are denoted by resistance of MTJ (low for 0 and high for 1), and belong to respective domain-wall nanowires in the left branch. The right branch is made complementary to the left one. Note that the and will be pre-charged high at first when PCSA EN signal is low. When the circuit is enabled, the branch with lower resistance will discharge its output to "0". For example, when left branch has no or only one MTJ in high resistance, i.e. no carry out, the right branch will have three or two MTJs in high resistance, such that the will be 0. The complete truth table will confirm CARRY logic by this circuit. The

2

116

31.8

TABLE II System Area, Power, Throughput and Energy Efficiency Comparison between In-Memory Architecture and Conventional Architecture

1

Approximate by small LUT

Domain wall nanowire device Speed (cycles) Energy (pJ) 1 0.5 1 0.1 1 0.3 Domain wall nanowire logic Speed (cycles) Energy (pJ) Area (um2) 54 40 2.6 163 308 18.9

Computation al resources utilized Area of computationa l units Power (Watt) Throughput (GBytes/s) Energy efficiency (nJ/bit)

GPP (with on-chip memory)

GPP (with off-chip memory)

1×Processor

1×Processor

18 mm2

18 mm2

10.1

12.5

12.5

108MBytes/s

9.3MBytes/s

9.3MBytes/s

7

389

642

DW-NN 1×Processor 7714×DW-ADD ER 7714×DW-MUL 551×DW-LUT 1×controller 18 mm2 (processor) + 0.5 mm2 (accelerators)

domain-wall nanowire works as the writing circuit for the operands by writing values at one end and shift it to PCSA. Note that with the full adder implemented by domain-wall nanowires and intrinsic shift ability of domain-wall nanowire, a shift/add multiplier can be readily achieved purely by domain-wall nanowires. C. Sigmoid Function by Domain-Wall Lookup Table Sigmoid function includes exponentiation, division, and addition, which is a computing-intensive operation in ELM application. In particular, the exponentiation will take many cycles to execute in the conventional processor due to the lack of corresponding accelerator. Therefore, it is extremely economic to perform exponentiation by look-up table. Look-up table (LUT), essentially a pre-configured memory array, takes a binary address as input, finds target cells that contain result through decoders, and finally outputs correspondingly by sense amplifiers. A domain-wall nanowire based LUT (DW-LUT) is illustrated in Fig. 5(a). Compared with the conventional SRAM or DRAM by CMOS, the DW-LUT can demonstrate two major advantages. Firstly, extremely high integration density can be achieved since multiple bits can be packed in one nanowire. Secondly, zero standby power can be expected as a non-volatile device does not require to be powered to

(a)

(b)

(c)

Fig. 6. (a) Original image before ELM-SR algorithm (SSIM value is 0.91); (b) Image quality improved after ELM-SR algorithm by DW-NN hardware implementation (SSIM value is 0.94); (c) Image quality improved by GPP platform (SSIM value is 0.97).

retain the stored data. By distributing the multiple bits of results in separated nanowires, the serial operation of nanowire can be avoided and the function can be done fast. Note that the LUT size is determined by the input domain, the output range, and the required precision for the floating point numbers. Fig. 5(b) shows the ideal logistic curve and approximated curves by LUTs. It can be observed that the output range is bounded between 0 and 1, and although the input domain is infinite, it is only informative in the center around 0. The LUT visually is the digitalized logistic curve and the granularity, i.e. precision, depends on the LUT size. For machine learning application, the precision is not as sensitive as scientific computations. As a result, the LUT size for sigmoid function can be greatly optimized and leads to high energy efficiency for sigmoid function execution.

Table I shows the energy cost and speed of basic domain-wall nanowire operations as well as logic circuits. Different from CMOS based logic, domain-wall nanowire based logic circuits need multiple cycles to operate, and thus the latency is expected to be much longer than its CMOS counterpart. However, computation throughput outweighs latency in the context of big-data applications where the data parallelism is high; benefited from the leakage free and high density property, more domain-wall logic resources are able to be allocated to significantly increase the system performance. B. In-memory Throughput and Energy Efficiency

To compare proposed in-memory DW-NN platform and conventional general purpose processor (GPP) based platform, ELM based super resolution (ELM-SR) application is executed as the workload. The evaluation of ELM-SR in GPP IV. Experiment Results platform is based on gem5 [23] and McPAT [24] for core power and area model. DW-NN is evaluated in our developed A. Domain-Wall Logic Performance self-consistent simulation platform based on NVMSPICE, To evaluate domain-wall nanowire based in-memory ELM DW-CACTI, and DW-NN behavioral simulator. The processor computing system, the following design platform has been set runs at 3GHz while the accelerators run at 500MHz. System up. Firstly at device level, the transient simulation of MTJ memory capacity is set as 1GB, and bus width is set as 128 read and write operations is performed within NVM-SPICE bits. Based on [25], 3.7nJ and 6.3nJ per access are used for [18,19,20,21] to obtain accurate operation energy and timing on-chip and off-chip I/O overhead respectively. Table II compares ELM-SR in both DW-NN and GPP for domain-wall nanowire. The shift-operation energy is modeled as the Joule heat dissipated on the nanowire when platforms. Due to the deployment of in-memory accelerators shift-current is applied. The shift-current density and and high data parallelism, the throughput of DW-NN improves shift-velocity relationship are based on [15]. The area of one by 11.6x compared to GPP platform. In terms of area used by domain-wall nanowire is calculated by its dimension computational resources, DW-NN is 2.7% higher than that of 2 parameters. Specifically, the technology node of 32nm is GPP platform. Additional 0.5 mm is used to deploy the assumed with width of 32nm, length of 64nm per domain, and domain-wall nanowire based accelerators. Thanks to the high thickness of 2.2nm for one domain-wall nanowire; the Roff is integration density of domain-wall nanowires, the numerous set at 2600Ω, the Ron at 1000Ω, the writing current at 100μA, accelerators are brought with only slight area overhead. In and the current density at 6×108A/cm2 for shift-operation. DW-NN, the additional power consumed by accelerators is Secondly at circuit level, the memory modeling tool CACTI compensated by the saved dynamic power of processor, since [22] is modified with name as DW-CACTI. It can provide the computation is mostly performed by the in-memory logic. accurate power and area information for domain-wall Overall, DW-NN achieves a power reduction of 19%. The nanowire memory peripheral circuits such as decoders and most noticeable advantage of DW-NN is its much higher sense amplifiers (SAs). Together with the device level energy efficiency compared to GPP. Specifically, it is 56x and performance data, the DW-ADDER as well as the DW-LUT 92x better than that of GPP with on-chip and off-chip memory respectively. The advantage comes from three aspects: (a) can be evaluated at circuit level.

conference on Multimedia, pp. 107-118, Oct. 2001. [7] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, "Extreme learning machine: theory and applications," Neurocomputing, vol. 70, pp. 489-501, 2006. [8] L. An and B. Bhanu, "Image super-resolution by extreme learning machine," 19th IEEE International Conference on Image Processing, pp. 2209-2212, Sept. 2012. [9] D. B. Strukov, and et al., "The missing memristor found," Nature, vol. 453, pp. 80-83, 2008. [10] F. Bedeschi, and et al., "4-Mb MOSFET-selected phase-change memory experimental chip," Proceeding of the 30th European Solid-State Circuits Conference, pp. 207-210, Sept. 2004. [11] M. Hosomi, and et al., "A novel nonvolatile memory with spin torque transfer magnetization switching: Spin-RAM," IEEE International Electron Devices Meeting Technical Digest, pp. 459-462, Dec. 2005. [12] S. S. Parkin, M. Hayashi, and L. Thomas, "Magnetic domain-wall racetrack memory," Science, vol. 320, pp. 190-194, 2008. V. Conclusion [13] R. Venkatesan and et al., "TapeCache: a high density, energy efficient cache based on domain wall memory," ACM/IEEE international symposium on Low power electronics and design, With the use of the newly introduced domain-wall nanowire, July 2012. this paper explores an in-memory architecture of machine learning on neural network, called DW-NN. In the proposed [14] X. Wang, and et al., "Spintronic memristor through spin-torque-induced magnetization motion," IEEEE lectron DW-NN, domain-wall nanowire based logic customized for Device Letters, vol. 30, pp. 294-297, 2009. machine learning is integrated within the image storage data [15] C. Augustine and et al., "Numerical analysis of domain wall such that machine-learning based image processing can be propagation for dense memory arrays," IEEE International performed locally within the memory. We show that all Electron Devices Meeting (IEDM), pp. 17.6. 1-17.6. 4, Dec. 2011. operations involved in machine learning on neural network [16] J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. can be mapped to a logic-in-memory architecture by 51, pp. 107-113, 2008. non-volatile domain-wall nanowire. The experimental results show that the I/O load in the proposed DW-NN is greatly [17] H.-P. Trinh and et al., D. Ravelsona, and C. Chappert, "Domain wall motion based magnetic adder," Electronics Letters, vol. 48, alleviated with an energy efficiency improvement by 92x and pp. 1049-1051, 2012. throughput improvement by 11.6x compared to the [18] W. Fei and et al., "Design exploration of hybrid cmos and conventional image processing system by general purpose memristor circuit by new modified nodal analysis", IEEE processor. Transactions on Very Large Scale Integration Systems, vol.20, no.6, pp.1012-1025, June 2012. [19] Y. Shang, W. Fei, and H. Yu, "Analysis and modeling of internal state variables for dynamic effects of nonvolatile memory Acknowledgments devices", IEEE TCAS-I, vol.59, no.9, pp.1906-1918, September 2012. This work is sponsored by Singapore MOE TIER-2 fund MOE2010-T2-2-037 (ARC 5/11) and NRF-CRP fund [20] Y. Wang, H. Yu, and W. Zhang, "Nonvolatile cbram crossbar based 3d integrated hybrid memory for data retention", IEEE NRFCRP9-2011-01. Transactions on Very Large Scale Integration Systems, 2013. [21] Y. Wang, and H. Yu, “An ultralow-power memory-based big-data computing platform by nonvolatile domain-wall References nanowire devices”, ACM/IEEE International Symposium on Low Power Electronics and Design, September 2013. [1] S. Matsunaga and et al.,"Fabrication of a nonvolatile full adder [22] S. J. Wilton and N. P. Jouppi, "CACTI: An enhanced cache based on logic-in-memory architecture using magnetic tunnel access and cycle time model," IEEE Journal of Solid-State junctions," Applied Physics Express, vol. 1, p. 1301, 2008. Circuits, vol. 31, pp. 677-688, 1996. [2] H. Kimura and et al., "Complementary ferroelectric-capacitor [23] N. Binkert and et al., “The gem5 simulator,” ACM SIGARCH logic for low-power logic-in-memory VLSI," IEEE JSSC, vol. 39, Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011. pp. 919-926, 2004. [24] S. Li and et al., “Mcpat: an integrated power, area, and timing [3] Y. Wang, H. Yu and D. Sylvester, "Energy efficient in-memory modeling framework for multicore and manycore architectures,” aes encryption based on nonvolatile domain-wall nanowire", in IEEE/ACM International Symposium on Microarchitecture, ACM/IEEE Design Automation and Test Conference in Europe, 2009, pp. 469–480. March 2014. [25] J.-K. Kim and et al., "A 3.6 Gb/s/pin simultaneous bidirectional [4] Y. Wang, P. Kong, and H. Yu, “Logic-in-memory based (SBD) I/O interface for high-speed DRAM," ISSCC Dig. Tech. map-reduced computing by nonvolatile domain-wall nanowire Papers, pp. 414-415, Feb. 2004. devices”, IEEE Non-Volatile Memory Technology Symposium, [26] Z. Wang and et al., "Image quality assessment: From error August, 2013. visibility to structural similarity," IEEE Transactions on Image [5] D. E. Goldberg and J. H. Holland, "Genetic algorithms and Processing vol. 13, pp. 600-612, 2004. machine learning," Machine learning, vol. 3, pp. 95-99, 1988. [6] S. Tong and E. Chang, "Support vector machine active learning for image retrieval," Proceedings of the ninth ACM international

in-memory computing architecture that saves I/O overhead; (b) non-volatile domain-wall nanowire devices that are leakage free; and (c) application specific accelerators. Fig. 6 shows the image quality comparison between the proposed in-memory DW-NN hardware implementation and the conventional GPP software implementation. To measure the performance quantitatively, structural similarity (SSIM) [26] is used to measure image quality after ELM-SR algorithm. It can be observed that the images after ELM-SR algorithm in both platforms have higher image quality than the original low-resolution image. However, due to the use of LUT, which trades off precision against the hardware complexity, the image quality in DW-NN is slightly lower than that in GPP. Specifically, the SSIM is 0.94 for DW-NN, 3% lower than 0.97 for GPP.

Camera-Ready Paper Submission GUIDE

congestion at memory I/Os for excessive big image data at exa-scale. This paper explores an in-memory machine learning on neural network architecture by ...

Download PDF

1MB Sizes 1 Downloads 276 Views

Report

Camera-Ready Paper Submission GUIDE

Recommend Documents