Accelerating String Matching Using Multi-threaded Algorithm on GPU Cheng-Hung Lin*, Sheng-Yu Tsai**, Chen-Hsiung Liu**, Shih-Chieh Chang**, Jyuo-Min Shyu** *National Taiwan Normal University, Taiwan **National Tsing Hua University , Taiwan

Introduction • Network Intrusions Detection System (NIDS) has been widely used to detect network attacks. • The pattern matching engine dominates the performance of an NIDS. • Traditional pattern matching approaches on uniprocessor are too slow for today’s networking. • Hardware approaches for acceleration pattern matching. – Logic-based – Memory-based – Multiprocessor-based 2

GPU for Pattern Matching • Parallel computation on GPU is suitable for accelerating pattern matching.

AAAAAAAAAAAAAAAAAAAAAAAB

AAAAAAAAAAAAAAAAAAAAAAAB Thread #1 Thread #2

Thread #3 Thread #4

1 thread 24 cycles

4 segments 4 threads 6 cycles 3

Boundary Problem • Boundary Problem – Pattern occurring in the boundary of adjacent segments cannot be detected. – False negative results False Negative

AAAAAAAAAAAAAAAAAABBBBBB Thread #1 Thread #2

Thread #3 Thread #4 4

Overlapped Computation • To resolve boundary problem – Scan across boundaries

• Problem – Overhead of overlapped computation – Throughput reduction Thread #3 can identify "AB" Thread #1

AAAAAAAAAAAAAAAAAABBBBBB Thread #2 Thread #3 Thread #4 5

Aho-Corasick Algorithm • Aho-Corasick (AC) algorithm has been widely used for pattern matching due to its advantage of matching multiple patterns in a single pass – Compiling multiple patterns into a composite state machine

B

Patterns (1) AB (2) ABG (3) BEDE (4) EF

2

1

A

G 3

[^ABE] B

0

4

E 8

E

F

E

D 5

6

7

9 6

Aho-Corasick Algorithm (cont.) • Aho-Corasick (AC) state machine composes of – Solid line represents valid transitions. – Dotted line represents failure transitions.

• Failure transition backtracks the state machine to recognize patterns in different start locations. B 1

A

G 2

3

[^ABE]

Input strings : A B E D E

B

0

4

E 8

location 1

E

F

E

D 5

6

7

9

location 2 7

Problems of AC on GPU • Direct implementation of AC on GPU – To resolve the boundary problem, each thread has low bound constraint of scanning length • Constraint = segment length + overlapped length • Overlapped length = the length of longest pattern -1

– Overhead of overlapped computation AAAAAAAAAAAAAAAAAABBBBBB

8

Problems of AC on GPU (cont.)

9

Failureless-AC State Machine • AC state machine – Failure transition backtracks the state machine to recognize patterns in different start locations. B

1

A

2

3

[^ABE]

Input strings :

4

E

location 2

E

D

E

B

0

A B E D E location 1

G

6

5

7

F 8

9

• Failureless-AC state machine – Remove failure transition – Terminated when no valid transitions – Recognize patterns in location 1. Input strings : Location 1

A B E D E

0

Stop B 1

A B

4

E 8

G 2

E

F

3 E

D 5

9

6

7

10

Parallel Failureless-AC Algorithm • Parallel Failureless-AC (PFAC) Algorithm – Allocate each byte of input a thread to traverse Failureless-AC state machine.

XXXXXXXXXABEDEXXXXXXXXXX

11

Mechanism of PFAC Thread #n

…XXXXABEDEXXX… Thread #n+1 1

A B 0

B

2

E 4

E 8

G D

5 F

3 E 6

1

A 7

B 0

Thread #n

2

E 4

E 9

B

8

3 E

D 5

F

G

6

7

9

Thread #n+1

12

Reducing Overlapped Computation • Direct Implementation of AC Algorithm – Each thread has low bound constraint of scanning length – Overlapped computation (overlapped length = 3)

• PFAC Algorithm – – – –

Without boundary problem. Each thread has variable scanning length Most thread terminates early Reducing overlapped computation to 1 1 3 …CCCCCCCCBCCCCCC… 13

Experimental Environments • CPU: Intel® Core™ i7 CPU 950 @3.07 GHz – 4 cores – 12 GB DDR3 memory • GPU: NVIDIA ® GeForce ® GTX 480 @ 1.4 GHz – 480 cores – 1536MB DDR5 memory • Patterns: String pattern of Snort V2.4 – 1,998 rules containing 41,997 characters – Total 27,754 states • Input: Normal and worst case – DEFCON packet 14

Experimental Results Table 1: Throughput of normal case inputs

2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB

AC_CPU

AC_OMP

AC_Pthread

PFAC

Speedup

1 thread (Gbps) 0.73 0.72 0.72 0.72 0.72 0.73 0.82 0.86

8 threads (Gbps) 2.84 2.98 3.04 3.06 3.04 3.11 3.36 3.43

8 threads (Gbps) 3.98 4.06 4.24 3.26 4.32 4.39 4.71 4.69

multi-threads (Gbps) 69.73 80.36 87.66 90.13 88.20 94.89 110.63 117.04

to fastest 17.53 19.78 20.66 27.63 20.43 21.63 23.51 24.94

15

Comparisons of Normal Case 140.00 120.00 100.00 80.00

AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads

60.00 40.00 20.00 0.00 2 MB 4 MB 8 MB

16 MB

32 MB

64 MB

128 MB

192 MB 16

Experimental Results Table 2: Throughput of worst case inputs

2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 192 MB

AC_CPU

AC_OMP

AC_Pthread

PFAC

Speedup

1 thread (Gbps) 0.45 0.45 0.45 0.45 0.46 0.45 0.46 0.46

8 threads (Gbps) 2.40 2.50 2.58 2.61 2.55 2.62 2.57 2.63

8 threads (Gbps) 2.44 2.57 2.62 2.63 2.61 2.63 2.60 2.64

multi-threads (Gbps) 12.09 12.61 12.79 12.98 12.95 13.10 13.12 13.10

to fastest 4.96 4.91 4.88 4.93 4.96 4.97 5.04 4.97

17

Comparisons of Worst Case 14.00 12.00 10.00 8.00

AC_CPU 1 thread AC_OMP 8 threads AC_Pthread 8 threads PFAC multi-threads

6.00 4.00 2.00 0.00 2 MB 4 MB 8 MB 16 MB

32 MB

64 MB

128 MB

192 MB

18

Comparisons Approaches

PFAC Huang et al. [10] Modified WM Schatz et al. [11] Suffix Tree Vasiliadis et al. [12] DFA Smith et al. [13] XFA

Character number of rule set 41997

Memory (KB)

Throughput (Gbps)

Memory Efficiency

Notes

27754

117.04

177.10

1565

230

2.40

16.33

200000

14125

2.00

28.32

N.A.

200000

6.40

NA

N.A.

3000

10.40

NA

NVIDIA GTX 480 NVIDIA 7600 GT NVIDIA GTX 8800 NVIDIA 9800 GX2 NVIDIA 8800 GTX

 Memory efficiency= (Throughput x # of characters) / Memory 19

Conclusions • We have proposed a novel parallel string matching algorithm which is well-suited to be performed on GPUs and is free from the boundary detection problem. • The proposed algorithm creates a new state machine which has less complexity and memory usage compared to the traditional Aho-Corasick state machine. • The new algorithm achieves a significant speedup compared to the traditional Aho-Corasick algorithm accelerated by OpenMP on CPU. • Compared to other GPU approaches, the new algorithm achieves 11.6 times faster than the state-of-the-art approach. 20

Accelerating String Matching Using Multi-threaded ...

Experimental Results. AC_CPU. AC_OMP AC_Pthread. PFAC. Speedup. 1 thread. (Gbps). 8 threads. (Gbps). 8 threads. (Gbps) multi-threads. (Gbps) to fastest.

303KB Sizes 0 Downloads 197 Views

Recommend Documents

Accelerating String Matching Using Multi-threaded ...
processor are too slow for today's networking. • Hardware approaches for .... less complexity and memory usage compared to the traditional. Aho-Corasick state ...

Accelerating String Matching Using Multi-Threaded ...
Abstract—Network Intrusion Detection System has been widely used to protect ... malware. The string matching engine used to identify network ..... for networks. In. Proceedings of LISA99, the 15th Systems Administration Conference,. 1999.

Fast exact string matching algorithms - Semantic Scholar
LITIS, Faculté des Sciences et des Techniques, Université de Rouen, 76821 Mont-Saint-Aignan Cedex, France ... Available online 26 January 2007 ... the Karp–Rabin algorithm consists in computing h(x). ..... programs have been compiled with gcc wit

Fast exact string matching algorithms - ScienceDirect.com
method to avoid a quadratic number of character com- parisons in most practical situations. It has been in- troduced ... Its expected number of text character comparisons is O(n + m). The algorithm of Wu and ...... structure for pattern matching, in:

Efficient parameterized string matching
Jun 14, 2006 - means by definition that P [j] = i. If any of ..... with realistic real world data. .... Parameterized duplication in strings: algorithms and an application.

Practical String Dictionary Compression Using String ...
Abstract—A string dictionary is a data structure for storing a set of strings that maps them ..... been proposed [9], [10] and implemented as open-source software, such as the .... ENWIKI: All page titles from English Wikipedia in. February 2015.9.

Optimization of String Matching Algorithm on GPU
times faster with significant improvement on memory efficiency. Furthermore, because the ... become inadequate for the high-speed network. To accelerate string ...

String Pattern Matching For High Speed in NIDS - IJRIT
scalability has been a dominant issue for implementation of NIDSes in hardware ... a preprocessing algorithm and a scalable, high-throughput, Memory-effi-.

String Pattern Matching For High Speed in NIDS
They are critical net-work security tools that help protect high-speed computer ... Most hardware-based solutions for high-speed string matching in NIDS fall into ...

A Guided Tour to Approximate String Matching
One of the largest areas deals with speech recognition, where the ... wireless networks, as the air is a low qual- ..... there are few algorithms to deal with them.

Fingerprint matching using ridges
(2) The solid-state sensors are increasingly used, which capture only a portion ... file is small. We have ... the ridge-based system will not degrade dramatically.

PARTIAL SEQUENCE MATCHING USING AN ...
where the end point of the alignment maybe be unknown. How- ever, it needs to know where the two matching sequences start. Very recently [6] proposed an ...

Accelerating X-Ray Data Collection Using Pyramid ...
A. Averbuch is with the School of Computer Science, Tel Aviv University,. Tel Aviv ... convert from pyramid beam projection data into parallel projec- tion data. II.

Accelerating Blowfish Encryption using C2H Compiler
the availability of unused logic elements on the. FPGA such ... FPGA, the unused programmable logic can be .... dereferences map to Avalon master ports and.

Accelerating Blowfish Encryption using C2H Compiler
Raj Singh, Head, IC Design Group, CEERI Pilani (Email: [email protected] ). Accelerating Blowfish ... of the NIOS II IDE, which is used for software development for the NIOS II ..... Automation Conference, Proceedings of the. ASP-DAC 2000.

Accelerating Differential Evolution Using an Adaptive ...
variants such as evolutionary strategies (ES) [2], real coded ge- netic algorithms .... tions in the reproduction stage [5], [23]. In order to distinguish ... uses an additional mutation operation called trigonometric mu- tation operation (TMO).

Bandwidth Efficient String Reconciliation using Puzzles
A version of this work will appear in the IEEE Transactions on Parallel and Distributed ... is a binary array; applying a mask to a string involves computing a dot product ... comparison of the proposed approach with the well known open-source ...

Myomectomy using purse-string suture during cesarean ...
Nov 13, 2010 - closure during myomectomy have been used [1–3]. Despite using these methods, some amount of bleeding may con- tinue from open vessels of the exposed raw surface resulting from myomectomy until myometrial defect is occluded with appro

Creating Multithreaded Applications -
process that creates two or more threads is called a multithreaded process. ... In C#, you create a thread by creating an object of type Thread, giving its ...

Fingerprint Recognition Using Minutiae Score Matching
speech, gait, signature) characteristics, called biometric identifiers or traits or .... lies in the pre processing of the bad quality of fingerprint images which also add to the low ... Images Using Oriented Diffusion”, IEEE Computer Society on Di

Using Fuzzy Logic to Enhance Stereo Matching in ...
Jan 29, 2010 - Stereo matching is generally defined as the problem of discovering points or regions ..... Scheme of the software architecture. ..... In Proceedings of the 1995 IEEE International Conference on Robotics and Automation,Nagoya,.

Regular Expression Matching using Partial Derivatives
Apr 2, 2010 - show that the run-time performance is promising and that our ap- ...... pattern matchings, such as Perl, python, awk and sed, programmers.