NP技术知识--688IT编程网

Evaluating Network Processors using NetBench

Gokhan Memik

University of California, Los Angeles

and

William H. Mangione-Smith

University of California, Los Angeles

The Network Processor market is one of the fastest growing segments of the microprocessor industry today. In spite of this increasing market importance, there does not exist a common framework to compare the performance of different Network Processor designs. Our primary goal in this study is to fill this gap by creating the NetBench benchmarking suite. NetBench is designed to represent Network Processor workloads. NetBench contains 11 programs that form 18 different applications. The programs are selected from all levels of packet processing: Small, low-level code fragments as well as large application level programs are included in the suite. These applications are representative of the Network Processor applications in the market. Using the SimpleScalar simulator to model an ARM processor, we

study these programs in detail and compare key characteristics such as instructions per cycle, instruction distribution, cache behavior, and branch prediction accuracy with the programs from MediaBench. Using statistical analysis, we show that the simulation results for the programs in NetBench have significantly different characteristics than programs in MediaBench. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized.

Categories and Subject Descriptors: C.4 [C omputer Systems Organization]: Performance of Systems- Measurement techniques; C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems- Real-time and embedded systems; I.6 [Computing Methodologies]: Simulation and Modeling – Simulation Support Systems;

General Terms: Benchmarking; Applications; Embedded Systems; Network Processors.

1. INTRODUCTION

Emerging applications in the networking field demand increasingly higher network bandwidths. In addition, new applications and protocols require the network to do more than just deliver packets. Instead, these applications have requirements such as quality of service guarantees, secure transmiss

ion of data, and intelligent/dynamic routing and switching. These applications require large amounts of processing which must be satisfied by the processor. This set of features coupled with the higher network link speeds puts a heavy demand on the network processing elements. Traditionally, embedded processors in networks are either custom-designed ASIC chips or variations of general-purpose processors. Both schemes have their advantages and disadvantages. ASIC chips have better performance, but they have higher manufacturing costs and lack the flexibility of programmable processors. If there is a change in the protocol or application, it is hard to reflect the change in the ASIC design. General-purpose processors, on the other hand, are not optimized for This paper is under review at the ACM Transactions on Embedded Computing Systems (TECS).

Author’s addresses: Gokhan Memik, Department of Electrical Engineering, University of California, Los Angeles, Los Angeles, CA 90095 {memik@ee.ucla.edu}; William H. Mangione-Smith, Department of Electrical Engineering, University of California, Los Angeles, Los Angeles, CA 90095 {billms@ee.ucla.edu}.

2 • Memik et al.

networking applications and hence do not provide satisfactory performance for most of the applications.

Network Processors (NPUs) eliminate the drawbacks of general-purpose processors and ASIC designs by combining the flexibility of general-purpose programmable processors and performance of ASIC chips.

Soon after their introduction [17], the NPU market became one of the fastest growing segments of the microprocessor industry. In the last two years, more than 40 new vendors have announced their NPU architectures (e.g. [10, 32]). Although, these processors aim at the same application domains, they vary widely in their architectural designs. Hence, there is a tremendous need to evaluate the performances of these different designs.

truncated bnp是什么

A designer of a product should know the type of applications, based on marketing requirements, for which the processor is optimized. Similarly, customers benefit from benchmarks by selecting the product that gives the best performance for the applications they consider important (when benchmarks are aligned with commercial workloads). In spite of the rapid increase in use of NPUs, there still does not exist a common framework or methodology for evaluating them. Our goal in this study is to fill this gap. Specifically, our contributions in this paper are

§ creating a benchmarking suite by defining a set of applications that are common for NPUs;

§ investigating several characteristics of these networking applications to understand their nature; § comparing the characteristics of these applications with the applications from MediaBench [14]; § reporting results for several different cache and branch prediction configurations using an accurate StrongARM simulator [3] to guide designers in the selection of architectural parameters;

and

§ demonstrating how NetBench can be utilized by providing a performance measurement of Intel IXP1200 Network Processor [9], a representative NPU product currently available in the market. This paper is organized as follows. In the next section, we summarize architectural characteristics of the NPUs. Section 3 discusses the related work. In Section 4, we present the applications in NetBench. Applications in NetBench are compared with the MediaBench applications in Section 5. In Section 5.C, we present experimental results for Intel IXP1200 simulations. Section 6 concludes the paper with a summary.

2. NETWORK PROCESSOR CHARACTERISTICS

This section discusses important characteristics of NPUs such as on-chip caches used in the processors, and techniques for hiding memory latency.

Figure 1. A generic NPU design

NPUs vary significantly in their design methodologies. Designs span from single-core superscalar processors (Broadcom SiByte) to system on a chip designs containing more than 40 execution cores (EZChip). Their major design m ethodologies can be grouped into three categories: VLIW-based

Evaluating Network Processors using NetBench • 3 processors [10], SMT-based processors [32] and single-chip multiprocessor systems [5, 9]. Most of the processors contain multiple execution cores to take advantage of the data parallelism that exists in many networking applications. In addition, most of the NPUs modify RISC-like ISAs with instructions that efficiently perform operations required by net

working applications. In addition, they employ special-purpose elements to improve the execution efficiency. These elements are either packet-oriented memory controllers or accelerators that perform certain operations (e.g. table lookup) that occur frequently in the applications and are not efficient to implement in the execution cores.

Figure 1 presents a generic NPU design. The processor contains a set of execution cores, accelerators and on-chip secondary memory (Level 2 cache). The processors communicate through a global shared bus. Each execution core is employed with local level 1 instruction and data caches. The execution cores can be very simple such as an ALU enhanced with local registers (e.g. EZChip), or they can be complex such as a modified MIPS core (e.g. PMC-Sierra RM9000).

The most common property among different NPUs is multithreading. Almost all the NPUs available in the market today employ a variation of a multithreading technique (e.g. Clearwater CNP810SP, Intel IXP, IBM Rainer, MMC nP7510, Motorola C-5). Clearwater CNP810SP can execute instructions from 8 different threads. Intel IXP family processors, on the other hand, execute instructions from a single thread, but have hardware support for single-cycle thread switching between 4 active threads.

The size of the level 1 instruction and data caches employed in the execution cores also varies among

different designs, but most of the designs employ caches of 4 to 16 KB. For example, Intel IXP 2800 has 4KB, IBM Rainier has 8KB and Lexra NetVortex has 16KB level 1 instruction caches.

3. RELATED WORK

NPUs are a class of programmable IC's based on SOC (system-on-a-chip) technology that implements communication-specific functions more efficiently than general-purpose processors. Crowley et al. [6] evaluate different design mechanisms for NPU. They measure the performance of a VLIW-based, an SMT-based, a fine-grain multithreaded multiprocessor, and a single-chip multiprocessor. For their study, they use a subset of applications that are in NetBench. These applications are, however, not available to the public.

Benchmarks play a major role in any product design process. SPEC [25] benchmarks have been well accepted and used by several processor manufacturers and researchers to measure the effectiveness of their design. Other fields have useful benchmarking suites designed for the specific application domain: TPC [29] for database systems, SPLASH [31] for parallel machine architectures.

The need for a benchmarking suite in the NPU area has been pointed out by several researchers. Nemirovsky [18] discusses the requirements and challenges of a benchmarking suite for NPU. He defi

nes a set of metrics to be used with any benchmarking suite and draws the guidelines for defining a benchmark. Currently, three benchmarking suites contain applications that might be used in NPU. EEMBC has a benchmarking suite designed for embedded processors [7], which contains three networking applications. However, these applications are control-plane tasks (such as a shortest path algorithm) and do not form a basis for measuring the effectiveness of NPUs. MediaBench [14] also contains security and communication applications that might be used by some of the NPUs. These applications perform translations between different data formats and therefore are not representative for most of the NPU applications. CommBench [30] is designed for Telecommunications NPUs. It contains four header processing applications, which effectively represent tasks related to traditional Ipv4 routing, and four payload applications. The payload applications are jpeg, cast (CAST-128 block cipher algorithm), reed (Reed-Solomon Forward Error Correction algorithm), and zip (Lempel-Ziv (LZ77) compression algorithm). Although these applications are representative for Telecommunications NPUs, the selected applications are limited to this type of NPUs and do not represent applications employed by majority of NPUs.

4. NETBENCH PROGRAMS

In this section, we present the applications in NetBench. Any benchmarking suite should be a represen

tative of the applications in the domain the benchmark is designed for. This was the most important criterion in our selection of the applications.

4 • Memik et al.

NPU applications contain a large variety of tasks from traditional routing and switching tasks to much more complicated applications containing intelligent routing and switching decisions. Therefore, any benchmarking suite attempting to represent the applications on NPUs should consider all levels of a networking application. Instead of using the traditional 7-level OSI model for categorizing the applications, we have used a three level categorization. These levels are:

§ Low or Micro level routines containing operations nearest to the link or operations that are part of more complex tasks;

§ Routing level applications, which are similar to traditional IP level routing and related tasks; and § Application level programs, which have to parse the packet header and sometimes a portion of the payload and make intelligent decisions about the destination of the packet.

This categorization is performed by considering the complexity of the application, instead of the specifi

c task it is performing. Hence, it is a better categorization for the designers of the NPUs than the 7-layer categorization. For example, as we will show in Section 5, applications parsing the packet data have different characteristics than the applications that only parse header information regardless of the task they are performing. Note that these three categories cover all the levels of a traditional 7-level reference model, and hence present an inclusive characterization of all networking applications. In the following, we list the applications in NetBench according to the category they belong:

A. M icro-Level Programs

CRC: The CRC-32 checksum calculates a checksum based on a cyclic redundancy check as described in ISO 3309 [13]. CRC-32 is used in Ethernet and ATM Adaptation Layer 5 (AAL-5) checksum calculation. The code is available in the public domain [4].

TL: TL is the table lookup routine common to all routing processes. We have used radix-tree routing table, which was used in several UNIX systems. The code segment is from FreeBSD operating system

[8].

B.IP-Level programs

These programs make a decision depending on the source or destination IP of the packet.

ROUTE: Route implements IPv4 routing according to RFC 1812 [1]. When a router receives a packet, it has to decide the next network hop. Route implements the table lookup along with internet checksum (for the header). It makes the necessary changes in the header (for example, the Time-To-Live value), fragments the packet if necessary and forwards it. The code is from the FreeBSD operating system [8]. DRR: Deficit-round robin (DRR) scheduling [24] is a scheduling method implemented in several switches today. In DRR, all the connections through the router have separate queues. Using these queues, the router tries to accomplish a fair scheduling by allowing same amount of data to be passed from each queue. The implementation is based on the algorithm by Shreedhar and Varghese [24]. IPCHAINS: IPCHAINS is a firewall application that checks the IP source of each of the incoming packet and decides either to pass the packet through the firewall (accept), to deny the packet (deny), to modify it (masq), or to reject the packet and send information to the sender (reject). The decision is based on rules given by the user. The implementation is from Rustcorp Inc. [23].

NAT: Network A ddress Translation (NAT) is a common method for IP address management. NAT operates on a router, usually connecting two networks, and translates the private (not globally unique) addresses in the internal network into legal addresses before packets are forwarded onto the public ne

twork. Hence, for any departing packet, the source IP on the packet should be changed. Similarly, the destination address on any incoming packet should also be modified. The program accomplishing this task is using several routines from FreeBSD operating system [8].

C. Application-Level Programs

These programs are the most time consuming applications in NetBench due to their processing requirements.

Evaluating Network Processors using NetBench • 5 DH: Diffie-Hellman (DH) is a common public key encryption/decryption m echanism. It is the security

protocol employed in several Virtual Private Networks (VPNs). The implementation is from RSA Data

Security, Inc. [22].

MD5: Message Digest algorithm (MD5) creates a signature for each outgoing packet, which is checked at the destination [20]. The signature is cryptographically secure, hence if the received packet does not match the signature, then the receiver will assume that the packet is unreliable and discard it. The implementation is from RSA Data Security, Inc. [22].

SNORT: Snort is an open source network intrusion detection system, capable of performing real-time traffic analysis and packet logging on IP networks. It can perform protocol analysis and content searching/matching in order to detect a variety of attacks and probes, such as buffer overflows, stealth port scans, and CGI attacks [21]. It uses a user-defined rule set that defines actions for each packet. We have used the default configuration file (f) that contains 886 rules for the snort-nids application. The logging mode (snort-l) stores information about the packets.

SSL: SSL (secure sockets layer) is the secure transmission package used in several UNIX systems. SSL is used by applications such as ssh [2] and sftp [27], which perform secure communication over insecure public networks. The ssl implementation we used is from the OpenSSL Project [28]. The application interface is modified to perform three different computations with varying strength: weak performs rc4-40 encryption without any digest, medium performs DSA authentication followed by blowfish encryption and md5 digest for each packet, and strong performs RSA authentication followed by 168-bit key 3-DES and SHA digest for each packet.

URL: URL implements URL-based destination switching, which is a commonly used content-based load balancing mechanism. In URL-based switching, all the incoming packets to a switch are parsed and forwarded according to URL. For example, all image requests might be sent to an image server. T

his application increases the utility of specialized servers in a server farm. The implementation is based on the description from PMC-Sierra [19].

D. Discussion

NetBench contains 11 applications implemented in C or C++. Many of the available NPUs in the market today have corresponding compilers for high-level languages such a C. For such processors, the implementations can be automatically mapped into the processor. We recommend the usage of the applications in such a framework to establish a fair comparison of the systems. Nevertheless, many NPUs do not provide such compilers and the applications have to be manually coded. Even in systems with compiler support, either the output of the compiler should be optimized or library routines should be used to perform activation of special hardware structures (e.g. table lookup engines). The NetBench applications do not employ such special calls. However, the implementations can be easily modified to perform the necessary operations. First, the user should locate the segments of the application that perform the specific task. Once such locations are identified, the code can be modified to perform the necessary operation in the special structure. For example, consider the route application: the table lookup is performed in the rn_search procedure. If the NPU employs a table lookup engine, this procedure can be modified to activate the engine instead of performing the radix-tr

ee lookup.

5. PROGRAM CHARACTERISTICS

In this section, we compare several characteristics of NetBench applications with MediaBench [14] applications. MediaBench is designed for multimedia and communication systems, which are in many ways similar to NPUs. We have selected MediaBench to compare against NetBench, due to this similarity of the aimed processor architectures. Although these applications are intuitively similar, we show that the applications for these architectures are significantly different, thus validating the need for a separate benchmarking suite for NPUs.

In Section A, we explain the simulation environment and the applications from MediaBench. Section B summarizes the experimental results. Section C presents the results for the Intel IXP1200 simulations.

6 • Memik et al.

Table 1. NetBench applications and their properties: arguments are the execution arguments, # inst is the number of instructions executed, # cycle is the number of cycles required, # il1 (dl1) acc is the number of accesses to the level 1 instruction (data) cache, # l2 acc is the level 2 cache accesses, bpred rate is the correct branch predictions in the base processor (not taken strategy).

Appl. Arguments # inst

[M]

cycle

[M]

# IL1

acc

[M]

# DL1

acc

[M]

# L2

acc

[M]

bpred

rate

[%]

crc crc 10000 145.8 262.0 219.0 59.8 0.6 0.1 dh dh 5 64 778.3 1663.1 1009.1 364.7 38.4 21.6 drr drr 128 10000 12.9 33.5 22.8 7.9 1.1 8.1 drr-l drr 1024 10000 34.7 80.2 60.1 23.3 5.0 5.6 ipchains ipchains 10 10000 61.7 160.2 103.9 26.2 3.6 26.1 md5 md5 10000 209.1 474.7 296.8 73.2 11.0 19.6 nat nat 128 10000 11.4 26.7 17.3 5.6 1.2 17.9 nat-l nat 1024 10000 33.2 74.2 55.0 21.1 5.1 8.5 rou route 128 10000 14.2 32.0 23.3 7.1 0.9 11.1 rou-l route 1024 10000 36.8 81.7 62.6 22.8 5.0 7.6 snort-l snort -r defcon –n 10000 –dev –l ./log –b 343.0 925.6 515.0 132.2 33.4 36.7 snort-n snort -r defcon –n 10000 –v –l ./log –c snf 545.9 1654.1 893.7 219.7 56.2 28.0 ssl-m openssl NetBench medium 10000 2718.2 5367.0 3260.1 989.7 142.8 35.7 ssl-s openssl NetBench strong 10000 3616.1 8

727.5 4453.4 1383.3 426.8 36.1 ssl-w opensll NetBench weak 10000 329.0 832.1 441.1 152.0 31.8 50.8 tl tl 128 10000 6.9 15.7 11.8 3.9 0.7 5.9 tl-l tl 1024 10000 30.3 67.1 52.2 19.9 4.7 5.1 url url small_inputs 10000 497.0 956.7 768.9 249.1 10.0 32.8 average523.6 1190.8 681.5 209.0 43.2 19.9 Table 2. MediaBench applications and their properties.

Appl. Arguments # inst

[M]

cycle

[M]

# IL1

acc

[M]

# DL1

acc

[M]

# L2

acc

[M]

bpred

rate[

adp-c rawcaudio < clinton.pcm > out.adpcm 8.1 10.2 9.3 1.1 0 49.6 adp-d rawdaudio < clinton.adpcm > out.pcm 6.5 8.5 7.7 1.1 0 49.9 epic-e epic test_image.pgm –b 25 60.7 108.5 89.2 18.9 0.3 45.3 epic-u unepic test_image.pgm.E 10.1 19 14.8 1.9 0.2 28.0 g72-e decode -4 –l -f clinton.

g721 369.2 753.5 510.4 118.3 13.4 47.7 g72-d encode -4 –l -f clinton.pcm 386.9 785.9 536.8 125.0 13.7 46.7 gsm-t toast –fpl clinton.pcm 294.4 436.5 333.8 92.4 1.7 11.7 gsm-u untoast –fpl clinton.pcm.gsm 102.7 162.3 141 19.8 0.4 32.6 jpg-c cjpeg –dct int –progressive –opt testimg.ppm 16 32.3 23.3 5.8 0.5 14.5 jpg-d djpeg –dct int –ppm –opt testimg.jpg 4.2 7.6 5.3 1.8 0.1 25.4 mpg-e mpeg2encode options.par out.m2v 158.3 342.1 252.1 57.6 5.6 34.5 mpg-d mpeg2decode –bmei16v2.m2v –r –f –o0

rec%d

1032.3 1830.8 1293.9 354.2 23.8 20.5 pegwit-e pegwit –e my.pub pgtest. 19.1 35.4 23.9 6.6 0.8 47.4 pegwit-d pegwit – pegwit.dec 38.9 72.9 48.3 13.3 1.7 46.1 rasta rasta –A –J –S 8000 –n12 –f map_weights.dat 16.2 42.2 23.8 7.4 1.2 31.3 average168.2 309.8 220.9 55.0 4.2 35.4

688IT编程网

NP技术知识

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

688IT编程网

NP技术知识

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林 重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

随机森林重要性