Radix sort for vector multiprocessors--688IT编程网

Radix Sort For Vector Multiprocessors

Marco Zagha and Guy E.Blelloch

School of Computer Science

Carnegie Mellon University

Pittsburgh,PA15213-3890

Abstract

We have designed a radix sort algorithm for vector mul-

tiprocessors and have implemented the algorithm on the

CRAY Y-MP.On one processor of the Y-MP,our sort is

over5times faster on large sorting problems than the opti-

mized library sort provided by CRAY Research.On eight

processors we achieve an additional speedup of almost5,

yielding a routine over25times faster than the library sort.

Using this multiprocessor version,we can sort at a rate of

15million64-bit keys per second.

Our sorting algorithm is adapted from a data-parallel

algorithm previously designed for a highly parallel Single

Instruction Multiple Data(SIMD)computer,the Connec-

tion Machine CM-2.To develop our version we introduce

three general techniques for mapping data-parallel algo-

rithms onto vector multiprocessors.These techniques allow

us to fully vectorize and parallelize the algorithm.The pa-

per also derives equations that model the performance of

our algorithm on the Y-MP.These equations are then used

to optimize the radix size.

1Introduction

Sorting is one of the most heavily studied problems in

computer science.A sorting algorithm,bitonicsort[1],was

one of theﬁrst algorithms designed for parallel machines,

and since then hundreds of articles on parallel sorting have

appeared in the literature.Although this work has devel-

oped a robust theory of parallel sorting,there has only been

limited success in obtaining efﬁcient implementations on

real parallel machines.Many of the reported results are

barely faster than serial sorting algorithms on fast work-

stations.This is due to the high communication costs of

multiprocessor algorithm from the data-parallel algorithm, we use the following techniques:

Virtual Processors:Each element of a vector register is viewed as a virtual processor.For a vector-register length on each processor,and for processors,we therefore use an algorithm designed for an pro-cessor machine(648512on the Y-MP).View-ing each vector element as its own processor forces independence among vector elements,allowing the algorithm to be fully vectorized.

Loop Raking:A new technique we call loop raking is used for vectorizing the loops.Instead of loading contiguous blocks of a vector on each vector load, as done in strip mining[16],loop-raking uses a con-stant stride to load a set of elements evenly distributed across the input vector—as if a rake was placed over the vector.Loop raking is necessary to keep each pass of the radix-sort stable(see Section3).

Processor Memory Layout:A careful layout of the mem-ory of each virtual processor is used to avoid

bank conﬂicts.This layout is used for placing the buckets needed by the radix sort and guarantees that for every vector gather from or scatter to the buckets,each ele-ment of a vector register accesses a different memory bank.As compared to a naive layout of the buckets in contiguous memory locations,our layout greatly reduces the running time for some common inputs.

In addition to describing the algorithm,this paper mod-els the running time of the radix sort in terms of the number of keys to be sorted,the size of each key,and the number of processors.The model yields the following equation for the number of6nsec clock cycles:

12215(1) where is the number of keys,is the number of physical processors used,is the maximun number of physical processors(8on the Y-MP),is the number of bits in the radix,and is the number of bits in the key.The form 1

22is based on the theoretical com-plexity,and the constants1and2are derived from empir-ical measurements on the Y-MP.The12215 term is due to memory contention and is also based on empirical measurements.Equation1is used to select an optimal radix size(typically between6and14),and can also be used to predict running times over a range of data sizes and numbers of processors without having to run the algorithm.With5000,equation1predicts the ac-tual ru

nning time within10%accuracy.The inaccuracy is due to a slight dependence of the running time on the distribution of the keys.

The remainder of this paper is organized as follows. Section2discusses the serial and parallel versions of radix

sort.Section3discusses the use of virtual processors,

loop raking,and memory layout.Section4describes our

the performance of our implementation and discusses how

to select the radix size.

2Radix Sorting

Radix-based sorting algorithms treat keys as multidigit

numbers in which each digit is an integer with a value in

1,where is the radix.A32-bit

integer,for example,could be treated as a4-digit number with radix232428256.The radix is usually chosen to minimize the running time and is highly

dependent on the implementation and the number of keys

being sorted.The standard radix sort on which our parallel

algorithm is based is perhaps the oldest form of sorting.

Theﬁrst implementation dates back to the19th century and Hollerith’s machines for sorting punched cards.

Radix sorting is an attractive alternative to comparison-

based sorts,such as quicksort[10],since for keys it runs in instead of lg time.In practice,however, a disadvantage is that it accesses memory in an arbitrary order.Therefore,on a machine with a cache,many memory references will cause cache misses.Studies have shown that on serial machines when the problemﬁts into physical memory but not into the cache,radix sort performs no fa

ster, or only marginally faster,than optimized implementations of quicksort(for32-bit and64-bit integers)[14,Chapter 8].

Since the Y-MP has no cache—all physical memory is

accessible with equal cost—and assuming radix sort can

be vectorized,radix sort is likely to have better relative

performance over comparison-based sorts than on cache based machines.This seems to be true.For large data sets, our single processor sort runs between3and5times faster than a fully vectorized version of quicksort implemented by Levin[13].1

Serial Radix Sort

Weﬁrst review the serial radix sort since our parallel al-gorithm will include the same phases(see[6,Section9.3] for more details).Radix sort works by breaking keys into digits and sorting one digit at a time,starting with the least signiﬁcant digit.Figure2shows an example.A counting

Figure2:Radix Sorting3-digit numbers1digit at a time.A counting sort is used for each digit.

sort(sometimes called distributionsort)is used to sort each digit.Since R ADIX-S ORT starts from the least signiﬁcant digit,the sort works only if the ordering generated in pre-vious passes is preserved.Each C OUNTING-S ORT therefore must be stable.

The C OUNTING-S ORT is implemented as follows.We assume that each digit consists of bits:2.The basic idea is to determine,for each input digit,the number of keys with a digit less than,and also the number of keys with a digit equal to appearing earlier in the input sequence.To do this,the sort uses buckets, one for each possible digit value.The following routine sorts an array of keys,based on an array of digits, both of size,bits at a time,placing the result in.The algorithm is divided into3phases:

C OUNTING-S ORT

H ISTOGRAM-K EYS

do0to21

do0to1

S CAN-B UCKETS

do0to21

R ANK-A ND-P ERMUTE

do0to1

Theﬁrst loop of H ISTOGRAM-K EYS clears the buckets. The second loop,at iteration,uses the elem

ent of the digit array as an offset into the buckets,and incre-ments the count in that bucket.At the end of H ISTOGRAM-K EYS,contains the number of digitshaving value .S CAN-B UCKETS performs a scan(also called all-preﬁx-sums)operation on the buckets,returning to each element

the sum of all previous elements.After the scan,

contains the number of digits with a value such that. This is the position in the output in which theﬁrst key with digit belongs.In theﬁnal phase,R ANK-A ND-P ERMUTE, each key with a digit of value is placed in itsﬁnal location by getting the offset from and incrementing the bucket so the next key with digit gets put in the next lo-cation.Since the keys are looped over in increasing order, C OUNTING-S ORT is stable.

Parallel Radix Sort

The serial algorithm cannot be directly parallelized(or vec-torized)because of loop dependencies in all three phases. If an attempt is made to parallelize over the iterations of the H ISTOGRAM-K EYS,several processors could try to incre-ment the same bucket simultaneously.If the bucket is not locked by each processor to gain exclusive access,only one of the increments would take effect.On the other hand,if the bucket is locked,the bucket would act as a serial bottle-neck and would greatly d

egrade performance if many keys had the same digit.

By using a separate set of buckets for each processor, it is possible to parallelize the algorithm without requiring a lock on the buckets.Each processor is responsible for its own subset of the keys and tabulates them in its own set of buckets.A restricted version of this algorithm was described by Cole and Vishkin as part of an optimal 2-ruling set algorithm[5].The full algorithm was also implemented on the Connection Machine CM-2[3].In this parallel version of radix sort,the buckets can be viewed as a matrix:

Buckets()

012..1

the whole input data.After S CAN -B UCKETS ,we would like

to contain the position in the output where

the ﬁrst key from processor with digit belongs (see Figure 3).This can be expressed as:

after

That is to say,the offset is the total number of digits less than over all the processors (0),plus the number of digits equal to in processors less than .This sum can be calculated by ﬂattening the matrix into column ma-jor order and executing a S CAN -B UCKETS on the ﬂattened matrix.The S CAN -B

UCKETS operation can be parallelized using a tree-summing or similar algorithm [12].The par-allel version generates the same permutation as the serial algorithm so the sort remains stable.

3212

012

Key value:

proc 0proc 1proc 2proc 3Bucket[1]offsets histogram

histogram

histogram Bucket[0]offsets Bucket[2]offsets Bucket[3]offsets Result Vector

Figure 3:

The scan step of parallel radix sort.The algorithm

is illustrated with 4processors and 4buckets for the values 0–3.The offsets are computed by scanning the buckets.After the scan,each processor has the starting position in the ﬁnal output of keys with a particular value.For example,processor 3will place the “0”keys starting at position 5in the output,and the “1”keys starting at position 14,as indicated by offsets.

3Algorithmic Techniques

The previous section described a data-parallel radix-sort algorithm.This section describes three general techniques that can used to map this algorithm onto vector mutipro-cessors:virtual processors ,loop raking ,and processor memory layout .We have also used these techniques in the implementation of a variety of other parallel algorithms on the CRAY Y-MP and Ardent

Titan.

L Input Digits

sor has a set of buckets for each vector register element (virtual processor).The ﬁgure depicts 2processors (2)using a vector length of 4(4).For a 3-bit digit (3),there are

82464buckets.Using processors,the keys are divided into equal sections,which are further divided into sections,one section for each virtual processor .

Virtual Processors:Vector multiprocessors,such as the CRAY,Convex,or Ardent Titan,offer two levels

of par-allelism:multiprocessor facilities and vector facilities.To take full advantage of these machines it is important to use both levels.One way is to ﬂatten the two levels into one homogeneous level by viewing each element of a vector-register as a virtual processor .We call these elements virtual processors so as not to confuse them with a full vector processor.In this view,a machine with a vector-register length ,and with processors,will have

virtual processors.Since the Y-MP has a maximum vector-register length of 64,and there are up to 8processors,the CRAY can contain up to 512virtual processors.Section 5discusses why using a smaller vector length,and hence number of virtual processors,can improve performance in certain cases.On machines with more ﬂexible use of the vector registers,such as the Ardent Titan,it is up to the algorithm designer to decide how many virtual processors to use.This decision should be based on the vector startup times,the number of vector registers needed,and the cost of using additional virtual processors.

Any data-parallel algorithm can be both vectorized and parallelized by assigning each processor in the algorithm to a virtual processor on the machine.In our implementation of the parallel radix sort on the CRAY Y-MP,the keys are divided into equally sized sets,and each virtual processor is responsible for one of these sets.Each virtual processor also uses its own set of buckets (see Figure 4).

Figure5:Loop Raking.Vectors are loaded with a stride of .To process an entire array of length,we move the rake down one step on each of iterations.

Loop Raking:The typical way to vectorize an operation on an array uses strip mining[16]:

do0to1by

process items to1

where is the vector-register length.In this method an el-ement of a vector-register handles every th element in the array.The problem with using strip mining for vectorizing counting sort is that the sort will not be stable,as required for radix sort to be correct.This is because each virtual processor(elem

ent of the vector register)will be responsi-ble for a strided set of keys rather than a contiguous block of keys as required by the parallel algorithm.

To generate a stable counting sort,we introduce a tech-nique that we call loop raking.Loop raking uses a constant stride()of to access a set of elements that are evenly distributed across the input vector—as if a rake was placed over the vector and shifted on each vector load.See Fig-ure5.The loop structure is as follows:

do0to1

process items,,2,...,1 To avoid excessive bank conﬂicts during a vector load, is selected so that is odd(see discussion of shape factors in[4]).To extend loop raking to multiple processors,a vector is divided into parts,and each part is raked by one of the processors.As will be discussed in Section4,loop raking is used in all three phases of the counting sort. Processor Memory Layout:On the CRAY Y-MP,there are(256)separate banks of interleaved memory,each with a relatively high latency.If one access to a bank is followed directly by another access to the same bank,the second access will be delayed.A memory location is contained in bank mod,causing sequential accesses to be very efﬁcient,since locations are striped across the memory banks[17].But when performing data dependent

access to the buckets in the radix sort,memory accesses may refer to the same banks.Ideally,we would like to arrange the buckets so that each access would be guaranteed to use a different bank,making the sorting time independent of the distribution of the keys.

As discussed in Section2,the buckets logically form a two dimensional array:the row index identiﬁes a set of buckets for a virtual processor and the column index selects a bucket within a set.The obvious way to lay out the buckets would be in row-major ,with

the buckets for each virtual processor stored in contiguous memory locations.This approach makes addressing the buckets simple,but causes data-dependent bank conﬂicts. For example,with a radix of28256and256memory banks,a virtual processor with a value of will access memory bank(assuming the buckets start at bank0).If

several virtual processors have the same value,they will access the same memory bank and introduce a delay.In fact,the worst case is quite common,since users will often sort keys in which all the keys have an identical digit.

Instead we lay out the buckets in column-major order so

that the buckets used by each virtual processor are in a sep-arate memory bank.For example,vector element0might only access banks0and64,while vector element1only accesses banks1and65.Hence using column-major order guarantees that each memory address in a vector gather or scatter is in a different bank.The only disadvantage to a column-major layout is that addressing bucket re-quires multiplying by the number of sets of , the number of virtual processors).Fortunately,we can perform the address arithmetic with inexpensive bit shift instructions,since the number of sets of buckets is a power of2.Furthermore,this shift can be folded in with other operations needed to extract the current digit from the keys. 4CRAY Implementation of Radix Sort This section describes our implementation of paral-lel R ADIX-S ORT on an8-processor CRAY Y-MP.Each pass of R ADIX-S ORT consists of a call to E XTRACT-D IGIT followed by the three phases of the C OUNTING-S ORT al-gorithm:H ISTOGRAM-K EYS,S CAN-B UCKETS and R ANK-A ND-P ERMUTE.For each routine,this section discusses vectorization and parallelization and develops an equation for estimating the execution time.This section also dis-cusses our implementation of a more general sorting inter-face.

We implemented R ADIX-S ORT using our own macro as-sembler,written in LISP.The macro assembler generates CRAY Assembly Language(CAL).In the assembler,we developed a set of macros for implementing loop raking, including a version that generates an unrolled loop to avoid

ating calls to microtasking macros($MDO)provided from

the CRAY macro package[7].Part of the motivation for

developing the macro assembler was to allow us to easily

port the code to other vector machines.

Extract Digit:On each pass,R ADIX-S ORT extracts the

current digit from the keys.In the parallel histogram op-eration,each processor then uses this digit to compute an

index into the array of buckets.Our implementation of

E XTRACT-D IGIT folds together the two operations of ex-

tracting the digit and converting it to a bucket index.This

simpliﬁes the H ISTOGRAM and R ANK-A ND-P ERMUTE rou-

tines,by providing direct access to the buckets without requiring additional address computation.

Since each virtual processor will process a contiguous

block of keys,E XTRACT-D IGIT uses loop raking.The digits

are extracted by masking with a bitwise AND operation and

shifting the result to the low order bits.The digit must then be converted into an index into the column-major bucket

array,where bucket of virtual processor is stored at

an offset of.The column offset is computed

by multiplying the digit by the number of sets of buckets, .The row offset is computed by having each virtual processor add its virtual processor number(which is pre-

computed outside the loop).

The E XTRACT-D IGIT routine is quite efﬁcient.Multiply-

ing by can be computed with a shift(since is

a power of two),and the mask,shift,and addition are all

chained in the hardware pipeline.The time for E XTRACT-D IGIT is

Extract-Digit12

where the constant was determined empirically.The units for all equations in this paper are expressed in clock cycles (1clock cycle6nsec).

Histogram Keys:Theﬁrst step to building the histogram consists of clearing the buckets using a simple loop. There are2buckets,and we achieve a speedup of ,yielding the equation:

Clear-Buckets112112

To produce counts of the digits,H ISTOGRAM-K EYS uses loop raking to access the digits.See Figure4.Since each digit is already converted to a bucket index,the counts are simply gathered from the buckets,incremented,and scattered back to the buckets.

In the pseudo-code below,the variables,and are vector registers.The notation::

identiﬁes elements from to,incrementing by .The access:1:therefore loads the

<1. To implement loop raking,each physical processor initial-izes a variable to the starting location of the keys which it will process.Then the stride is set to the number of elements per virtual processor,since each vector register element will simulate one virtual processor.

H ISTOGRAM-K EYS

%021=Buckets

%01=Digits(pre-converted to bucket indices) doall0to1

688IT编程网

Radix sort for vector multiprocessors

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

688IT编程网

Radix sort for vector multiprocessors

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林 重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

随机森林重要性