Radix Sort For Vector Multiprocessors
Marco Zagha and Guy E.Blelloch
School of Computer Science
Carnegie Mellon University
Pittsburgh,PA15213-3890
Abstract
We have designed a radix sort algorithm for vector mul-
tiprocessors and have implemented the algorithm on the
CRAY Y-MP.On one processor of the Y-MP,our sort is
over5times faster on large sorting problems than the opti-
mized library sort provided by CRAY Research.On eight
processors we achieve an additional speedup of almost5,
yielding a routine over25times faster than the library sort.
Using this multiprocessor version,we can sort at a rate of
15million64-bit keys per second.
Our sorting algorithm is adapted from a data-parallel
algorithm previously designed for a highly parallel Single
Instruction Multiple Data(SIMD)computer,the Connec-
tion Machine CM-2.To develop our version we introduce
three general techniques for mapping data-parallel algo-
rithms onto vector multiprocessors.These techniques allow
us to fully vectorize and parallelize the algorithm.The pa-
per also derives equations that model the performance of
our algorithm on the Y-MP.These equations are then used
to optimize the radix size.
1Introduction
Sorting is one of the most heavily studied problems in
computer science.A sorting algorithm,bitonicsort[1],was
one of thefirst algorithms designed for parallel machines,
and since then hundreds of articles on parallel sorting have
appeared in the literature.Although this work has devel-
oped a robust theory of parallel sorting,there has only been
limited success in obtaining efficient implementations on
real parallel machines.Many of the reported results are
barely faster than serial sorting algorithms on fast work-
stations.This is due to the high communication costs of
multiprocessor algorithm from the data-parallel algorithm, we use the following techniques:
Virtual Processors:Each element of a vector register is viewed as a virtual processor.For a vector-register length on each processor,and for processors,we therefore use an algorithm designed for an pro-cessor machine(648512on the Y-MP).View-ing each vector element as its own processor forces independence among vector elements,allowing the algorithm to be fully vectorized.
Loop Raking:A new technique we call loop raking is used for vectorizing the loops.Instead of loading contiguous blocks of a vector on each vector load, as done in strip mining[16],loop-raking uses a con-stant stride to load a set of elements evenly distributed across the input vector—as if a rake was placed over the vector.Loop raking is necessary to keep each pass of the radix-sort stable(see Section3).
Processor Memory Layout:A careful layout of the mem-ory of each virtual processor is used to avoid
bank conflicts.This layout is used for placing the buckets needed by the radix sort and guarantees that for every vector gather from or scatter to the buckets,each ele-ment of a vector register accesses a different memory bank.As compared to a naive layout of the buckets in contiguous memory locations,our layout greatly reduces the running time for some common inputs.
In addition to describing the algorithm,this paper mod-els the running time of the radix sort in terms of the number of keys to be sorted,the size of each key,and the number of processors.The model yields the following equation for the number of6nsec clock cycles:
12215(1) where is the number of keys,is the number of physical processors used,is the maximun number of physical processors(8on the Y-MP),is the number of bits in the radix,and is the number of bits in the key.The form 1
22is based on the theoretical com-plexity,and the constants1and2are derived from empir-ical measurements on the Y-MP.The12215 term is due to memory contention and is also based on empirical measurements.Equation1is used to select an optimal radix size(typically between6and14),and can also be used to predict running times over a range of data sizes and numbers of processors without having to run the algorithm.With5000,equation1predicts the ac-tual ru
nning time within10%accuracy.The inaccuracy is due to a slight dependence of the running time on the distribution of the keys.
The remainder of this paper is organized as follows. Section2discusses the serial and parallel versions of radix
sort.Section3discusses the use of virtual processors,
loop raking,and memory layout.Section4describes our
register forimplementation on the CRAY Y-MP.Section5evaluates
the performance of our implementation and discusses how
to select the radix size.
2Radix Sorting
Radix-based sorting algorithms treat keys as multidigit
numbers in which each digit is an integer with a value in
1,where is the radix.A32-bit
integer,for example,could be treated as a4-digit number with radix232428256.The radix is usually chosen to minimize the running time and is highly
dependent on the implementation and the number of keys
being sorted.The standard radix sort on which our parallel
algorithm is based is perhaps the oldest form of sorting.
Thefirst implementation dates back to the19th century and Hollerith’s machines for sorting punched cards.
Radix sorting is an attractive alternative to comparison-
based sorts,such as quicksort[10],since for keys it runs in instead of lg time.In practice,however, a disadvantage is that it accesses memory in an arbitrary order.Therefore,on a machine with a cache,many memory references will cause cache misses.Studies have shown that on serial machines when the problemfits into physical memory but not into the cache,radix sort performs no fa
ster, or only marginally faster,than optimized implementations of quicksort(for32-bit and64-bit integers)[14,Chapter 8].
Since the Y-MP has no cache—all physical memory is
accessible with equal cost—and assuming radix sort can
be vectorized,radix sort is likely to have better relative
performance over comparison-based sorts than on cache based machines.This seems to be true.For large data sets, our single processor sort runs between3and5times faster than a fully vectorized version of quicksort implemented by Levin[13].1
Serial Radix Sort
Wefirst review the serial radix sort since our parallel al-gorithm will include the same phases(see[6,Section9.3] for more details).Radix sort works by breaking keys into digits and sorting one digit at a time,starting with the least significant digit.Figure2shows an example.A counting
Figure2:Radix Sorting3-digit numbers1digit at a time.A counting sort is used for each digit.
sort(sometimes called distributionsort)is used to sort each digit.Since R ADIX-S ORT starts from the least significant digit,the sort works only if the ordering generated in pre-vious passes is preserved.Each C OUNTING-S ORT therefore must be stable.
The C OUNTING-S ORT is implemented as follows.We assume that each digit consists of bits:2.The basic idea is to determine,for each input digit,the number of keys with a digit less than,and also the number of keys with a digit equal to appearing earlier in the input sequence.To do this,the sort uses buckets, one for each possible digit value.The following routine sorts an array of keys,based on an array of digits, both of size,bits at a time,placing the result in.The algorithm is divided into3phases:
C OUNTING-S ORT
H ISTOGRAM-K EYS
do0to21
do0to1
1
S CAN-B UCKETS
do0to21
R ANK-A ND-P ERMUTE
do0to1
1
Thefirst loop of H ISTOGRAM-K EYS clears the buckets. The second loop,at iteration,uses the elem
ent of the digit array as an offset into the buckets,and incre-ments the count in that bucket.At the end of H ISTOGRAM-K EYS,contains the number of digitshaving value .S CAN-B UCKETS performs a scan(also called all-prefix-sums)operation on the buckets,returning to each element
the sum of all previous elements.After the scan,
contains the number of digits with a value such that. This is the position in the output in which thefirst key with digit belongs.In thefinal phase,R ANK-A ND-P ERMUTE, each key with a digit of value is placed in itsfinal location by getting the offset from and incrementing the bucket so the next key with digit gets put in the next lo-cation.Since the keys are looped over in increasing order, C OUNTING-S ORT is stable.
Parallel Radix Sort
The serial algorithm cannot be directly parallelized(or vec-torized)because of loop dependencies in all three phases. If an attempt is made to parallelize over the iterations of the H ISTOGRAM-K EYS,several processors could try to incre-ment the same bucket simultaneously.If the bucket is not locked by each processor to gain exclusive access,only one of the increments would take effect.On the other hand,if the bucket is locked,the bucket would act as a serial bottle-neck and would greatly d
egrade performance if many keys had the same digit.
By using a separate set of buckets for each processor, it is possible to parallelize the algorithm without requiring a lock on the buckets.Each processor is responsible for its own subset of the keys and tabulates them in its own set of buckets.A restricted version of this algorithm was described by Cole and Vishkin as part of an optimal 2-ruling set algorithm[5].The full algorithm was also implemented on the Connection Machine CM-2[3].In this parallel version of radix sort,the buckets can be viewed as a matrix:
Buckets()
012..1
the whole input data.After S CAN -B UCKETS ,we would like
to contain the position in the output where
the first key from processor with digit belongs (see Figure 3).This can be expressed as:
after
10
1
1
That is to say,the offset is the total number of digits less than over all the processors (0),plus the number of digits equal to in processors less than .This sum can be calculated by flattening the matrix into column ma-jor order and executing a S CAN -B UCKETS on the flattened matrix.The S CAN -B
UCKETS operation can be parallelized using a tree-summing or similar algorithm [12].The par-allel version generates the same permutation as the serial algorithm so the sort remains stable.
3212
012
3
Key value:
proc 0proc 1proc 2proc 3Bucket[1]offsets histogram
histogram
histogram
histogram Bucket[0]offsets Bucket[2]offsets Bucket[3]offsets Result Vector
Figure 3:
The scan step of parallel radix sort.The algorithm
is illustrated with 4processors and 4buckets for the values 0–3.The offsets are computed by scanning the buckets.After the scan,each processor has the starting position in the final output of keys with a particular value.For example,processor 3will place the “0”keys starting at position 5in the output,and the “1”keys starting at position 14,as indicated by offsets.
3Algorithmic Techniques
The previous section described a data-parallel radix-sort algorithm.This section describes three general techniques that can used to map this algorithm onto vector mutipro-cessors:virtual processors ,loop raking ,and processor memory layout .We have also used these techniques in the implementation of a variety of other parallel algorithms on the CRAY Y-MP and Ardent
Titan.
L Input Digits
sor has a set of buckets for each vector register element (virtual processor).The figure depicts 2processors (2)using a vector length of 4(4).For a 3-bit digit (3),there are
2
82464buckets.Using processors,the keys are divided into equal sections,which are further divided into sections,one section for each virtual processor .
Virtual Processors:Vector multiprocessors,such as the CRAY,Convex,or Ardent Titan,offer two levels
of par-allelism:multiprocessor facilities and vector facilities.To take full advantage of these machines it is important to use both levels.One way is to flatten the two levels into one homogeneous level by viewing each element of a vector-register as a virtual processor .We call these elements virtual processors so as not to confuse them with a full vector processor.In this view,a machine with a vector-register length ,and with processors,will have
virtual processors.Since the Y-MP has a maximum vector-register length of 64,and there are up to 8processors,the CRAY can contain up to 512virtual processors.Section 5discusses why using a smaller vector length,and hence number of virtual processors,can improve performance in certain cases.On machines with more flexible use of the vector registers,such as the Ardent Titan,it is up to the algorithm designer to decide how many virtual processors to use.This decision should be based on the vector startup times,the number of vector registers needed,and the cost of using additional virtual processors.
Any data-parallel algorithm can be both vectorized and parallelized by assigning each processor in the algorithm to a virtual processor on the machine.In our implementation of the parallel radix sort on the CRAY Y-MP,the keys are divided into equally sized sets,and each virtual processor is responsible for one of these sets.Each virtual processor also uses its own set of buckets (see Figure 4).
Figure5:Loop Raking.Vectors are loaded with a stride of .To process an entire array of length,we move the rake down one step on each of iterations.
Loop Raking:The typical way to vectorize an operation on an array uses strip mining[16]:
do0to1by
process items to1
where is the vector-register length.In this method an el-ement of a vector-register handles every th element in the array.The problem with using strip mining for vectorizing counting sort is that the sort will not be stable,as required for radix sort to be correct.This is because each virtual processor(elem
ent of the vector register)will be responsi-ble for a strided set of keys rather than a contiguous block of keys as required by the parallel algorithm.
To generate a stable counting sort,we introduce a tech-nique that we call loop raking.Loop raking uses a constant stride()of to access a set of elements that are evenly distributed across the input vector—as if a rake was placed over the vector and shifted on each vector load.See Fig-ure5.The loop structure is as follows:
do0to1
process items,,2,...,1 To avoid excessive bank conflicts during a vector load, is selected so that is odd(see discussion of shape factors in[4]).To extend loop raking to multiple processors,a vector is divided into parts,and each part is raked by one of the processors.As will be discussed in Section4,loop raking is used in all three phases of the counting sort. Processor Memory Layout:On the CRAY Y-MP,there are(256)separate banks of interleaved memory,each with a relatively high latency.If one access to a bank is followed directly by another access to the same bank,the second access will be delayed.A memory location is contained in bank mod,causing sequential accesses to be very efficient,since locations are striped across the memory banks[17].But when performing data dependent
access to the buckets in the radix sort,memory accesses may refer to the same banks.Ideally,we would like to arrange the buckets so that each access would be guaranteed to use a different bank,making the sorting time independent of the distribution of the keys.
As discussed in Section2,the buckets logically form a two dimensional array:the row index identifies a set of buckets for a virtual processor and the column index selects a bucket within a set.The obvious way to lay out the buckets would be in row-major ,with
the buckets for each virtual processor stored in contiguous memory locations.This approach makes addressing the buckets simple,but causes data-dependent bank conflicts. For example,with a radix of28256and256memory banks,a virtual processor with a value of will access memory bank(assuming the buckets start at bank0).If
several virtual processors have the same value,they will access the same memory bank and introduce a delay.In fact,the worst case is quite common,since users will often sort keys in which all the keys have an identical digit.
Instead we lay out the buckets in column-major order so
that the buckets used by each virtual processor are in a sep-arate memory bank.For example,vector element0might only access banks0and64,while vector element1only accesses banks1and65.Hence using column-major order guarantees that each memory address in a vector gather or scatter is in a different bank.The only disadvantage to a column-major layout is that addressing bucket re-quires multiplying by the number of sets of , the number of virtual processors).Fortunately,we can perform the address arithmetic with inexpensive bit shift instructions,since the number of sets of buckets is a power of2.Furthermore,this shift can be folded in with other operations needed to extract the current digit from the keys. 4CRAY Implementation of Radix Sort This section describes our implementation of paral-lel R ADIX-S ORT on an8-processor CRAY Y-MP.Each pass of R ADIX-S ORT consists of a call to E XTRACT-D IGIT followed by the three phases of the C OUNTING-S ORT al-gorithm:H ISTOGRAM-K EYS,S CAN-B UCKETS and R ANK-A ND-P ERMUTE.For each routine,this section discusses vectorization and parallelization and develops an equation for estimating the execution time.This section also dis-cusses our implementation of a more general sorting inter-face.
We implemented R ADIX-S ORT using our own macro as-sembler,written in LISP.The macro assembler generates CRAY Assembly Language(CAL).In the assembler,we developed a set of macros for implementing loop raking, including a version that generates an unrolled loop to avoid
register reservation effects[17](therefore allowing better chaining),and a version that supports loops by gener-
ating calls to microtasking macros($MDO)provided from
the CRAY macro package[7].Part of the motivation for
developing the macro assembler was to allow us to easily
port the code to other vector machines.
Extract Digit:On each pass,R ADIX-S ORT extracts the
current digit from the keys.In the parallel histogram op-eration,each processor then uses this digit to compute an
index into the array of buckets.Our implementation of
E XTRACT-D IGIT folds together the two operations of ex-
tracting the digit and converting it to a bucket index.This
simplifies the H ISTOGRAM and R ANK-A ND-P ERMUTE rou-
tines,by providing direct access to the buckets without requiring additional address computation.
Since each virtual processor will process a contiguous
block of keys,E XTRACT-D IGIT uses loop raking.The digits
are extracted by masking with a bitwise AND operation and
shifting the result to the low order bits.The digit must then be converted into an index into the column-major bucket
array,where bucket of virtual processor is stored at
an offset of.The column offset is computed
by multiplying the digit by the number of sets of buckets, .The row offset is computed by having each virtual processor add its virtual processor number(which is pre-
computed outside the loop).
The E XTRACT-D IGIT routine is quite efficient.Multiply-
ing by can be computed with a shift(since is
a power of two),and the mask,shift,and addition are all
chained in the hardware pipeline.The time for E XTRACT-D IGIT is
Extract-Digit12
where the constant was determined empirically.The units for all equations in this paper are expressed in clock cycles (1clock cycle6nsec).
Histogram Keys:Thefirst step to building the histogram consists of clearing the buckets using a simple loop. There are2buckets,and we achieve a speedup of ,yielding the equation:
Clear-Buckets112112
To produce counts of the digits,H ISTOGRAM-K EYS uses loop raking to access the digits.See Figure4.Since each digit is already converted to a bucket index,the counts are simply gathered from the buckets,incremented,and scattered back to the buckets.
In the pseudo-code below,the variables,and are vector registers.The notation::
identifies elements from to,incrementing by .The access:1:therefore loads the
<1. To implement loop raking,each physical processor initial-izes a variable to the starting location of the keys which it will process.Then the stride is set to the number of elements per virtual processor,since each vector register element will simulate one virtual processor.
H ISTOGRAM-K EYS
%021=Buckets
%01=Digits(pre-converted to bucket indices) doall0to1

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。