engineer-to-engineernoteee-255-analogdevices--688IT编程网

Engineer-to-Engineer Note

EE-255

Technical notes on using Analog Devices DSPs, processors and development tools

**********************************************************************************Or visit our on-line resources www.analog/ee-notes and www.analog/processors

Porting PC-Based MP3 Player Software to ADSP-21262 SHARC® Processors

Contributed by Srinivas K. and Kunal Singh

Rev 1 – November 16, 2004

ADSP-21262 devices are a members of the third generation of SHARC® family of processors. ADSP-21262 processors offer SIMD architecture and are equipped with powerful DMA engines,ensuring high bandwidth data transfers to and from the processor. Data transfers are compl

etely transparent to the processor core. ADSP-21262 processors operate up to 200 MIPS and provide several peripherals (e.g., SPORTs, PP, SPI, IDPs) that are well suited for audio applications.

MP3 is a standard for digitally compressed music. This compression algorithm is capable of up to 10:1 compression with no noticeable loss in quality of the audio data. MP3 (short for MPEG3) stands for Motion Picture Experts Group, Audio Layer 3. MP3 is becoming an increasingly popular way to store audio in electronic format. An MP3 decoder reads the compressed data from the storage media and performs various decoding steps to obtain the raw audio data. This audio data is in PCM audio format, which can be stored on storage media or played to an audio output device (speaker) in real time.

This application note is based upon experience gained while porting pure PC-based C code for an MP3-decoder to ADSP-21262 processors using the VisualDSP++® 3.5 tools suite. The target platform was the ADSP-21262 EZ-KIT Lite® evaluation system. This application

note summarizes key considerations involved in

porting general PC-based C-code to ADSP-21262 processors.

Data I/O - PC versus SHARC

As depicted in Figure 1, general PC-based code primarily uses file I/O for data input and output operation. The data may be stored in the form of the files on the PC's hard drive. The file I/Os on PC are supported by the OS running on the PC. For example, MP3 files for an MP3 decoder may be stored on the PC's hard drive.

Figure 1. Data I/O Scheme for a PC-based System

DMA scheme is particularly suitable for real-time applications in which huge amounts of data must be moved in and out of the processor in real time.

ADSP-21262 processors offer powerful DMA engines to perform data transfers across:

Internal and external memory

Internal memory and an external peripheral The above data transfers are completely transparent to the core.

Parallel Data Fetch and SIMD ADSP-21262 processors offer dual data fetches and a MAC operation in a single cycle. The internal bus architecture of the ADSP-21262 processor consists of separate PM and DM buses. In normal scenarios, the PM bus fetches instructions from Program Memory and the DM bus reads/writes data from Data Memory. While executing computation instructions with dual data fetch, one operand is fetched on the PM bus and the second operand is fetched on the DM bus. Having the executed instructions available in the Instruction Cache (so instruction fetches are not needed and the PM bus is free to access data) is a prerequisite for the above operation to complete in a single cycle.

Instructions involving dual operands are encountered frequently in typical signal-processing code. Some examples include FIR/IIR filter loops, DCT, FFT, and other transforms.

The above routines involve MAC operations on two vectors. The operations are performed in a loop (so all instructions are moved to Instruction Cache during the first execution of the loop, and no instruction fetches are required for subsequent loop iterations). If the two data vectors are located in different memory blocks (PM and DM), it may be possible to use a dual fetch in a single cycle. Another important feature of the ADSP-21262 processor is its SIMD architecture. The ADSP-21262 has two parallel compute units which can execute same instructions on different data sets in parallel.

Consider the following multiplication loop:

float operand1[1024];

float operand2[1024];

float result;

{

int j;

result = 0;

for (j= 0; j<1024; j++)

{

result += operand1[j] *

operand2[j];

}

Listing 1. Multiplication Loop Without Optimization

In the absence of a dual data fetch, the inner multiplication loop in the above example would require 2048 cycles to finish the execution. This is because the fetching of operand1 and operand2 for each instruction requires a total of two cycles. The above code structure can be modified such that one of the operands lies in the PM block. With the above modification, the two operands can be fetched in a

single cycle. Since the multiplication is being performed within a loop, the instruction would get cached after the first execution, so that processor can fetch the two operands in a single cycle.

float PM operand1[1024];

// the “PM” command would place

operand1

// in PM

float operand2[1024];

float result;

{

int j;

result = 0;

for (j= 0; j<1024; j++)

{

result += operand1[j] * operand2[j]; }

}

Listing 2. Multiplication Loop with Dual Data Fetch

As shown above, the PM command instructs the compiler that this particular variable must be stored in the PM block. The above loop would take approximately 1024 cycles to execute. The code can be structured further, allowing the compiler to use SIMD mode.

float PM operand1[1024];

// the “PM” command would place

operand1

// in PM

float operand2[1024];

float result1, result2;

{

int j;

result1 = 0;

for (j= 0; j<512; j++)

{

result1+= operand1[j] *

operand2[j];

result2+=

operand1[j+1]*operand2[j+1];

}

result = result1 +result2;

}

Listing 3:

Listing 3. Multiplication Loop with Dual Data Fetch

and SIMD

With the multiplication loop re-structured in the above fashion, the compiler would enable SIMD mode and execute the instructions for result1 and result2 on different processing elements. The above loop would take approximately 512 cycles to execute.

Native Instructions

Instructions in the processor's instruction set can be executed in a single cycle. However, operations that are not native to the instruction set take multiple cycles.

Some complex operations can be performed in alternative ways that rely only on the native instructions to perform the operation. For example, signal-processing code frequently involves division by a factor of 2/4/8 and so on, which take approximately 40 cycles. However, these divisions can be replaced by right-shift operations which would be performed in a single cycle.

Function Calls

Another important consideration is function calls. The C run-time manager must save/restore the context information across the function calls. The context information is pushed onto the stack while calling a new function and is popped from the stack when returning from the function call. If frequent function calls are made to a relatively smaller function, large overheads are required. These overheads can be eliminated by replacing such function calls with inline code.

The VisualDSP++ 3.5 compiler also provides built-in versions of some C library functions. The compiler immediately recognizes them and replaces them with inline assembly code instead of a function call. Inline assembly code is faster than an average library routine, and it does not incur the calling overhead.

Processor Built-In Functions

The VisualDSP++ compiler supports intrinsic (built-in) functions that enable efficient use of hardware resources. These functions are different

from the built-in library functions which we discussed above, in which the function call is replaced by inline assembly. Rather, the processor built-in functions provide a means to use the processor's hardware efficiently.

The built-in functions can be used to:

(a)Access the System registers: Some intrinsic

functions provide efficient access to registers,

modes, and addresses not normally accessible

from C source. This can be achieved through

a set of functions defined in the “sysreg.h”

file. Examples include reading/writing System registers and setting/clearing particular System register bits.

(b)Instruct the compiler to use circular buffer

indexing. This is important for access to a

data array with a fixed offset between two

accesses. Decimation, filtering, and FFTs are

examples of algorithms that may utilize the

above function. Consider the following example:

int m, jj;

float sum, COS[SIZE];

#define circindex __builtin_circindex

for (m=0;m<N ; m++)

{

sum += COS [jj];

jj = circindex (jj, MODIFY, SIZE);

}

ignore subsequent bad blocksListing 4. circindex Function for Circular Buffering

In the above example, the COS array is accessed inside a loop. The circindex function instructs the compiler to access COS using circular buffering with Mx = modify and Lx = length. However, if circindex is not specified in the above example, the compiler may not implement the accesses to COS with index registers. Instead,

it may use other calculations to calculate the index for each access, which would consume extra cycles. Other Optimizations

C code that performs satisfactorily on a PC may not be MIPS efficient on a processor. The MIPS on the processor are constrained generally and may require further optimizations specific to the algorithms being used. For example, DCT computations may be replaced by fast DCT algorithms.

As discussed already, great benefits may be achieved by using processor native instructions in place of complex computations. We would like to share the following example:

N=36;

for( p= 0;p<N; p++)

{

sum = 0.0;

for(m=0;m<N/2;m++)

sum += in [m] *

COS[((2*p+1+N/2)*(2*m+1))%(4*36)];

out [p] = sum * win [block_type] [p];

}

Listing 5. An IDCT Loop

The above code section (taken from the MP3 algorithm) is used for the inverse DCT computation.

The instruction in the innermost FOR loop uses complex logic to calculate the index of a COS table. In the particular algorithm, the FOR loop instruction would execute 41472 (2x32x36x18) times to process each audio frame. Using the above code on ADSP-21262 processor in place the algorithm takes 330 MIPS.

We have tried to use simple logic to index the COS table in the above example. A “%” is not a native operation for the processor. It would be performed using a certain library function (C-library) that would introduce additional

688IT编程网

engineer-to-engineernoteee-255-analogdevices

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

688IT编程网

engineer-to-engineernoteee-255-analogdevices

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林算法的改进方法

基于随机森林算法的风险预警模型研究

Python中的随机森林算法详解

随机森林发展历史

如何使用随机森林进行时间序列数据模式识别(八)

随机森林回归模型原理

如何使用随机森林进行时间序列数据模式识别(六)

如何使用随机森林进行时间序列数据预测(四)

如何使用随机森林进行异常检测(六)

随机森林算法和grandientboosting算法 -回复

随机森林方法总结全面

随机森林算法原理和步骤

随机森林的原理

随机森林 重要性

随机森林算法

机器学习中随机森林的原理

随机森林算法原理

使用计算机视觉技术进行动物识别的技巧

基于crf命名实体识别实验总结

transformer预测模型训练方法

最新文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

随机森林结合直接正交信号校正的模型传递方法

标签列表

随机森林重要性