你猜为什么A64为什么没有LDM和STM指令了,⽽是⽤LDP跟
STP呢?
⼀、前⾔
我们知道在Arm Arch32⾥⾯有个突发传输指令LDM、STM,也就是说可以⼀次传输多个值,到底是多少个呢?根据⼿册⾥⾯所说:加载和存储多个寄存器。寄存器r0到r15的任何组合均可在ARM状态下传输。
也就是说传输到通⽤寄存器⾥⾯⼀次可以传输很多啊!
但是到了Arch64⾥⾯就取消掉这个指令了,取⽽代之的是LDP和STP,固定的⼀次最多只能取2个值,为何呢?
这篇⽂章接下来的部分就是为了探究这个问题的!
⼆、资料搜集
在 资料:《Cortex_A57_Software_Optimization_Guide_external》
中说到了⽤LDP做memcopy的好处是可以尽可能地利⽤load和store的pipeline:
4.5 Load/Store Throughput
The Cortex-A57 processor includes separate load and store pipelines, which allow it to execute one load μop and one store μop every cycle. .
译:由于a57是有分别独⽴存在的加载、存储流⽔线,也就是说配合多发射就可能在⼀个cycle内同时执⾏两条(ldr、str)指令。
To achieve maximum throughput for memory copy (or similar loops), one should do the following.
实现内存复制最⼤吞吐的指导思想如下:
Unroll the loop to include multiple load and store operations per iteration, minimizing the overheads of looping.(⽼⽣常谈:循环展开,使得循环部分产⽣的过冲减少,其实就是减少循环逻辑部分在整个指令执⾏数⾥⾯的⽐例,从⽽提⾼准确率以及减少分⽀预测的次数。)
Use discrete, non-writeback forms of load and store instructions (such as LDRD and STRD), interleaving them so that one load and one store operation may be performed each cycle. Avoid load-/store-multiple instruction encodings (such as LDM and STM), which lead to separated bursts of load and store μops which may not allow concurrent utilization of both the load and store pipelines.
offset指令是什么意思使⽤离散的、⾮写回的内存指令,并间隔开来从⽽最⼤化利⽤ld/st的双pipeline特性;避免使⽤LDM跟STM,因为这样⼦的话就会产⽣⼀些分散的突发传输,从⽽⽆法合理利⽤双pipeline特性。
The following example shows a recommended instruction sequence for a long memory copy in AArch32 state:
ARM® Cortex®-A Series Version: 1.0 Programmer’s Guide for ARMv8-A
3.2 prefetch
预取就是告诉memory⼦系统,我有个数据马上要⽤,你给安排⼀下!嘿嘿~像不像真实世界的⾛后门啊!提前打声招呼,提前准备好!VIP
3.3 LDP/STP
⾸先告诉你64bit模式下是没有32bit模式下的LDM跟STM指令的:

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。