aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorGeorge Steed <george.steed@arm.com>2024-03-13 16:40:33 +0000
committerFrank Barchard <fbarchard@chromium.org>2024-04-25 21:21:52 +0000
commit4f52235a6719eba097ccac60d84bd2c23bad89ed (patch)
treee5be14e85519e37b65983cf92da48ce4fb98294b
parent53b65220da5cdc3dce0e088cd67e08bdf0a76dd6 (diff)
downloadlibyuv-4f52235a6719eba097ccac60d84bd2c23bad89ed.tar.gz
[AArch64] Replace SHRN{,2} pair by UZP2 in DivideRow_16_NEON
Shift instructions have worse throughput than other permute instructions on some micro-architectures, and we can avoid the need for two separate narrowing instructions by taking the high halves of each lane directly through use of the UZP2 instruction. Reduction in runtime for DivideRow_16_NEON: Cortex-A55: -6.2% Cortex-A510: -30.0% Cortex-A76: -11.9% Cortex-X2: -46.8% Bug: libyuv:976 Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949 Reviewed-by: Frank Barchard <fbarchard@chromium.org>
-rw-r--r--source/row_neon64.cc6
1 files changed, 2 insertions, 4 deletions
diff --git a/source/row_neon64.cc b/source/row_neon64.cc
index 68e0d8da..ef0a82d4 100644
--- a/source/row_neon64.cc
+++ b/source/row_neon64.cc
@@ -4680,10 +4680,8 @@ void DivideRow_16_NEON(const uint16_t* src_y,
"umull v2.4s, v3.4h, v4.4h \n"
"umull2 v3.4s, v3.8h, v4.8h \n"
"prfm pldl1keep, [%0, 448] \n"
- "shrn v0.4h, v0.4s, #16 \n"
- "shrn2 v0.8h, v1.4s, #16 \n"
- "shrn v1.4h, v2.4s, #16 \n"
- "shrn2 v1.8h, v3.4s, #16 \n"
+ "uzp2 v0.8h, v0.8h, v1.8h \n"
+ "uzp2 v1.8h, v2.8h, v3.8h \n"
"stp q0, q1, [%1], #32 \n" // store 16 pixels
"subs %w2, %w2, #16 \n" // 16 src pixels per loop
"b.gt 1b \n"