26

Consider the following snippet of code:

int* find_ptr(int* mem, int sz, int val) {
    for (int i = 0; i < sz; i++) {
        if (mem[i] == val) { 
            return &mem[i];
        }
    }
    return nullptr;
}

GCC on -O3 compiles this to:

find_ptr(int*, int, int):
        mov     rax, rdi
        test    esi, esi
        jle     .L4                  # why not .L8?
        lea     ecx, [rsi-1]
        lea     rcx, [rdi+4+rcx*4]
        jmp     .L3
.L9:
        add     rax, 4
        cmp     rax, rcx
        je      .L8
.L3:
        cmp     DWORD PTR [rax], edx
        jne     .L9
        ret
.L8:
        xor     eax, eax
        ret
.L4:
        xor     eax, eax
        ret

In this assembly, the blocks with labels .L4 and .L8 are identical. Would it not be better to rewrite jumps to .L4 to .L8 and drop .L4? I thought this might be a bug, but clang also duplicates the xor-ret sequence back to back. However, ICC and MSVC each take a pretty different approach.

Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?

  • 3
    @MaheshAttarde If it is true, it is a bug of the low-level optimizer. – YSC Apr 18 at 11:34
  • 2
    A small discovery: setting -mno-avx drops the duplicated block on GCC (but not on clang). – Alex Reinking Apr 18 at 11:36
  • 2
    I've see multiple rets like this quite a few times, although not exactly with xor like this but still redundant when you can just use the same ret for all code paths – phuclv Apr 19 at 1:29
  • 5
    @AlexReinking: that's really odd, nice discovery that -mtune=skylake merges the blocks, but enabling AVX breaks it again. You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. But IMO just report this MCVE to gcc's bugzilla with keyword missed-optimization. – Peter Cordes Apr 19 at 7:13
  • 1
    -Os also helps. It appears that the duplication happens during expansion (gimple->rtl). – Marc Glisse Apr 19 at 8:43
8

This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.

But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret that gcc conditionally branches to, instead of branching to a ret in another existing path. (x86 doesn't have a conditional ret, so simple functions that don't need any stack cleanup often just need to branch to a ret. Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)

Modern CPUs have a return-address predictor stack that easily predicts the branch target for ret instructions, so there's not going to be an effect like one ret instruction more often returning to one caller and another ret more often returning to another caller, so it doesn't help branch prediction to separate them and let them use different entries. (It might possibly help with -mtune=pentium3 or other ancient CPUs without a RAS predictor, but even then you wouldn't normally spend extra code-size just for that.)

IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).

Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc

(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP, and fixed a few days later.)


Cause:

You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for loops into if(sz>0) { do{}while(); } if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.

I don't know why gcc fails to notice and merge two identical basic blocks that end with ret.

Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?

You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.


Interesting discovery that it only happens with -mavx (enabled by -march=skylake or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr or strlen. So IDK why AVX even makes a difference at all.

(Note that the C abstract machine never reads mem[i] beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int before an unmapped page, and sz=1000, as long as *mem == val. So to auto-vectorize without int mem[static sz] guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz] would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)

  • 2
    +1 I filed it here. Hopefully the devs can shed some more light on the issue. gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 – Alex Reinking Apr 19 at 8:33
  • Also you have it backwards: -mtune=skylake breaks it but disabling AVX merges the blocks. – Alex Reinking Apr 19 at 8:34
  • 1
    @AlexReinking: like I explained in comments on the question, -march and -mtune are not the same thing. godbolt.org/z/tkcNn- For the record for future readers, we see this bug both with -O3 (with the default -mtune=generic) and with -O3 -mtune=skylake -mavx. (But not with -O3 -mtune=skylake). – Peter Cordes Apr 19 at 17:35
  • Devs seem to be treating this as a bug, so your answer is correct! Accepted. – Alex Reinking Apr 20 at 21:36
  • 1
    @AlexReinking: Oops, I did have it partially backwards with gcc options. I was thinking that bare -O3 showed the bug, but it doesn't. It's purely -mavx that causes the problem, whether you enable that via -march=skylake or directly. tune=skylake or not is irrelevant. – Peter Cordes Apr 20 at 21:48

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.