UT_RELAX_CPU isn't at all relaxing

Description

See http://bugs.mysql.com/bug.php?id=74832

Patch has languished since Nov 2014.

So i'm just pasting in my original description here:

The code here has seemingly changed in 5.7.5 compared to previous versions. I'm talking about 5.7.5-m15 here.

The purpose of ut_delay() is to spin for a while before InnoDB attempts to again acquire a mutex. Optimizations include (on x86) calling the pause instruction inside the spin loop and (on POWER) setting the thread priority to low for the duration of ut_delay.

Here is the current (MySQL 5.7.5) implementation of ut_delay:
ulint
ut_delay(
/=====/
ulint delay) /*!< in: delay in microseconds on 100 MHz Pentium */
{
ulint i, j;

UT_LOW_PRIORITY_CPU();

j = 0;

for (i = 0; i < delay * 50; i++) {
j += i;
UT_RELAX_CPU();
}

if (ut_always_false) {
ut_always_false = (ibool) j;
}

UT_RESUME_PRIORITY_CPU();

return(j);
}

There are a couple of problems with this code:
1) ut_always_false could quite legitimately be compiled away by the compiler
2) j is actually unneeded and if UT_RELAX_CPU() was not implemented, then the compiler could legitimately completely optimize away the loop

But there's another problem that's a bit hidden....

In ut0ut.h we have the following:
#ifndef UNIV_HOTBACKUP

  1. if defined(HAVE_PAUSE_INSTRUCTION)
    /* According to the gcc info page, asm volatile means that the
    instruction has important side-effects and must not be removed.
    Also asm volatile may trigger a memory barrier (spilling all registers
    to memory). */

  2. ifdef __SUNPRO_CC

  3. define UT_RELAX_CPU() asm ("pause" )

  4. else

  5. define UT_RELAX_CPU() _asm_ _volatile_ ("pause")

  6. endif /* __SUNPRO_CC */

  1. elif defined(HAVE_FAKE_PAUSE_INSTRUCTION)

  2. define UT_RELAX_CPU() _asm_ _volatile_ ("rep; nop")

  3. elif defined(HAVE_ATOMIC_BUILTINS)

  4. define UT_RELAX_CPU() do { \
    volatile lint volatile_var; \
    os_compare_and_swap_lint(&volatile_var, 0, 1); \
    } while (0)

  5. elif defined(HAVE_WINDOWS_ATOMICS)
    /* In the Win32 API, the x86 PAUSE instruction is executed by calling
    the YieldProcessor macro defined in WinNT.h. It is a CPU architecture-
    independent way by using YieldProcessor. */

  6. define UT_RELAX_CPU() YieldProcessor()

  7. else

  8. define UT_RELAX_CPU() ((void)0) /* avoid warning for an empty statement */

  9. endif

Which if HAVE_PAUSE_INSTRUCTION or HAVE_FAKE_PAUSE_INSTRUCTION are defined (i.e. recent x86), you'll get the desired effect, there will be a pause instruction.

However, if you HAVE_ATOMIC_BUILTINS, then you get this:
do { \
volatile lint volatile_var; \
os_compare_and_swap_lint(&volatile_var, 0, 1); \
} while (0)

Which is anything but relaxing. So, on POWER, where we have atomics but not pause instruction, we get that instead of an empty statement.

This likely affects other platforms too (e.g. SPARC, MIPS, ARM, ia64, mips, m68k... basically everything that isn't x86).

What we really want here is instead of that, just a compiler barrier, so that it knows that it cannot optimize away the loop.

Back to ut_delay, if we look at the original PowerPC assembler for this, it's rather larger than it needs to be:

0000000000000380 <._Z8ut_delaym>:
380: fb e1 ff f8 std r31,-8(r1)
384: f8 21 ff b1 stdu r1,-80(r1)
388: 7c 3f 0b 78 mr r31,r1
38c: 7c 21 0b 78 mr r1,r1
390: 1d 03 00 32 mulli r8,r3,50
394: 38 60 00 00 li r3,0
398: 2f a8 00 00 cmpdi cr7,r8,0
39c: 41 9e 00 44 beq cr7,3e0 <._Z8ut_delaym+0x60>
3a0: 39 20 00 00 li r9,0
3a4: 38 e0 00 01 li r7,1
3a8: 60 00 00 00 nop
3ac: 60 00 00 00 nop
3b0: 7c 00 04 ac sync
3b4: 7c 63 4a 14 add r3,r3,r9
3b8: 38 df 00 30 addi r6,r31,48
3bc: 7d 40 30 a8 ldarx r10,0,r6
3c0: 2c 2a 00 00 cmpdi r10,0
3c4: 40 82 00 0c bne 3d0 <._Z8ut_delaym+0x50>
3c8: 7c e0 31 ad stdcx. r7,0,r6
3cc: 40 a2 ff ec bne 3b8 <._Z8ut_delaym+0x38>
3d0: 4c 00 01 2c isync
3d4: 39 29 00 01 addi r9,r9,1
3d8: 7f a9 40 40 cmpld cr7,r9,r8
3dc: 40 9e ff d4 bne cr7,3b0 <._Z8ut_delaym+0x30>
3e0: 3c c2 00 00 addis r6,r2,0
3e4: e9 26 00 00 ld r9,0(r6)
3e8: 2f a9 00 00 cmpdi cr7,r9,0
3ec: 41 9e 00 08 beq cr7,3f4 <._Z8ut_delaym+0x74>
3f0: f8 66 00 00 std r3,0(r6)
3f4: 7c 42 13 78 mr r2,r2
3f8: 38 3f 00 50 addi r1,r31,80
3fc: eb e1 ff f8 ld r31,-8(r1)
400: 4e 80 00 20 blr

The bits that stare at me are the sync and isync instructions. We're executing memory barriers in there! In a loop! When we're meant to be relaxing!

So, once I remove the buggy UT_RELAX_CPU() implementation and simplify ut_delay (patch attached), I end up with:

0000000000000380 <._Z8ut_delaym>:
380: fb e1 ff f8 std r31,-8(r1)
384: f8 21 ff c1 stdu r1,-64(r1)
388: 7c 3f 0b 78 mr r31,r1
38c: 7c 21 0b 78 mr r1,r1
390: 1c 63 00 32 mulli r3,r3,50
394: 7c 69 03 a6 mtctr r3
398: 2f a3 00 00 cmpdi cr7,r3,0
39c: 41 9e 00 08 beq cr7,3a4 <._Z8ut_delaym+0x24>
3a0: 42 00 00 00 bdnz 3a0 <._Z8ut_delaym+0x20>
3a4: 7c 42 13 78 mr r2,r2
3a8: 38 3f 00 40 addi r1,r31,64
3ac: eb e1 ff f8 ld r31,-8(r1)
3b0: 4e 80 00 20 blr
3b4: 00 00 00 00 .long 0x0
3b8: 00 09 00 00 .long 0x90000
3bc: 80 01 00 00 lwz r0,0(r1)

Which is exactly what we should be doing - we go into low priority (mr r1,r1), spin for a while, then resume normal priority (mr r2, r2) and return. We also avoid doing unnecessary work (which is good).

This also may have a positive performance impact on x86 as the extra math and work around there would have to be done, and IIRC modern KVM on x86 will trap the pause instruction and attempt to schedule a vcpu that may hold the lock that we're spinning for.

How to repeat:
look at profiles, or disassemble code and examine it (like I've done above)

Suggested fix:
merge my patch (attached) that fixes this.

Environment

ppc64el

Assignee

Jan Lindström

Reporter

Stewart Smith

Labels

None

Components

Sprint

None

Fix versions

Affects versions

Priority

Major
Configure