We've run into an issue with MariaDB when running Sysbench "oltp.lua" test with 8 threads. The server daemon crashed mostly with an assertion failure at storage/xtradb/fil/fil0fil.c:5288:
An attached debugger gave the following backtrace:
Once the daemon crashed we've sometimes been unable to start it again without wiping out the database and re-installing it.
Having done some digging it is apparent that there is a problem in the mutex_exit code path; in particular at:
A load-acquire is used to exit the mutex rather than a store-release. This leads to unpredictable results for architectures with a weak memory model.
We have the following in program order:
However, the following sequence of events can be observed by another core:
The above can (and has for our test system) lead to severe data corruption; that prevents the daemon from even re-starting.
I've attached an emergency patch that re-introduces __ sync_lock_release to release the mutex. This fixes the crash and data corruption issues for me, but I understand from comments in the code that there were issues with this function in the past? Could the gcc intrinsics be moved over to the __ atomic_* functions? Ideally:
To acquire the lock:
To release the lock:
(which also worked on my test system).
I believe this issue may affect other versions of MariaDB, but I've only tested 5.5.36.