InnoDB: Use of large externally-stored fields makes crash recovery lose data


When too-large blob fields are used, this is noted to the administrator in a rather innocuous looking message:

InnoDB: ERROR: the age of the last checkpoint is XXX,
InnoDB: which exceeds the log group capacity YYY.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

I would have expected that this means that InnoDB is stalling in order to make more space in its redo logs. However, what it actually means is that InnoDB has overwritten its most recent checkpoint in its redo logs. This compromises crash recovery, potentially causing data loss (or even metadata loss, such as writes to data dictionary tables or system tablespace data). This is easily reproducible using the attached test case.

This appears to happen because externally-stored fields are always written in a single batch to the redo logs, all while holding the log mutex, thus making it impossible to checkpoint during that write. There are several possible solutions to this:

1. Allow flushing to "catch up" and checkpoint during large external field writes. This will involve releasing the log mutex during the write, which is likely complex.

2. Disallow (at least optionally) such large writes. Disallowing external field writes which sum to more than 10% of the total redo log space will in theory prevent this problem, because log_free_check() is called before the write of the external field, and (although it has some races) it should ensure that 10% of the log space is available before starting the write.

This issue exists in all versions of MySQL and MariaDB.






Jeremy Cole



Fix versions

Affects versions