Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6518

Binlog is not synced to disk when binlog is changed in runtime

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 10.1.0
    • Fix Version/s: N/A
    • Component/s: None
    • Labels:
      None
    • Environment:
      Centos 6

      Description

      I've changed binlog_format in runtime from STATEMENT to MIXED and after that binlog was not synced to disk for a few days, after restart all changes was lost.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            elenst Elena Stepanova added a comment -

            Please provide error logs from master and slave, and binary logs from master. You can upload them to our ftp.askmonty.org/private.
            Please also specify correct version of MariaDB server that you use.

            Thanks.

            Show
            elenst Elena Stepanova added a comment - Please provide error logs from master and slave, and binary logs from master. You can upload them to our ftp.askmonty.org/private. Please also specify correct version of MariaDB server that you use. Thanks.
            Hide
            elenst Elena Stepanova added a comment -

            Hi Michal,

            Thanks for the logs. I have some questions about them.

            The log in the archive named "master" is in fact also a slave log. Do you have a chained replication, or multi-master replication, or what is your setup, exactly?

            The master log has several server restarts. Can you pinpoint the time when you changed the binlog format and after which restart your changes were lost?
            From the error log I can guess that you changed the binlog format at 140729 14:42:00, but if so, there is no server restart in the logs after that, only shutdown.

            The master log says at some point that a table was marked as crashed. The previous shutdowns seem to be clean, so there is no obvious reason why tables would have been damaged. Could it be that you had a disk problem or something?

            The master log (which as mentioned earlier is also a slave log) shows several replication errors at different times. Are you aware of those and sure they don't have anything to do with your data loss?

            Finally, the slave log also shows several replication errors, which could attribute for the slave not replicating from the master. Whether they are related or not depends on the timing of your observations.

            Show
            elenst Elena Stepanova added a comment - Hi Michal, Thanks for the logs. I have some questions about them. The log in the archive named "master" is in fact also a slave log. Do you have a chained replication, or multi-master replication, or what is your setup, exactly? The master log has several server restarts. Can you pinpoint the time when you changed the binlog format and after which restart your changes were lost? From the error log I can guess that you changed the binlog format at 140729 14:42:00, but if so, there is no server restart in the logs after that, only shutdown. The master log says at some point that a table was marked as crashed. The previous shutdowns seem to be clean, so there is no obvious reason why tables would have been damaged. Could it be that you had a disk problem or something? The master log (which as mentioned earlier is also a slave log) shows several replication errors at different times. Are you aware of those and sure they don't have anything to do with your data loss? Finally, the slave log also shows several replication errors, which could attribute for the slave not replicating from the master. Whether they are related or not depends on the timing of your observations.
            Hide
            michal.zubkowicz Michal Zubkowicz added a comment -

            It was a master - master replication.
            It was shutdown because i've realized that is something wrong, and i've made a copy of data dir. After start I've realized that data is lost and recovered data from last backup (backup was ok from previous day).
            One table was crashing very often in this version and 5x series. I'm using now Galera Cluster 10.0 and everything is ok. It was something related to triggers and deadlocks.

            Replication errors are not related to data loss, because all data in RAM was ok. All backups are ok, data was not synced to disk only. There was a gap in modification times of binlogs also.

            Show
            michal.zubkowicz Michal Zubkowicz added a comment - It was a master - master replication. It was shutdown because i've realized that is something wrong, and i've made a copy of data dir. After start I've realized that data is lost and recovered data from last backup (backup was ok from previous day). One table was crashing very often in this version and 5x series. I'm using now Galera Cluster 10.0 and everything is ok. It was something related to triggers and deadlocks. Replication errors are not related to data loss, because all data in RAM was ok. All backups are ok, data was not synced to disk only. There was a gap in modification times of binlogs also.
            Hide
            elenst Elena Stepanova added a comment -

            Sorry, it doesn't help much for investigation.

            Could you please describe what exactly you did and what you observed, as precisely as you can, only facts, for now without drawing any conclusions?
            Instead of "realized that is something wrong", "data in RAM was ok", "data was not synced to disk", please put it in the way "I did/checked this and saw that".

            Like,

            • I had replication from .. to ... with <binlog format>;
            • around yy:mm:dd hh:mm I checked replication status on server X by running ... and it was ... ;
            • i changed binlog format on server Y ... ;
            • i checked ... there was ...
              etc.

            Also, please attach your cnf files (but they don't replace describing the problem as above, only supplement it).

            Thanks.

            Show
            elenst Elena Stepanova added a comment - Sorry, it doesn't help much for investigation. Could you please describe what exactly you did and what you observed, as precisely as you can, only facts, for now without drawing any conclusions? Instead of "realized that is something wrong", "data in RAM was ok", "data was not synced to disk", please put it in the way "I did/checked this and saw that". Like, I had replication from .. to ... with <binlog format>; around yy:mm:dd hh:mm I checked replication status on server X by running ... and it was ... ; i changed binlog format on server Y ... ; i checked ... there was ... etc. Also, please attach your cnf files (but they don't replace describing the problem as above, only supplement it). Thanks.
            Hide
            michal.zubkowicz Michal Zubkowicz added a comment -

            I had replication master-master with default binlog format (statement)
            26.07 I've changed binlog format at runtime to MIXED on both servers.
            28.07 was done last backup of data (all data was ok)
            29.07 07:41 There was a lock in database so i had to quickly restart server because database was unusable.
            After that I've seen gap in binlog files also checked that data between two mentioned points is lost.

            Replication is checked every minute by script. So I've got a report when something is wrong. There were nothing unusual.

            Show
            michal.zubkowicz Michal Zubkowicz added a comment - I had replication master-master with default binlog format (statement) 26.07 I've changed binlog format at runtime to MIXED on both servers. 28.07 was done last backup of data (all data was ok) 29.07 07:41 There was a lock in database so i had to quickly restart server because database was unusable. After that I've seen gap in binlog files also checked that data between two mentioned points is lost. Replication is checked every minute by script. So I've got a report when something is wrong. There were nothing unusual.
            Hide
            elenst Elena Stepanova added a comment -

            Hi Michal,

            Thanks a lot, it is much easier to look into the problem this way.

            However, with the provided binlogs, I don't see the gap you are talking about in the time interval that you are giving.

            Both sets of binlogs, "master" and "slave", seem to have a continuous sequence of events for 26.07, 27.07, 28.07, 29.07. In 'master' set it is in mysql-bin-2.000023, and in 'slave' set it is in mysql-bin-2.000006. If there was a data loss, the logs need to be analyzed event by event, to see if the events responsible for the data in question are present there; but apparently the logs were synced to disk somehow.

            If you have examples of events that should have been present in the binary logs but are not, we can try to look deeper into it.

            Also, I don't quite understand the connection you make between allegedly not synced binary logs and the data loss.
            Lets take "master" for example. Even if the binlogs were not stored on disk on whatever reason, and replication to the "slave" didn't work, how would it cause the data loss on the "master" itself? Since you restarted the server nicely, and there was no recovery from the binlog, the data should have been there anyway. Same for the "slave" if it receives any direct updates. If the binlogs were not stored and replication didn't work (which I don't see any indication of so far), the data on the two servers would have been inconsistent, but not lost?

            Show
            elenst Elena Stepanova added a comment - Hi Michal, Thanks a lot, it is much easier to look into the problem this way. However, with the provided binlogs, I don't see the gap you are talking about in the time interval that you are giving. Both sets of binlogs, "master" and "slave", seem to have a continuous sequence of events for 26.07, 27.07, 28.07, 29.07. In 'master' set it is in mysql-bin-2.000023, and in 'slave' set it is in mysql-bin-2.000006. If there was a data loss, the logs need to be analyzed event by event, to see if the events responsible for the data in question are present there; but apparently the logs were synced to disk somehow. If you have examples of events that should have been present in the binary logs but are not, we can try to look deeper into it. Also, I don't quite understand the connection you make between allegedly not synced binary logs and the data loss. Lets take "master" for example. Even if the binlogs were not stored on disk on whatever reason, and replication to the "slave" didn't work, how would it cause the data loss on the "master" itself? Since you restarted the server nicely, and there was no recovery from the binlog, the data should have been there anyway. Same for the "slave" if it receives any direct updates. If the binlogs were not stored and replication didn't work (which I don't see any indication of so far), the data on the two servers would have been inconsistent, but not lost?
            Hide
            michal.zubkowicz Michal Zubkowicz added a comment -

            On both servers there is a gap in binlogs:
            mysql-bin-2.000022 07 26 14:51
            mysql-bin-2.000023 07 29 07:51
            There should be about 3-10 files per day normally not less.

            Same range of data is missing on both servers. So i assumed there is a problem with disk sync after binlog format change, because it was only change I've made, and nothing else was touched, and same gap was in binlog files and the data files.

            Maybe my conclusion was bad.

            Show
            michal.zubkowicz Michal Zubkowicz added a comment - On both servers there is a gap in binlogs: mysql-bin-2.000022 07 26 14:51 mysql-bin-2.000023 07 29 07:51 There should be about 3-10 files per day normally not less. Same range of data is missing on both servers. So i assumed there is a problem with disk sync after binlog format change, because it was only change I've made, and nothing else was touched, and same gap was in binlog files and the data files. Maybe my conclusion was bad.
            Hide
            elenst Elena Stepanova added a comment -

            Timestamps on the logs don't mean much. What's important is the contents.

            Binary logs can be rotated due to only a few reasons:

            • the log size reached max_binlog_size;
            • FLUSH LOGS is executed;
            • server is restarted.

            In your case, it is always the latter. Your binary logs are not even close to 1 Gb which is the default value for max_binlog_size and is not modified in your cnf.
            At the same time, if you check when each of your binary logs ends, and compare it with the error log which shows server restart, every time binlog rotation correlates with server shutdown/restart.
            For example, if we look in the 'master' dir, you have:
            16 logs for 2014-07-24 (and 16 restarts)
            4 logs for 2014-07-25 (and 4 restarts)
            2 logs for 2014-07-26 (and 2 restarts)
            10 logs for 2014-07-29 (and 10 restarts).

            But between 2014-07-26 and 2014-07-29, you didn't have a single server restart, so the binary log didn't rotate even once.
            The log itself contains records for all these days, and generally looks quite normal.

            Now, I don't know why your server usually restarts so frequently. If it news for you, you might want to check your environment.
            Two times it looked like the server was brutally killed, but all other times it's a normal shutdown.
            Btw the times when the server was killed can account for the crashed tables.

            All in all, I don't see anything wrong with the binary logs at all.
            There is still a question about the data loss, but if it did happen, as expected it has nothing to do with the binary logs.
            If it's the range of data that is missing on both servers, it seems more likely that something was accidentally deleted. Since you have all the binary logs, it can even be possible to find out what happened to those records if you can identify them, e.g. by table name + pk value.

            Show
            elenst Elena Stepanova added a comment - Timestamps on the logs don't mean much. What's important is the contents. Binary logs can be rotated due to only a few reasons: the log size reached max_binlog_size; FLUSH LOGS is executed; server is restarted. In your case, it is always the latter. Your binary logs are not even close to 1 Gb which is the default value for max_binlog_size and is not modified in your cnf. At the same time, if you check when each of your binary logs ends, and compare it with the error log which shows server restart, every time binlog rotation correlates with server shutdown/restart. For example, if we look in the 'master' dir, you have: 16 logs for 2014-07-24 (and 16 restarts) 4 logs for 2014-07-25 (and 4 restarts) 2 logs for 2014-07-26 (and 2 restarts) 10 logs for 2014-07-29 (and 10 restarts). But between 2014-07-26 and 2014-07-29, you didn't have a single server restart, so the binary log didn't rotate even once. The log itself contains records for all these days, and generally looks quite normal. Now, I don't know why your server usually restarts so frequently. If it news for you, you might want to check your environment. Two times it looked like the server was brutally killed, but all other times it's a normal shutdown. Btw the times when the server was killed can account for the crashed tables. All in all, I don't see anything wrong with the binary logs at all. There is still a question about the data loss, but if it did happen, as expected it has nothing to do with the binary logs. If it's the range of data that is missing on both servers, it seems more likely that something was accidentally deleted. Since you have all the binary logs, it can even be possible to find out what happened to those records if you can identify them, e.g. by table name + pk value.
            Hide
            michal.zubkowicz Michal Zubkowicz added a comment - - edited

            Thank you very much for explanation. You have right. Still don't understand how data disappeared, for sure it was not deleted accidentally because gap was in many tables. I will try to investigate it and reopen this issue if i will find anything.

            Show
            michal.zubkowicz Michal Zubkowicz added a comment - - edited Thank you very much for explanation. You have right. Still don't understand how data disappeared, for sure it was not deleted accidentally because gap was in many tables. I will try to investigate it and reopen this issue if i will find anything.

              People

              • Assignee:
                elenst Elena Stepanova
                Reporter:
                michal.zubkowicz Michal Zubkowicz
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Due:
                  Created:
                  Updated:
                  Resolved: