Содержание
- MySQL GTID Replication Error
- Table of Contents
- MySQL GTID Replication Error
- Show MySQL Slave Status via CLI
- Stop MySQL Slave
- Fix MySQL GTID Replication Error
- Skip MySQL GTID Replication Error
- Start MySQL Slave
- Check MySQL Slave Replication Status
- Mysql slave stop error
- Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs
- mariadb mysql slave cannot stop and then crashes on start up with segmentation fault or abort
- Let’s show you what means
- Here is the what this specific case will teach you:
- Here is presented a specific case with replication, but in your case you may not use replication, your problem table could be another (and your mariadb/mysql server crashes on start up or selecting from a specific table or on shutdown?), so find the problem table and remove it, here we show you how to do it! BACKUP the mysql datadir BEFORE any intervention!
- STEP 1) We tried everything from “systemctl stop mysql” to kill -TERM multiple times
- STEP 2) So a
- STEP 3) So the default way of repairing the InnoDB is to use
- STEP 4) So trying to prevent the start of the replication on start of the mysql process with:
- STEP 5) strace and file descritor to find the offender table file.
- STEP 6) And from the manual:
- STEP 7) Remove a corrupted innodb file, which causes database (mysql process) crash leaving your with no database at all and then recreate the table
- Leave a Reply Cancel reply
- Find Us
- About This Site
MySQL GTID Replication Error
MySQL GTID replication error can happen unexpectedly for various reasons and as often as the classic master slave, circular or multi-master replication but in this short tutorial we will learn, step by step, how to skip MySQL GTID transaction error in order to keep in sync our MySQL slave server. Before we begin we need to say that all below steps were tested in a master-slave setup using MySQL 5.6 and MySQL 5.7, we think that this MySQL skipping transaction method detailed below can be easily applied even for newer MySQL versions like 8+. Please note that all below operations were taken on the slave server only, no master queries or changes were needed.
Table of Contents
MySQL GTID Replication Error
Show MySQL Slave Status via CLI
Let’s start by checking our MySQL slave server status via CLI by executing show slave status G query like shown in the example below:
We can clearly see on the row called Slave_SQL_Running that our slave isn’t executing any new SQL queries received from the master due a SQL query error like shown on the row Last_SQL_Error .
Stop MySQL Slave
Our next step is to stop MySQL slave by executing a SQL query in order to fix the replication error, we won’t stop mysql / mysqld service (daemon) for this we just need to stop the process that handles the replication. Let’s run the SQL query that will stop the slave process for us:
By executing once again the SQL query show slave statusG we should be able now to see that our slave is stopped and no replication is happening between master and slave:
Fix MySQL GTID Replication Error
Knowing now that our MySQL replication is stopped let’s take a closer look at the error message.
The error message says that the ALTER USER query failed for the user named dummyuser . Let’s check our mysql.user table to identify the issue with this query.
We can now see the differences between the SQL query that’s failing and the actual data that’s in our table. First issue that we notice is present in the host column, the error shows users host as being 10.1.%.% but in our table we have 10.% . The second issue is with authentication_string column, we can see that the password has been updated and it doesn’t match anymore. The conclusion here is that the password for our dummyuser has been updated on the master. On the slave server we didn’t had that record to match the criteria, meaning that someone else updated the record on the slave manually by replacing the host entry with 10.% .
To avoid such situations make sure that you never perform any changes on slave, all SQL changes must be carried out on the master server as this is the data source for our slave.
Let’s fix this error by updating our record on the slave server.
Using the queries above we’ve updated our host and authentication_string (hashed password) records now to match with the records from our master server. Now we have the same data on the slave as on the master, no data was lost.
Skip MySQL GTID Replication Error
The classic method for skipping master-slave errors won’t work in this particular case as our replication is GTID.
The error above is expected as we’re running a MySQL GTID replication setup.
Let’s have a look on the last transaction executed on the slave that failed:
We know now that the transaction number 59506053 coming from our master server having the ID bdf3bf63-0cc9-11e8-89e8-000d3a365fa6 failed.
Having now the record fixed manually on the slave we can carry on and insert an empty transaction in order to bypass the error. We’ll need to increment the transaction via a SQL query like shown below in order to bypass the error.
Now we’ll have to commit our change:
Start MySQL Slave
Having everything in place like transaction number incremented and change committed we can try to start our GTID slave replication back with the next SQL query:
A successful show slave statusG will look like this:
Check MySQL Slave Replication Status
We know that our slave is running now, receiving data from the master and we can check how far the slave is behind the master like this:
Once the Seconds_Behind_Master value goes down to 0 seconds then we can say that our slave is fully synced.
Источник
Mysql slave stop error
Stops the replication threads. STOP SLAVE requires the SUPER privilege. Recommended best practice is to execute STOP SLAVE on the replica before stopping the replica server (see Section 5.1.16, “The Server Shutdown Process”, for more information).
When using the row-based logging format : You should execute STOP SLAVE or STOP SLAVE SQL_THREAD on the replica prior to shutting down the replica server if you are replicating any tables that use a nontransactional storage engine (see the Note later in this section).
Like START SLAVE , this statement may be used with the IO_THREAD and SQL_THREAD options to name the thread or threads to be stopped. Note that the Group Replication applier channel ( group_replication_applier ) has no replication I/O thread, only a replication SQL thread. Using the SQL_THREAD option therefore stops this channel completely.
STOP SLAVE causes an implicit commit of an ongoing transaction. See Section 13.3.3, “Statements That Cause an Implicit Commit”.
gtid_next must be set to AUTOMATIC before issuing this statement.
You can control how long STOP SLAVE waits before timing out by setting the rpl_stop_slave_timeout system variable. This can be used to avoid deadlocks between STOP SLAVE and other SQL statements using different client connections to the replica. When the timeout value is reached, the issuing client returns an error message and stops waiting, but the STOP SLAVE instruction remains in effect. Once the replication threads are no longer busy, the STOP SLAVE statement is executed and the replica stops.
Some CHANGE MASTER TO statements are allowed while the replica is running, depending on the states of the replication SQL thread and the replication I/O thread. However, using STOP SLAVE prior to executing CHANGE MASTER TO in such cases is still supported. See Section 13.4.2.1, “CHANGE MASTER TO Statement”, and Section 16.3.7, “Switching Sources During Failover”, for more information.
The optional FOR CHANNEL channel clause enables you to name which replication channel the statement applies to. Providing a FOR CHANNEL channel clause applies the STOP SLAVE statement to a specific replication channel. If no channel is named and no extra channels exist, the statement applies to the default channel. If a STOP SLAVE statement does not name a channel when using multiple channels, this statement stops the specified threads for all channels. This statement cannot be used with the group_replication_recovery channel. See Section 16.2.2, “Replication Channels” for more information.
When using statement-based replication : changing the source while it has open temporary tables is potentially unsafe. This is one of the reasons why statement-based replication of temporary tables is not recommended. You can find out whether there are any temporary tables on the replica by checking the value of Slave_open_temp_tables ; when using statement-based replication, this value should be 0 before executing CHANGE MASTER TO . If there are any temporary tables open on the replica, issuing a CHANGE MASTER TO statement after issuing a STOP SLAVE causes an ER_WARN_OPEN_TEMP_TABLES_MUST_BE_ZERO warning.
When using a multithreaded replica ( slave_parallel_workers is a nonzero value), any gaps in the sequence of transactions executed from the relay log are closed as part of stopping the worker threads. If the replica is stopped unexpectedly (for example due to an error in a worker thread, or another thread issuing KILL ) while a STOP SLAVE statement is executing, the sequence of executed transactions from the relay log may become inconsistent. See Section 16.4.1.32, “Replication and Transaction Inconsistencies”, for more information.
If the current replication event group has modified one or more nontransactional tables, STOP SLAVE waits for up to 60 seconds for the event group to complete, unless you issue a KILL QUERY or KILL CONNECTION statement for the replication SQL thread. If the event group remains incomplete after the timeout, an error message is logged.
Источник
Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs
Совсем недавно при настройке репликации в MySQL при запуске slave была выдана ошибка: Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs
Ниже я расскажу ее причину и как быстро ее устранить.
Исходные данные: ОС Ubuntu 18.04, Oracle MySQL 5.7.33;
Задача: Решить проблему с запуском slave — The slave I/O thread stops because master and slave have equal MySQL server UUIDs
На работе коллеги самостоятельно установили MySQL для нужд своего специфического софта, далее они клонировали виртуальную машину и решили развернуть Master -> Slave репликацию, но у них это не получилось.
При выполнении команды start slave они получили ошибку «Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs»
Увидеть ее можно было при выполнении команды show slave statusG на ведомом (slave) сервере, смотрите вывод ниже:
Мы видим, что I/O поток не запустился с ошибкой 1593 — Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs; these UUIDs must be different for replication to work.
Смысл ошибки очень простой — Ведомый (slave) сервер имеет тот же UUID, что и ведущий (master). Но как? А все дело в том, что VM была полным клоном мастера.
В идентичности UUID легко убедиться, для этого выполним команду ниже на ведущем (master) и на ведомом (slave) сервере:
Они покажут одно и тоже «54460172-c5d3-11e9-b76c-000c298bc5b7».
Как исправить UUID?
Все просто, для Oracle MySQL он храниться в файле /var/lib/mysql/auto.cnf
Проверим, для этого на ведомом (slave) сервере выполним:
Теперь остановим MySQL на ведомом (slave) сервере, удалим файл /var/lib/mysql/auto.cnf и запустим MySQL:
Снова проверим UUID на ведомом (slave) сервере:
Мы видим, что UUID стал другим. Отлично. Теперь проверим статус репликации:
Мы видим, что Slave_IO_Running стал Yes, а ошибка пропала, то есть репликация заработала. Так же мы можем видеть, что в строке Master_UUID появился уже знакомый UUID ведущего сервера, мы его не меняли.
На этом все, до скорых встреч. Если у Вас возникли вопросы или Вы хотите чтобы я помог Вам, то Вы всегда можете связаться со мной разными доступными способами.
Источник
mariadb mysql slave cannot stop and then crashes on start up with segmentation fault or abort
Let’s show you what means
“That’s specifics of InnoDB data storage.”
Here is the what this specific case will teach you:
- basic debug with strace
- MySQL could hangs in a infinite loop during a shutdown (could be seen with strace)
- InnoDB files could get corrupted without a hardware issue, but with a kill -KILL or probably it was corrupted before the shutdown and the kill?
- list the opened files (and the IDs = file descriptors) of a mysql process
- start a mysql slave without a starting the replication on start with “skip-slave-start”
- Remove a corrupted innodb file, which causes database (mysql process) crash leaving your with no database at all – replace a innodb file, start the server, then drop the innodb table with sql command and recreate it to continue using healthy table
- 2 rows (4 INTEGER columns) could eat up a lot of space – 12G and probably infinite! === “That’s specifics of InnoDB data storage.”
This table file is 12 Gigabytes in size.
Here is presented a specific case with replication, but in your case you may not use replication, your problem table could be another (and your mariadb/mysql server crashes on start up or selecting from a specific table or on shutdown?), so find the problem table and remove it, here we show you how to do it! BACKUP the mysql datadir BEFORE any intervention!
Here we have a situation: a mariadb (mysql) server running as slave to a really busy master server, so you could expect 10 000 update/insert/delete queries. Everything was working till the time we wanted to shutdown the mysql process, which occurred to be impossible.
STEP 1) We tried everything from “systemctl stop mysql” to kill -TERM multiple times
5 hour the mysql process was running with 2000 opened file descriptors to multiple table files. The strace showed this:
And this has been repeating many times for hours without any disk activity on flushing any IO…so no use to wait for something, which apparently won’t finish at all.
STEP 2) So a
was used to stop the mysql process. What could go wrong. InnoDB is awesome and cannot corrupt if the hardware is OK, right? Yeahhh right…
When the start command was executed, the mysql process started, the innodb engine recovery completed successfully and after 5 minutes without listening socket and heavy IO reading there is the segmentation fault crash and YOU have no database….
So here is one of the crashes taken from the log:
STEP 3) So the default way of repairing the InnoDB is to use
in your my.cnf configuration file:
But again and again crashes even enabling all the options of innodb_force_recovery=1,2,3,4,5 and last 6. But when using “innodb_force_recovery=4” and 5 and 6 we have some strange additional error:
The innodb_force_recovery did not help we still cannot start our MySQL database, but there were two things:
- No errors were reported by the InnoDB Engine – “InnoDB: Buffer pool(s) load completed” and Crash recovery always finished without errors
- The MySQL starts successfully, but not listening socket and then after 3
5 minutes of extensive IO reading from the disk by the mysql process it crashes
It was like something big was loading just on the start and in fact needed on the very beginning, so what start immediately after successful load of the mysql process?
Remember this is a slave!
STEP 4) So trying to prevent the start of the replication on start of the mysql process with:
STEP 5) strace and file descritor to find the offender table file.
The skip-slave-start option DID the trick – NO Segmentation fault
the mysql processes immediately after successful engines load began to listen on sockets. So the big offender was probably something connected with the replication! A big table used in the replication? How to find it, ok remove the “skip-slave-start”, then start the mysql process and wait for the time the IO read kicks in, then strace the mysql process (find the ID of the process with ps – the first number of the following command is the ID of the mysql process):
You see multiple lines (thousands) with:
The pread64 – reads from file descriptor with then you can list all file descriptors for the mysql process ID with:
And again file descriptor with is:
And when you check the size – 12G…..12G for this table? Why? What is used for?
STEP 6) And from the manual:
Hmm “The table is updated with the new position as part of each transaction committed during replication” and 12G table – could it track minillions of transactions? Probably not, in fact this table has only 2-3-4 rows at a given time. But the size is 12G, well as we said this is:
“That’s specifics of InnoDB data storage. Did you try to run OPTIMIZE TABLE mysql.gtid_slave_pos? It should allow to reclaim the disk space.” (taken from: https://jira.mariadb.org/browse/MDEV-12318).
STEP 7) Remove a corrupted innodb file, which causes database (mysql process) crash leaving your with no database at all and then recreate the table
Restart your mysql server with “skip-slave-start” again to be able to start it (look above).
So a table probably with couple of rows takes 12G and MariaDB is using it for the replication on start up, but the replication position is kept on another place “master.info”, can we delete this offender file “gtid_slave_pos.ibd”? Yes we can, move the file out of its place (mysql datadir) and then create a second table with:
Copy gtid_slave_pos1.ibd to gtid_slave_pos.ibd and restart the mysql process, it will report an error for a table mysql.gtid_slave_pos but you will be able to drop the table and then create it with the same name: “gtid_slave_pos” (you could drop the temporary one “gtid_slave_pos1”)
The create table statement is taken from another server probably it is a good idea to do it yourself, login on a healthy mysql server and issue: “show create table mysql.gtid_slave_pos;”.
So now you have a healthy mysql.gtid_slave_posto be used for your replication. Restart your mysql server removing “skip-slave-start” from your configuration file and here it is the replication is OK and running:
Leave a Reply Cancel reply
Find Us
Address
101010010100 Main Street
Earth, EA 101010101010100
E-mail
info@ahelpme.com
Hours (in the TimeBank)
1000000:00:0:00:00 in time…
About This Site
Highly skilled hominins inhabited planet Earth a long long ago! And these hominins have a need to share some knowledge here.
Источник
MySQL GTID replication error can happen unexpectedly for various reasons and as often as the classic master slave, circular or multi-master replication but in this short tutorial we will learn, step by step, how to skip MySQL GTID transaction error in order to keep in sync our MySQL slave server. Before we begin we need to say that all below steps were tested in a master-slave setup using MySQL 5.6 and MySQL 5.7, we think that this MySQL skipping transaction method detailed below can be easily applied even for newer MySQL versions like 8+. Please note that all below operations were taken on the slave server only, no master queries or changes were needed.
Table of Contents
MySQL GTID Replication Error
This tutorial about MySQL GTID Replication Error contains the next sections:
Show MySQL Slave Status via CLI
Stop MySQL Slave
Fix MySQL GTID Replication Error
Skip MySQL GTID Replication Error
Start MySQL Slave
Check MySQL Slave Replication Status
Show MySQL Slave Status via CLI
Let’s start by checking our MySQL slave server status via CLI by executing show slave status G
query like shown in the example below:
mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 10.1.1.1
Master_User: dbrepuser
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: bin.000706
Read_Master_Log_Pos: 374830771
Relay_Log_File: dbm-relay-bin.000713
Relay_Log_Pos: 959223458
Relay_Master_Log_File: bin.000691
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1396
Last_Error: Error 'Operation ALTER USER failed for 'dummyuser'@'10.1.%.%'' on query. Default database: 'mysql'. Query: 'ALTER USER 'dummyuser'@'10.1.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*1D1219A2256FC58143DF98BA9457EFE385AD7B2C''
Skip_Counter: 0
Exec_Master_Log_Pos: 959223257
Relay_Log_Space: 16698942535
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1396
Last_SQL_Error: Error 'Operation ALTER USER failed for 'dummyuser'@'10.1.%.%'' on query. Default database: 'mysql'. Query: 'ALTER USER 'dummyuser'@'10.1.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*1D1219A2256FC58143DF98BA9457EFE385AD7B2C''
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_UUID: bdf3bf63-0cc9-11e8-89e8-000d3a365fa6
Master_Info_File: /mnt/mysql-data/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 180628 06:29:51
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: bdf3bf63-0cc9-11e8-89e8-000d3a365fa6:58869595-59928201
Executed_Gtid_Set: b2c84f0c-5dcb-11e8-a550-005056847b4c:1-13, bdf3bf63-0cc9-11e8-89e8-000d3a365fa6:52044900-59506053
Auto_Position: 0
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
We can clearly see on the row called Slave_SQL_Running
that our slave isn’t executing any new SQL queries received from the master due a SQL query error like shown on the row Last_SQL_Error
.
Stop MySQL Slave
Our next step is to stop MySQL slave by executing a SQL query in order to fix the replication error, we won’t stop mysql
/mysqld
service (daemon) for this we just need to stop the process that handles the replication. Let’s run the SQL query that will stop the slave process for us:
mysql> STOP SLAVE;
By executing once again the SQL query show slave statusG
we should be able now to see that our slave is stopped and no replication is happening between master and slave:
mysql> SHOW SLAVE STATUS G
...
Slave_IO_Running: No
Slave_SQL_Running: No
...
Fix MySQL GTID Replication Error
Knowing now that our MySQL replication is stopped let’s take a closer look at the error message.
Last_SQL_Error: Error 'Operation ALTER USER failed for 'dummyuser'@'10.1.%.%'' on query. Default database: 'mysql'. Query: 'ALTER USER 'dummyuser'@'10.1.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*1D1219A2256FC58143DF98BA9457EFE385AD7B2C''
The error message says that the ALTER USER
query failed for the user named dummyuser
. Let’s check our mysql.user
table to identify the issue with this query.
mysql> use mysql;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> select user, host, authentication_string from user where user like '%dummyuser%';
+----------------+------+-------------------------------------------+
| user | host | authentication_string |
+----------------+------+-------------------------------------------+
| dummyuser | 10.% | *1B2319A2556FC59343FE08BC9027EFF385AD7A1D |
+----------------+------+-------------------------------------------+
1 row in set (0.00 sec)
We can now see the differences between the SQL query that’s failing and the actual data that’s in our table. First issue that we notice is present in the host
column, the error shows users host as being 10.1.%.%
but in our table we have 10.%
. The second issue is with authentication_string
column, we can see that the password has been updated and it doesn’t match anymore. The conclusion here is that the password for our dummyuser
has been updated on the master. On the slave server we didn’t had that record to match the criteria, meaning that someone else updated the record on the slave manually by replacing the host
entry with 10.%
.
To avoid such situations make sure that you never perform any changes on slave, all SQL changes must be carried out on the master server as this is the data source for our slave.
Let’s fix this error by updating our record on the slave server.
mysql> UPDATE user SET host='10.1.%.%' WHERE host='10.%' AND user='dummyuser';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> ALTER USER 'dummyuser'@'10.1.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*1D1219A2256FC58143DF98BA9457EFE385AD7B2C';
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Using the queries above we’ve updated our host
and authentication_string
(hashed password) records now to match with the records from our master server. Now we have the same data on the slave as on the master, no data was lost.
Skip MySQL GTID Replication Error
The classic method for skipping master-slave errors won’t work in this particular case as our replication is GTID.
mysql> SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
ERROR 1858 (HY000): sql_slave_skip_counter can not be set when the server is running with @@GLOBAL.GTID_MODE = ON. Instead, for each transaction that you want to skip, generate an empty transaction with the same GTID as the transaction
The error above is expected as we’re running a MySQL GTID replication setup.
Let’s have a look on the last transaction executed on the slave that failed:
Executed_Gtid_Set: b2c84f0c-5dcb-11e8-a550-005056847b4c:1-13, bdf3bf63-0cc9-11e8-89e8-000d3a365fa6:52044900-59506053
We know now that the transaction number 59506053
coming from our master server having the ID bdf3bf63-0cc9-11e8-89e8-000d3a365fa6
failed.
Having now the record fixed manually on the slave we can carry on and insert an empty transaction in order to bypass the error. We’ll need to increment the transaction via a SQL query like shown below in order to bypass the error.
mysql> SET GTID_NEXT='bdf3bf63-0cc9-11e8-89e8-000d3a365fa6:59506054';
Now we’ll have to commit our change:
mysql> BEGIN; COMMIT; SET GTID_NEXT='AUTOMATIC';
Start MySQL Slave
Having everything in place like transaction number incremented and change committed we can try to start our GTID slave replication back with the next SQL query:
mysql> START SLAVE;
A successful show slave statusG
will look like this:
mysql> SHOW SLAVE STATUS G
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
...
Check MySQL Slave Replication Status
We know that our slave is running now, receiving data from the master and we can check how far the slave is behind the master like this:
mysql> SHOW SLAVE STATUS G
...
Seconds_Behind_Master: 168356
...
Once the Seconds_Behind_Master
value goes down to 0
seconds then we can say that our slave is fully synced.
mysql> SHOW SLAVE STATUS G
...
Seconds_Behind_Master: 0
...
Let’s show you what means
“That’s specifics of InnoDB data storage.”
Here is the what this specific case will teach you:
- basic debug with strace
- MySQL could hangs in a infinite loop during a shutdown (could be seen with strace)
- InnoDB files could get corrupted without a hardware issue, but with a kill -KILL or probably it was corrupted before the shutdown and the kill?
- list the opened files (and the IDs = file descriptors) of a mysql process
- start a mysql slave without a starting the replication on start with “skip-slave-start”
- Remove a corrupted innodb file, which causes database (mysql process) crash leaving your with no database at all – replace a innodb file, start the server, then drop the innodb table with sql command and recreate it to continue using healthy table
- 2 rows (4 INTEGER columns) could eat up a lot of space – 12G and probably infinite! === “That’s specifics of InnoDB data storage.”
This table file is 12 Gigabytes in size!!!MariaDB [mysql]> select * from gtid_slave_pos; +-----------+----------+-----------+-------------+ | domain_id | sub_id | server_id | seq_no | +-----------+----------+-----------+-------------+ | 0 | 16983943 | 101 | 45790450502 | | 0 | 16983944 | 101 | 45790450503 | +-----------+----------+-----------+-------------+ 2 rows in set (0.00 sec)
Here is presented a specific case with replication, but in your case you may not use replication, your problem table could be another (and your mariadb/mysql server crashes on start up or selecting from a specific table or on shutdown?), so find the problem table and remove it, here we show you how to do it! BACKUP the mysql datadir BEFORE any intervention!
Here we have a situation: a mariadb (mysql) server running as slave to a really busy master server, so you could expect 10 000 update/insert/delete queries. Everything was working till the time we wanted to shutdown the mysql process, which occurred to be impossible.
STEP 1) We tried everything from “systemctl stop mysql” to kill -TERM multiple times
5 hour the mysql process was running with 2000 opened file descriptors to multiple table files. The strace showed this:
[pid 14824] <... io_getevents resumed> []{0, 500000000}) = 0 [pid 14815] <... io_getevents resumed> []{0, 500000000}) = 0 [pid 14824] io_getevents(139723876253696, 1, 256, <unfinished ...> [pid 14815] io_getevents(139723876356096, 1, 256, <unfinished ...> [pid 14825] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14825] futex(0x55d16885deb0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 14825] futex(0x55d16885dedc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3576261, {1525268239, 312230000}, ffffffff <unfinished ...> [pid 14852] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14852] futex(0x55d16885dd30, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 14852] futex(0x55d16885dd5c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 2775717, {1525268240, 308225000}, ffffffff <unfinished ...> [pid 14825] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 14825] futex(0x55d16885deb0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 14825] futex(0x55d16885dedc, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3576263, {1525268240, 304889000}, ffffffff <unfinished ...> [pid 14862] <... nanosleep resumed> NULL) = 0 [pid 14862] nanosleep({1, 0}, <unfinished ...> [pid 14821] <... io_getevents resumed> []{0, 500000000}) = 0 [pid 14821] io_getevents(139723876315136, 1, 256, <unfinished ...> [pid 14820] <... io_getevents resumed> []{0, 500000000}) = 0
And this has been repeating many times for hours without any disk activity on flushing any IO…so no use to wait for something, which apparently won’t finish at all.
STEP 2) So a
kill -9
was used to stop the mysql process. What could go wrong??? InnoDB is awesome and cannot corrupt if the hardware is OK, right? Yeahhh right…
When the start command was executed, the mysql process started, the innodb engine recovery completed successfully and after 5 minutes without listening socket and heavy IO reading there is the segmentation fault crash and YOU have no database….
So here is one of the crashes taken from the log:
2018-05-02 19:51:54 139990018914496 [ERROR] Plugin 'innodb' already installed 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Uses event mutexes 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Compressed tables use zlib 1.2.8 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Using Linux native AIO 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Number of pools: 1 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Using SSE2 crc32 instructions 2018-05-02 19:51:54 139990018914496 [Note] InnoDB: Initializing buffer pool, total size = 64G, instances = 8, chunk size = 128M 2018-05-02 19:51:56 139990018914496 [Note] InnoDB: Completed initialization of buffer pool 2018-05-02 19:51:56 139913709942528 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority(). 2018-05-02 19:51:56 139990018914496 [Note] InnoDB: Highest supported file format is Barracuda. 2018-05-02 19:51:56 139990018914496 [Note] InnoDB: Starting crash recovery from checkpoint LSN=311205983427362 2018-05-02 19:51:57 139990018914496 [Note] InnoDB: Last binlog file '/var/log/mysql-binlog/mysql-bin.227009', position 38590879 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: 128 out of 128 rollback segments are active. 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: File './ibtmp1' size is now 12 MB. 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: Waiting for purge to start 2018-05-02 19:52:12 139990018914496 [Note] InnoDB: 5.7.21 started; log sequence number 311205983427371 2018-05-02 19:52:12 139915446597376 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool 2018-05-02 19:52:12 139990018914496 [Note] Plugin 'FEEDBACK' is disabled. 2018-05-02 19:52:12 139990018914496 [Note] Recovering after a crash using tc.log 2018-05-02 19:52:12 139990018914496 [Note] Starting crash recovery... 2018-05-02 19:52:12 139990018914496 [Note] Crash recovery finished. 2018-05-02 19:52:12 139990018914496 [Note] Server socket created on IP: '0.0.0.0'. 2018-05-02 19:52:12 139990018914496 [Warning] 'user' entry 'root@srvdns2' ignored in --skip-name-resolve mode. 2018-05-02 19:52:12 139990018914496 [Warning] 'proxies_priv' entry '@% root@srvdns1' ignored in --skip-name-resolve mode. 2018-05-02 19:53:52 139915446597376 [Note] InnoDB: Buffer pool(s) load completed at 180502 19:53:52 180502 19:57:12 [ERROR] mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.2.13-MariaDB-10.2.13+maria~xenial key_buffer_size=2147483648 read_buffer_size=262144 max_used_connections=0 max_threads=10002 thread_count=6 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 6871600 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x7f40700009a8 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7f515c4fecf8 thread_stack 0x49000 *** buffer overflow detected ***: /usr/sbin/mysqld terminated ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f51f54eb7e5] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f51f558d15c] /lib/x86_64-linux-gnu/libc.so.6(+0x117160)[0x7f51f558b160] /lib/x86_64-linux-gnu/libc.so.6(+0x1190a7)[0x7f51f558d0a7] /usr/sbin/mysqld(my_addr_resolve+0xde)[0x55a9e9f1832e] /usr/sbin/mysqld(my_print_stacktrace+0x1e2)[0x55a9e9eff2e2] /usr/sbin/mysqld(handle_fatal_signal+0x345)[0x55a9e9999b95] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f51f5eda390] /lib/x86_64-linux-gnu/libc.so.6(+0x9f849)[0x7f51f5513849] /usr/sbin/mysqld(insert_dynamic+0x2a)[0x55a9e9ed3a4a] /usr/sbin/mysqld(_Z25rpl_load_gtid_slave_stateP3THD+0x424)[0x55a9e98c43a4] /usr/sbin/mysqld(handle_slave_background+0xe7)[0x55a9e97834e7] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f51f5ed06ba] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f51f557b41d] ======= Memory map: ======== 55a9e933e000-55a9ea423000 r-xp 00000000 08:03 148565 /usr/sbin/mysqld 55a9ea622000-55a9ea6f4000 r--p 010e4000 08:03 148565 /usr/sbin/mysqld 55a9ea6f4000-55a9ea7aa000 rw-p 011b6000 08:03 148565 /usr/sbin/mysqld 55a9ea7aa000-55a9eb03d000 rw-p 00000000 00:00 0 55a9eba46000-55aa4e67b000 rw-p 00000000 00:00 0 [heap]
STEP 3) So the default way of repairing the InnoDB is to use
innodb_force_recovery
in your my.cnf configuration file:
[mysqld] innodb_force_recovery=1
But again and again crashes even enabling all the options of innodb_force_recovery=1,2,3,4,5 and last 6. But when using “innodb_force_recovery=4” and 5 and 6 we have some strange additional error:
2018-05-02 21:43:34 139667439187712 [ERROR] InnoDB: Failed to find tablespace for table `mysql`.`gtid_slave_pos` in the cache. Attempting to load the tablespace with space id 63584 2018-05-02 21:43:34 139667439187712 [Warning] InnoDB: Allocated tablespace ID 63584 for mysql/gtid_slave_pos, old maximum was 0
The innodb_force_recovery did not help we still cannot start our MySQL database, but there were two things:
- No errors were reported by the InnoDB Engine – “InnoDB: Buffer pool(s) load completed” and Crash recovery always finished without errors
- The MySQL starts successfully, but not listening socket and then after 3~5 minutes of extensive IO reading from the disk by the mysql process it crashes
It was like something big was loading just on the start and in fact needed on the very beginning, so what start immediately after successful load of the mysql process?
REPLICATION
Remember this is a slave!
STEP 4) So trying to prevent the start of the replication on start of the mysql process with:
[mysqld] skip-slave-start
STEP 5) strace and file descritor to find the offender table file.
The skip-slave-start option DID the trick – NO Segmentation fault
the mysql processes immediately after successful engines load began to listen on sockets. So the big offender was probably something connected with the replication! A big table used in the replication? How to find it, ok remove the “skip-slave-start”, then start the mysql process and wait for the time the IO read kicks in, then strace the mysql process (find the ID of the process with ps – the first number of the following command is the ID of the mysql process):
srv@local ~ # ps axuf|grep mysql|grep -v grep mysql 6969 239 8.3 78561320 10983848 ? Ssl 22:53 108:59 /usr/sbin/mysqld srv@local ~ # strace -f -p 6969 strace -f -p 6969 strace: Process 6969 attached with 27 threads [pid 6996] rt_sigtimedwait([HUP QUIT ALRM TERM TSTP], NULL, NULL, 8 <unfinished ...> [pid 6994] futex(0x55941cb5d1ec, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 6995] futex(0x7f93c6de1d24, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 6993] restart_syscall(<... resuming interrupted nanosleep ...> <unfinished ...> [pid 6992] futex(0x55941cb5d0ec, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 6991] futex(0x55941cb5cd6c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 6990] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6997] mremap(0x7f9298b76000, 2210828288, 2210836480, MREMAP_MAYMOVE <unfinished ...> [pid 6989] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6988] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6986] futex(0x55947cfe941c, FUTEX_WAIT_PRIVATE, 33, NULL <unfinished ...> [pid 6985] futex(0x55947cfe941c, FUTEX_WAIT_PRIVATE, 31, NULL <unfinished ...> [pid 6984] futex(0x55947cfe941c, FUTEX_WAIT_PRIVATE, 32, NULL <unfinished ...> [pid 6997] <... mremap resumed> ) = 0x7f9298b76000 [pid 6984] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 6983] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6984] futex(0x55947cfe941c, FUTEX_WAIT_PRIVATE, 33, NULL <unfinished ...> [pid 6982] io_getevents(140347217702912, 1, 256, <unfinished ...> [pid 6981] io_getevents(140347217723392, 1, 256, <unfinished ...> [pid 6985] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 6980] io_getevents(140347217743872, 1, 256, <unfinished ...> [pid 6979] io_getevents(140347217764352, 1, 256, <unfinished ...> [pid 6978] io_getevents(140347217825792, 1, 256, <unfinished ...> [pid 6977] io_getevents(140347217846272, 1, 256, <unfinished ...> [pid 6985] futex(0x55947cfe941c, FUTEX_WAIT_PRIVATE, 33, NULL <unfinished ...> [pid 6976] io_getevents(140347217866752, 1, 256, <unfinished ...> [pid 6975] io_getevents(140347219357696, 1, 256, <unfinished ...> [pid 6974] io_getevents(140347217784832, 1, 256, <unfinished ...> [pid 6973] io_getevents(140347217805312, 1, 256, <unfinished ...> [pid 6971] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6970] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 6969] futex(0x55941a4be944, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> [pid 6997] pread64(38, "201L32136553153531553156130307o303365]E277"..., 16384, 6228590592) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210836480, 2210844672, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "210271277343531565315353157130307o343362~E277"..., 16384, 6228639744) = 16384 [pid 6997] pread64(38, "30517303k5315753156531510130307o377224gE277"..., 16384, 6228656128) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210844672, 2210852864, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "217351V~531510531575315n130307p3431363E277"..., 16384, 6228672512) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210852864, 2210861056, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "3673213302325315n5315105315v130307p>316360E277"..., 16384, 6228705280) = 16384 [pid 6997] pread64(38, "374v2021775315v5315n5315r130307p\alE277"..., 16384, 6228721664) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210861056, 2210869248, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "s261A5315r5315v531516130307p2121AE277"..., 16384, 6228754432) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210869248, 2210877440, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "214t72445315165315r531521130307p242H37E277"..., 16384, 6228770816) = 16384 [pid 6997] pread64(38, "<n272"531521531516531524130307p311i313E277"..., 16384, 6228819968) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210877440, 2210885632, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] mremap(0x7f9298b76000, 2210885632, 2210893824, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "n|In531524531521531525130307p360EqE277"..., 16384, 6228869120) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210893824, 2210902016, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "270206326207531525531524531527130307q1a251E277"..., 16384, 6228885504) = 16384 [pid 6997] pread64(38, "{l226S531527531525531531130307q&205!E277"..., 16384, 6228918272) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210902016, 2210910208, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "201265]&531531531527531532130307qK365212E277"..., 16384, 6228951040) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210910208, 2210918400, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "352?236327531532531531531533130307qob:E277"..., 16384, 6228967424) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210918400, 2210926592, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] mremap(0x7f9298b76000, 2210926592, 2210934784, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "3c33301531533531532531536130307q2236%E277"..., 16384, 6228983808) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210934784, 2210942976, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "Zi343177531536531533531537130307q32527LE277"..., 16384, 6229032960) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210942976, 2210951168, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "v5h%5315375315365315!130307rf325364E277"..., 16384, 6229049344) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210951168, 2210959360, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "06Yn5315!5315375315"130307sCX@E277"..., 16384, 6229082112) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210959360, 2210967552, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "203243K2035315"5315!5315$130307t;234302E277"..., 16384, 6229098496) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210967552, 2210975744, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "9264332j5315$5315"5315%130307177370#371E277"..., 16384, 6229131264) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210975744, 2210983936, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, ":200+3375315%5315$5315&13030720020U-E277"..., 16384, 6229147648) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210983936, 2210992128, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "35-th5315&5315%5315(130307200+333CE277"..., 16384, 6229164032) = 16384 [pid 6997] pread64(38, "246Y3053515315(5315&5315)130307200[324305E277"..., 16384, 6229196800) = 16384 [pid 6997] mremap(0x7f9298b76000, 2210992128, 2211000320, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "340(23505315)5315(5315+130307200214L365E277"..., 16384, 6229213184) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211000320, 2211008512, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "A37437235315+5315)5315,13030720025230373E277"..., 16384, 6229245952) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211008512, 2211016704, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "iW2743655315,5315+5315.130307200332266256E277"..., 16384, 6229262336) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211016704, 2211024896, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "343c3403475315.5315,5315/130307200356272YE277"..., 16384, 6229295104) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211024896, 2211033088, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "}3532i5315/5315.531501303072015226256E277"..., 16384, 6229311488) = 16384 [pid 6997] pread64(38, "322237206377531505315/53151130307201257241272E277"..., 16384, 6229327872) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211033088, 2211041280, MREMAP_MAYMOVE) = 0x7f9298b76000 [pid 6997] pread64(38, "/25021!53151531505315213030720332372E277"..., 16384, 6229344256) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211041280, 2211049472, MREMAP_MAYMOVE) = 0x7f9298b76000
You see multiple lines (thousands) with:
[pid 6997] pread64(38, "/25021!53151531505315213030720332372E277"..., 16384, 6229344256) = 16384 [pid 6997] mremap(0x7f9298b76000, 2211041280, 2211049472, MREMAP_MAYMOVE) = 0x7f9298b76000
The pread64 – reads from file descriptor with ID=38, then you can list all file descriptors for the mysql process ID with:
srv@local mysql # ls -altr /proc/6969/fd total 0 dr-xr-xr-x 9 mysql mysql 0 May 2 21:32 .. l-wx------ 1 root root 64 May 2 21:35 2 -> /var/log/mysql/error.log dr-x------ 2 root root 0 May 2 21:35 . lrwx------ 1 root root 64 May 2 21:35 9 -> /tmp/ibgyw4AI (deleted) lrwx------ 1 root root 64 May 2 21:35 8 -> /tmp/ibuaprWP (deleted) lrwx------ 1 root root 64 May 2 21:35 7 -> /tmp/ibfXwsBW (deleted) lrwx------ 1 root root 64 May 2 21:35 6 -> /var/lib/mysql/ibdata1 lrwx------ 1 root root 64 May 2 21:35 5 -> /var/lib/mysql/aria_log.00000001 lr-x------ 1 root root 64 May 2 21:35 4 -> /var/lib/mysql lrwx------ 1 root root 64 May 2 21:35 38 -> /var/lib/mysql/mysql/gtid_slave_pos.ibd lrwx------ 1 root root 64 May 2 21:35 37 -> /var/lib/mysql/mysql/event.MYD lrwx------ 1 root root 64 May 2 21:35 36 -> /var/lib/mysql/mysql/event.MYI lrwx------ 1 root root 64 May 2 21:35 35 -> /var/lib/mysql/mysql/procs_priv.MYD lrwx------ 1 root root 64 May 2 21:35 34 -> /var/lib/mysql/mysql/procs_priv.MYI lrwx------ 1 root root 64 May 2 21:35 33 -> /var/lib/mysql/mysql/columns_priv.MYD lrwx------ 1 root root 64 May 2 21:35 32 -> /var/lib/mysql/mysql/columns_priv.MYI lrwx------ 1 root root 64 May 2 21:35 31 -> /var/lib/mysql/mysql/tables_priv.MYD lrwx------ 1 root root 64 May 2 21:35 30 -> /var/lib/mysql/mysql/tables_priv.MYI lrwx------ 1 root root 64 May 2 21:35 3 -> /var/lib/mysql/aria_log_control lrwx------ 1 root root 64 May 2 21:35 29 -> /var/lib/mysql/mysql/roles_mapping.MYD lrwx------ 1 root root 64 May 2 21:35 28 -> /var/lib/mysql/mysql/roles_mapping.MYI lrwx------ 1 root root 64 May 2 21:35 27 -> /var/lib/mysql/mysql/proxies_priv.MYD lrwx------ 1 root root 64 May 2 21:35 26 -> /var/lib/mysql/mysql/proxies_priv.MYI lrwx------ 1 root root 64 May 2 21:35 25 -> /var/lib/mysql/mysql/host.MYD lrwx------ 1 root root 64 May 2 21:35 24 -> /var/lib/mysql/mysql/host.MYI lrwx------ 1 root root 64 May 2 21:35 23 -> /var/lib/mysql/mysql/db.MYD lrwx------ 1 root root 64 May 2 21:35 22 -> /var/lib/mysql/mysql/db.MYI lrwx------ 1 root root 64 May 2 21:35 21 -> /var/lib/mysql/mysql/user.MYD lrwx------ 1 root root 64 May 2 21:35 20 -> /var/lib/mysql/mysql/user.MYI lrwx------ 1 root root 64 May 2 21:35 19 -> socket:[49519] lrwx------ 1 root root 64 May 2 21:35 18 -> socket:[49518] lrwx------ 1 root root 64 May 2 21:35 17 -> /var/lib/mysql/mysql/servers.MYD lrwx------ 1 root root 64 May 2 21:35 16 -> /var/lib/mysql/mysql/servers.MYI lrwx------ 1 root root 64 May 2 21:35 15 -> /var/lib/mysql/tc.log lrwx------ 1 root root 64 May 2 21:35 13 -> /tmp/ib281FZH (deleted) lrwx------ 1 root root 64 May 2 21:35 12 -> /var/lib/mysql/ibtmp1 lrwx------ 1 root root 64 May 2 21:35 11 -> /var/lib/mysql/ib_logfile1 lrwx------ 1 root root 64 May 2 21:35 10 -> /var/lib/mysql/ib_logfile0 l-wx------ 1 root root 64 May 2 21:35 1 -> /var/log/mysql/error.log lrwx------ 1 root root 64 May 2 21:35 0 -> /dev/pts/3
And again file descriptor with ID=38 is:
/var/lib/mysql/mysql/gtid_slave_pos.ibd
And when you check the size – 12G…..12G for this table? Why? What is used for?
STEP 6) And from the manual:
The mysql.gtid_slave_pos table is used in replication by slave servers to keep track of their current position (the global transaction ID of the last transaction applied). Using the table allows the slave to maintain a consistent value for the gtid_slave_pos system variable across server restarts. See Global Transaction ID. You should never attempt to modify the table directly. If you do need to change the global gtid_slave_pos value, use SET GLOBAL gtid_slave_pos = ... instead. The table is updated with the new position as part of each transaction committed during replication. This makes it preferable that the table is using the same storage engine as the tables otherwise being modified in the transaction, since otherwise a multi-engine transaction is needed that can reduce performance.
Hmm “The table is updated with the new position as part of each transaction committed during replication” and 12G table – could it track minillions of transactions? Probably not, in fact this table has only 2-3-4 rows at a given time!!! But the size is 12G, well as we said this is:
“That’s specifics of InnoDB data storage. Did you try to run OPTIMIZE TABLE mysql.gtid_slave_pos? It should allow to reclaim the disk space.” (taken from: https://jira.mariadb.org/browse/MDEV-12318).
STEP 7) Remove a corrupted innodb file, which causes database (mysql process) crash leaving your with no database at all and then recreate the table
Restart your mysql server with “skip-slave-start” again to be able to start it (look above).
So a table probably with couple of rows takes 12G and MariaDB is using it for the replication on start up, but the replication position is kept on another place “master.info”, can we delete this offender file “gtid_slave_pos.ibd”? Yes we can, move the file out of its place (mysql datadir) and then create a second table with:
MariaDB [mysql]> use mysql MariaDB [mysql]> CREATE TABLE `gtid_slave_pos1` ( -> `domain_id` int(10) unsigned NOT NULL, -> `sub_id` bigint(20) unsigned NOT NULL, -> `server_id` int(10) unsigned NOT NULL, -> `seq_no` bigint(20) unsigned NOT NULL, -> PRIMARY KEY (`domain_id`,`sub_id`) -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Replication slave GTID position'; Query OK, 0 rows affected (0.01 sec)
Copy gtid_slave_pos1.ibd to gtid_slave_pos.ibd and restart the mysql process, it will report an error for a table mysql.gtid_slave_pos but you will be able to drop the table and then create it with the same name: “gtid_slave_pos” (you could drop the temporary one “gtid_slave_pos1”)
MariaDB [mysql]> use mysql MariaDB [mysql]> DROP TABLE `gtid_slave_pos`; Query OK, 0 rows affected (0.01 sec) MariaDB [mysql]> CREATE TABLE `gtid_slave_pos` ( -> `domain_id` int(10) unsigned NOT NULL, -> `sub_id` bigint(20) unsigned NOT NULL, -> `server_id` int(10) unsigned NOT NULL, -> `seq_no` bigint(20) unsigned NOT NULL, -> PRIMARY KEY (`domain_id`,`sub_id`) -> ) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Replication slave GTID position'; Query OK, 0 rows affected (0.01 sec) MariaDB [mysql]> DROP TABLE `gtid_slave_pos1`; Query OK, 0 rows affected (0.01 sec)
The create table statement is taken from another server probably it is a good idea to do it yourself, login on a healthy mysql server and issue: “show create table mysql.gtid_slave_pos;”.
So now you have a healthy mysql.gtid_slave_posto be used for your replication. Restart your mysql server removing “skip-slave-start” from your configuration file and here it is the replication is OK and running:
MariaDB [(none)]> show slave statusG; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 10.10.10.10 Master_User: replusr Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.071301 Read_Master_Log_Pos: 64980976 Relay_Log_File: mysqld-relay-bin.230012 Relay_Log_Pos: 80704376 Relay_Master_Log_File: mysql-bin.090129 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: test Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 80704077 Relay_Log_Space: 113209377078 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 30944 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 101 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: No Gtid_IO_Pos: Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for room in worker thread event queue 1 row in set (0.00 sec)