after rollback on master
When starting a transaction with a statement containing changes
to both transactional tables and non-transactional tables, the
statement is considered as non-transactional and is therefore
written directly to the binary log. This behaviour was present
in 5.0, and has propagated to 5.1.
If a trigger containing a change of a non-transactional table is
added to a transactional table, any changes to the transactional
table is "tainted" as non-transactional.
This patch solves the problem by removing the existing "hack" that
allows non-transactional statements appearing first in a transaction
to be written directly to the binary log. Instead, anything inside
a transaction is treaded as part of the transaction and not written
to the binary log until the transaction is committed.
A transaction could result in having an extra event after a query that
errored e.g because of a dup key. Such a query is rolled back in
innodb, as specified, but has not been in binlog.
It appeares that the binlog engine did not always register for a query
(statement) because the previous query had not reset at its statement
commit time. Because of that fact there was no roll-back to the
trx_data->before_stmt_pos position and a the pending event of the
errorred query could become flushed to the binlog file.
Fixed with deploying the reset of trx_data->before_stmt_pos at the end
of the query processing.
The Blackhole engine did not support row-based replication
since the delete_row(), update_row(), and the index and range
searching functions were not implemented.
This patch adds row-based replication support for the
Blackhole engine by implementing the two functions mentioned
above, and making the engine pretend that it has found the
correct row to delete or update when executed from the slave
SQL thread by implementing index and range searching functions.
It is necessary to only pretend this for the SQL thread, since
a SELECT executed on the Blackhole engine will otherwise never
return EOF, causing a livelock.
The problem is that when statement-based replication was enabled,
statements such as INSERT INTO .. SELECT FROM .. and CREATE TABLE
.. SELECT FROM need to grab a read lock on the source table that
does not permit concurrent inserts, which would in turn be denied
if the source table is a log table because log tables can't be
locked exclusively.
The solution is to not take such a lock when the source table is
a log table as it is unsafe to replicate log tables under statement
based replication. Furthermore, the read lock that does not permits
concurrent inserts is now only taken if statement-based replication
is enabled and if the source table is not a log table.
SUPER is not required to change binlog format for session
A user without SUPER privileges can change the value of the
session variable BINLOG_FORMAT, causing problems for a DBA.
This changeset requires a user to have SUPER privileges to
change the value of the session variable BINLOG_FORMAT, and
not only the global variable BINLOG_FORMAT.
The size of the Innodb_buffer_pool_pages differs by one byte on row versus statement
log, so neuter the last position of the stringified decimal representation. Innobase
says the size isn't very important in any case.
Also, split out the "mixed" format to its own file, as mtr seems to dislike having only
stm and row but not mix.
tables open
When executing a DROP DATABASE statement in ROW mode and having temporary
tables open at the same time, the existance of temporary tables prevent
the server from switching back to row mode after temporarily switching to
statement mode to handle the logging of the statement.
Fixed the problem by removing the code to switch to statement mode and added
code to temporarily disable the binary log while dropping the objects in the
database.
Problem: binlog_stm_binlog runs INSERT DELAYED queries, and
then prints the contents of the binlog. Before checking the
contents of the binlog, the test waits until the rows have
appeared in the table. However, this is not enough, since
INSERT DELAYED does not write rows to the binlog at the same
time as it writes them to the table. So there is a race.
Fix: Add a FLUSH TABLES before SHOW BINLOG EVENTS. That
waits until the insert_delayed thread is done.
In order to handle CHAR() fields, 8 bits were reserved for
the size of the CHAR field. However, instead of denoting the
number of characters in the field, field_length was used which
denotes the number of bytes in the field.
Since UTF-8 fields can have three bytes per character (and
has been extended to have four bytes per character in 6.0),
an extra two bits have been encoded in the field metadata
work for fields of type Field_string (i.e., CHAR fields).
Since the metadata word is filled, the extra bits have been
encoded in the upper 4 bits of the real type (the most
significant byte of the metadata word) by computing the
bitwise xor of the extra two bits. Since the upper 4 bits
of the real type always is 1111 for Field_string, this
means that for fields of length <256, the encoding is
identical to the encoding used in pre-5.1.26 servers, but
for lengths of 256 or more, an unrecognized type is formed,
causing an old slave (that does not handle lengths of 256
or more) to stop.
If a binlog file is manually replaced with a namesake directory the internal purging did
not handle the error of deleting the file so that eventually
a post-execution guards fires an assert.
Fixed with reusing a snippet of code for bug@18199 to tolerate lack of the file but no other error
at an attempt to delete it.
The same applied to the index file deletion.
The cset carries pieces of manual merging.
The bug allow multiple executing transactions working with non-transactional
to interfere with each others by interleaving the events of different trans-
actions.
Bug is fixed by writing non-transactional events to the transaction cache and
flushing the cache to the binary log at statement commit. To mimic the behavior
of normal statement-based replication, we flush the transaction cache in row-
based mode when there is no committed statements in the transaction cache,
which means we are committing the first one. This means that it will be written
to the binary log as a "mini-transaction" with just the rows for the statement.
Note that the changes here does not take effect when building the server with
HAVE_TRANSACTIONS set to false, but it is not clear if this was possible before
this patch either.
For row-based logging, we also have that when AUTOCOMMIT=1, the code now always
generates a BEGIN/COMMIT pair for single statements, or BEGIN/ROLLBACK pair in the
case of non-transactional changes in a statement that was rolled back. Note that
for the case where changes to a non-transactional table causes a rollback due
to error, the statement will now be logged with a BEGIN/ROLLBACK pair, even
though some changes has been committed to the non-transactional table.
binlog_format=mixed
Statement-based replication of DELETE ... LIMIT, UPDATE ... LIMIT,
INSERT ... SELECT ... LIMIT is not safe as order of rows is not
defined.
With this fix, we issue a warning that this statement is not safe to
replicate in statement mode, or go to row-based mode in mixed mode.
Note that we may consider a statement as safe if ORDER BY primary_key
is present. However it may confuse users to see very similiar statements
replicated differently.
Note 2: regular UPDATE statement (w/o LIMIT) is unsafe as well, but
this patch doesn't address this issue. See comment from Kristian
posted 18 Mar 10:55.
using a trig in SP
For all 5.0 and up to 5.1.12 exclusive, when a stored routine or
trigger caused an INSERT into an AUTO_INCREMENT column, the
generated AUTO_INCREMENT value should not be written into the
binary log, which means if a statement does not generate
AUTO_INCREMENT value itself, there will be no Intvar event (SET
INSERT_ID) associated with it even if one of the stored routine
or trigger caused generation of such a value. And meanwhile, when
executing a stored routine or trigger, it would ignore the
INSERT_ID value even if there is a INSERT_ID value available set
by a SET INSERT_ID statement.
Starting from MySQL 5.1.12, the generated AUTO_INCREMENT value is
written into the binary log, and the value will be used if
available when executing the stored routine or trigger.
Prior fix of this bug in MySQL 5.0 and prior MySQL 5.1.12
(referenced as the buggy versions in the text below), when a
statement that generates AUTO_INCREMENT value by the top
statement was executed in the body of a SP, all statements in the
SP after this statement would be treated as if they had generated
AUTO_INCREMENT by the top statement. When a statement that did
not generate AUTO_INCREMENT value by the top statement but by a
function/trigger called by it, an erroneous Intvar event would be
associated with the statement, this erroneous INSERT_ID value
wouldn't cause problem when replicating between masters and
slaves of 5.0.x or prior 5.1.12, because the erroneous INSERT_ID
value was not used when executing functions/triggers. But when
replicating from buggy versions to 5.1.12 or newer, which will
use the INSERT_ID value in functions/triggers, the erroneous
value will be used, which would cause duplicate entry error and
cause the slave to stop.
The patch for 5.1 fixed it to ignore the SET INSERT_ID value when
executing functions/triggers if it is replicating from a master
of buggy versions, another patch for 5.0 fixed it not to generate
the erroneous Intvar event.
Problem: in mixed and statement mode, a query that refers to a
system variable will use the slave's value when replayed on
slave. So if the value of a system variable is inserted into a
table, the slave will differ from the master.
Fix: mark statements that refer to a system variable as "unsafe",
meaning they will be replicated by row in mixed mode and produce a warning
in statement mode. There are some exceptions: some variables are actually
replicated. Those should *not* be marked as unsafe.
BUG#34732: mysqlbinlog does not print default values for auto_increment variables
Problem: mysqlbinlog does not print default values for some variables,
including auto_increment_increment and others. So if a client executing
the output of mysqlbinlog has different default values, replication will
be wrong.
Fix: Always print default values for all variables that are replicated.
I need to fix the two bugs at the same time, because the test cases would
fail if I only fixed one of them.