mariadb/sql/rpl_utility.cc
Monty 232533978f MDEV-36290: ALTER TABLE with multi-master can cause data loss
One can have data loss in multi-master setups when 1) both masters
update the same table, 2) ALTER TABLE is run on one master which
re-arranges the column ordering, and 3) transactions are binlogged
in ROW binlog_format.

This is because the slave assumes that all columns are in the same
order on the master and slave and all columns on the master also
exists on the slave. This happens even if binlog_row_metadata=FULL is
used.  If this is not the case, this will lead to silent data loss.

A new option for slave_type_conversions bit field,
ERROR_IF_MISSING_FIELD, has been added. This allows the user to define if
the slave should abort replication if it is missing some field that
existed on the master. This option is off by default to keep things
compatible with earlier versions.
If a field is missing on the slave and log_warnings >= 1, a warning
will be logged to the error log.

This patch fixes this, when binlog_row_metadata=FULL is used on the
master, by mapping fields with identical names on the master and slave.
If slave has fields that does not exist in the row event, these will
be set to their default value.

The main idea is that we added two conversion tables:
m_tabledef.master_to_slave_map[master_column_index] -> slave_column_index
and m_tabledef.master_to_slave_error[master_column_index] which contains
an error number if the master_column does not exists on the slave or
it is not possible to convert the master data to the slave column.
master_to_slave_error[#] contains 0 if the column exists and is compatible.

General code changes:
- Instead of looping over row fields in the order of slave table
  we are instead looping over fields in the order of the binary log.
- We are using table->write_set to know which fields should be updated
  on the slave. This is reflected in unpack_row
- We are calling TABLE::mark_columns_per_binlog_row_image() to ensure
  that rpl_write_set is properly set. This is needed if the slave also
  is doing binary logging.
- Before replication aborted if the master and slave tables where too
  different.  Now replication is only aborted if the row actually uses
  columns that does not exists on the slave (and ALLOW_MISSING_FIELDS
  is not used) or uses columns that cannot be converted.
  - Instead of giving errors in compatible_with(), used when table is
    accessed by first the row event, we are instead giving errors
    when we examine a row event and notice that it is accessing
    a not existing or not compatible field.

Other code changes:
- Removed conv_table argument from compatible_with() and store it
  directly in RPL_TABLE_LIST->m_conv_table
- table_def::compatible_with() returns now 1 on error (not 0).
- Remove m_width and skip arguments from prepare_record() as we are
  now using table->write_set() to check which elements need a default
  value.
- Moved DBUG_ENTER() to it's proper place (after variable
  declarations) in a few functions.
- Some changes in unpack_row():
  - Replaced null_mask and null_ptr with an indexed bit check for
    simplicity.
  - Removed check of rgi == null and table_found which never worked.
  - Updated comments to reflect current code.
  - Indentation changes as the code now uses 'continue' instead of
    'if-else' in the main loop.
  - The code to throw away 'extra master fields' is not needed as we
    are now looping over fields in binary log, not over fields in
    slave table.
- fill_extra_persistent_columns() is now using table->cond_set to know
  which columns where not updated from binlog.
- Simplified get_table_data(TABLE *table_arg) by returning found
  table_list.
- Errors for row events are now initialized in compatible_with(),
  checked in check_wrong_column_usage() and reported in
  give_compatibility_error().

Test cases and some code patchs provide by Brandon Nesterenko
<brandon.nesterenko@mariadb.com>
2025-05-19 19:57:47 +03:00

361 lines
11 KiB
C++

/* Copyright (c) 2006, 2013, Oracle and/or its affiliates.
Copyright (c) 2011, 2013, Monty Program Ab
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA */
#include "mariadb.h"
#include <my_bit.h>
#include "rpl_utility.h"
#include "log_event.h"
/*********************************************************************
* table_def member definitions *
*********************************************************************/
/*
This function returns the field size in raw bytes based on the type
and the encoded field data from the master's raw data.
*/
uint32 table_def::calc_field_size(uint col, uchar *master_data) const
{
uint32 length= 0;
switch (type(col)) {
case MYSQL_TYPE_NEWDECIMAL:
length= my_decimal_get_binary_size(m_field_metadata[col] >> 8,
m_field_metadata[col] & 0xff);
break;
case MYSQL_TYPE_DECIMAL:
case MYSQL_TYPE_FLOAT:
case MYSQL_TYPE_DOUBLE:
length= m_field_metadata[col];
break;
/*
The cases for SET and ENUM are include for completeness, however
both are mapped to type MYSQL_TYPE_STRING and their real types
are encoded in the field metadata.
*/
case MYSQL_TYPE_SET:
case MYSQL_TYPE_ENUM:
case MYSQL_TYPE_STRING:
{
uchar type= m_field_metadata[col] >> 8U;
if ((type == MYSQL_TYPE_SET) || (type == MYSQL_TYPE_ENUM))
length= m_field_metadata[col] & 0x00ff;
else
{
/*
We are reading the actual size from the master_data record
because this field has the actual lengh stored in the first
byte.
*/
length= (uint) *master_data + 1;
DBUG_ASSERT(length != 0);
}
break;
}
case MYSQL_TYPE_YEAR:
case MYSQL_TYPE_TINY:
length= 1;
break;
case MYSQL_TYPE_SHORT:
length= 2;
break;
case MYSQL_TYPE_INT24:
length= 3;
break;
case MYSQL_TYPE_LONG:
length= 4;
break;
#ifdef HAVE_LONG_LONG
case MYSQL_TYPE_LONGLONG:
length= 8;
break;
#endif
case MYSQL_TYPE_NULL:
length= 0;
break;
case MYSQL_TYPE_NEWDATE:
length= 3;
break;
case MYSQL_TYPE_DATE:
case MYSQL_TYPE_TIME:
length= 3;
break;
case MYSQL_TYPE_TIME2:
length= my_time_binary_length(m_field_metadata[col]);
break;
case MYSQL_TYPE_TIMESTAMP:
length= 4;
break;
case MYSQL_TYPE_TIMESTAMP2:
length= my_timestamp_binary_length(m_field_metadata[col]);
break;
case MYSQL_TYPE_DATETIME:
length= 8;
break;
case MYSQL_TYPE_DATETIME2:
length= my_datetime_binary_length(m_field_metadata[col]);
break;
case MYSQL_TYPE_BIT:
{
/*
Decode the size of the bit field from the master.
from_len is the length in bytes from the master
from_bit_len is the number of extra bits stored in the master record
If from_bit_len is not 0, add 1 to the length to account for accurate
number of bytes needed.
*/
uint from_len= (m_field_metadata[col] >> 8U) & 0x00ff;
uint from_bit_len= m_field_metadata[col] & 0x00ff;
DBUG_ASSERT(from_bit_len <= 7);
length= from_len + ((from_bit_len > 0) ? 1 : 0);
break;
}
case MYSQL_TYPE_VARCHAR:
case MYSQL_TYPE_VARCHAR_COMPRESSED:
{
length= m_field_metadata[col] > 255 ? 2 : 1; // c&p of Field_varstring::data_length()
length+= length == 1 ? (uint32) *master_data : uint2korr(master_data);
break;
}
case MYSQL_TYPE_TINY_BLOB:
case MYSQL_TYPE_MEDIUM_BLOB:
case MYSQL_TYPE_LONG_BLOB:
case MYSQL_TYPE_BLOB:
case MYSQL_TYPE_BLOB_COMPRESSED:
case MYSQL_TYPE_GEOMETRY:
{
/*
Compute the length of the data. We cannot use get_length() here
since it is dependent on the specific table (and also checks the
packlength using the internal 'table' pointer) and replication
is using a fixed format for storing data in the binlog.
*/
switch (m_field_metadata[col]) {
case 1:
length= *master_data;
break;
case 2:
length= uint2korr(master_data);
break;
case 3:
length= uint3korr(master_data);
break;
case 4:
length= uint4korr(master_data);
break;
default:
DBUG_ASSERT(0); // Should not come here
break;
}
length+= m_field_metadata[col];
break;
}
default:
length= ~(uint32) 0;
}
return length;
}
PSI_memory_key key_memory_table_def_memory;
table_def::table_def(unsigned char *types, ulong size,
uchar *field_metadata, int metadata_size,
uchar *null_bitmap, uint16 flags,
const uchar *optional_metadata_str,
uint optional_metadata_len)
: m_size(size), m_type(0), m_field_metadata_size(metadata_size),
m_field_metadata(0), m_null_bits(0), m_flags(flags),
m_memory(NULL)
{
m_memory= (uchar *)
my_multi_malloc(key_memory_table_def_memory, MYF(MY_WME),
&m_type, size,
&m_field_metadata,
size * sizeof(uint16),
&m_null_bits, (size + 7) / 8,
&optional_metadata.str,
optional_metadata_len,
&master_to_slave_map,
m_size * sizeof(*master_to_slave_map),
&master_to_slave_error,
m_size * sizeof(*master_to_slave_error),
&master_column_name,
m_size * sizeof(uchar*),
NULL);
bzero(m_field_metadata, size * sizeof(uint16));
bzero(master_to_slave_error, m_size * sizeof(*master_to_slave_error));
bzero(master_column_name, m_size * sizeof(uchar*));
if (m_type)
memcpy(m_type, types, size);
else
m_size= 0;
if ((optional_metadata.length= optional_metadata_len))
memcpy((char*) optional_metadata.str, optional_metadata_str,
optional_metadata_len);
/*
Extract the data from the table map into the field metadata array
iff there is field metadata. The variable metadata_size will be
0 if we are replicating from an older version server since no field
metadata was written to the table map. This can also happen if
there were no fields in the master that needed extra metadata.
*/
if (m_size && metadata_size)
{
int index= 0;
for (unsigned int i= 0; i < m_size; i++)
{
switch (binlog_type(i)) {
case MYSQL_TYPE_TINY_BLOB:
case MYSQL_TYPE_BLOB:
case MYSQL_TYPE_BLOB_COMPRESSED:
case MYSQL_TYPE_MEDIUM_BLOB:
case MYSQL_TYPE_LONG_BLOB:
case MYSQL_TYPE_DOUBLE:
case MYSQL_TYPE_FLOAT:
case MYSQL_TYPE_GEOMETRY:
{
/*
These types store a single byte.
*/
m_field_metadata[i]= field_metadata[index];
index++;
break;
}
case MYSQL_TYPE_SET:
case MYSQL_TYPE_ENUM:
case MYSQL_TYPE_STRING:
{
uint16 x= field_metadata[index++] << 8U; // real_type
x+= field_metadata[index++]; // pack or field length
m_field_metadata[i]= x;
break;
}
case MYSQL_TYPE_BIT:
{
uint16 x= field_metadata[index++];
x = x + (field_metadata[index++] << 8U);
m_field_metadata[i]= x;
break;
}
case MYSQL_TYPE_VARCHAR:
case MYSQL_TYPE_VARCHAR_COMPRESSED:
{
/*
These types store two bytes.
*/
char *ptr= (char *)&field_metadata[index];
m_field_metadata[i]= uint2korr(ptr);
index= index + 2;
break;
}
case MYSQL_TYPE_NEWDECIMAL:
{
uint16 x= field_metadata[index++] << 8U; // precision
x+= field_metadata[index++]; // decimals
m_field_metadata[i]= x;
break;
}
case MYSQL_TYPE_TIME2:
case MYSQL_TYPE_DATETIME2:
case MYSQL_TYPE_TIMESTAMP2:
m_field_metadata[i]= field_metadata[index++];
break;
default:
m_field_metadata[i]= 0;
break;
}
}
}
if (m_size && null_bitmap)
memcpy(m_null_bits, null_bitmap, (m_size + 7) / 8);
}
table_def::~table_def()
{
my_free(m_memory);
#ifndef DBUG_OFF
m_type= 0;
m_size= 0;
#endif
}
/**
@param even_buf point to the buffer containing serialized event
@param event_len length of the event accounting possible checksum alg
@return TRUE if test fails
FALSE as success
@notes
event_buf will have same values on return. However during the process of
caluclating the checksum, it's temporary changed. Because of this the
event_buf argument is not a pointer to const.
*/
bool event_checksum_test(uchar *event_buf, ulong event_len,
enum enum_binlog_checksum_alg alg)
{
bool res= FALSE;
uint16 flags= 0; // to store in FD's buffer flags orig value
if (alg != BINLOG_CHECKSUM_ALG_OFF && alg != BINLOG_CHECKSUM_ALG_UNDEF)
{
ha_checksum incoming;
ha_checksum computed;
if (event_buf[EVENT_TYPE_OFFSET] == FORMAT_DESCRIPTION_EVENT)
{
#ifdef DBUG_ASSERT_EXISTS
int8 fd_alg= event_buf[event_len - BINLOG_CHECKSUM_LEN -
BINLOG_CHECKSUM_ALG_DESC_LEN];
#endif
/*
FD event is checksummed and therefore verified w/o the binlog-in-use flag
*/
flags= uint2korr(event_buf + FLAGS_OFFSET);
if (flags & LOG_EVENT_BINLOG_IN_USE_F)
event_buf[FLAGS_OFFSET] &= ~LOG_EVENT_BINLOG_IN_USE_F;
/*
The only algorithm currently is CRC32. Zero indicates
the binlog file is checksum-free *except* the FD-event.
*/
DBUG_ASSERT(fd_alg == BINLOG_CHECKSUM_ALG_CRC32 || fd_alg == 0);
DBUG_ASSERT(alg == BINLOG_CHECKSUM_ALG_CRC32);
/*
Complile time guard to watch over the max number of alg
*/
compile_time_assert(BINLOG_CHECKSUM_ALG_ENUM_END <= 0x80);
}
incoming= uint4korr(event_buf + event_len - BINLOG_CHECKSUM_LEN);
/* checksum the event content without the checksum part itself */
computed= my_checksum(0, event_buf, event_len - BINLOG_CHECKSUM_LEN);
if (flags != 0)
{
/* restoring the orig value of flags of FD */
DBUG_ASSERT(event_buf[EVENT_TYPE_OFFSET] == FORMAT_DESCRIPTION_EVENT);
event_buf[FLAGS_OFFSET]= (uchar) flags;
}
res= DBUG_EVALUATE_IF("simulate_checksum_test_failure", TRUE, computed != incoming);
}
return res;
}