The problem is related to the changes made in bug#13025132.
get_partition_set can do dynamic pruning which limits the partitions
to scan even further. This is not accounted for when setting
the correct start of the preallocated record buffer used in
the priority queue, thus leading to wrong buffer is used
(including wrong preset partitioning id, connected to that buffer).
Solution is to fast forward the buffer pointer to point to the correct
partition record buffer.
The problem is related to the changes made in bug#13025132.
get_partition_set can do dynamic pruning which limits the partitions
to scan even further. This is not accounted for when setting
the correct start of the preallocated record buffer used in
the priority queue, thus leading to wrong buffer is used
(including wrong preset partitioning id, connected to that buffer).
Solution is to fast forward the buffer pointer to point to the correct
partition record buffer.
Follow-up patch - Fix broken build:
error: format ‘%u’ expects argument of type ‘unsigned int’,
but argument 2 has type ‘key_part_map {aka long unsigned int}’
[-Werror=format]
The partitioning engine does not implement index_next for partitions
which return HA_ERR_KEY_NOT_FOUND in index_read_map.
If HA_ERR_KEY_NOT_FOUND was returned by a partition during
index_read_map, that partition would not be included in following
calls to index_next. If no partition returned a row in index_read_map,
then the subsequent call to index_next would try to use a non existing
handler (index out of bound).
Even after fixing the index out of bound if at least one partition
returned.
So it is really two connected bugs
1) crash due to index out of bound (-1 unsigned).
2) not including partitions that returned HA_ERR_KEY_NOT_FOUND.
Fixed by recording the partitions that returned HA_ERR_KEY_NOT_FOUND,
and include them too when doing handle_ordered_next the first time.
Additional patch to remove the part_id -> ref_buffer offset.
The partitioning id and the associate record buffer can
be found without having to calculate it.
By initializing it for each used partition, and then reuse
the key-buffer from the queue, it is not needed to have
such map.
The buffer for the current read row from each partition
(m_ordered_rec_buffer) used for sorted reads was
allocated on open and freed when the ha_partition handler
was closed or destroyed.
For tables with many partitions and big records this could
take up too much valuable memory.
Solution is to only allocate the memory when it is needed
and free it when nolonger needed. I.e. allocate it in
index_init and free it in index_end (and to handle failures
also free it on reset, close etc.)
Also only allocating needed memory, according to
partitioning pruning.
Manually tested that it does not use as much memory and
releases it after queries.
There can be cases when the optimizer calls ha_partition::records_in_range
when there are no matching partitions. So the DBUG_ASSERT of
!tot_used_partitions does assert.
Fixed by returning 0 instead when no matching partitions are found.
This will avoid the crash. records_in_range will then try to find the
biggest used partition, which will not find any partition and
records_in_range will then return 0, meaning non rows can be found.
Patch contributed by Davi Arnaut at twitter.
PARTITION STATISTICS
Problem was the fix for bug#11756867; It always used the first
partitions, and stopped after it checked 10 [sub]partitions.
(or until it found a partition which would contain a match).
This results in bad statistics for tables where the first 10 partitions
don't represent the majority of the data (like when the first 10
partitions only contained a few rows in total).
The solution was to take statisics from the partitions containing
the most rows instead:
Added an array of partition ids which is sorted by number of records
in descending order.
this array is used in records_in_range to cover as many records as
possible in as few calls as possible.
Also changed the limit of how many partitions to use for the statistics
from a static max of 10 partitions, into a dynamic model:
Maximum number of partitions is now log2(total number of partitions)
taken from the ordered array.
It will continue calling partitions records_in_range until it has
checked:
(total rows in matching partitions) * (maximum number of partitions)
/ (number of used partitions)
Also reverted the changes for ha_partition::scan_time() and
ha_partition::estimate_rows_upper_bound() to before
the fix of bug#11756867. Since they are not as slow as
records_in_range.
RESULT FROM PREVIOUS TRANSACTION
The current Query Cache API is not fully compatible with
the partitioning engine.
There is no good way to implement support for QC due to:
1) a static callback for ha_partition would need to have access
to all partition names and call the underlying callback for each
[sub]partition with the correct name.
2) pruning would be impossible, even if one used the ulonglong
engine_data due to if engine_data is changed, the table is
invalidated by the QC.
So the only viable solution to avoid incorrect data is to not allow
caching of queries using partitioned tables.
(There are some extra changes, due to removal of \r as line break)
ALTER TABLE AFTER DROP PARTITION
Bug#13608188 - 64038: CRASH IN HANDLER::HA_THD ON ALTER TABLE AFTER
REPAIR NON-EXISTING PARTITION
Backport of bug#13357766 from -trunk to -5.5.
The state of some partitions was not reset on failure, leading
to invalid states of partitions in consequent statements.
Fixed by reverting back to original state for all partitions
if not all partition names was resolved.
Also adding extra security by forcing tables to be reopened
in case of error in mysql_alter_table.
(There is also removal of \r at the end of some lines.)
(also 5.5+ solution for bug#11766879/bug#60106)
The valgrind warning was due to an unused 'new handler_add_index(...)'
which was never freed.
The error handling did not work (fails as in bug#11766879) and
the implementation was not as transparant as it could, therefore I
made it a bit simpler and more transparant to the underlying handlers.
This way it follows the api better and the error handling works and
is also now tested.
Also added a debug test to verify the error handling.
Improved according to Jon Olavs review:
Added class ha_partition_add_index.
Also added base class Sql_alloc to handler_add_index.
Update 3.
PARTITONING, ON INDEX CREATE
If the first partition succeeded in adding a index, but a successive partition failed,
then the first partition had still the new index.
The fix reverts the added indexes from previous partitions on failure.
In sql_class.cc, 'row_count', of type 'ha_rows', was used as last argument for
ER_TRUNCATED_WRONG_VALUE_FOR_FIELD which is
"Incorrect %-.32s value: '%-.128s' for column '%.192s' at row %ld".
So 'ha_rows' was used as 'long'.
On SPARC32 Solaris builds, 'long' is 4 bytes and 'ha_rows' is 'longlong' i.e. 8 bytes.
So the printf-like code was reading only the first 4 bytes.
Because the CPU is big-endian, 1LL is 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01
so the first four bytes yield 0. So the warning message had "row 0" instead of
"row 1" in test outfile_loaddata.test:
-Warning 1366 Incorrect string value: '\xE1\xE2\xF7' for column 'b' at row 1
+Warning 1366 Incorrect string value: '\xE1\xE2\xF7' for column 'b' at row 0
All error-messaging functions which internally invoke some printf-life function
are potential candidate for such mistakes.
One apparently easy way to catch such mistakes is to use
ATTRIBUTE_FORMAT (from my_attribute.h).
But this works only when call site has both:
a) the format as a string literal
b) the types of arguments.
So:
func(ER(ER_BLAH), 10);
will silently not be checked, because ER(ER_BLAH) is not known at
compile time (it is known at run-time, and depends on the chosen
language).
And
func("%s", a va_list argument);
has the same problem, as the *real* type of arguments is not
known at this site at compile time (it's known in some caller).
Moreover,
func(ER(ER_BLAH));
though possibly correct (if ER(ER_BLAH) has no '%' markers), will not
compile (gcc says "error: format not a string literal and no format
arguments").
Consequences:
1) ATTRIBUTE_FORMAT is here added only to functions which in practice
take "string literal" formats: "my_error_reporter" and "print_admin_msg".
2) it cannot be added to the other functions: my_error(),
push_warning_printf(), Table_check_intact::report_error(),
general_log_print().
To do a one-time check of functions listed in (2), the following
"static code analysis" has been done:
1) replace
my_error(ER_xxx, arguments for substitution in format)
with the equivalent
my_printf_error(ER_xxx,ER(ER_xxx), arguments for substitution in
format),
so that we have ER(ER_xxx) and the arguments *in the same call site*
2) add ATTRIBUTE_FORMAT to push_warning_printf(),
Table_check_intact::report_error(), general_log_print()
3) replace ER(xxx) with the hard-coded English text found in
errmsg.txt (like: ER(ER_UNKNOWN_ERROR) is replaced with
"Unknown error"), so that a call site has the format as string literal
4) this way, ATTRIBUTE_FORMAT can effectively do its job
5) compile, fix errors detected by ATTRIBUTE_FORMAT
6) revert steps 1-2-3.
The present patch has no compiler error when submitted again to the
static code analysis above.
It cannot catch all problems though: see Field::set_warning(), in
which a call to push_warning_printf() has a variable error
(thus, not replacable by a string literal); I checked set_warning() calls
by hand though.
See also WL 5883 for one proposal to avoid such bugs from appearing
again in the future.
The issues fixed in the patch are:
a) mismatch in types (like 'int' passed to '%ld')
b) more arguments passed than specified in the format.
This patch resolves mismatches by changing the type/number of arguments,
not by changing error messages of sql/share/errmsg.txt. The latter would be wrong,
per the following old rule: errmsg.txt must be as stable as possible; no insertions
or deletions of messages, no changes of type or number of printf-like format specifiers,
are allowed, as long as the change impacts a message already released in a GA version.
If this rule is not followed:
- Connectors, which use error message numbers, will be confused (by insertions/deletions
of messages)
- using errmsg.sys of MySQL 5.1.n with mysqld of MySQL 5.1.(n+1)
could produce wrong messages or crash; such usage can easily happen if
installing 5.1.(n+1) while /etc/my.cnf still has --language=/path/to/5.1.n/xxx;
or if copying mysqld from 5.1.(n+1) into a 5.1.n installation.
When fixing b), I have verified that the superfluous arguments were not used in the format
in the first 5.1 GA (5.1.30 'bteam@astra04-20081114162938-z8mctjp6st27uobm').
Had they been used, then passing them today, even if the message doesn't use them
anymore, would have been necessary, as explained above.
include/my_getopt.h:
this function pointer is used only with "string literal" formats, so we can add
ATTRIBUTE_FORMAT.
mysql-test/collections/default.experimental:
test should pass now
sql/derror.cc:
by having a format as string literal, ATTRIBUTE_FORMAT check becomes effective.
sql/events.cc:
Change justified by the following excerpt from sql/share/errmsg.txt:
ER_EVENT_SAME_NAME
eng "Same old and new event name"
ER_EVENT_SET_VAR_ERROR
eng "Error during starting/stopping of the scheduler. Error code %u"
sql/field.cc:
ER_TOO_BIG_SCALE 42000 S1009
eng "Too big scale %d specified for column '%-.192s'. Maximum is %lu."
ER_TOO_BIG_PRECISION 42000 S1009
eng "Too big precision %d specified for column '%-.192s'. Maximum is %lu."
ER_TOO_BIG_DISPLAYWIDTH 42000 S1009
eng "Display width out of range for column '%-.192s' (max = %lu)"
sql/ha_ndbcluster.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
(sizeof() returns size_t)
sql/ha_ndbcluster_binlog.cc:
Too many arguments for:
ER_GET_ERRMSG
eng "Got error %d '%-.100s' from %s"
Patch by Jonas Oreland.
sql/ha_partition.cc:
print_admin_msg() is used only with a literal as format, so ATTRIBUTE_FORMAT
works.
sql/handler.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
(sizeof() returns size_t)
sql/item_create.cc:
ER_TOO_BIG_SCALE 42000 S1009
eng "Too big scale %d specified for column '%-.192s'. Maximum is %lu."
ER_TOO_BIG_PRECISION 42000 S1009
eng "Too big precision %d specified for column '%-.192s'. Maximum is %lu."
'c_len' and 'c_dec' are char*, passed as %d !! We don't know their value
(as strtoul() failed), but they are likely big, so we use INT_MAX.
'len' is ulong.
sql/item_func.cc:
ER_WARN_DATA_OUT_OF_RANGE 22003
eng "Out of range value for column '%s' at row %ld"
ER_CANT_FIND_UDF
eng "Can't load function '%-.192s'"
sql/item_strfunc.cc:
ER_TOO_BIG_FOR_UNCOMPRESS
eng "Uncompressed data size too large; the maximum size is %d (probably, length of uncompressed data was corrupted)"
max_allowed_packet is ulong.
sql/mysql_priv.h:
sql_print_message_func is a function _pointer_.
sql/sp_head.cc:
ER_SP_RECURSION_LIMIT
eng "Recursive limit %d (as set by the max_sp_recursion_depth variable) was exceeded for routine %.192s"
max_sp_recursion_depth is ulong
sql/sql_acl.cc:
ER_PASSWORD_NO_MATCH 42000
eng "Can't find any matching row in the user table"
ER_CANT_CREATE_USER_WITH_GRANT 42000
eng "You are not allowed to create a user with GRANT"
sql/sql_base.cc:
ER_NOT_KEYFILE
eng "Incorrect key file for table '%-.200s'; try to repair it"
ER_TOO_MANY_TABLES
eng "Too many tables; MySQL can only use %d tables in a join"
MAX_TABLES is size_t.
sql/sql_binlog.cc:
ER_UNKNOWN_ERROR
eng "Unknown error"
sql/sql_class.cc:
ER_TRUNCATED_WRONG_VALUE_FOR_FIELD
eng "Incorrect %-.32s value: '%-.128s' for column '%.192s' at row %ld"
WARN_DATA_TRUNCATED 01000
eng "Data truncated for column '%s' at row %ld"
sql/sql_connect.cc:
ER_HANDSHAKE_ERROR 08S01
eng "Bad handshake"
ER_BAD_HOST_ERROR 08S01
eng "Can't get hostname for your address"
sql/sql_insert.cc:
ER_WRONG_VALUE_COUNT_ON_ROW 21S01
eng "Column count doesn't match value count at row %ld"
sql/sql_parse.cc:
ER_WARN_HOSTNAME_WONT_WORK
eng "MySQL is started in --skip-name-resolve mode; you must restart it without this switch for this grant to work"
ER_TOO_HIGH_LEVEL_OF_NESTING_FOR_SELECT
eng "Too high level of nesting for select"
ER_UNKNOWN_ERROR
eng "Unknown error"
sql/sql_partition.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
sql/sql_plugin.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
sql/sql_prepare.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
ER_UNKNOWN_STMT_HANDLER
eng "Unknown prepared statement handler (%.*s) given to %s"
length value (for '%.*s') must be 'int', per the doc of printf()
and the code of my_vsnprintf().
sql/sql_show.cc:
ER_OUTOFMEMORY HY001 S1001
eng "Out of memory; restart server and try again (needed %d bytes)"
sql/sql_table.cc:
ER_TOO_BIG_FIELDLENGTH 42000 S1009
eng "Column length too big for column '%-.192s' (max = %lu); use BLOB or TEXT instead"
sql/table.cc:
ER_NOT_FORM_FILE
eng "Incorrect information in file: '%-.200s'"
ER_COL_COUNT_DOESNT_MATCH_PLEASE_UPDATE
eng "Column count of mysql.%s is wrong. Expected %d, found %d. Created with MySQL %d, now running %d. Please use mysql_upgrade to fix this error."
table->s->mysql_version is ulong.
sql/unireg.cc:
ER_TOO_LONG_TABLE_COMMENT
eng "Comment for table '%-.64s' is too long (max = %lu)"
ER_TOO_LONG_FIELD_COMMENT
eng "Comment for field '%-.64s' is too long (max = %lu)"
ER_TOO_BIG_ROWSIZE 42000
eng "Row size too large. The maximum row size for the used table type, not counting BLOBs, is %ld. You have to change some columns to TEXT or BLOBs"
SECONDARY INDEX IN INNODB
The patches for Bug#11751388 and Bug#11784056 enabled concurrent
reads while creating secondary indexes in InnoDB. However, they
introduced a regression. This regression occured if ALTER TABLE
failed after the index had been added, for example during the
lock upgrade needed to update .FRM. If this happened, InnoDB
and the server got out of sync with regards to which indexes
actually existed. Therefore the patch for Bug#11815600 again
disabled concurrent reads.
This patch re-enables concurrent reads. The original regression
is fixed by splitting the ADD INDEX operation into two parts.
First the new index is created but not made active. This is
done while concurrent reads are allowed. The second part of
the operation makes the index active (or reverts the change).
This is done after lock upgrade, which prevents the original
regression.
In order to implement this change, the patch changes the storage
API for in-place index creation. handler::add_index() is split
into two functions, handler_add_index() and
handler::final_add_index(). The former for creating indexes without
making them visible and the latter for commiting (i.e. making
visible) new indexes or reverting the changes.
Large parts of this patch were written by Marko Mäkelä.
Test case added to innodb_mysql_lock.test.
The partitioning engine checked the auto_increment column even if it was not to be written,
triggering a DBUG_ASSERT.
Fixed by checking if table->write_set for that column was set.
Partitions can have different ref_length (position data length).
Removed DBUG_ASSERT which crashed debug builds when using
MAX_ROWS on some partitions.
Update for previous patch according to reviewers comments.
Updated the constructors for ha_partitions to use the common
init_handler_variables functions
Added use of defines for size and offset to get better readability for the code that reads
and writes the .par file. Also refactored the get_from_handler_file function.
When executing row-ordered-retrieval index merge,
the handler was cloned, but it used the wrong
memory root, so instead of allocating memory
on the thread/query's mem_root, it used the table's
mem_root, resulting in non released memory in the
table object, and was not freed until the table was
closed.
Solution was to ensure that memory used during cloning
of a handler was allocated from the correct memory root.
This was implemented by fixing handler::clone() to also
take a name argument, so it can be used with partitioning.
And in ha_partition only allocate the ha_partition's ref, and
call the original ha_partition partitions clone() and set at cloned
partitions.
Fix of .bzrignore on Windows with VS 2010
Regression from bug#11766232.
m_last_part could be set beyond the last partition.
Fixed by only setting it if within the limit.
Also added check in print_error.
Regression introduced in bug#52455. Problem was that the
fixed function did not set the last used partition variable, resulting
in wrong partition used when storing the position of the newly
retrieved row.
Fixed by setting the last used partition in ha_partition::index_read_idx_map.
that implement add_index
The problem was that ALTER TABLE blocked reads on an InnoDB table
while adding a secondary index, even if this was not needed. It is
only needed for the final step where the .frm file is updated.
The reason queries were blocked, was that ALTER TABLE upgraded the
metadata lock from MDL_SHARED_NO_WRITE (which blocks writes) to
MDL_EXCLUSIVE (which blocks all accesses) before index creation.
The way the server handles index creation, is that storage engines
publish their capabilities to the server and the server determines
which of the following three ways this can be handled: 1) build a
new version of the table; 2) change the existing table but with
exclusive metadata lock; 3) change the existing table but without
metadata lock upgrade.
For InnoDB and secondary index creation, option 3) should have been
selected. However this failed for two reasons. First, InnoDB did
not publish this capability properly.
Second, the ALTER TABLE code failed to made proper use of the
information supplied by the storage engine. A variable
need_lock_for_indexes was set accordingly, but was not later used.
This patch fixes this problem by only doing metadata lock upgrade
before index creation/deletion if this variable has been set.
This patch also changes some of the related terminology used
in the code. Specifically the use of "fast" and "online" with
respect to ALTER TABLE. "Fast" was used to indicate that an
ALTER TABLE operation could be done without involving a
temporary table. "Fast" has been renamed "in-place" to more
accurately describe the behavior.
"Online" meant that the operation could be done without taking
a table lock. However, in the current implementation writes
are always prohibited during ALTER TABLE and an exclusive
metadata lock is held while updating the .frm, so ALTER TABLE
is not completely online. This patch replaces "online" with
"in-place", with additional comments indicating if concurrent
reads are allowed during index creation/deletion or not.
An important part of this update of terminology is renaming
of the handler flags used by handlers to indicate if index
creation/deletion can be done in-place and if concurrent reads
are allowed. For example, the HA_ONLINE_ADD_INDEX_NO_WRITES
flag has been renamed to HA_INPLACE_ADD_INDEX_NO_READ_WRITE,
while HA_ONLINE_ADD_INDEX is now HA_INPLACE_ADD_INDEX_NO_WRITE.
Note that this is a rename to clarify current behavior, the
flag values have not changed and no flags have been removed or
added.
Test case added to innodb_mysql_sync.test.