An unfortunate change to the default behavior of the handling of
core dumps was implemented in
commit e9be5428a2
by making MTR_PRINT_CORE=small the default value, that is, to
only display the stack trace of one thread in crash reports.
Many if not most failures that occur in regression tests are
sporadic and involve race conditions or deadlocks. To be able to
analyze such failures, having the stack traces of all active threads
is a must, because CI environments typically do not save any core dumps.
While the environment variable MTR_PRINT_CORE could be set in CI
environments to compensate for the unfortunate change, it is better
to revert to the old default (dumping all threads) so that
no explicit action will be required from maintainers of independent
CI systems. In that case, if something fails once in a blue moon,
we can have some hope of diagnosing it based on the output.
We fix this regression by defaulting the unset environment variable
MTR_PRINT_CORE to "medium".
first SIGTERM and if the process didn't die in 10 seconds, SIGKILL it.
This allows various tools like `rr`, `gcov`, `gprof`, etc to flush
their data to disk properly
66832e3a introduced change that prints core dumps in very detailed
format. That's completely out of user-friendliness but serves as a
measure for debugging hard-reproducible bugs.
The proper way to implement this:
1. it must be controlled by command-line and environment variable;
2. detailed traces must be default for buildbots only, for user
invocations normal stack traces should be printed.
Options for control are: MTR_PRINT_CORE and --print-core that accept
the following values:
no Don't print core
short Print stack trace of failed thread
medium Print stack traces of all threads
detailed Print all stack traces with debug context
custom:<code> Use debugger commands <code> to print stack trace
Default setting is: short (see env_or_default() call in pre_setup())
For environment variable wrong values are silently ignored (falls back
to default setting, see env_or_default()).
Command-line option --print-core (or -C) overrides environment
variable. Its default value is 'short' if not specified explicitly
(same env_or_default() call in pre_setup()). Explicit values are
checked for validity.
--print-method option can specify by which debugger we print
cores. For Windows there is only one choice: cdb. For Unix the values
are: gdb, dbx, lldb, auto. Default value is: auto
In 'auto' we try to use all possible debuggers until success.
setup_boot_args(), setup_client_args(), setup_args() traversing
datastructures on each invocation. Even if performance is not
important to perl script (though it definitely saves some CO2), this
nonetheless provokes some code-reading questions. Reading and
debugging such code is not convenient.
The better way is to prepare all the data in advance in an easily
readable form as well as do the validation step before any further
processing.
Use mtr_report() instead of die() like the other code does.
TODO: do_args() does even more data processing magic. Prepare that
data according the above strategy in advance in pre_setup() if possible.
bt full - to include args and locals.
set print sevenbit on
- it is more useful to be able to see the exact bytes
(in case something is dumped as a string and not hexadecimal digits)
set print static-members off
- there are many interesting (non-const) static members
set frame-arguments all
- even non-printables are useful to see.
Let's make our bb logs give a little bit more detail on those
hard to reproduce bugs.
Tests on rhel7's gdb-7.6.1-120.el7
mtr detects a forced combination if the command line for a test already
includes all options from this combination. options are stored in a perl
hash as (key,value) pairs.
this breaks if the command line has two options with the same name,
like --plugin-load-add=foo --plugin-load-add=bar, and the combination
forces plugin foo.
In particular, this resulted in warnings when running
federated.federatedx_versioning test
There is a server startup option --gdb a.k.a. --debug-gdb that requests
signals to be set for more convenient debugging. Most notably, SIGINT
(ctrl-c) will not be ignored, and you will be able to interrupt the
execution of the server while GDB is attached to it.
When we are debugging, the signal handlers that would normally display
a terse stack trace are useless.
When we are debugging with rr, the signal handlers may interfere with
a SIGKILL that could be sent to the process by the environment, and ruin
the rr replay trace, due to a Linux kernel bug
https://lkml.org/lkml/2021/10/31/311
To be able to diagnose bugs in kill+restart tests, we may really need
both a trace before the SIGKILL and a trace of the failure after a
subsequent server startup. So, we had better avoid hitting the problem
by simply not installing those signal handlers.
Create minidump when server fails to shutdown. If process is being
debugged, cause a debug break.
Moves some code which is part of safe_kill into mysys, as both safe_kill,
and mysqltest produce minidumps on different timeouts.
Small cleanup in wait_until_dead() - replace inefficient loop with a single
wait.
if all options from a combination from the combinations file are already
present in the server's list of options, then don't try to run tests
in other combinations from this file.
old behavior was: if at least one option from a combination is
already present in the list...
part II
need to tell SafeProcess not to collect the exited mysqld
in sleep_until_file_created(), so that it would be found in the
later wait_any_timeout() in run_testcase()
Removed a strange condition in SafeProcess::wait_one()
that treated return value of -1 from waitpid() as "process exists"
instead of as "no such child process" (see `perldoc -f waitpid`)
expect file is always removed before starting a server.
So if it exists here, it means the server started successfully,
mysqltest continued executing the test, created a new expect file,
and shut down the server. All while we were waiting for the server
to start.
In other words, if the expect file exists, the server did actually start.
Even if it isn't running now.
This fixes occasional failures of innodb.log_corruption (in 10.6)