MariaDB server prints the stack information if a crash happens.
It traverses the stack frames in function `print_with_addr_resolve`.
For *EACH* frame, it tries to parse the file name and line number of the
frame using `addr2line`, or prints `backtrace_symbols_fd` if `addr2line`
fails.
1. Logic in `addr_resolve` function uses addr2line to get the file name
and line numbers. It has a timeout of 500ms to wait for the response
from addr2line. However, that's not enough on small instances
especially if the debug information is in a separate file or
compressed.
Increase the timeout to 5 seconds to support some edge cases, as
experiments showed addr2line may take 2-3 seconds on some frames.
2. While parsing a frame inside of a shared library using `addr2line`,
the file name and line numbers could be `??`, empty or `0` if the
debug info is not loaded.
It's easy to reproduce when glibc-debuginfo is not installed.
Instead of printing a meaningless frame like:
:0(__GI___poll)[0x1505e9197639]
...
??:0(__libc_start_main)[0x7ffff6c8913a]
We want to print the frame information using `backtrace_symbols_fd`,
with the shared library name and a hexadecimal offset.
Stacktrace example on a real instance with this commit:
/lib64/libc.so.6(__poll+0x49)[0x145cbf71a639]
...
/lib64/libc.so.6(__libc_start_main+0xea)[0x7f4d0034d13a]
`addr_resolve` has considered the case of meaningless combination of
file name and line number returned by `addr2line`. e.g. `??:?`
However, conditions like `:0` and `??:0` are not handled. So now the
function will rollback to `backtrace_symbols_fd` in above cases.
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.
When trying to output stacktrace, and addr2line is not installed, the
child process forked by start_addr2line_fork() will fail to do exec(),
and finish with exit(1).
There is a problem with exit() though - it runs exit handlers,
and for the forked copy of crashing process, it is a bad idea.
In 10.5+ code for example, exit handlers include
tpool::task_group static destructors, and it will hang infinitely
waiting for completion of the outstanding tasks.
The fix is to use _exit() instead, which skips the execution of exit
handlers
When a server is compiled with -fPIE, my_addr_resolve needs to
subtract the info.dli_fbase from symbol addresses in memory for
addr2line to recognize them. When a server is compiled without -fPIE,
my_addr_resolve should not do it. Unfortunately not all compilers
define __PIE__ when -fPIE was used (e.g. older gcc doesn't), so we
have to resort to run-time detection.
When a server is compiled with -fPIE, my_addr_resolve needs to subtract the info.dli_fbase from symbol addresses in memory for addr2line to recognize them.
When a server is compiled without -fPIE, my_addr_resolve should not do it.
Unfortunately not all compilers define __PIE__ when -fPIE was used
(e.g. older gcc doesn't), so we have to resort to run-time detection.
Encountered the linker failure on Debug build in 10.4:
[53/585] Linking CXX executable unittest/sql/mf_iocache-t
FAILED: unittest/sql/mf_iocache-t
: && /usr/bin/c++ -pie -fPIC -fstack-protector --param=ssp-buffer-size=4 -fPIC -g -DENABLED_DEBUG_SYNC -ggdb3 -DSAFE_MUTEX -DSAFEMALLOC -DTRASH_FREED_MEMORY -Wall -Wextra -Wno-format-truncation -Wno-init-self -Wno-nonnull-compare -Wno-unused-parameter -Woverloaded-virtual -Wnon-virtual-dtor -Wvla -Wwrite-strings -Werror -Wl,-z,relro,-z,now unittest/sql/CMakeFiles/mf_iocache-t.dir/mf_iocache-t.cc.o unittest/sql/CMakeFiles/mf_iocache-t.dir/__/__/sql/mf_iocache_encr.cc.o -o unittest/sql/mf_iocache-t -lpthread mysys/libmysys.a unittest/mytap/libmytap.a mysys_ssl/libmysys_ssl.a mysys/libmysys.a dbug/libdbug.a mysys/libmysys.a dbug/libdbug.a -lz -lm strings/libstrings.a -lpthread -lssl -lcrypto -ldl && :
/usr/bin/ld: mysys/libmysys.a(my_addr_resolve.c.o):/home/dan/repos/mariadb-server-10.4/mysys/my_addr_resolve.c:173: multiple definition of `info'; unittest/sql/CMakeFiles/mf_iocache-t.dir/mf_iocache-t.cc.o:/home/dan/repos/mariadb-server-10.4/unittest/sql/mf_iocache-t.cc:99: first defined here
We make Dl_info static as in MDEV-21646 moving it out of the function
was the main goal and having it scope limited by static doesn't affect
the function.
According to https://stackoverflow.com/questions/22827510/how-to-avoid-bad-fd-set-buffer-overflow-crash
it seems that using select instead of poll can cause additional memory
allocations. As we are in a crashed state, we must prevent allocating
any memory (if possible). Thus, switch select call to poll.
Also move some bigger datastructures to global space. The code is not
run in a multithreaded context so best we don't use up stack space
if it's not needed.
Resolving a stacktrace including functions in dynamic libraries requires
us to look inside the libraries for the symbols. Addr2line needs to be
started with the correct binary for each address on the stack. To do this,
figure out which library it is using dladdr, then if the addr2line
binary was started with a different binary, fork it again with the
correct one.
We only have one addr2line process running at any point during the
stacktrace resolving step. The maximum number of forks for addr2line should
generally be around 6.
One for server stacktrace code, one for plugin code, one when going back
into server code, one for pthread library, one for libc, one for the
_start function in the server. More can come up if plugin calls server
function which goes back to a plugin, etc.
In order to get all the input from addr2line we must read in a loop,
until the response is complete. Also, in case that the response is
malformed, we must not end up reading invalid memory.
Take into account that PIE binaries are loaded at some offset, so
addresses cannot be directly resolved with addr2line. Find this
offset and subtract it before resolving an address.
mysql-test/suite/innodb/t/group_commit_crash.test:
remove autoincrement to avoid rbr being used for insert ... select
mysql-test/suite/innodb/t/group_commit_crash_no_optimize_thread.test:
remove autoincrement to avoid rbr being used for insert ... select
mysys/my_addr_resolve.c:
a pointer to a buffer is returned to the caller -> the buffer cannot be on the stack
mysys/stacktrace.c:
my_vsnprintf() is ok here, in 5.5
include/mysql_com.h:
remove "shutdown levels" that aren't shutdown levels from mysql_enum_shutdown_level
mysys/my_addr_resolve.c:
my_snprintf in 5.5 (but not in 5.3) supports %p
sql/item_func.cc:
use a method (that exists only in 5.5) instead of directly accessing a member
sql/item_subselect.cc:
use a method (that exists only in 5.5) instead of directly accessing a member
sql/opt_subselect.cc:
use a method (that exists only in 5.5) instead of directly accessing a member
sql/sql_select.cc:
use a method (that exists only in 5.5) instead of directly accessing a member
fix safemalloc to compile w/o libbfd.
CMakeLists.txt:
NOT_FOR_DISTRIBUTION option
cmake/readline.cmake:
simplify libedit/readline detection.
never use bundled libedit.
use system readline v6 only if NOT_FOR_DISTRIBUTION=1
configure.cmake:
use libbfd only if NOT_FOR_DISTRIBUTION=1
include/my_stacktrace.h:
link with libbfd even w/o safemalloc.