MDEV-28430: Fix memory barrier missing of lf_alloc on Arm64

When testing MariaDB on Arm64, a stall issue will occur, jira link:
https://jira.mariadb.org/browse/MDEV-28430.

The stall occurs because of an unexpected circular reference in the
LF_PINS->purgatory list which is traversed in lf_pinbox_real_free().

We found that on Arm64, ABA problem in LF_ALLOCATOR->top list was not
solved, and various undefined problems will occur, including circular
reference in LF_PINS->purgatory list.

The following codes are used to solve ABA problem, code copied
from below link.
cb4c271355/mysys/lf_alloc-pin.c (L501-)#L505

     do
     {
503     node= allocator->top;
504     lf_pin(pins, 0, node);
505  } while (node != allocator->top && LF_BACKOFF());

1. ABA problem on Arm64
Combine the below steps to analyze how ABA problem occur on Arm64, the
relevant codes in steps are simplified, code line numbers below are in
MariaDB v10.4.
------------------------------------------------------------------------
Abnormal case.
Initial state: pin = 0, top = A, top list: A->B

T1                              T2
                                step1. write top=B //seq-cst, #L517
                                step2. write A->next= "any"
                                step3. read pin==0 //relaxed, #L295
step1. write pin=A  //seq-cst, #L504
step2. read old value of top==A  //relaxed, #L505
step3. next=A->next="any" //#L517
                                step4. write A->next=B,top=A //#L420-435
step4. CAS(top,A,next) //#L517
step5. write pin=0     //#L521
------------------------------------------------------------------------
Above case is due to T1.step2 reading the old value of top, causing
"T1.step3, T1.step4" and "T2.step4" to occur at the same time, in other
words, they are not mutually exclusive.

It may happen that T2.step4 is sandwiched between T1.step3 and T1.step4,
which cause top to be updated to "any", which may be in-use or invalid
address.

2. Analyze above issue with Dekker's algorithm
Above problem can be mapped to Dekker's algorithm, link is as below
https://en.wikipedia.org/wiki/Dekker%27s_algorithm.
The following extracts the read and write operations on 'top' and 'pin',
and maps them to Dekker's algorithm to analyze the root cause.
------------------------------------------------------------------------
Initial state: top = A, pin = 0
T1                                    T2
store_seq_cst(pin, A) // write pin    store_seq_cst(top, B)  //write top
rt= load_relaxed(top) // read top     rp= load_relaxed(pin)  //read pin

if (rt == A && rp == 0) printf("oops\n"); // will "oops" be printed?
------------------------------------------------------------------------
How T1 and T2 enter their critical section:
(1) T1, write pin, if T1 reads that top has not been updated, T1 enter
its critical section(T1.step3 and T1.step4, try to obtain 'A', #L517),
otherwise just give up (T1 without priority).
(2) T2, write top, if T2 reads that pin has not been updated, T2 enter
critical section(T2.step4, try to add 'A' to top list again, #L420-435),
otherwise wait until pin!=A (T2 with priority).

In the previous code, due to load 'top' and 'pin' with relaxed semantic,
on arm and ppc, there is no guarantee that the above critical sections
are mutually exclusive, in other words, "oops" will be printed.

This bug only happens on arm and ppc, not x86. On current x86
implementation, load is always seq-cst (relaxed and seq-cst load
generates same machine code), as shown in https://godbolt.org/z/sEzMvnjd9

3. Fix method
Add sequential-consistency semantic to read 'top' in #L505(T1.step2),
Add sequential-consistency semantic to read "el->pin[i]" in #L295
and #L320.

4. Issue reproduce
Add "delay" after #L503 in lf_alloc-pin.c, When run unit.lf, can quickly
get segment fault because "top" point to an invalid address. For detail,
see comment area of below link.
https://jira.mariadb.org/browse/MDEV-28430.

5. Futher improvement
To make this code more robust and safe on all platforms, we recommend
replacing volatile with C11 atomics and to fix all data races. This will
also make the code easier to reason.

Signed-off-by: Xiaotong Niu <xiaotong.niu@arm.com>
This commit is contained in:
Xiaotong Niu 2023-10-27 12:44:57 +08:00 committed by Marko Mäkelä
parent 5707f1efda
commit 8a505980c5

View file

@ -292,7 +292,7 @@ static int harvest_pins(LF_PINS *el, struct st_harvester *hv)
{ {
for (i= 0; i < LF_PINBOX_PINS; i++) for (i= 0; i < LF_PINBOX_PINS; i++)
{ {
void *p= el->pin[i]; void *p= my_atomic_loadptr((void **)&el->pin[i]);
if (p) if (p)
*hv->granary++= p; *hv->granary++= p;
} }
@ -317,7 +317,7 @@ static int match_pins(LF_PINS *el, void *addr)
LF_PINS *el_end= el+LF_DYNARRAY_LEVEL_LENGTH; LF_PINS *el_end= el+LF_DYNARRAY_LEVEL_LENGTH;
for (; el < el_end; el++) for (; el < el_end; el++)
for (i= 0; i < LF_PINBOX_PINS; i++) for (i= 0; i < LF_PINBOX_PINS; i++)
if (el->pin[i] == addr) if (my_atomic_loadptr((void **)&el->pin[i]) == addr)
return 1; return 1;
return 0; return 0;
} }
@ -502,7 +502,8 @@ void *lf_alloc_new(LF_PINS *pins)
{ {
node= allocator->top; node= allocator->top;
lf_pin(pins, 0, node); lf_pin(pins, 0, node);
} while (node != allocator->top && LF_BACKOFF()); } while (node != my_atomic_loadptr((void **)(char *)&allocator->top)
&& LF_BACKOFF());
if (!node) if (!node)
{ {
node= (void *)my_malloc(allocator->element_size, MYF(MY_WME)); node= (void *)my_malloc(allocator->element_size, MYF(MY_WME));