This patch reduces the overhead of system calls prior to a query, for
threadpool. Previously, 3 system calls were done
1. WSARecv() to get notification of input data from client, asynchronous
equivalent of select() in one-thread-per-connection
2. recv(4 bytes) - reading packet header length
3. recv(packet payload)
Now there will be usually, just WSARecv(), which pre-reads user data into
a buffer, so we spared 2 syscalls
Profiler shows the most expensive call WSARecv(16%CPU) becomes 4% CPU,
after the patch, benchmark results (network heavy ones like point-select)
improve by ~20%
The buffer management was rather carefully done to keep
buffers together, as Windows would keeps the pages pinned
in memory for the duration of async calls.
At most 1MB memory is used for the buffers, and overhead per-connection is
only 256 bytes, which should cover most of the uses.
SSL does not yet use the optmization, so far it does not properly use
VIO for reads and writes. Neither one-thread-per-connection would get any
benefit, but that should be fine, it is not even default on Windows.
Apparently, in stats_reset_table(), the innocuous
memset(&group->counters, 0, sizeof(group->counters));
is converted by clang to SSE2 instructions.
The problem is that "group" is not correctly aligned,
despite MY_ALIGNED(CPU_LEVEL1_DCACHE_LINESIZE) in the thread_group_t
declaration.
It is not aligned because it was allocated with my_malloc, since
commit fd9f1638, MDEV-5205. Previously all_groups was a
statically allocated array.
Fix is to remove MY_ALIGNED, and pad the struct instead.