Followup: remove this line from get_column_range_cardinality()
set_if_bigger(res, col_stats->get_avg_frequency());
and make sure it is only used with the binary histograms.
For JSON histograms, it makes the estimates unnecessarily imprecise.
Previous JSON parser was using an API which made the parsing
inefficient: the same JSON contents was parsed again and again.
Switch to using a lower-level parsing API which allows to do
parsing in an efficient way.
Factor the code that updates count, count_distinct,
count_distinct_single_occurrence into class Basic_stats_collector
Change from Histogram_builder and its descendant Histogram_builder_json
to Histogram_builder (the interface), and Histogram_binary_builder,
Histogram_json_builder.
In Histogram_json_builder, do not forget to collect the right bound
of the right-most bucket.
* it also adds an "explain select" statement to the test so that the fprintf calls
can print the computed intervals to mysqld.1.err
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
This fixes the wrong calculation for avg_frequency in json histograms
by replacing the specific histogram objects with the generic Histogram_base class.
It also restores get/set size functions as they were useful in calculating fields
for binary histogram.
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
A demo of how to use in-memory data structure for histogram.
The patch shows how to
* convert string form of data to binary form
* compare two values in binary form
* compute a fraction for val in [X, Y] range.
grep for GSOC-TODO for notes.
Preparation for handling different kinds of histograms:
- In Column_statistics, change "Histogram histogram" into
"Histogram *histogram_". This allows for different kinds
of Histogram classes with virtual functions.
- [Almost] remove the usage of Histogram->set_values and
Histogram->set_size. The code outside the histogram should
not make any assumptions about what/how is stored in the Histogram.
- Introduce drafts of methods to read/save histograms to/from disk.
This fixes the memory allocation for json histogram builder and add more column types for testing.
Some challenges at the moment include:
* Garbage value at the end of JSON array still persists.
* Garbage value also gets appended to bucket values if the column is a primary key.
* There's a memory leak resulting in a "Warning: Memory not freed" message at the end of tests.
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
The issue here was histogram statistics were being used even when
the level of optimizer_use_condition_selectivity doesn't allow
usage of statistics from histogram.
The histogram statistics are read for a table only when
optimizer_use_condition_selectivity > 3. But the TABLE structure can be
stored in the internal table cache and be reused for the next query.
So in this case the histogram statistics will be available for the next query.
The fix would be to make sure to use the histogram statistics only when
optimizer_use_condition_selectivity > 3.
An oveflow was happening on windows because on Windows sizeof(ulong) is 4 bytes
while it is 8 bytes on Linux.
Switched avg_frequency and avg length for column statistics to ulonglong.
Switched avg_frequency for index statistics to ulonglong.
Previously multiple threads were allowed to load histograms concurrently.
There were no known problems caused by this. But given amount of data
races in this code, it'd happen sooner or later.
To avoid scalability bottleneck, histograms loading is protected by
per-TABLE_SHARE atomic variable.
Whenever histograms were loaded by preceding statement (hot-path), a
scalable load-acquire check is performed.
Whenever histograms have to be loaded anew, mutual exclusion for loaders
is established by atomic variable. If histograms are being loaded
concurrently, statement waits until load is completed.
- Table_statistics::total_hist_size moved to TABLE_STATISTICS_CB: only
meaningful within TABLE_SHARE (not used for collected stats).
- TABLE_STATISTICS_CB::histograms_can_be_read and
TABLE_STATISTICS_CB::histograms_are_read are replaced with a tri state
atomic variable.
- Simplified away alloc_histograms_for_table_share().
Note: there's still likely a data race if a thread attempts accessing
histograms data after it failed to load it (because of concurrent load).
It was there previously and goes out of the scope of this effort. One way
of fixing it could be reviving TABLE::histograms_are_read and adding
appropriate checks whenever it is needed.
Part of MDEV-19061 - table_share used for reading statistical tables is
not protected
read_statistics_for_tables_if_needed
Regression after 279a907, read_statistics_for_tables_if_needed() was
called after open_normal_and_derived_tables() failure.
Fixed by moving read_statistics_for_tables() call to a branch of
get_schema_stat_record() where result of open_normal_and_derived_tables()
is checked.
Removed THD::force_read_stats, added read_statistics_for_tables() instead.
Simplified away statistics_for_command_is_needed().