mariadb/mysql-test/std_data/ldml
Michal Schorm 5a1f349b82 MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml
Summary:
The charset definition files sql/share/charsets/Index.xml and
mysql-test/std_data/ldml/Index.xml contained duplicate "flag" attributes
on single <collation> elements, violating XML well-formedness rules.
Standard XML parsers (xmllint, libxml2, etc.) reject duplicate attributes,
making these files unparseable by any spec-compliant tool.

Root Cause:
When nopad_bin collations were added, their flags were specified as
XML attributes: flag="binary" flag="nopad". The XML specification
(Section 3.1, Well-Formedness Constraint: Unique Att Spec) prohibits
duplicate attribute names on a single element. MariaDB's custom XML
parser in strings/xml.c happened to process both duplicates because
it handles attributes sequentially in a while loop, but this is
non-standard behavior that breaks interoperability with standard
XML tooling.

What the patch does:
Converts all 24 occurrences of duplicate flag attributes from
self-closing elements with duplicate attributes to elements with
child <flag> nodes. This follows the existing pattern already used
by many collations in the same file (e.g., big5_chinese_ci,
latin1_swedish_ci, utf8mb3_general_ci).

Before (invalid XML):
  <collation name="latin2_nopad_bin" id="1101" flag="binary" flag="nopad"/>

After (valid XML):
  <collation name="latin2_nopad_bin" id="1101">
    <flag>binary</flag>
    <flag>nopad</flag>
  </collation>

No C code changes are required. The _CS_FLAG handler in
strings/ctype.c (around line 621) already processes <flag> child
elements using bitwise OR (|=) to accumulate flags, so both "binary"
(MY_CS_BINSORT) and "nopad" (MY_CS_NOPAD) flags are correctly applied.

Files modified:
- sql/share/charsets/Index.xml (23 collations fixed)
- mysql-test/std_data/ldml/Index.xml (1 collation fixed)

Complete list of 24 collations fixed:

sql/share/charsets/Index.xml:
 1. latin2_nopad_bin     (id=1101)
 2. dec8_nopad_bin       (id=1093)
 3. cp850_nopad_bin      (id=1104)
 4. hp8_nopad_bin        (id=1096)
 5. koi8r_nopad_bin      (id=1098)
 6. swe7_nopad_bin       (id=1106)
 7. ascii_nopad_bin      (id=1089)
 8. cp1251_nopad_bin     (id=1074)
 9. hebrew_nopad_bin     (id=1095)
10. latin7_nopad_bin     (id=1103)
11. koi8u_nopad_bin      (id=1099)
12. greek_nopad_bin      (id=1094)
13. cp1250_nopad_bin     (id=1090)
14. cp1257_nopad_bin     (id=1082)
15. latin5_nopad_bin     (id=1102)
16. armscii8_nopad_bin   (id=1088)
17. cp866_nopad_bin      (id=1092)
18. keybcs2_nopad_bin    (id=1097)
19. macce_nopad_bin      (id=1067)
20. macroman_nopad_bin   (id=1077)
21. cp852_nopad_bin      (id=1105)
22. cp1256_nopad_bin     (id=1091)
23. geostd8_nopad_bin    (id=1117)

mysql-test/std_data/ldml/Index.xml:
24. ascii2_nopad_bin     (id=325)

Validation:
- xmllint --noout passes cleanly on both files after the fix
- Zero duplicate flag attributes remain (verified with grep)
- The fix is consistent with the existing pattern used by other
  collations in the same files

Co-Authored-By: Claude AI <noreply@anthropic.com>
2026-03-31 16:01:05 +03:00
..
ascii2.xml MDEV-10743 LDML: a new syntax to reuse sort order from another 8bit simple collation 2016-09-06 12:37:11 +04:00
Index.xml MDEV-18359, MDEV-26905: Fix invalid XML in charsets Index.xml 2026-03-31 16:01:05 +03:00
latin1.xml Merge 10.1 into 10.2 2019-05-13 17:54:04 +03:00