-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set the 4 name
fields to utf8mb4
#885
Comments
This seems worth doing, if only for consistencies' sake.
It really shouldn't, but more tests is always good. |
Thanks @lentinj: really helpful to know that it shouldn't make a difference (e.g. when matching on name)
I'm not sure how we would automatically test performance here. I don't think we are doing perf tests at the moment. |
We're not, and doing so in an automated fashion is hard work. But an API exerciser producing timings would be useful here and pre/post upgrade validation (for example). Using apachebench to do this rather than a python script would mean it could be used for load testing at some point as well. |
I tested utf8mb4 and oddly, searches stopped working (no results show up at all). Here's what I ran (successfully):
Notably, test_database_settings.py did pass with this change. When I changed those all to utf8 instead, my searches became fast for the first time. One of the tests started failing, though:
Also, in order to run these SQL queries successfully, I had to adjust server-side timeouts in /etc/mysql/my.cnf:
I'm not sure exactly which of these config changes were truly needed (ChatGPT suggested these ...), but I can confidently say that client-side settings in MySQL Workbench weren't enough. I can take a closer look later to try to see what might have caused utf8mb4 to fail. Apparently utf8mb3 is deprecated and utf8mb4 is the new default, so it seems worthwhile to make the change: |
Thanks for digging. Does your DB connection string in appconfig.ini specify the utf8mb4 encoding:
? |
Yes:
|
How odd: in order to try to debug this, I reverted to utf8mb4 ... but now searches continue to work properly. In all cases I've taken care to make sure the queries completed successfully. The only rational explanation (aside from a bug somewhere) is that the change to utf8 truncated some values outside the charset that were causing trouble such that when going back to utf8mb4 it was no longer a problem. I can try recreating my container to reproduce this. |
Ah, it could be the truncation thing, yes! |
Turns out the SQL query was failing but the exception was swallowed. It points directly to the issue:
There were a few bugs I had to solve in create_db_indexes.sql to fix this. First, it was only updating the collation of a column if the column wasn't already utf8mb4, so it would leave inconsistent collation in. Once I fixed the conditional, it was still failing, and that's because the ALTER TABLE ... CONVERT TO CHARACTER SET affects every column in the table and resets their collation (reference). Hence, multiple MakeFullUnicode calls were resetting the collation from previous calls. I think the intention here was actually ALTER TABLE ... DEFAULT CHARACTER SET. When I changed to that, it worked. Here's the full amended function:
With that change, searches work fine with utf8mb4. I do want to note that this other collation, "utf8mb4_0900_ai_ci", is newer than the one that is being explicitly set here, and is the new default for utf8mb4. Among other things, it makes string comparisons insensitive to accents on letters. Hence, the simplest solution might be to drop the explicit collation, leave the CONVERT TO CHARACTER SET query as-is, and take advantage of defaults to be consistent. I suppose there might be a benefit in being explicit, though, as defaults change in new versions of MySQL, and to fix any columns that might have been changed with earlier queries. If you'd like me to submit the most conservative fix, I'm happy to send a PR for the update shown above. |
I think explicit is good, but I'd be happy to go with the new mysql default, or whatever seems most sensible. FYI, it looks like we are running mysql |
I'm going to send out a pull request shortly. That's good to know you are running an earlier version of MySQL -- this new collation I mentioned was introduced in MySQL 8 (which is what the Docker image uses), so we will need to stick with the existing one, but we can at least apply it consistently. Regardless, I don't think we would need to change the DB connection string, which doesn't mention the collation (only charset). I am curious why the charset mismatch that I originally reported doesn't affect your prod instance (for me, it caused both slowness and a failed test). It shouldn't have anything to do with the MySQL version: it's a direct consequence of the "ALTER TABLE ... CONVERT TO CHARACTER SET" query which is documented to affect every column in the table, so that would have unintentionally set some columns to utf8mb4. Did you not run the create_db_indexes.sql script in prod, or perhaps you ran follow up queries to fix this? I would be curious to see what these queries return: SELECT character_set_name, column_type, collation_name FROM information_schema.`COLUMNS`
WHERE table_schema = SCHEMA() AND table_name = 'vernacular_by_name' AND column_name = 'name';
SELECT character_set_name, column_type, collation_name FROM information_schema.`COLUMNS`
WHERE table_schema = SCHEMA() AND table_name = 'ordered_leaves' AND column_name = 'name'; |
In
OZtree/OZprivate/ServerScripts/SQL/create_db_indexes.sql
Line 39 in f90d82c
vernacular_by_name.name
,vernacular_by_ott.name
,ordered_nodes.name
andordered_leaves.name
. Although these should only contain scientific ("latin") names, and so should not need full 4-byte unicode, it might be as well to add those columns to the list toMakeFullUnicode()
too.I assume that using
utf8mb4
won't make searching these fields noticeably slower, but perhaps it's worth testing?The text was updated successfully, but these errors were encountered: