Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#121 2024-03-01 15:18:51

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,445
Website GitHub

Re: etc_search: when the default search is not enough

I read that WordPress uses remove_accents() on content before it does a search. But if this attempt at recreating it is any indication then it only works for European languages.

We have our dumbDown() function which is kind of equivalent for creating URLs, but again it’s not exactly comprehensive.

Seems that databases are geared up for accent-(in)sensitive queries but PHP isn’t.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#122 2024-03-01 15:57:38

etc
Developer
Registered: 2010-11-11
Posts: 5,186
Website GitHub

Re: etc_search: when the default search is not enough

Bloke wrote #336814:

Seems that databases are geared up for accent-(in)sensitive queries

Seemingly this depends on the collation. We could assume that it is utf8mb4_unicode_ci in most cases and tweak php regex patterns accordingly, but even there I don’t know how equivalent character groups (uüù etc) are exactly defined.

And here is an example that suggests that accented character groups (and thus LIKE query results) depend on collation:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á           |
+-----------------+

Offline

#123 2024-03-01 20:35:04

jakob
Admin
From: Germany
Registered: 2005-01-20
Posts: 4,726
Website

Re: etc_search: when the default search is not enough

Interesting … and also rather complicated.

The site in question I am working with uses utf8mb4_unicode_ci but presumably you’d like to have a more universal solution. FWIW, it seems Textpattern assumes that if you are using utf8mb4, then your collation is utf8mb4_unicode_ci. Ruud’s rvm_utf8_to_utf8mb4 makes the same assumption. Maybe that gives you a reasonable assumption over what could/should be the standard setup.

I guess you saw this, but in a subsequent post on the page you linked to, there’s a (broken) link to collation-charts.org – this page shows utf8_unicode_ci (European chars) showing what I assume to be the equivalent character groups. That was an earlier version (MySQL 6.04) and also the non-multibyte variant but possibly still instructive. I’m willing to test where I can.

(Oh, and thanks again for the time you’re giving to this)


TXP Builders – finely-crafted code, design and txp

Offline

#124 2024-03-01 20:44:21

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,445
Website GitHub

Re: etc_search: when the default search is not enough

etc wrote #336815:

Seemingly this depends on the collation.

Well yeah. I meant that MySQL has the ability natively to cater to the varying equivalencies, depending on what the user chooses as their collation. But PHP has no notion of equivalency and matches are always exacting.

So if we’re asking the database to find stuff and then asking PHP to interpret/display them, the results aren’t going to match. The only way I can see of getting them to match the collation rules is to pass the database results verbatim to the page.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#125 2024-03-01 20:53:23

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,445
Website GitHub

Re: etc_search: when the default search is not enough

jakob wrote #336816:

collation-charts.org

Wooooww. That’s a biiiig table! Imagine how much space that’d take up in code.

If our dumbDown function could be extended to that somehow, it would be way more fully featured. Then we’d at least have a route available to us to collapse any accented European characters to their equivalents and perform matches on the dumbed down content.

Even then, it still wouldn’t necessarily match the rules as set in the database collation because there are a heap of those a site owner can choose from. Including the _bin ones that do case sensitive matching.

To get even partway close, we would have to do what you suggest: make an opinionated assumption. Convention over configuration. But that’s not going to suit everyone, and the extended unicode tables beyond European languages cloud the water further.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

Board footer

Powered by FluxBB