etc_search: when the default search is not enough

Bloke · 2024-03-01 15:18:51

I read that WordPress uses remove_accents() on content before it does a search. But if this attempt at recreating it is any indication then it only works for European languages.

We have our dumbDown() function which is kind of equivalent for creating URLs, but again it’s not exactly comprehensive.

Seems that databases are geared up for accent-(in)sensitive queries but PHP isn’t.

etc · 2024-03-01 15:57:38

Bloke wrote #336814:

Seems that databases are geared up for accent-(in)sensitive queries

Seemingly this depends on the collation. We could assume that it is utf8mb4_unicode_ci in most cases and tweak php regex patterns accordingly, but even there I don’t know how equivalent character groups (uüù etc) are exactly defined.

And here is an example that suggests that accented character groups (and thus LIKE query results) depend on collation:

mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a               |
| á               |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á             |
| ä               |
+-----------------+

mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á           |
+-----------------+

jakob · 2024-03-01 20:35:04

Interesting … and also rather complicated.

The site in question I am working with uses utf8mb4_unicode_ci but presumably you’d like to have a more universal solution. FWIW, it seems Textpattern assumes that if you are using utf8mb4, then your collation is utf8mb4_unicode_ci. Ruud’s rvm_utf8_to_utf8mb4 makes the same assumption. Maybe that gives you a reasonable assumption over what could/should be the standard setup.

I guess you saw this, but in a subsequent post on the page you linked to, there’s a (broken) link to collation-charts.org – this page shows utf8_unicode_ci (European chars) showing what I assume to be the equivalent character groups. That was an earlier version (MySQL 6.04) and also the non-multibyte variant but possibly still instructive. I’m willing to test where I can.

(Oh, and thanks again for the time you’re giving to this)

Bloke · 2024-03-01 20:44:21

etc wrote #336815:

Seemingly this depends on the collation.

Well yeah. I meant that MySQL has the ability natively to cater to the varying equivalencies, depending on what the user chooses as their collation. But PHP has no notion of equivalency and matches are always exacting.

So if we’re asking the database to find stuff and then asking PHP to interpret/display them, the results aren’t going to match. The only way I can see of getting them to match the collation rules is to pass the database results verbatim to the page.

Bloke · 2024-03-01 20:53:23

jakob wrote #336816:

collation-charts.org

Wooooww. That’s a biiiig table! Imagine how much space that’d take up in code.

If our dumbDown function could be extended to that somehow, it would be way more fully featured. Then we’d at least have a route available to us to collapse any accented European characters to their equivalents and perform matches on the dumbed down content.

Even then, it still wouldn’t necessarily match the rules as set in the database collation because there are a heap of those a site owner can choose from. Including the _bin ones that do case sensitive matching.

To get even partway close, we would have to do what you suggest: make an opinionated assumption. Convention over configuration. But that’s not going to suit everyone, and the extended unicode tables beyond European languages cloud the water further.

jakob · 2025-03-25 14:20:08

jakob wrote #336762:

On recent PHP8+ I get a

passing null to parameter #1 ($string) of type string is deprecated on line 576...

with etc_search. It results when parse_url($action, PHP_URL_QUERY) returns NULL and then parse_str tries to process that into variables.

I encountered this again today. It occurs when you specify an action="/section-name/" attribute. Of course, I had forgotten about my earlier resolution, and this time resolved it by changing:

else parse_str(parse_url($action, PHP_URL_QUERY), $qs);

into

else parse_str(parse_url($action, PHP_URL_QUERY) ?? "", $qs);

This plugin is not on github.com/etc-plugins/ otherwise I would have submitted a PR.

I have to say, I love and hate (not really) this plugin: it’s so useful and powerful, but I struggle with finding the correct notation every time. This time I’m using it to create a search for an “internal area” where that section is excluded from the site-wide regular search, while also incorporating searching image captions which happen to hold the names of staff members and tying those back to the articles. I think I’m nearly there now…

etc · 2025-03-25 15:20:02

Hi Julian, and sorry if I have forgotten to upload the (long time ago locally) updated version, done.

jakob wrote #339359:

I have to say, I love and hate (not really) this plugin: it’s so useful and powerful, but I struggle with finding the correct notation every time.

Me too :-) The core search is rather limited and should be enhanced, but with a more natural syntax. Ideas welcome.

jakob · 2025-03-25 16:24:08

etc wrote #339360:

Hi Julian, and sorry if I have forgotten to upload the (long time ago locally) updated version, done.

Thank you, that resolved it and I’ve updated the plugin in my installation.

However, I spoke too soon: it turns out that etc_search honours sections that are marked as not to be included in the search results, so a search query like:

{Title, Body, Excerpt} AND Section = 'intern' AND Status = 4

doesn’t return any results because AND Section != 'intern' is also part of the query (in the tag trace), cancelling it out. I wondered if I could achieve this with a custom query but:

a) when I try to choose “custom” as a type, I get a fatal error. (see also this thread)
b) I’m not sure of the correct notation for how to construct a custom query, e.g. what to use for the search_term placeholder and whether I need to write out the entire query including Posted and Expired time relationships to current time and …AND ( (Title LIKE '%{q}%') OR (Body LIKE '%{q}%') OR (Excerpt LIKE '%{q}%') ) …

——

EDIT: I’m getting a little closer with the following custom query:

SELECT * FROM textpattern WHERE {Body, Title, Excerpt} AND Section = 'intern' AND Status = 4

It gives me a ton of Article tags cannot be used outside an article context notices but I do get matches.

BTW: I added the ‘custom’ option to the database as mentioned in the thread linked above, and also made ‘custom’ a choosable option in the radio buttons. Then I could select custom as the query context.

(BTW II: might it not now be prudent to drop CHARACTER SET=utf8 in the plugin.enable routine? I imagine most new installations will be utf8 or utf8mb4 by now.)

——

Alternatively – and maybe more simply – could we have an option/flag to override/ignore the searchable sections filter (and maybe also the Status)? For example, if I comment out the four lines in the plugin that create the AND Section !={non-searchable-section} list, my original article-context query works as intended.

etc · 2025-03-25 21:14:51

I need to seriously refresh my brain. Meanwhile, I’ve (tried to) put it on github.com/etc-plugins/etc_search, please feel free to submit PRs.

jakob · 2025-03-26 10:37:08

etc wrote #339362:

I need to seriously refresh my brain. Meanwhile, I’ve (tried to) put it on github.com/etc-plugins/etc_search, please feel free to submit PRs.

Thank you! I’ll have a look at making a few minor updates. Should I make different branches for each aspect so that the PRs can be separately applied (when not interdependent)? I can imagine you might have a more optimal solution than mine.

Things I have encountered / seen:

it seems when MySQL is running in strict mode, an empty ENUM value (which you have used for the “custom” context up to now) is not permitted. It works again with an explicit “custom” enum entry. The plugin_enable/update event would also need to add that to any existing search forms with empty type values in the table.
I’ve found that when using a custom search, I can’t use article-context txp:tags in the search results output, only placeholder replacements like {Title}, {url_title} and so on. That works, and you do mention that in the plugin help that came with the version of the plugin you uploaded yesterday. However txp:etc_search_result_excerpt also doesn’t return anything as it searches in the $thisarticle / $thisimage / $thisfile / $thislink global, which the custom query doesn’t seem to populate. That is a useful information for searchers, though.

So I’m thinking about returning to an article-context query and finding a way of searching in the non-searchable sections.
- The idea of an “override searchable sections” switch mentioned above would involve a new UI option, a new table field, and updating that in existing installations. A general override also throws out all previously non-searchable sections, rather than just those a user might want to target. Maybe not so ideal. However this option would be backwards compatible.
- A more granular solution might be to extract any section name mentioned in the user’s search query (e.g. in all instances of Section = or Section IN), then skip those sections when generating the list of AND Section !='section-name' strings for the non-searchable sections. That would mean the user gets the query they expect even if it conflicts with the setting in section panel, but it would still respect any other non-searchable sections. There’s a slightl possibility that this is not ‘backwards compatible’: in the rare case where a user has written a query that didn’t work previously because Section != and Section = cancelled each other out, the query would start working again.
  The same could be done for the currently hard-coded Status >= 4: if the user has specified a particular Status in their query, then use that instead. You already do something similar for ORDER BY.

Other minor things

Some help updates
gTxt textpack strings and localisation (some are the result of lang strings having changed context in recent txp versions)
html5 void tag endings (probably achievable by using more core functions for the tag output)
some slight UI modifications
the forced utf8 charset in the table creation

etc · 2025-03-26 14:06:31

jakob wrote #339364:

Should I make different branches for each aspect so that the PRs can be separately applied (when not interdependent)? I can imagine you might have a more optimal solution than mine.

Oh no, please PR the main branch, otherwise I would get lost, I’m afraid. I’ll gladly give you the commit rights if you agree. It’s an old amateurish plugin with inconsistent enhancements written for some specific tasks, so there is a room for improvements.

it seems when MySQL is running in strict mode, an empty ENUM value (which you have used for the “custom” context up to now) is not permitted. It works again with an explicit “custom” enum entry. The plugin_enable/update event would also need to add that to any existing search forms with empty type values in the table.

Thank you for the report. A first PR? :-)

I’ve found that when using a custom search, I can’t use article-context txp:tags in the search results output, only placeholder replacements like {Title}, {url_title} and so on.

Yes, that seems natural, since a custom query can fetch a mix of fields from different tables (article, image etc). Populating data without knowing the context might be arbitrary.

However txp:etc_search_result_excerpt also doesn’t return anything as it searches in the $thisarticle / $thisimage / $thisfile / $thislink global, which the custom query doesn’t seem to populate.

That’s annoying, probably some other entity should be populated for custom searches.

So I’m thinking about returning to an article-context query and finding a way of searching in the non-searchable sections.

An attribute, à la searchall, maybe? That’s flexible, but only relevant for articles, though.

jakob · 2025-03-26 16:51:06

Great, I’ll make some PRs shortly. I managed to get my suggestion working too:

Permit search queries to include own “Status” conditions (the default remains Status >= 4 if not defined)
Permit search queries to search in sections that are normally not searchable if explicitly specified in the search query

I also discovered that etc_search_result_count does work if you explicitly specify a text attribute; the standard text string was reallocated to a new language string group in the core at some point and was no longer available to the plugin. I’ll make that a plugin textpack string and that should work again.

However, it has to appear after etc_search_results. You can switch the order by first outputting the results to a txp:variable, then showing the result count heading, and outputting the variable. txp:items_count does however work above the etc_sarch_results tag.

Textpattern CMS

Textpattern CMS support forum

#121 2024-03-01 15:18:51

Re: etc_search: when the default search is not enough

#122 2024-03-01 15:57:38

Re: etc_search: when the default search is not enough

Bloke wrote #336814:

#123 2024-03-01 20:35:04

Re: etc_search: when the default search is not enough

#124 2024-03-01 20:44:21

Re: etc_search: when the default search is not enough

etc wrote #336815:

#125 2024-03-01 20:53:23

Re: etc_search: when the default search is not enough

jakob wrote #336816:

#126 2025-03-25 14:20:08

Re: etc_search: when the default search is not enough

jakob wrote #336762:

#127 2025-03-25 15:20:02

Re: etc_search: when the default search is not enough

jakob wrote #339359:

#128 2025-03-25 16:24:08

Re: etc_search: when the default search is not enough

etc wrote #339360:

#129 2025-03-25 21:14:51

Re: etc_search: when the default search is not enough

#130 2025-03-26 10:37:08

Re: etc_search: when the default search is not enough

etc wrote #339362:

#131 2025-03-26 14:06:31

Re: etc_search: when the default search is not enough

jakob wrote #339364:

#132 2025-03-26 16:51:06

Re: etc_search: when the default search is not enough

Board footer