Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2005-10-28 18:31:35

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

[issue] 4.0.1: search and case of non-ascii letters

I’ve just noticed that in non-english languages, the search doesn’t handle the several cases of accented letters.

Let’s take an example with french: the é letter is low-case, the É letter is the same but upper case. If I have a word with the upper case version in an article (for example, Édition) and I am searching for the lower case word (here: édition) the search won’t find it.

So, the search should know (like Google do for example) that some letters are the same. It’s mostly:

  1. upper case (for example: É)
  2. lower case (for example: é)
  3. the ascii version of it (used in bad writings, or in URL, for example: e)

Maybe the fulltext search capability of MySQL would help here ?

Last edited by Jeremie (2005-10-28 18:32:54)

Offline

#2 2005-10-28 19:06:18

Sencer
Archived Developer
From: cgn, de
Registered: 2004-03-23
Posts: 1,803
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

> Maybe the fulltext search capability of MySQL would help here ?

Search is handled by MySql’s fulltext search.

Is this on mysql4.1 with utf8 tables? I think there it should work.

Offline

#3 2005-10-28 20:25:42

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

Yep, it’s on a Textdrive default setup. But it’s on a TXP website updated since RC1, so it was long before the true utf-8 link to the database.

Here’s the diag for this one :

Version de Textpattern: 4.0.1 (r1060)
last_update: 2005-10-13 12:25:28/2005-09-23 22:37:21
Document racine: /home/ludysnet/domains/shadowrun.fr/public_html (/users/home/ludysnet/domains/shadowrun.fr/public_html)
$path_to_site: /users/home/ludysnet/domains/shadowrun.fr/public_html
Chemin d’accès à Textpattern: /usr/home/ludysnet/domains/shadowrun.fr/public_html/textpattern
Format des URL: section_title
upload_tmp_dir: /tmp
Répertoire temporaire: /users/home/ludysnet/domains/shadowrun.fr/public_html/textpattern/tmp
URL du site: shadowrun.fr
Version de PHP: 5.0.4
server_time: 2005-10-28 20:24:04
MySQL: 4.1.13-log
Locale: fr_FR.UTF-8
Serveur: Apache

Vérifications:
————————————
/usr/home/ludysnet/domains/shadowrun.fr/public_html/textpattern/setup/ existe encore
————————————

contenu du fichier .htaccess:
————————————
#htaccess global de shadowrun.fr

#On laisse passer les Tiscali en referer
SecFilterSelective “HTTP_REFERER” “chez\.tiscali\.fr” “nolog,allow”

#DirectoryIndex index.php index.html
#Options +FollowSymLinks
#RewriteBase /relative/web/path/

<IfModule mod_rewrite.c> RewriteEngine On

#Redir www. vers .
RewriteCond %{HTTP_HOST} ^www\.shadowrun\.fr [NC]
RewriteRule ^/?(.*) http://shadowrun\.fr/$1 [R=301,L]

#TXP Rewrites RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^(.+) – [PT,L]

RewriteRule ^(.*) index.php </IfModule>

————————————

Charset (default/config): latin1/
character_set_client: latin1
character_set_connection: latin1
character_set_database: utf8
character_set_results: latin1
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/local/share/mysql/charsets/
20 Tables: txp_textpattern is utf8, txp_txp_category is utf8, txp_txp_css is utf8, txp_txp_discuss is utf8, txp_txp_discuss_ipban is utf8, txp_txp_discuss_nonce is utf8, txp_txp_file is utf8, txp_txp_form is utf8, txp_txp_image is utf8, txp_txp_lang is utf8, txp_txp_link is utf8, txp_txp_log is utf8, txp_txp_log_mention is utf8, txp_txp_note is utf8, txp_txp_page is utf8, txp_txp_plugin is utf8, txp_txp_prefs is utf8, txp_txp_priv is utf8, txp_txp_section is utf8, txp_txp_users is utf8

Extensions PHP: standard/5.0.4, SPL, libxml, bcmath, bz2, ctype, curl, dbx, dio/0.1, SimpleXML, dom/20031129, exif/1.4 $Id: exif.c,v 1.162.2.8 2005/03/17 17:20:35 iliaa Exp $, fribidi/0.1, gd, gettext, gmp, iconv, imap, ssh2/0.1, mbstring, mcrypt, mhash, mysql, mysqli/0.1, ncurses, openssl, pcntl, pcre, pgsql, posix, readline, session, shmop, soap, sockets, SQLite, tokenizer/0.1, xml, xmlrpc/0.51, xsl/0.1, zlib/1.1, pdf/2.0.3, Zend Optimizer

But I’ve just tested on another website, recently installed – so it has true utf8, even inside the db – and it’s the same issue. Here’s the diag for that one :

Version de Textpattern: 4.0.1 (r1060)
last_update: 2005-10-10 19:30:17/2005-09-18 04:53:20
Document racine: /home/ludysnet/domains/lapagedublacky.net/public_html (/users/home/ludysnet/domains/lapagedublacky.net/public_html)
$path_to_site: /users/home/ludysnet/domains/lapagedublacky.net/public_html
Chemin d’accès à Textpattern: /home/ludysnet/domains/lapagedublacky.net/public_html/textpattern
Format des URL: section_title
upload_tmp_dir: /tmp
Répertoire temporaire: /users/home/ludysnet/domains/lapagedublacky.net/public_html/textpattern/tmp
URL du site: www.lapagedublacky.net
Version de PHP: 5.0.4
server_time: 2005-10-28 20:25:20
MySQL: 4.1.13-log
Locale: fr_FR.UTF-8
Serveur: Apache

Vérifications:
————————————
/home/ludysnet/domains/lapagedublacky.net/public_html/textpattern/setup/ existe encore
————————————

contenu du fichier .htaccess:
————————————
#On laisse passer les Tiscali en referer
SecFilterSelective “HTTP_REFERER” “chez\.tiscali\.fr” “nolog,allow”

#TXP Directives
#DirectoryIndex index.php index.html
#Options +FollowSymLinks
#RewriteBase /relative/web/path/

<IfModule mod_rewrite.c>

RewriteEngine On

#Redir . vers www
RewriteCond %{HTTP_HOST} ^lapagedublacky\.net [NC]
RewriteRule ^/?(.*) http://www.lapagedublacky\.net/$1 [R=301,L]

RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.+) – [PT,L]

RewriteRule ^(.*) index.php
</IfModule>

————————————

Charset (default/config): latin1/utf8
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/local/share/mysql/charsets/
18 Tables: OK

Extensions PHP: standard/5.0.4, SPL, libxml, bcmath, bz2, ctype, curl, dbx, dio/0.1, SimpleXML, dom/20031129, exif/1.4 $Id: exif.c,v 1.162.2.8 2005/03/17 17:20:35 iliaa Exp $, fribidi/0.1, gd, gettext, gmp, iconv, imap, ssh2/0.1, mbstring, mcrypt, mhash, mysql, mysqli/0.1, ncurses, openssl, pcntl, pcre, pgsql, posix, readline, session, shmop, soap, sockets, SQLite, tokenizer/0.1, xml, xmlrpc/0.51, xsl/0.1, zlib/1.1, pdf/2.0.3, Zend Optimizer

Offline

#4 2005-10-28 22:10:22

zem
Developer Emeritus
From: Melbourne, Australia
Registered: 2004-04-08
Posts: 2,579

Re: [issue] 4.0.1: search and case of non-ascii letters

Run this mysql query:

show full columns from textpattern;

What is the Collation on the Body, Excerpt and Title fields?


Alex

Offline

#5 2005-10-28 22:23:43

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

I’m running it against the true utf-8 for now. All field are either no collation (because the type doesn’t allow it), or utf8_general_ci. I’ve double check Body, Title and Excerpt, they are too.

utf8_general_ci is also the collation for this database

requête SQL: show full columns from jbb_txp_textpattern;
Enregistrements: 36
Field Type Collation Null Key Default Extra Privileges Comment
ID int(11) NULL PRI NULL auto_increment select,insert,update,references
Posted datetime NULL MUL 0000-00-00 00:00:00 select,insert,update,references
AuthorID varchar(64) utf8_general_ci select,insert,update,references
LastMod datetime NULL 0000-00-00 00:00:00 select,insert,update,references
LastModID varchar(64) utf8_general_ci select,insert,update,references
Title varchar(255) utf8_general_ci MUL select,insert,update,references
Title_html varchar(255) utf8_general_ci select,insert,update,references
Body mediumtext utf8_general_ci select,insert,update,references
Body_html mediumtext utf8_general_ci select,insert,update,references
Excerpt text utf8_general_ci select,insert,update,references
Excerpt_html mediumtext utf8_general_ci select,insert,update,references
Image varchar(255) utf8_general_ci select,insert,update,references
Category1 varchar(128) utf8_general_ci MUL select,insert,update,references
Category2 varchar(128) utf8_general_ci select,insert,update,references
Annotate int(2) NULL 0 select,insert,update,references
AnnotateInvite varchar(255) utf8_general_ci select,insert,update,references
comments_count int(8) NULL 0 select,insert,update,references
Status int(2) NULL 4 select,insert,update,references
textile_body int(2) NULL 1 select,insert,update,references
textile_excerpt int(2) NULL 1 select,insert,update,references
Section varchar(64) utf8_general_ci select,insert,update,references
override_form varchar(255) utf8_general_ci select,insert,update,references
Keywords varchar(255) utf8_general_ci select,insert,update,references
url_title varchar(255) utf8_general_ci select,insert,update,references
custom_1 varchar(255) utf8_general_ci select,insert,update,references
custom_2 varchar(255) utf8_general_ci select,insert,update,references
custom_3 varchar(255) utf8_general_ci select,insert,update,references
custom_4 varchar(255) utf8_general_ci select,insert,update,references
custom_5 varchar(255) utf8_general_ci select,insert,update,references
custom_6 varchar(255) utf8_general_ci select,insert,update,references
custom_7 varchar(255) utf8_general_ci select,insert,update,references
custom_8 varchar(255) utf8_general_ci select,insert,update,references
custom_9 varchar(255) utf8_general_ci select,insert,update,references
custom_10 varchar(255) utf8_general_ci select,insert,update,references
uid varchar(32) utf8_general_ci select,insert,update,references
feed_time date NULL 0000-00-00 select,insert,update,references

I’ve done a PMA (utf-8 encoded) SQL query for it, and it worked: <code>SELECT `ID` , `Body` FROM `jbb_txp_textpattern` WHERE `Body` LIKEéditionLIMIT 0 , 30</code> return the article with Édition in its body.

The search field for the website is generated with a classic use of the TXP tag: <txp:search_input label="" wraptag="p" />

Last edited by Jeremie (2005-10-28 23:32:53)

Offline

#6 2005-10-28 22:52:56

zem
Developer Emeritus
From: Melbourne, Australia
Registered: 2004-04-08
Posts: 2,579

Re: [issue] 4.0.1: search and case of non-ascii letters

I’ve done a PMA (utf-8 encoded) SQL query for it, and it worked: SELECT `ID , Body FROM jbb_txp_textpattern WHERE Body LIKE ‘édition’ LIMIT 0 , 30@ return the article with Édition in its body.

Thanks. Sounds like it’s either the connection charset, or Textpattern is doing something to the é character before it makes it into the query. I don’t really understand the way the charset selection works, so I’ll leave that for Sencer.

If you edit lib/txplib_db.php and set define('TXP_DEBUG', 1);, you’ll see a dump of every database query. If it’s not a live site, you could try that, and see what the article query is when you use the search form (it’ll be the query with MATCH..AGAINST). Perhaps even copy-paste that same query into PMA, and see if it works there.


Alex

Offline

#7 2005-10-28 23:26:42

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

Ok. Two queries are relevant as far as I can tell :

<code>select count(*) from `jbb_txp_textpattern` where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘édition’)</code>
and
<code>select *, unix_timestamp(Posted) as uPosted, match (Title,Body) against (‘édition’) as score from `jbb_txp_textpattern` where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘édition’) order by score desc limit 0, 3</code>

Both return the same in PMA. The first one get 0 articles, the second one get nothing (which is consistant with what TXP is displaying in HTML).

If i edit any one of these queries with the correct character, it work:

<code>select count(*) from `jbb_txp_textpattern` where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘Édition’)</code>
return one count.

So it may have something to do with the rlike command. I’m sure I’m dumb, but I can’t find the manual page about rlike for 4.1. The only occurence I can found is this : <q>REGEXP and RLIKE use the current character set (cp1252 Latin1 by default) when deciding the type of a character. Warning: These operators are not multi-byte safe.</q>. I don’t know if this is relevant.

Edit: ok it seems rlike is just an alias for regexp. So it seems that it doesn’t like multibyte, which I believe include utf-8. That may be the issue here, because a query with the like command does work.

Last edited by Jeremie (2005-10-28 23:32:05)

Offline

#8 2005-10-28 23:59:09

zem
Developer Emeritus
From: Melbourne, Australia
Registered: 2004-04-08
Posts: 2,579

Re: [issue] 4.0.1: search and case of non-ascii letters

match (Title,Body) against ('édition')

Is that really what the query looks like (with an encoded html entity), or did punbb’s textile mess up your copy-paste? Does the full query (with match..against) work, if you fix up that character?

[edit] hmm, I’m a bit confused by “If i edit any one of these queries with the correct character, it work:” – edit what, exactly?

Last edited by zem (2005-10-29 00:02:49)


Alex

Offline

#9 2005-10-29 00:26:09

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

zem wrote:
Is that really what the query looks like (with an encoded html entity), or did punbb’s textile mess up your copy-paste? Does the full query (with match..against) work, if you fix up that character?

You can check the queries and do some test here if you want to see it fresh. But no, Textile has messed up the query, it has no encoding.

The queries without format:

select count(*) from jbb_txp_textpattern where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘édition’)

and

select *, unix_timestamp(Posted) as uPosted, match (Title,Body) against (‘édition’) as score from jbb_txp_textpattern where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘édition’) order by score desc limit 0, 3

[edit] hmm, I’m a bit confused by “If i edit any one of these queries with the correct character, it work:” – edit what, exactly?

I meant:

select count(*) from jbb_txp_textpattern where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘édition’)

return 0 count.

select count(*) from jbb_txp_textpattern where 1 and Status=‘4’ and Posted < now() and (Title rlike ‘édition’ or Body rlike ‘Édition’)

return 1 count (check the first character of the last word).

Sorry about the confusion.

Last edited by Jeremie (2005-10-29 00:28:24)

Offline

#10 2005-10-29 01:38:06

zem
Developer Emeritus
From: Melbourne, Australia
Registered: 2004-04-08
Posts: 2,579

Re: [issue] 4.0.1: search and case of non-ascii letters

Ok, looks to me like the query is fine; comes down to charset stuff, which is more Sencer’s department.


Alex

Offline

#11 2005-11-01 22:10:15

Jeremie
Member
From: Provence, France
Registered: 2004-08-11
Posts: 1,578
Website

Re: [issue] 4.0.1: search and case of non-ascii letters

Another related – I think – issue, and maybe a bigger one is the case of the apostrophe (‘). It’s automaticaly transformed by Textile to the right character, it’s heavily used in several languages (including french and italian) but how the search is handling it since no keyword or even special driver – to my knowledge – are able to output it.

Put it another way: Textile transform the single quote character <code>’</code> into a real unicode apostrophe ‘ , but no one using the search tool will use the real one they will all use the single quote character.

I’ve done some test.. well the results are strange… TXP does find some articles with words with apostrophe, but find another set of articles if the real apostrophe character is copied/pasted. Really strange.

Offline

#12 2005-11-01 23:42:03

zem
Developer Emeritus
From: Melbourne, Australia
Registered: 2004-04-08
Posts: 2,579

Re: [issue] 4.0.1: search and case of non-ascii letters

Textpattern searches the text before Textile is applied, exactly as it’s entered in the body textarea.

The difference might be caused by two different encodings of the same character (ASCII 0×27 vs. Unicode U+0027), or by two different characters that might appear similar on screen (U+0027 vs U+2019).

At any rate, MySQL’s fulltext indexing isn’t really designed to include puncutation in searches.


Alex

Offline

Board footer

Powered by FluxBB