Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2008-02-13 16:15:38

shi
Member
Registered: 2005-12-09
Posts: 34

How to configure TXP search? (utf-8 problems)

Hello,

I’m working on a website that uses utf-8 characters, and I need to configure Textpattern search engine, cause it doesn’t give me results that I need.

1. For example, if user type čačak in a search field, I’d like to output all results from database that contain čačak AND cacak. In other words, I want to force TXP search to ignore differences between some characters… (č = ć = c)

2. On the other hand, Textpattern doesn’t make a difference between uppercase and lower case when showing search results, which is cool (something is equal to SOMETHING) , but id does make a difference if I use non-standard characters, like šđčćž (šomething is different then ŠOMETHING). Is there a way of making TXP search engine to ignore differences between uppercase and lowercase for non-standard latin characters?

I guess I need to hack textpattern/publish/search.php… or maybe I could use some plugin, if there’s any.

Any help appreciated.

Last edited by shi (2008-02-13 16:19:41)

Offline

#2 2008-02-13 17:10:09

Mary
Sock Enthusiast
Registered: 2004-06-27
Posts: 6,236

Re: How to configure TXP search? (utf-8 problems)

Textpattern doesn’t have it’s own brand of search mechanism, it uses MySQL.

  1. You could try and scrub the search term before it is handed to MySQL. Take a look at dumbDown() in textpattern/lib/txplib_misc.php
  2. That’s actually due to your databases collation (the _ci stands for “case-insensitive”). What charset and collation are you using for your Txp database?

Offline

#3 2008-02-15 17:49:23

shi
Member
Registered: 2005-12-09
Posts: 34

Re: How to configure TXP search? (utf-8 problems)

1. You could try and scrub the search term before it is handed to MySQL. Take a look at dumbDown() in textpattern/lib/txplib_misc.php

I have to say that I dont know PHP that good so maybe I’m doing something wrong here. I’ve tried to edit txplib_misc.php but with no results. I tried several times with 'š'=>'s' , '&s;'=>'š' , 'š'=>'s' and 's'=>'š' . I’ve tried $text = preg_replace("/s/","š",dumbDown($text)); too but with no luck, and even if it work – I think I still won’t get the results I want, cause in that case I couldn’t search words with letter S, and so on.

What I want, is to make Textpattern search:

  • search both letters s and š, if I type s in searchfield; (this is the problem)
  • search only š if I type š in the searchfield. (here there’s no need to work-out cause mySql search already behave like this)

That’s how google works, and people used to it. :))

2. That’s actually due to your databases collation (the _ci stands for “case-insensitive”). What charset and collation are you using for your Txp database?

I dont know. PhpMyAdmin version is 2.3.2 but i can’t see some options and info cause I have limited access to database. I’ve tried to type SHOW COLLATION; in mysql query but it says that COLLATION is invalid sintax. What I need is utf_8_ci I guess. I browsed the database and I saw that some letters lile č are replaced with some weird signs like Ä�.

But I found a temporary solution in this post . The only downside is that I can’t search words with les then 4 letters. And it’s not a big deal for now, so I just need to find a solution for my first question.

Sorry about this long post. If someone could help me with my first question I would be grateful.

—-

Here’s my TXP diagnostics:

Textpattern version: 4.0.5 (r2466)
Last updated: 2007-12-15 16:26:13/2007-12-15 16:01:42
Document root: /home/virtual/site253/fst/var/www/html
$path_to_site: /home/virtual/site253/fst/var/www/html
Textpattern path: /home/virtual/site253/fst/var/www/html/textpattern
Permanent link mode: section_id_title
open_basedir: /home/virtual/site253/fst/
Temporary directory path: /home/virtual/site253/fst/var/www/html/textpattern/tmp
Site URL:
PHP version: 4.3.10
GD Image Library: bundled (2.0.28 compatible); podržani formati: GIF, JPG, PNG.
Server local time: 2008-02-15 18:20:28
MySQL: 3.23.58
Locale: en_GB.UTF-8
Server: Apache
Apache version: Apache
PHP Server API: apache
RFC 2616 headers:
Server OS: Linux 2.4.20-28.7smp
Active plugins-i: ako_nav-1.0, glx_if-0.6.4, ob1_title-4.1, ied_hide_in_admin-0.1.6, upm_insert_tab-0.3, zem_contact_reborn-4.0.3.20, zem_contact_lang-4.0.3.6m, rss_auto_excerpt-0.5, zem_ir-0.5, rss_article_edit-0.1, rvm_if_this_article-0.1, hak_article_image-0.6.3, glz_custom_fields-1.0m, hpw_admincss-0.1

Pre-flight check:
————————————
Some Textpattern files are modified: /lib/admin_config.php, /lib/txplib_head.php, /publish.php
————————————

.htaccess contains:
————————————
#DirectoryIndex index.php index.html

#Options +FollowSymLinks
#Options -Indexes

<IfModule mod_rewrite.c> RewriteEngine On #RewriteBase /relative/web/path/

RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^(.+) – [PT,L]

RewriteRule ^(.*) index.php
</IfModule>

#php_value register_globals 0
————————————

Charset (default/config): latin1/latin1
character_set: latin1
character_sets: latin1 big5 cp1251 cp1257 croat czech danish dec8 dos estonia euc_kr gb2312 gbk german1 greek hebrew hp8 hungarian koi8_ru koi8_ukr latin2 latin5 swe7 usa7 win1250 win1251 win1251ukr ujis sjis tis620
17 Tables: -

PHP extensions: yp, xml, wddx, tokenizer/0.1, sysvshm, sysvsem, standard/4.3.10, sockets, shmop, session, pspell, posix, pgsql, pcre, overload, mbstring, iconv, gmp, gettext, gd, ftp, exif/1.4 $Id: exif.c,v 1.118.2.29 2004/11/10 01:44:58 iliaa Exp $, domxml/20020815, dio/0.1, dbx, dba, curl, ctype, calendar, bz2, bcmath, zlib/1.1, openssl, apache, mysql, ldap, snmp

/include/txp_category.php: r2243 (3706fea923cd77f7053f7803de169df4)
/include/txp_plugin.php: r1917 (c63f72f33986c08367672fc9fe7b42dd)
/include/txp_auth.php: r2356 (33255ec1ea1a825163c78272496d8783)
/include/txp_form.php: r1913 (ecea3fecf9d7d1f8088cda67f097eceb)
/include/txp_section.php: r1891 (1f0121b3e2969d94bc8a7fb98bfdfbd5)
/include/txp_tag.php: r2260 (1bd67bdb9dcfb72e34ea967e39406216)
/include/txp_list.php: r2450 (997a3b1bec7115bf49b76f62b28da146)
/include/txp_page.php: r2099 (56bde34b6c7bcb9123ac91e73065e894)
/include/txp_discuss.php: r2451 (91e0b29ef39a9471ae5c78d0b1bba086)
/include/txp_prefs.php: r2405 (a4b76476930b2376199f23fbfd5f1ac9)
/include/txp_log.php: r2439 (16730c34e2a437dd88b8f5cc7eff8218)
/include/txp_preview.php: r1238 (696728f35f3557b648c011bb4d6496c3)
/include/txp_image.php: r2439 (9fac6ed0d9d4c3d8196492051f38dc9a)
/include/txp_article.php: r2453 (bdac8fcac5df2f93f10afa7e50c3fb6f)
/include/txp_css.php: r2403 (4e8c52bb1cf5bfe2e2f0640892f9b92e)
/include/txp_admin.php: r2403 (f8700a3d453ece08e7f137b47c967eda)
/include/txp_link.php: r2463 (0a0171bf606296106332d3fdcb83a678)
/include/txp_diag.php: r2361 (dccf3269049dd25e59afdd7ad8d235cd)
/include/txp_file.php: r2403 (e62abd5fcadabe629322ed17135d89eb)
/include/txp_import.php: r1238 (70a6207c0f3604ecfc4b20369986c4d7)
/lib/admin_config.php: r1747 (c1a47ca0214aaea068ede0a4bb4d6990)
/lib/txplib_misc.php: r2464 (615afd44a10311f1c0b7852d9bc15d24)
/lib/taglib.php: r1535 (9b519f9dc88791e5ee8eacc029dd6975)
/lib/txplib_head.php: r2404 (a8a8b8a7768c86ccbc30c72e6b16ae89)
/lib/classTextile.php: r2462 (a031e2ea894e339711c601f230c5ee71)
/lib/txplib_html.php: r2403 (97e173da3058b438513df67fd7d1ceca)
/lib/txplib_db.php: r2406 (5ed67642f805639b54e381fb22efd208)
/lib/IXRClass.php: r765 (137b91497628f0058a2fca9eba5c3b7f)
/lib/txplib_forms.php: r2403 (438a734b52acef40b36d8a3ba23987e8)
/lib/class.thumb.php: r2329 (b2a2fda54371dbd6c40ba553941f090e)
/lib/constants.php: r2361 (ab6d51668fab1e3c98e7d520b1a59f0f)
/lib/txplib_update.php: r1239 (10f28a986d23187b436369dc29ab552f)
/lib/txplib_wrapper.php: r2286 (419125ec74a17a70bf1e86ebfcd45253)
/publish/taghandlers.php: r2444 (cc9de8f2018b01398a2ba542c5f5bdc6)
/publish/atom.php: r2402 (46c4402717f695fde0d49d806adfa4c4)
/publish/log.php: r1637 (5254d0f3942086bc55723923307a51db)
/publish/comment.php: r2460 (2d1ae1dec0784f044e7005fa5ed50930)
/publish/search.php: r1748 (8c86ebcb5be08e214d81ca15a32164ca)
/publish/rss.php: r2393 (09aac29bf22ffa71c1e118e851cff3c3)
/publish.php: r2436 (0fe5d06f9419501c38f54b67f343ae44)
/index.php: r2466 (30ecf35de5c1edc6ef68e780c8c79daa)
/css.php: r944 (8beba8f83a091068723435cdcdc02f2f)

Last edited by shi (2008-02-26 02:23:41)

Offline

#4 2008-02-15 17:57:45

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: How to configure TXP search? (utf-8 problems)

MySQL 3.23.* does not support the UTF-8 charset natively (the utf8 content that TXP manages is stored in a latin1 table in MySQL, which works, but is far from perfect for searching). You’d have to upgrade to MySQL 4.1 and use my latin1_to_utf8 plugin to convert the database. Then MySQL would be able to give more accurate search results.

Offline

#5 2008-02-15 18:03:08

shi
Member
Registered: 2005-12-09
Posts: 34

Re: How to configure TXP search? (utf-8 problems)

Thanks a lot ruud!

I’ll contact the hosting company and be back with the results.

Offline

#6 2008-02-26 02:17:30

shi
Member
Registered: 2005-12-09
Posts: 34

Re: How to configure TXP search? (utf-8 problems)

They’ve upgraded MySql database to 5.0.22. I’ve used latin1_to_utf8 plugin and converted tables to utf_8_ci, and I don’t have those weird characters in database any more. Everything went fine… but…

I still have the same problems that I had before. :(

  1. I still can’t force TXP search not-to-make difference between s and š. Hacking txplib_misc.php didn’t help. At least, I tried to modify txplib_misc.php with my begginer knowledge of PHP, as I’ve posted before. I’m not sure what exactly do i need to do with dumbDown function, and I’m not sure that will give me the results I need. What I need, as I posted before, is to make TXP to search the database for both s and š if I type s in a search field on my site.
  2. Search is still case sensitive for non-latin characters. Looks like I’ll have to do the same thing as before for this problem. ( To modify /textpattern/publish.php )

Anyone knows what is going on, or what I’m doing wrong?

Thanks in advance,

————

P.S. Here’s my new diagnostics:

Textpattern verzija: 4.0.6 (r2805)
Poslednji put osvežen: 2008-02-19 13:21:16/2008-02-19 13:19:34
Document root: /var/www/html
$path_to_site: /var/www/html
Textpattern putanja: /var/www/html/textpattern
Permanent link mode: section_id_title
Temporary directory path: /home/lastawww/mainwebsite_html/textpattern/tmp
Site URL:
PHP verzija: 5.1.6
GD Image Library: bundled (2.0.28 compatible); podržani formati: GIF, JPG, PNG.
Lokalno vreme servera: 2008-02-26 02:41:29
MySQL: 5.0.22
Locale: en_GB.UTF-8
Server: Apache/2.2.3 (CentOS)
PHP Server API: cgi-fcgi
RFC 2616 headers:
Server OS: Linux 2.6.18-53.1.6.el5
Aktivni plugin-i: ako_nav-1.0, glx_if-0.6.4m, ob1_title-4.1, ied_hide_in_admin-0.1.6, upm_insert_tab-0.3, zem_contact_reborn-4.0.3.20, zem_contact_lang-4.0.3.6m, rss_auto_excerpt-0.5, zem_ir-0.5, rss_article_edit-0.1, rvm_if_this_article-0.1, hak_article_image-0.6.3, glz_custom_fields-1.1.2m, hpw_admincss-0.1

Pre-flight check:
————————————
Neki Textpattern fajlovi su izmenjeni: /lib/admin_config.php
————————————

.htaccess fajl sadrži:
————————————
#DirectoryIndex index.php index.html

#Options +FollowSymLinks
#Options -Indexes

<IfModule mod_rewrite.c> RewriteEngine On #RewriteBase /relative/web/path/

RewriteCond %{REQUEST_FILENAME} -f [OR] RewriteCond %{REQUEST_FILENAME} -d RewriteRule ^(.+) – [PT,L]

RewriteRule ^(.*) index.php

RewriteCond %{HTTP:Authorization} !^$ RewriteRule .* – [E=REMOTE_USER:%{HTTP:Authorization}]
</IfModule>

#php_value register_globals 0

————————————

Charset (default/config): latin1/utf8
character_set_client: utf8
character_set_connection: utf8
character_set_database: latin1
character_set_filesystem: binary
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
17 Tables: OK

PHP ekstenzije: libxml, xml, wddx, tokenizer/0.1, sysvshm, sysvsem, sysvmsg, standard/5.1.6, SimpleXML, sockets, SPL, shmop, session, Reflection, pspell, posix, pcntl, mime_magic/0.1, iconv, hash/1.0, gmp, gettext, ftp, exif/1.4 $Id: exif.c,v 1.173.2.5 2006/04/10 18:23:24 helly Exp $, date/5.1.6, curl, ctype, calendar, bz2, zlib/1.1, pcre, openssl, gd, imap, ldap, mbstring, mysql/1.0, mysqli/0.1, odbc/1.0, PDO, pdo_mysql/1.0.2, PDO_ODBC, pdo_sqlite/1.0.1

/../index.php: r2774 (66519e6f500fa0e59fa27567e97d3675)
/css.php: r2772 (4807cbc15661213f2b4d0fd26c7179ff)
/include/txp_admin.php: r2729 (0c2b3cf59ff433c943bcc293a526651a)
/include/txp_article.php: r2680 (49a7155d831f843bcf3e8de306dfe7f1)
/include/txp_auth.php: r2728 (c472bfbe49a71fd35e89000c8a18de08)
/include/txp_category.php: r2243 (0ed99b6f44b5d221bdf35674240141ab)
/include/txp_css.php: r2730 (7974aa87728b39d3afaba5a3b18cf6b5)
/include/txp_diag.php: r2791 (aeb96445180b68c31821e237b6150332)
/include/txp_discuss.php: r2774 (852a8a4d4307358e161e0501124b7247)
/include/txp_file.php: r2530 (9f34fdbf98b9b649d65e2ced4c9ca763)
/include/txp_form.php: r1913 (780340d28f384113c72924843194b43e)
/include/txp_image.php: r2668 (11269b464db6cfa3affff47674533a50)
/include/txp_import.php: r1238 (86f0e64d2c9362066e6c48b9cd486e37)
/include/txp_link.php: r2463 (2379d25f83b37ec6c8d5f3edb1122ce8)
/include/txp_list.php: r2725 (1ed6c6f729eaeb7f8a582b27cd5b9e78)
/include/txp_log.php: r2796 (f249e0962a996f05041b899fea91ccae)
/include/txp_page.php: r2717 (807ff04b4a649b54b3d710c1ab0a428f)
/include/txp_plugin.php: r2774 (e9fdc47a3ed9bdd13197d929161c6a13)
/include/txp_prefs.php: r2528 (50bd3be8c22e17d5ca2855ccea081bac)
/include/txp_preview.php: r1238 (c45992b3273ac8019477e2f959d63120)
/include/txp_section.php: r2759 (9208297e0bd7b3d41bd0e6f9fc9ab120)
/include/txp_tag.php: r2774 (f371b400e8d7318e2ac48e032fe6c274)
/index.php: r2805 (ee8ab2e3c4bc9abd77aa7384ecba5268)
/lib/IXRClass.php: r765 (0120eb4713c9b6446a0eebe8b1039d1c)
/lib/admin_config.php: r1747 (dd6705d4dd86103f1eb706adbcafa99a)
/lib/class.thumb.php: r2329 (c7f66a32531f32d6dfcbe5c7d26c7852)
/lib/classTextile.php: r2779 (b6d5b9cecbc5bc6475b5d1ee6a5231ea)
/lib/constants.php: r2361 (5338211ece1b2592804acdd204c9df33)
/lib/taglib.php: r2612 (727737ebd08127c632b9822bae87fee0)
/lib/txplib_admin.php: r2726 (c4f65bac2ddef62867f5bfee97ad7dfe)
/lib/txplib_db.php: r2748 (3feb369b1c34f251815cd6085a216d62)
/lib/txplib_forms.php: r2759 (a2d3de62110e582fab2a3a20224661f4)
/lib/txplib_head.php: r2783 (74ced647523a94da307af9853d7ed596)
/lib/txplib_html.php: r2696 (57985ebd2501bc303d2e97ae7538db1f)
/lib/txplib_misc.php: r2788 (7ecfaa5d4fabefbf411d01615dea9485)
/lib/txplib_update.php: r1239 (e3bd2d0c2b491d4028a656b8301a0086)
/lib/txplib_wrapper.php: r2800 (4ad38ee67f3ee8d9e7b51544a4f0f58b)
/publish.php: r2777 (0ce3da212329e7d34de07e53e109d182)
/publish/atom.php: r2774 (50aa384a2edf7cc07effee9020e0893b)
/publish/comment.php: r2776 (0e1ea64316087edcd75f394494b42100)
/publish/log.php: r1637 (f69237dc2ff39bd7a691c8ca1bc87808)
/publish/rss.php: r2793 (022caa22c756c64f2255aae6625686d8)
/publish/search.php: r1748 (ea84e04b2c688b0bb8b5a9ecf395749a)
/publish/taghandlers.php: r2774 (59dc36e6dabc619e23c43f722fe7b8f1)
/update/_to_1.0.0.php: r711 (0f49fca8fbd8e6fca0fc48b0f69f0461)
/update/_to_4.0.2.php: r711 (e77c0e0d972868f19eaee4565bd0b4c4)
/update/_to_4.0.3.php: r711 (f5506cfd0fbc3ad4bd9a9b2299468775)
/update/_to_4.0.4.php: r711 (4d867b42ee87a7f11d2bff3a8e91bed0)
/update/_to_4.0.5.php: r2464 (dbe80cd4a775d3a43a203c3c4a2d0e3f)
/update/_to_4.0.6.php: r2464 (7e5ae73eb64c24438918697089a1f321)
/update/_update.php: r2792 (6ff7b4dedb2c7735a01e76b13b3f1fb1)

Last edited by shi (2008-02-26 02:27:36)

Offline

#7 2008-02-26 09:54:22

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: How to configure TXP search? (utf-8 problems)

I’ve tried this myself and if I put šomething in an article, searching for ŠOMETHING does give a result.
What I’ve done to make this work is change this line in /textpattern/publish.php:

$search   = " and (Title rlike '$q' or Body rlike '$q') $s_filter";

into:

$search   = " and (match (Title,Body) against ('$q')) $s_filter";

Offline

#8 2008-02-26 13:50:12

shi
Member
Registered: 2005-12-09
Posts: 34

Re: How to configure TXP search? (utf-8 problems)

Yes, I know about that…

shi wrote:

But I found a temporary solution in this post . The only downside is that I can’t search words with les then 4 letters. And it’s not a big deal for now, so I just need to find a solution for my first question.

I just thought there’s some “cleaner” way of doing it since database is utf-8 now. But it’s OK as it is for now.

I’ll consider that problem solved. So, what I need to do now, is to find a solution for my first problem. Is it possible? I think it is, Google search works like that, but it seems very hard to configure search to behave like that… or I’m just a big noob :|

I just can’t believe that I can not find any posts about this at the forum, cause it is a common problem with non-latin characters, as I know…

I’ll post my problem again:

What I want, is to make Textpattern search:

  • search both letters s and š, if I type s or š in a searchfield (in other words: make s equal to š);

Last edited by shi (2008-02-26 14:00:51)

Offline

#9 2008-02-26 14:48:26

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: How to configure TXP search? (utf-8 problems)

Unless MySQL has a collation that treats these characters as being the same, I don’t think that’s possible.

Offline

Board footer

Powered by FluxBB