Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2008-01-08 22:17:03

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,826
Website GitHub

Charset, collation, plugin, arghhh...

I know this is a minefield and I’ve read some relevant resources on it but I’m still a trifle confused.

When installing a couple of Rob Sable’s plugins (rss_thumbpop and rss_admin_db_manager) the help text previews fully on the Install Plugin page but when viewing the help after installation, it cuts off at the first “unknown” (foreign?) character. In rss_thumbpop it’s after the word “default” in the ‘category’ section, and in the other plugin it’s during the surname of the 2nd contributor.

Just to test, I installed them in another database I had — call this #2; it has the same TXP v4.0.5 vanilla, no MLP or anything else out of the ordinary — and the help displayed fine, skipping over the foreign characters and either rendering them as ? or just ignoring them entirely.

I compared the two databases from the diagnostics tab:

Database 1 Database 2
Charset (default/config): latin1/utf8 Charset (default/config): latin1/latin1
character_set_client: utf8 character_set_client: latin1
character_set_connection: utf8 character_set_connection: latin1
character_set_database: latin1 character_set_database: latin1
character_set_results: utf8 character_set_results: latin1
character_set_server: latin1 character_set_server: latin1
character_set_system: utf8 character_set_system: utf8
Collation (from phpMyAdmin): utf8_general_ci Collation (from phpMyAdmin): latin1_swedish_ci

Questions:

  1. Is the smashed-up hybrid charset in Database 1 the culprit? (I’ve no idea how it came to be like that, it’s a test database on the same server as Database 2, using the same phpMyAdmin interface, the same MySQL 4.1.22-standard under the same cpanel)
  2. What should they all be, and what can I do about it? (would rvm_latin1_to_utf8 help?)
  3. Does the result I’m seeing also depend on the collation/charset of the person who authored the plugin? If so, is there anything can be done about this on plugin install or plugin compile? (I could build it into ied_plugin_composer to make the results more consistent)
  4. What should I look for on the server / in the phpMyAdmin settings before I install my next TXP database? i.e are there any prerequisites or “best” settings when creating an empty database for a new install?

Thanks in advance for any pointers.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#2 2008-01-09 08:05:51

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Charset, collation, plugin, arghhh...

What does diagnostics say about the individual tables? Are they okay or do they all show up with a charset alongside it?

Offline

#3 2008-01-09 19:27:13

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,826
Website GitHub

Re: Charset, collation, plugin, arghhh...

ruud wrote:

What does diagnostics say about the individual tables?

Database 1
18 Tables: txp_js is latin1
(I have stm_javascript installed)

Database 2
18 Tables: OK


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#4 2008-01-09 20:15:13

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Charset, collation, plugin, arghhh...

This is pure speculation: in latin1, you can’t really get illegal characters, because no matter what you insert in the database, it translates to a valid latin1 character. latin1 uses 1 byte for each character. 256 different characters = all possible combinations for 1 byte.
For utf8, which is multi-byte charset this isn’t true. Some combinations of bytes are not valid as UTF8 characters. I suspect that these plugins contain latin1 or something else that isn’t valid utf8 characters. In database 1, MySQL chokes on it, cutting it of at the first invalid character so only the first part that is valid utf8 is stored, while on the second database… it’s a valid string of latin1 characters (what isn’t) so the whole thing is stored.

The short answer: ask the plugin author to remove the non-utf8 characters.

Offline

#5 2008-01-09 21:10:05

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,826
Website GitHub

Re: Charset, collation, plugin, arghhh...

Thanks for the explanation Ruud, makes sense.

So in the plugin_composer, do you think it’s wise for me to always run the help through utf8_docode() prior to saving it to the database? Or is that going to a) create more problems, b) come back and bite me on the ass when (if) the world — PHP? MySQL? — goes true native Unicode and UTF-8?

Last edited by Bloke (2008-01-09 21:10:40)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#6 2008-01-09 21:14:17

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Charset, collation, plugin, arghhh...

Nah, fix the problem, not the symptom. Show an error when invalid utf8 is found.
utf8_decode will only make the problem worse, because TXP assumes everything is utf8… and utf8_decode outputs latin1, which likely is not valid utf8.

Offline

#7 2008-01-09 21:25:28

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,826
Website GitHub

Re: Charset, collation, plugin, arghhh...

Suits me, thanks!


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

Board footer

Powered by FluxBB