Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2011-04-06 10:30:32

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 8,775
Website

Odd character encoding woes

Not sure I’ve fully grasped this character encoding stuff, but the following image shows how certain “unknown” characters appear in phpMyAdmin (top) and the Body field on the Write tab (bottom). The little question mark character is also littered across the public site.

The textpattern table’s collation is set to utf8_general_ci but whether that’s the same as the character encoding I don’t know. It appears that some kind of weird oblique ‘smart quote’ backtick character has been used in place of apostrophes. Presumably the same thing will be happening to ‘smart’ double quotes, but I haven’t found an article with those in yet.

My guess is the content was cut ‘n’ pasted from something like Word which had a different character set in use (or something) that doesn’t translate well to the web font or the DB’s charset. There are a couple of hundred articles in the site exhibiting stuff like this to some degree or another.

Question is: what’s the best way to fix it? What do I need to look into, or query from the database, to find all these occurrences and either replace them with proper characters — and, I assume, programmatically resave each article so body_html is updated — or alter the character encoding somehow so it ‘understands’ the characters? (can you tell I’m a character set novice yet? :-) Further, is there anything that I should do to help prevent this kind of thing happening in articles submitted in future?

One other possible telling tidbit: in phpMyAdmin, if I go and edit any record — say an entry in txp_prefs — it shows me the SQL that it used to perform that action. Usually it’s a simple UPDATE blah SET field='val' but in this case it reports the following as the command it used:

UPDATE `txp_prefs`
SET `val` = 'Your comment'
WHERE `txp_prefs`.`prefs_id` =1
AND CONVERT( `txp_prefs`.`name` USING utf8 ) = 'comments_default_invite'
AND CONVERT( `txp_prefs`.`val` USING utf8 ) = 'Comment' AND `txp_prefs`.`type` =0
AND CONVERT( `txp_prefs`.`event` USING utf8 ) = 'comments'
AND CONVERT( `txp_prefs`.`html` USING utf8 ) = 'text_input' AND `txp_prefs`.`position` =40
AND CONVERT( `txp_prefs`.`user_name` USING utf8 ) = '' LIMIT 1 ;

Does that mean the database (it’s been imported from an ancient TXP 4.0.3 to a 4.4.0 environment) had an old encoding somehow? Can I find out what it was by issuing an SQL query on the old database or something? Would ruud’s rvm_latin1_to_utf8 be worth a shot here on the shiny new DB? Is this extra stuff in the UPDATE command even relevant? So many questions, so little time, cried Alice.

Thanks in advance for any pointers.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#2 2011-04-06 20:28:34

ruud
Developer emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Odd character encoding woes

Is character encoding set identically in config.php in both the 4.0.3 and 4.4 environments?

Offline

#3 2011-04-06 21:09:50

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 8,775
Website

Re: Odd character encoding woes

ruud wrote:

Is character encoding set identically in config.php in both the 4.0.3 and 4.4 environments?

It doesn’t mention anything about it in either file. Hmmmm. Mind you, I didn’t go through setup to make the new environment. Just copied everything over, manually altered the relevant txp_prefs and manually updated config.php with the new server / DB details.

Should config.php have an entry listing the character set in use?


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#4 2011-04-06 21:14:46

Els
Admin
From: The Netherlands
Registered: 2004-06-06
Posts: 7,458

Re: Odd character encoding woes

Bloke wrote:

Should config.php have an entry listing the character set in use?

config-dist.php ;)

Offline

#5 2011-04-06 21:22:42

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 8,775
Website

Re: Odd character encoding woes

Els wrote:

config-dist.php

Aha, how did I forget about that? Thanks Els.

I set the dbcharset to utf-8 as a first attempt. Instead of those little question marks I now get a tiny little square box everywhere there’s a quote symbol, with what looks like FF on the top row and FD beneath it. Either way, it still can’t figure it out. So do I just go through all the available charsets like latin-1, iso-8859-1, etc until I find one where the symbols go away and quotes pop out?

And then what should I do? I really wish I understood all this stuff *sigh*

Last edited by Bloke (2011-04-06 21:23:28)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#6 2011-04-06 21:42:50

ruud
Developer emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Odd character encoding woes

It’s typically either latin1 (without the dash) or utf8 (also without dash).

How exactly did you copy the database from the old to the new install? Perhaps there was a mismatch between the export and import in the charset used.
Did the import of the database create the tables or did you first install 4.4 and then restore a 4.0.3 database without dropping the table structure?

Offline

#7 2011-04-06 22:01:47

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 8,775
Website

Re: Odd character encoding woes

ruud wrote:

How exactly did you copy the database from the old to the new install?

phpMyAdmin->export to file then phpMyAdmin->import from that file, into an empty database shell (with no tables) that had been created for me by the hoster. Then I uploaded the 4.4.0 TXP filesystem via FTP to the new dir, made the relevant mods to the txp_prefs table to fix paths, etc, and then logged into the admin side a few times to make sure all the updates ran.

Looking at the original site/articles, they show the question mark characters too, which is probably because of the lack of dbcharset in config.php.

It’s typically either latin1 (without the dash) or utf8 (also without dash).

Ahaaa. I changed it to utf8 (thanks for the correction) and the symbols have gone away. Yay! Looking at the page source code, the apostrophes aren’t true apostrophes — they’re characters — and the double quotes are ‘curly’ but it’s close enough and a million times better than weird symbols dotted across the page.

So unless there’s something I can do to convert the smart quotes into real quotes en-masse I think that’ll have to do. From now on I suppose real apostrophes typed into new articles will be rendered correctly, and copy-n-pasted quote chars from Word docs will continue to be ‘curly’, but they will at least render a close approximation to their true character.

Thank you to the Netherlands Massive for setting me straight on this one and educating me in the process. Much appreciated.

Last edited by Bloke (2011-04-06 22:03:03)


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#8 2011-04-07 00:04:50

maniqui
Moderator
From: Buenos Aires, Argentina
Registered: 2004-10-10
Posts: 3,070
Website

Re: Odd character encoding woes

Hi Bloke,

you may enjoy this reading by Joel Spolsky:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

I’ve read it a few times, and charsets and encodings… I still don’t get them :)


La música ideas portará y siempre continuará

TXP Builders – finely-crafted code, design and txp

Offline

#9 2011-04-07 00:50:10

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 8,775
Website

Re: Odd character encoding woes

maniqui wrote:

you may enjoy this reading by Joel Spolsky:

Excellent, thanks. It’s beginning to sink in…


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#10 2013-02-20 21:59:46

ecklesroad
Plugin Author
From: Bemidji, MN
Registered: 2008-02-22
Posts: 119
Website

Re: Odd character encoding woes

sorry to bring back an old thread, but I’m having the same issue with 4.5.4

Can’t figure out what the character might be. This showed up when I updated from a previous version to the newest. Is this a language issue? I have $txpcfg['dbcharset'] = 'utf8'; in my config file

Offline

Board footer

Powered by FluxBB