Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Odd character encoding woes
Not sure I’ve fully grasped this character encoding stuff, but the following image shows how certain “unknown” characters appear in phpMyAdmin (top) and the Body field on the Write tab (bottom). The little question mark character is also littered across the public site.
The textpattern table’s collation is set to utf8_general_ci
but whether that’s the same as the character encoding I don’t know. It appears that some kind of weird oblique ‘smart quote’ backtick character has been used in place of apostrophes. Presumably the same thing will be happening to ‘smart’ double quotes, but I haven’t found an article with those in yet.
My guess is the content was cut ‘n’ pasted from something like Word which had a different character set in use (or something) that doesn’t translate well to the web font or the DB’s charset. There are a couple of hundred articles in the site exhibiting stuff like this to some degree or another.
Question is: what’s the best way to fix it? What do I need to look into, or query from the database, to find all these occurrences and either replace them with proper characters — and, I assume, programmatically resave each article so body_html is updated — or alter the character encoding somehow so it ‘understands’ the characters? (can you tell I’m a character set novice yet? :-) Further, is there anything that I should do to help prevent this kind of thing happening in articles submitted in future?
One other possible telling tidbit: in phpMyAdmin, if I go and edit any record — say an entry in txp_prefs — it shows me the SQL that it used to perform that action. Usually it’s a simple UPDATE blah SET field='val'
but in this case it reports the following as the command it used:
UPDATE `txp_prefs`
SET `val` = 'Your comment'
WHERE `txp_prefs`.`prefs_id` =1
AND CONVERT( `txp_prefs`.`name` USING utf8 ) = 'comments_default_invite'
AND CONVERT( `txp_prefs`.`val` USING utf8 ) = 'Comment' AND `txp_prefs`.`type` =0
AND CONVERT( `txp_prefs`.`event` USING utf8 ) = 'comments'
AND CONVERT( `txp_prefs`.`html` USING utf8 ) = 'text_input' AND `txp_prefs`.`position` =40
AND CONVERT( `txp_prefs`.`user_name` USING utf8 ) = '' LIMIT 1 ;
Does that mean the database (it’s been imported from an ancient TXP 4.0.3 to a 4.4.0 environment) had an old encoding somehow? Can I find out what it was by issuing an SQL query on the old database or something? Would ruud’s rvm_latin1_to_utf8 be worth a shot here on the shiny new DB? Is this extra stuff in the UPDATE command even relevant? So many questions, so little time, cried Alice.
Thanks in advance for any pointers.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Odd character encoding woes
Is character encoding set identically in config.php in both the 4.0.3 and 4.4 environments?
Offline
Re: Odd character encoding woes
ruud wrote:
Is character encoding set identically in config.php in both the 4.0.3 and 4.4 environments?
It doesn’t mention anything about it in either file. Hmmmm. Mind you, I didn’t go through setup to make the new environment. Just copied everything over, manually altered the relevant txp_prefs and manually updated config.php with the new server / DB details.
Should config.php have an entry listing the character set in use?
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
#4 2011-04-06 21:14:46
- els
- Moderator
- From: The Netherlands
- Registered: 2004-06-06
- Posts: 7,458
Re: Odd character encoding woes
Offline
Re: Odd character encoding woes
Els wrote:
config-dist.php
Aha, how did I forget about that? Thanks Els.
I set the dbcharset to utf-8
as a first attempt. Instead of those little question marks I now get a tiny little square box everywhere there’s a quote symbol, with what looks like FF
on the top row and FD
beneath it. Either way, it still can’t figure it out. So do I just go through all the available charsets like latin-1, iso-8859-1, etc until I find one where the symbols go away and quotes pop out?
And then what should I do? I really wish I understood all this stuff *sigh*
Last edited by Bloke (2011-04-06 21:23:28)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Odd character encoding woes
It’s typically either latin1 (without the dash) or utf8 (also without dash).
How exactly did you copy the database from the old to the new install? Perhaps there was a mismatch between the export and import in the charset used.
Did the import of the database create the tables or did you first install 4.4 and then restore a 4.0.3 database without dropping the table structure?
Offline
Re: Odd character encoding woes
ruud wrote:
How exactly did you copy the database from the old to the new install?
phpMyAdmin->export to file then phpMyAdmin->import from that file, into an empty database shell (with no tables) that had been created for me by the hoster. Then I uploaded the 4.4.0 TXP filesystem via FTP to the new dir, made the relevant mods to the txp_prefs table to fix paths, etc, and then logged into the admin side a few times to make sure all the updates ran.
Looking at the original site/articles, they show the question mark characters too, which is probably because of the lack of dbcharset in config.php
.
It’s typically either latin1 (without the dash) or utf8 (also without dash).
Ahaaa. I changed it to utf8
(thanks for the correction) and the symbols have gone away. Yay! Looking at the page source code, the apostrophes aren’t true apostrophes — they’re ’
characters — and the double quotes are ‘curly’ but it’s close enough and a million times better than weird symbols dotted across the page.
So unless there’s something I can do to convert the smart quotes into real quotes en-masse I think that’ll have to do. From now on I suppose real apostrophes typed into new articles will be rendered correctly, and copy-n-pasted quote chars from Word docs will continue to be ‘curly’, but they will at least render a close approximation to their true character.
Thank you to the Netherlands Massive for setting me straight on this one and educating me in the process. Much appreciated.
Last edited by Bloke (2011-04-06 22:03:03)
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Odd character encoding woes
Hi Bloke,
you may enjoy this reading by Joel Spolsky:
I’ve read it a few times, and charsets and encodings… I still don’t get them :)
Offline
Re: Odd character encoding woes
maniqui wrote:
you may enjoy this reading by Joel Spolsky:
Excellent, thanks. It’s beginning to sink in…
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Odd character encoding woes
sorry to bring back an old thread, but I’m having the same issue with 4.5.4
Can’t figure out what the character might be. This showed up when I updated from a previous version to the newest. Is this a language issue? I have $txpcfg['dbcharset'] = 'utf8';
in my config file
Offline
Re: Odd character encoding woes
I am running into a very similar issue, these ? characters appearing where quotes and apostrophes should be. Started about a month ago, the host did something, like an upgrade to services, just not sure what.
Running Textpattern 4.6.2. Coding is set to $txpcfg[‘dbcharset’] = ‘utf8mb4’; in config.php. I’ve tried adding AddDefaultCharset UTF-8 to .htaccess. I’ve got <meta charset=“utf-8” /> in the header of all pages. I’ve tried exporting, deleting, and importing the database.
I also took a back up of the site and installed locally, no issues.
Anything else I can be looking at?
Last edited by shayne (2019-03-28 21:15:54)
Offline
Re: Odd character encoding woes
Hi shayne,
Did you consider updating the site to the latest txp version?
Yiannis
——————————
NeMe | hblack.art | EMAP | A Sea change | Toolkit of Care
I do my best editing after I click on the submit button.
Offline