Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
Bad character encoding in articles imported into database
This is a 10+ year old Textpattern website still running great. At some point in the history the whole thing was hacked and restored from a (possibly poorly done) database backup. All the new articles are fine, but there are many old articles (which because of new archives and site search now are quite easy for the innocent to come across) that display wrong. Here’s an example. Basically non-ascii(?) characters, including dashes and smart quotes, display as a diamond question mark character:
This is what this looks like in the database:
‘Reduced’,’‘,’*Sept. 10 – Nov. 6, 2005*\r\n*Opening Reception, Friday, Sept. 9, 7 - 10 pm*\r\n\r\nMinimalism’s impact on subsequent generations forms the exhibition Reduced.
When I go to edit this article the backend chokes and shows me blank fields:
This is probably a good time to mention that I’m running $txpcfg['dbcharset'] = 'latin1';
(is there a way to migrate this?). When I look at newer articles in the DB they have special characters encoded e.g. an appostrophe is ’
.
Assuming there are no better tools or ideas, the solution would be to grab the affected articles from the database dump into a separate text file, carefully search ‘n’ replace all the special characters with their respective codes, paste it back and restore the DB.
Any suggestions or pointers or warnings?
Yes, I have tried turning it off and on.
Offline
Re: Bad character encoding in articles imported into database
Ah, good ol’ character set issues. You might like to make sure you have a backup, then spin up a copy of rvm_latin1_to_utf8 which does exactly what it says on the tin.
If that doesn’t help, there’s a plugin by net-carver (sed_
plugin prefix) floating around that effectively re-textiles every article. I’ve had (some) success with that in the past. Failing that, you can fall back on SQL commands to alter collations and character sets in columns/tables, but that’s messy and I’m not the best source for those types of thing.
I think Ruud’s plugin will do what you want, so give that a whirl first then we’ll see how much further there is to go.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Bad character encoding in articles imported into database
The rvm_latin1_to_utf8 plugin doesn’t solve the problem where the database contains a mix of utf8/latin1 in articles.
You’d have to loop through each of the articles, try to detect if it contains latin1 and then convert that to utf8 and save the article.
Offline
Re: Bad character encoding in articles imported into database
ruud wrote #303522:
You’d have to loop through each of the articles, try to detect if it contains latin1 and then convert that to utf8 and save the article.
What I was afraid of. (I also tried to find the net-carver plugin both in the forums and on his github but no luck.)
Let’s say I dump the database to a text file and open it. In the old entries that are showing up wrong on the website, the characters show up exactly as I’d want them to. An apostrophe is an apostrophe. In the new articles that show up correctly on the website, an apostrophe is encoded as ’
. Sooooo theoretically if I do a search/replace in the entire file for each of the characters (really it’s just smart quotes and dashes) and replace with the appropriate strings as cribbed from the new correctly encoded files I should be fine. Or it will fail spectacularly but I will have a backup.
This will be a fun project for when I have some spare time.
Yes, I have tried turning it off and on.
Offline
Re: Bad character encoding in articles imported into database
alesh wrote #303546:
if I do a search/replace in the entire file for each of the characters (really it’s just smart quotes and dashes) and replace with the appropriate strings as cribbed from the new correctly encoded files I should be fine.
Yep, that oughta do it.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: Bad character encoding in articles imported into database
alesh wrote #303546:
This will be a fun project for when I have some spare time.
I’ve had to do this exact task with WordPress and Textpattern when moving from shoddy hosting.
Rather than export -> change -> import, you could backup -> live dangerously with SQL commands -> test instead.
Take a look here — that syntax should cover you with search/replace operations of this nature, and covers body and excerpt.
Backup. Backup. Backup. Then make the changes.
Offline
Re: Bad character encoding in articles imported into database
gaekwad wrote #303552:
Rather than export -> change -> import, you could backup -> live dangerously with SQL commands -> test instead.
Ah, thanks. I really should do this, if for not other reason then I need to work on my SQL chops. I’m concerned that this is going to be an ugly process where things go wrong a few times before they go right and it’ll be easier for me to troubleshoot with a text file in front of my face. Also adding an extra layer of complexity makes it that much more of a terrifying undertaking and so makes me more wont to push it back?
Seems I need a psychologist, not just tech advice.
Yes, I have tried turning it off and on.
Offline
#8 2017-01-12 17:32:42
- uli
- Moderator
- From: Cologne
- Registered: 2006-08-15
- Posts: 4,315
Re: Bad character encoding in articles imported into database
alesh wrote #303556:
Seems I need a psychologist, not just tech advice.
The doctor is —> IN
… and says: You’ll love the SQL way once you’ve experienced the ease of changing a whole database with just one click (and a spoonful of bravery). But you’ll also love yourself. And that’s the best of it. And what can happen with a backup up your sleeve?!
Edited: Lucy van Pelt’s psychiatric booth placard wording
In bad weather I never leave home without wet_plugout, smd_where_used and adi_form_links
Offline
Re: Bad character encoding in articles imported into database
uli wrote #303558:
The doctor is —> IN
… and says: You’ll love the SQL way once you’ve experienced the ease of changing a whole database with just one click (and a spoonful of bravery). But you’ll also love yourself. And that’s the best of it. And what can happen with a backup up your sleeve?!
Watch this space. I’ll report back. If you don’t hear from me by the end of winter it means I typed something that deleted the ground beneath my feet and have tumbled into the Earth’s core.
Yes, I have tried turning it off and on.
Offline