Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2016-12-29 13:18:44

alesh
Member
From: Miami, FL
Registered: 2005-04-13
Posts: 228
Website

Bad character encoding in articles imported into database

This is a 10+ year old Textpattern website still running great. At some point in the history the whole thing was hacked and restored from a (possibly poorly done) database backup. All the new articles are fine, but there are many old articles (which because of new archives and site search now are quite easy for the innocent to come across) that display wrong. Here’s an example. Basically non-ascii(?) characters, including dashes and smart quotes, display as a diamond question mark character:

This is what this looks like in the database:

‘Reduced’,’‘,’*Sept. 10 – Nov. 6, 2005*\r\n*Opening Reception, Friday, Sept. 9, 7 - 10 pm*\r\n\r\nMinimalism’s impact on subsequent generations forms the exhibition Reduced.

When I go to edit this article the backend chokes and shows me blank fields:

This is probably a good time to mention that I’m running $txpcfg['dbcharset'] = 'latin1'; (is there a way to migrate this?). When I look at newer articles in the DB they have special characters encoded e.g. an appostrophe is ’.

Assuming there are no better tools or ideas, the solution would be to grab the affected articles from the database dump into a separate text file, carefully search ‘n’ replace all the special characters with their respective codes, paste it back and restore the DB.

Any suggestions or pointers or warnings?


Yes, I have tried turning it off and on.

Offline

#2 2016-12-29 13:29:38

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,269
Website GitHub

Re: Bad character encoding in articles imported into database

Ah, good ol’ character set issues. You might like to make sure you have a backup, then spin up a copy of rvm_latin1_to_utf8 which does exactly what it says on the tin.

If that doesn’t help, there’s a plugin by net-carver (sed_ plugin prefix) floating around that effectively re-textiles every article. I’ve had (some) success with that in the past. Failing that, you can fall back on SQL commands to alter collations and character sets in columns/tables, but that’s messy and I’m not the best source for those types of thing.

I think Ruud’s plugin will do what you want, so give that a whirl first then we’ll see how much further there is to go.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#3 2017-01-06 18:35:31

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Bad character encoding in articles imported into database

The rvm_latin1_to_utf8 plugin doesn’t solve the problem where the database contains a mix of utf8/latin1 in articles.

You’d have to loop through each of the articles, try to detect if it contains latin1 and then convert that to utf8 and save the article.

Offline

#4 2017-01-10 14:04:51

alesh
Member
From: Miami, FL
Registered: 2005-04-13
Posts: 228
Website

Re: Bad character encoding in articles imported into database

ruud wrote #303522:

You’d have to loop through each of the articles, try to detect if it contains latin1 and then convert that to utf8 and save the article.

What I was afraid of. (I also tried to find the net-carver plugin both in the forums and on his github but no luck.)

Let’s say I dump the database to a text file and open it. In the old entries that are showing up wrong on the website, the characters show up exactly as I’d want them to. An apostrophe is an apostrophe. In the new articles that show up correctly on the website, an apostrophe is encoded as ’. Sooooo theoretically if I do a search/replace in the entire file for each of the characters (really it’s just smart quotes and dashes) and replace with the appropriate strings as cribbed from the new correctly encoded files I should be fine. Or it will fail spectacularly but I will have a backup.

This will be a fun project for when I have some spare time.


Yes, I have tried turning it off and on.

Offline

#5 2017-01-10 14:29:58

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,269
Website GitHub

Re: Bad character encoding in articles imported into database

alesh wrote #303546:

if I do a search/replace in the entire file for each of the characters (really it’s just smart quotes and dashes) and replace with the appropriate strings as cribbed from the new correctly encoded files I should be fine.

Yep, that oughta do it.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Offline

#6 2017-01-11 20:47:34

gaekwad
Server grease monkey
From: People's Republic of Cornwall
Registered: 2005-11-19
Posts: 4,137
GitHub

Re: Bad character encoding in articles imported into database

alesh wrote #303546:

This will be a fun project for when I have some spare time.

I’ve had to do this exact task with WordPress and Textpattern when moving from shoddy hosting.

Rather than export -> change -> import, you could backup -> live dangerously with SQL commands -> test instead.

Take a look here — that syntax should cover you with search/replace operations of this nature, and covers body and excerpt.

Backup. Backup. Backup. Then make the changes.

Offline

#7 2017-01-12 13:47:13

alesh
Member
From: Miami, FL
Registered: 2005-04-13
Posts: 228
Website

Re: Bad character encoding in articles imported into database

gaekwad wrote #303552:

Rather than export -> change -> import, you could backup -> live dangerously with SQL commands -> test instead.

Ah, thanks. I really should do this, if for not other reason then I need to work on my SQL chops. I’m concerned that this is going to be an ugly process where things go wrong a few times before they go right and it’ll be easier for me to troubleshoot with a text file in front of my face. Also adding an extra layer of complexity makes it that much more of a terrifying undertaking and so makes me more wont to push it back?

Seems I need a psychologist, not just tech advice.


Yes, I have tried turning it off and on.

Offline

#8 2017-01-12 17:32:42

uli
Moderator
From: Cologne
Registered: 2006-08-15
Posts: 4,304

Re: Bad character encoding in articles imported into database

alesh wrote #303556:

Seems I need a psychologist, not just tech advice.

The doctor is —> IN
… and says: You’ll love the SQL way once you’ve experienced the ease of changing a whole database with just one click (and a spoonful of bravery). But you’ll also love yourself. And that’s the best of it. And what can happen with a backup up your sleeve?!

Edited: Lucy van Pelt’s psychiatric booth placard wording


In bad weather I never leave home without wet_plugout, smd_where_used and adi_form_links

Offline

#9 2017-01-13 14:43:12

alesh
Member
From: Miami, FL
Registered: 2005-04-13
Posts: 228
Website

Re: Bad character encoding in articles imported into database

uli wrote #303558:

The doctor is —> IN
… and says: You’ll love the SQL way once you’ve experienced the ease of changing a whole database with just one click (and a spoonful of bravery). But you’ll also love yourself. And that’s the best of it. And what can happen with a backup up your sleeve?!

Watch this space. I’ll report back. If you don’t hear from me by the end of winter it means I typed something that deleted the ground beneath my feet and have tumbled into the Earth’s core.


Yes, I have tried turning it off and on.

Offline

Board footer

Powered by FluxBB