Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2009-12-29 17:27:43

matthijs
Member
Registered: 2008-12-15
Posts: 32

Character set problems UTF-8 or latin

Backing up a database from a textpattern site I noticed weird characters.

Website in browser with headers UTF-8 displays
ö and ô

phpMyAdmin and sql dump files (using the Txp plugin admin_dbmanager) displays
ö and ô

This has to do with the character encoding. I’ve read tons about it, but still fail to understand what’s going on here. Is the first line UTF-8? Is the second line (the one from the sql file and within phpmyadmin) something else, like latin-1?

Can I do something about this? I want to have good backups from the database, and having all these weird characters in them makes me uncomfortable

Thanks!

Offline

#2 2009-12-29 20:47:58

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Character set problems UTF-8 or latin

Try adding $txpcfg['dbcharset'] = 'latin1'; to your textpattern.cfg file on the new install.

If that solves the problem, you can then use my plugin rvm_latin1_to_utf8 to convert the database to proper UTF8, assuming you have a non-ancient MySQL version installed. (but be sure to have good backups before doing so).

UTF-8 is a multibyte charset, which means that it can require multiple bytes to store a single character, which is what happens for all characters that are not part of US-ASCII, such as: ö and ô

If you look at that in a dump file, which treats it latin1 characters (latin1 always uses 1 byte per character) you see: ö and ô
The accented o’s take up 2 bytes per character, each of those bytes corresponds with 1 latin1 character, so it is displayed as 2 latin1 characters.

Last edited by ruud (2009-12-29 20:52:14)

Offline

#3 2009-12-30 10:51:41

trenc
Plugin Author
From: ⛵️, currently Göteborg, SE
Registered: 2008-02-27
Posts: 574
Website GitHub

Re: Character set problems UTF-8 or latin

Seems to me that your db is configured as UTF-8 but only your editor reading the sql dump and you phpMyAdmin are displaying the UTF-8 content as ISO-8859-1 (latin1). So the the weird chars are displayed as Ruud described.

Are you sure your editor and phpMyAdmin is configured as UTF-8?


Digital nomad, sailing the world on a sailboat: 32fthome.com

Offline

#4 2009-12-30 15:17:10

matthijs
Member
Registered: 2008-12-15
Posts: 32

Re: Character set problems UTF-8 or latin

Ruud and trenc, thanks for the answers.

Looking at the config file I use now on the site, it is
$txpcfg[‘dbcharset’] = ‘latin1’;
That’s weird though, since it seems the characters are in fact utf8

And just to be clear: the character ö does display as ö on the website

My textediter is Textmate, file Encoding is UTF8.

Looking at the response headers from phpMyAdmin:

bc.

Date: Wed, 30 Dec 2009 15:09:15 GMT
Server: Apache
X-Powered-By: PHP/5.2.5
Set-Cookie: pmaCookieVer=4; expires=Fri, 29-Jan-2010 15:09:15 GMT; path=/phpMyAdmin/; httponly
phpMyAdmin=YfX6edgtefE4wduhehdE46SrbETtV2; path=/phpMyAdmin/; HttpOnly
pma_fontsize=100%25; expires=Fri, 29-Jan-2010 15:09:15 GMT; path=/phpMyAdmin/; httponly
pma_lang=en-utf-8; expires=Fri, 29-Jan-2010 15:09:16 GMT; path=/phpMyAdmin/; httponly
pma_charset=iso-8859-1; expires=Fri, 29-Jan-2010 15:09:16 GMT; path=/phpMyAdmin/; httponly
pma_collation_connection=utf8_unicode_ci; expires=Fri, 29-Jan-2010 15:09:16 GMT; path=/phpMyAdmin/; httponly
pma_theme=original; expires=Fri, 29-Jan-2010 15:09:16 GMT; path=/phpMyAdmin/; httponly
Expires: Wed, 30 Dec 2009 15:09:16 GMT
Cache-Control: no-store, no-cache, must-revalidate, pre-check=0, post-check=0, max-age=0
Last-Modified: Wed, 30 Dec 2009 15:09:16 GMT
X-ob_mode: 1
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Keep-Alive: timeout=15, max=91
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=utf-8

200 OK

bc.

That’s the frustrating thing with character sets. I can look at any character but never know what it is that I’m seeing …

What I could also try is write a PHP script and do a direct select from the db, making sure the connection is set as utf-8

Offline

#5 2009-12-30 16:20:44

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: Character set problems UTF-8 or latin

$txpcfg[‘dbcharset’] = ‘latin1’ means that the database thinks it’s storing latin1, when in fact TXP uses it to store UTF-8. If you then create a backup of such a table and store it as UTF-8, you’re seeing the latin1 representation of UTF-8 encoded in UTF-8. It’s probably normal.

I found that annoying, which is why I wrote a plugin to convert the database to latin1 instead of utf-8.

I’d recommend against doing a direct select using a UTF-8 connection to the database, especially since the dbcharset is set to latin1.

Offline

#6 2009-12-30 17:04:53

matthijs
Member
Registered: 2008-12-15
Posts: 32

Re: Character set problems UTF-8 or latin

So in the meantime I wrote a short test script in php and that confirmed that the characters are utf-8. When setting the headers of the script to utf-8 and doing a select the chars display fine. Without the header(utf8) they display messy.

So it’s the fault of phpMyAdmin which incorrectly retrieves the data. Now I have to look for some db backup script which does do it correct

Offline

Board footer

Powered by FluxBB