Textpattern CMS support forum
You are not logged in. Register | Login | Help
- Topics: Active | Unanswered
[Solved] UTF-8 URLs
Moderator’s Annotation:
Please note: In the meantime Andreas created a plugin which is now available: here. – Uli
——————————————
Original post:
Is it possible to hack textpattern in that manner, that it will put out utf-8 encoded urls instead of ascii encoded one’s?
Last edited by uli (2012-09-07 13:02:21)
Offline
Re: [Solved] UTF-8 URLs
Offline
Re: [Solved] UTF-8 URLs
Ok, thanks so one solution is to define the permlink by hand.
But what´s about category and section names in url scheme, is there any – not so difficult – solution?
Offline
Re: [Solved] UTF-8 URLs
I found a solution that satisfies me. Use at your own risk :-D.
Surely there is a better solution …
In lib/txplib_misc.php look for
function sanitizeForUrl($text)
{
// any overrides?
$out = callback_event('sanitize_for_url', '', 0, $text);
if ($out !== '') return $out;
// Remove names entities and tags
$text = preg_replace("/(^|&\S+;)|(<[^>]*>)/U","",dumbDown($text));
// Dashify high-order chars leftover from dumbDown()
$text = preg_replace("/[\x80-\xff]/","-",$text);
// Collapse spaces, minuses, (back-)slashes and non-words
$text = preg_replace('/[\s\-\/\\\\]+/', '-', trim(preg_replace('/[^\w\s\-\/\\\\]/', '', $text)));
// Remove all non-whitelisted characters
$text = preg_replace("/[^A-Za-z0-9\-_]/","",$text);
return $text;
}
and replace it with
function sanitizeForUrl($text)
{
// any overrides?
$out = callback_event('sanitize_for_url', '', 0, $text);
if ($out !== '') return $out;
// remove all signs but letters, numbers, dashes and connectors
$text = preg_replace("/[\p{Ps}\p{Po}\p{Pi}\p{Pf}\p{Pe}\p{No}\p{Nl}\p{M}\p{C}\p{S}]/u","",$text);
// Collapse spaces, minuses, (back-)slashes and non-words
$text = preg_replace('/[\s\-\/\\\\]+/', '-', $text);
// Remove all non-whitelisted characters
$text = preg_replace("/[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\-_]/u","",$text);
return $text;
}
I tested it only with textpattern 4.2.0.
Textpattern->Advanced Options-> “Attach titles to permalinks?”=> yes and “Permalink title-like-this (instead of TitleLikeThis)?”=>yes
It works with umlauts, ひらがな,漢字, кириллица and other signs …
The only issue is that Internet Explorer 8 doesn´t render these urls. But on Wikipedia is it the same …
Firefox, Opera, Safari, Chrome, Konqueror usw. does …
If you implement it, think about “Maximum URL length (in characters)” in Textpattern->Advanced Options
Last edited by whocarez (2010-08-09 22:56:05)
Offline
Re: [Solved] UTF-8 URLs
whocarez wrote:
Surely there is a better solution …
Try the latest SVN instead? Since r3344 TXP has the ability to percent-encode URLs which is probably as close as we’ll get. YMMV.
The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.
Txp Builders – finely-crafted code, design and Txp
Offline
Re: [Solved] UTF-8 URLs
Bloke wrote:
Try the latest SVN instead? Since r3344 TXP has the ability to percent-encode URLs which is probably as close as we’ll get. YMMV.
No, no thanks Bloke. I really need these unicode urls, not only as a fallback.
I suggest to integrate an option in textpattern to explicit use unicode urls. Some say it is SEO relevant :-D …
Offline
Re: [Solved] UTF-8 URLs
Can’t you use the sanitize_for_url callback_event instead of editing TXP code?
Offline
Re: [Solved] UTF-8 URLs
Hm, as far as I understand the code is the fallback only working, if we after sanitizing get a blank url or a url-string only with minuses and only in this case the original url is rawencoded.
But in other case, e.g. Cyrillics, there work the other conditions and dumbDown($text)
, so the url string is fully latinized and I want to avoid that.
I want to take the article title, remove unused symbols and save the unicode-string as url. For example like in wikipedia, see ru.wikipedia.org.
rawurlencode takes the string, e.g. the article_title, as it is and does not remove unneeded symbols like “?!€,”.
If I misunderstand the code or your question, please tell me …
Last edited by whocarez (2010-08-10 20:40:52)
Offline
Re: [Solved] UTF-8 URLs
No, Ruud is just pointing out that you don’t need to modify the core. Instead, you can build a plugin and hook to the callback. If content is returned by the callback fuction, the normal permlinks will be overriden. Neat, huh. No need to modify anything. Something like this with your modified code:
register_callback('sanitize_for_url','xxx_utf8_permlinks');
function xxx_utf8_permlinks($event,$step,$text) {
$text = preg_replace("/[\p{Ps}\p{Po}\p{Pi}\p{Pf}\p{Pe}\p{No}\p{Nl}\p{M}\p{C}\p{S}]/u","",$text);
// Collapse spaces, minuses, (back-)slashes and non-words
$text = preg_replace('/[\s\-\/\\\\]+/', '-', $text);
// Remove all non-whitelisted characters
$text = preg_replace("/[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\-_]/u","",$text);
return $text;
}
Might have missed the callback argument order and in which order they are assigned to the function, didn’t check it.
Last edited by Gocom (2010-08-10 22:50:53)
Offline
Re: [Solved] UTF-8 URLs
Ah, ok I got it (in the end) …
Last edited by whocarez (2010-08-11 11:03:23)
Offline
Re: [Solved] UTF-8 URLs
Version 0.1.4
Get it here.
Offline
#12 2012-08-03 11:44:56
- uli
- Moderator
- From: Cologne
- Registered: 2006-08-15
- Posts: 4,313
Re: [Solved] UTF-8 URLs
Andreas, you might like to publish this in the Plugin Author Support forum. I reported you so to give you Plugin Author forum privileges.
In bad weather I never leave home without wet_plugout, smd_where_used and adi_form_links
Offline