Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

  1. Index
  2. » How do I…?
  3. » [Solved] UTF-8 URLs

#1 2010-07-19 06:31:27

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

[Solved] UTF-8 URLs

Moderator’s Annotation:

Please note: In the meantime Andreas created a plugin which is now available: here. – Uli

——————————————
Original post:

Is it possible to hack textpattern in that manner, that it will put out utf-8 encoded urls instead of ascii encoded one’s?

Last edited by uli (2012-09-07 13:02:21)

Offline

#2 2010-07-19 06:59:27

Dragondz
Moderator
From: Algérie
Registered: 2005-06-12
Posts: 1,529
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Hi

here is some discussions about that : this and this

Cheers

Offline

#3 2010-07-19 08:11:07

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Ok, thanks so one solution is to define the permlink by hand.
But what´s about category and section names in url scheme, is there any – not so difficult – solution?

Offline

#4 2010-08-09 22:08:36

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

I found a solution that satisfies me. Use at your own risk :-D.
Surely there is a better solution …

In lib/txplib_misc.php look for

	function sanitizeForUrl($text)
	{
		// any overrides?
		$out = callback_event('sanitize_for_url', '', 0, $text);
		if ($out !== '') return $out;

		// Remove names entities and tags
		$text = preg_replace("/(^|&\S+;)|(<[^>]*>)/U","",dumbDown($text));
		// Dashify high-order chars leftover from dumbDown()
		$text = preg_replace("/[\x80-\xff]/","-",$text);
		// Collapse spaces, minuses, (back-)slashes and non-words
		$text = preg_replace('/[\s\-\/\\\\]+/', '-', trim(preg_replace('/[^\w\s\-\/\\\\]/', '', $text)));
		// Remove all non-whitelisted characters
		$text = preg_replace("/[^A-Za-z0-9\-_]/","",$text);
		return $text;
	}

and replace it with

	function sanitizeForUrl($text)
	{
		// any overrides?
		$out = callback_event('sanitize_for_url', '', 0, $text);
		if ($out !== '') return $out;

// remove all signs but letters, numbers, dashes and connectors

		$text = preg_replace("/[\p{Ps}\p{Po}\p{Pi}\p{Pf}\p{Pe}\p{No}\p{Nl}\p{M}\p{C}\p{S}]/u","",$text);

		// Collapse spaces, minuses, (back-)slashes and non-words
		$text = preg_replace('/[\s\-\/\\\\]+/', '-', $text);
		// Remove all non-whitelisted characters
		$text = preg_replace("/[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\-_]/u","",$text);
		return $text;
	}

I tested it only with textpattern 4.2.0.
Textpattern->Advanced Options-> “Attach titles to permalinks?”=> yes and “Permalink title-like-this (instead of TitleLikeThis)?”=>yes

It works with umlauts, ひらがな,漢字, кириллица and other signs …
The only issue is that Internet Explorer 8 doesn´t render these urls. But on Wikipedia is it the same …
Firefox, Opera, Safari, Chrome, Konqueror usw. does …
If you implement it, think about “Maximum URL length (in characters)” in Textpattern->Advanced Options

Last edited by whocarez (2010-08-09 22:56:05)

Offline

#5 2010-08-10 08:05:06

Bloke
Developer
From: Leeds, UK
Registered: 2006-01-29
Posts: 11,271
Website GitHub

Re: [Solved] UTF-8 URLs

whocarez wrote:

Surely there is a better solution …

Try the latest SVN instead? Since r3344 TXP has the ability to percent-encode URLs which is probably as close as we’ll get. YMMV.


The smd plugin menagerie — for when you need one more gribble of power from Textpattern. Bleeding-edge code available on GitHub.

Txp Builders – finely-crafted code, design and Txp

Online

#6 2010-08-10 10:15:56

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Bloke wrote:

Try the latest SVN instead? Since r3344 TXP has the ability to percent-encode URLs which is probably as close as we’ll get. YMMV.

No, no thanks Bloke. I really need these unicode urls, not only as a fallback.
I suggest to integrate an option in textpattern to explicit use unicode urls. Some say it is SEO relevant :-D …

Offline

#7 2010-08-10 19:43:55

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: [Solved] UTF-8 URLs

Can’t you use the sanitize_for_url callback_event instead of editing TXP code?

Offline

#8 2010-08-10 20:40:23

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Hm, as far as I understand the code is the fallback only working, if we after sanitizing get a blank url or a url-string only with minuses and only in this case the original url is rawencoded.
But in other case, e.g. Cyrillics, there work the other conditions and dumbDown($text), so the url string is fully latinized and I want to avoid that.
I want to take the article title, remove unused symbols and save the unicode-string as url. For example like in wikipedia, see ru.wikipedia.org.

rawurlencode takes the string, e.g. the article_title, as it is and does not remove unneeded symbols like “?!€,”.

If I misunderstand the code or your question, please tell me …

Last edited by whocarez (2010-08-10 20:40:52)

Offline

#9 2010-08-10 22:48:18

Gocom
Developer Emeritus
From: Helsinki, Finland
Registered: 2006-07-14
Posts: 4,533
Website

Re: [Solved] UTF-8 URLs

No, Ruud is just pointing out that you don’t need to modify the core. Instead, you can build a plugin and hook to the callback. If content is returned by the callback fuction, the normal permlinks will be overriden. Neat, huh. No need to modify anything. Something like this with your modified code:

register_callback('sanitize_for_url','xxx_utf8_permlinks');
function xxx_utf8_permlinks($event,$step,$text) {
	$text = preg_replace("/[\p{Ps}\p{Po}\p{Pi}\p{Pf}\p{Pe}\p{No}\p{Nl}\p{M}\p{C}\p{S}]/u","",$text);
	// Collapse spaces, minuses, (back-)slashes and non-words
	$text = preg_replace('/[\s\-\/\\\\]+/', '-', $text);
	// Remove all non-whitelisted characters
	$text = preg_replace("/[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\-_]/u","",$text);
	return $text;
}

Might have missed the callback argument order and in which order they are assigned to the function, didn’t check it.

Last edited by Gocom (2010-08-10 22:50:53)

Offline

#10 2010-08-11 08:33:24

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Ah, ok I got it (in the end) …

Last edited by whocarez (2010-08-11 11:03:23)

Offline

#11 2012-08-03 10:25:42

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 305
Website GitHub Twitter

Re: [Solved] UTF-8 URLs

Version 0.1.4
Get it here.

Offline

#12 2012-08-03 11:44:56

uli
Moderator
From: Cologne
Registered: 2006-08-15
Posts: 4,304

Re: [Solved] UTF-8 URLs

Andreas, you might like to publish this in the Plugin Author Support forum. I reported you so to give you Plugin Author forum privileges.


In bad weather I never leave home without wet_plugout, smd_where_used and adi_form_links

Offline

  1. Index
  2. » How do I…?
  3. » [Solved] UTF-8 URLs

Board footer

Powered by FluxBB