Go to main content

Textpattern CMS support forum

You are not logged in. Register | Login | Help

#1 2011-06-30 11:56:14

uli
Moderator
From: Cologne
Registered: 2006-08-15
Posts: 4,304

[textile] Links with umlauts don't work

Yannis/colak recently spotted a bug here (his post scriptum): umlauts (äöüß) in domain names prevent Textile from creating links.


In bad weather I never leave home without wet_plugout, smd_where_used and adi_form_links

Offline

#2 2011-07-01 00:57:22

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 3,081
Website

Re: [textile] Links with umlauts don't work

It is not limited to to umlauts in domain names. I think every single IDN is br0ken.

ex. with Japanese characters in the URL: a link to Japanese Wikipedia page: Fukushima is incorrectly linkified on this forum (url is http://ja.wikipedia.org/wiki/福島). When used in a Textpattern article, it fails completely.
Given "Fukushima":http://ja.wikipedia.org/wiki/福島 Textile spits out the litteral string "Fukushima":http://ja.wikipedia.org/wiki/福島.


Where is that emoji for a solar powered submarine when you need it ?
Sand space – admin theme for Textpattern

Offline

#3 2011-07-01 06:48:21

wet
Developer Emeritus
From: Schoerfling, Austria
Registered: 2005-06-06
Posts: 3,323
Website Mastodon

Re: [textile] Links with umlauts don't work

From RFC 2396:

A URI is a sequence of characters from a very limited set, i.e. the letters of the basic Latin alphabet, digits, and a few special characters.

How shall we deal with this bug challenge in Textile?

Offline

#4 2011-07-01 09:41:56

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: [textile] Links with umlauts don't work

RFC 2396 is still valid, I think, but you’d have to encode the actual URL using punycode: RFC 3492

Offline

#5 2011-07-01 09:44:06

wet
Developer Emeritus
From: Schoerfling, Austria
Registered: 2005-06-06
Posts: 3,323
Website Mastodon

Re: [textile] Links with umlauts don't work

Would we consider that many TLDs allow only a restricted set of Unicode characters for registering Internationalized Domain Names, or blindly punycode all characters and hope for the best?

Offline

#6 2011-07-01 10:06:51

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: [textile] Links with umlauts don't work

LOL… having a library for encoding URLs that’s bigger than the rest of TXP combined may be a bit too much. I suspect some characters (inside us-ascii) are still not allowed in an URL, so those can be used to detect where an URL ends. And it’s up to the user to specify a valid URL.

I don’t really see the difference between http://5u43750qhnlfdhg9-uthgjtrewtngfdksbfdsgfd.com (non-existing domain) and http://contains_unicode_outside_restricted_set_for_some_tlds. In both cases the URL won’t lead you to an existing website.

Offline

#7 2011-08-18 08:04:41

Vienuolis
Member
From: Vilnius, Lithuania
Registered: 2009-06-14
Posts: 307
Website GitHub GitLab Twitter

Re: [textile] Links with umlauts don't work

The problem is general URL encoding, not only IDN. Wikipedia.org lets non-ASCII characters in URLs, and Mediawiki encode them (ė → %C4%97) successfully. Textpattern can convert, too (Admin: Meta: URL), but Textile do not.

Offline

#8 2011-08-18 09:30:06

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: [textile] Links with umlauts don't work

Right, URL encoding for the part after the domain name, IDN for the domain name itself: http://idn-encoded-domain-name.tld/urlencoded-part-of-the-url.
And in the URL encoded part, the difficulty is also in distinguishing the ? & characters as used to separate query string parameters from the same characters that occur inside the data you’re trying to send. You can’t distinguish those, so the user has to provide a good URL. Wikipedia is able do this automatically, because when you create a new page, Wikipedia knows what the URL should look like (as does TXP when you specify that part of the URL in ‘meta’). If a user specifies an URL in textile, you don’t have the required extra information to decide whether to have & in the query string or %38.

Last edited by ruud (2011-08-18 09:31:57)

Offline

#9 2011-08-19 11:01:59

Vienuolis
Member
From: Vilnius, Lithuania
Registered: 2009-06-14
Posts: 307
Website GitHub GitLab Twitter

Re: [textile] Links with umlauts don't work

If a user specifies an URL in textile, you don’t have the required extra information to decide whether to have & in the query string or %38.

Would we achieve quite reasonable, e.g. 90% of desired results in [] enclosed URLs? I guess & is not a big problem — much harder for an author to encode every non-ASCII letter in non-English texts. I would appreciate to fulfil 10% of remaining <a href=""> instead of coding all the 100% HTML URLs.

Offline

#10 2011-08-19 11:10:02

ruud
Developer Emeritus
From: a galaxy far far away
Registered: 2006-06-04
Posts: 5,068
Website

Re: [textile] Links with umlauts don't work

90% should be easier to achieve, provided the domain name itself contains only ascii characters and only the part that follows contains non-ascii.

Offline

#11 2011-08-19 11:45:07

Vienuolis
Member
From: Vilnius, Lithuania
Registered: 2009-06-14
Posts: 307
Website GitHub GitLab Twitter

Re: [textile] Links with umlauts don't work

Most of non-ASCII URLs come from Wikipedia, and counting. IDN domains are not popular, they occur in very rare occasions.

Offline

#12 2011-08-20 06:40:05

phiw13
Plugin Author
From: Japan
Registered: 2004-02-27
Posts: 3,081
Website

Re: [textile] Links with umlauts don't work

Vienuolis wrote:

Most of non-ASCII URLs come from Wikipedia, and counting. IDN domains are not popular, they occur in very rare occasions.

Depending on where in the world you look :-). I start to see them more and more (Jpn and Chinese – yesterday I saw one in a publicity clip on Jpn tv).


Where is that emoji for a solar powered submarine when you need it ?
Sand space – admin theme for Textpattern

Offline

Board footer

Powered by FluxBB