Textpattern Forum

You are not logged in. Register | Login | Help

#1 2012-08-03 20:44:56

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 168

wcz_utf8_url - native unicode urls

Summary

wcz_utf8_url – uses UTF-8 permlinks instead of transliterated ones for SEO

Features

  • automatically handles non ASCII characters
  • integrated function for updating all of your existing titles – backup first!!!, use on your OWN risk —> <txp:update_urls />
  • works with German, Russian, Ukrainian (on live sites)
  • tested with Japanese

Version history:

0.1.4 - minor fix of preserving already existing dashes/minuses
0.1.3 - added remove small words
0.1.2 - minor fixes with double dashes and trimming the url string
0.1.1 - minor fixes
0.1.0 - initial release

Requirements

Tested with:

  • Textpattern 4.4.1, MySQL 5.1.49, PHP 5.3.3 (Debian Squeeze)

Download & Installation

  1. Download wcz_utf_url and install it in the usual way.
  2. Adjust Textpattern->Advanced Options->“Maximum URL length (in characters)” to your needs.

Source

Git repository: https://github.com/whocarez-textpattern/Unicode-url-for-Textpattern

Bugs & Limitations

Let me know if you find any problems.

To do

  • make list of “small words” a parameter

Feedback

Comments are welcome.

Credits

Big thanks to Gocom, see this thread

Last edited by whocarez (2012-08-03 20:47:04)

Offline

#2 2015-02-08 16:53:48

raminrahimi
Member
From: India
Registered: 2013-03-19
Posts: 105

Re: wcz_utf8_url - native unicode urls

Hi, i tried that with Asian languages (Arabic, Persian,etc…) but It’s not working ! it become something ??????????????? on the URL , could u plz help me. tnx

Offline

#3 2015-02-09 17:33:22

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 168

Re: wcz_utf8_url - native unicode urls

Hello,
hm, what version of textpattern are you using?

I tested it on a fresh textpattern installation 4.5.7
and with a title like

fhdfgdডতগডগডগвапвпавпважєїхіäöüß所利じょろ

I get

domain/fhdfgd%E0%A6%A1%E0%A6%A4%E0%A6%97%E0%A6%A1%E0%A6%97%E0%A6%A1%E0%A6%97%D0%B2%D0%B0%D0%BF%D0%B2%D0%BF%D0%B0%D0%B2%D0%BF%D0%B2%D0%B0%D0%B6%D1%94%D1%97%D1%85%D1%96%C3%A4%C3%B6%C3%BC%C3%9F%E6%89%80%E5%88%A9%E3%81%98%E3%82%87%E3%82%8D

But in another installation I have also sometimes problems. There it seems to me, that the callback is not everytime registered, so I suppose, that the standard function is instead used. Seems to me, that you were confrontated with the same problem.

actually I use this code:

register_callback('wcz_utf8_url','sanitize_for_url');
function wcz_utf8_url($event,$step,$text) {

// replace slash and backslash before deleting unneeded signs, you may consider to add some more replacings e.g. € with Euro or евро
    $text = str_replace(array("1+1","$","€","%","/","\\"),array("1plus1"," Dollar"," Euro"," Prozent"," "," "),$text);
// Remove all unneeded symbols ...
    $text = preg_replace("/[\p{P}\p{No}\p{Nl}\p{M}\p{C}\p{S}]/u","-",$text);
// Collapse spaces, minuses, (back-)slashes and non-words
    $text = preg_replace('/[\s\-\/\\\\]+/', '-', $text);
// Trim url string
    $text = trim($text,"-");
// Remove small words
//    $text = preg_replace("/(^|-)[\p{Ll}\p{Lu}\p{Lt}\p{Lo}]{1,2}(?=-|$)/u","", $text);

      $text = trim(preg_replace("/(^|-)(([\p{Ll}\p{Lu}\p{Lt}\p{Lo}]{1,3})(?<!new|wer|wen|wie|was|wo|wem|how|who|zug|uni|job|gps|bus|tod|tot|eko|öko|eu|dai|gai|hiv|df|ing|ua|upa|oun|omv|otp|ss|umc|twi|tvi|usa|uno|bio|see|kuh|fuß|not|kot|tür|sex|uhu|rat|dvd|cd|tau|rot|tor|tat|bit|sau|ehe|gut|mfg|ard|zdf|rtl|mdr|tee|uhr|zoo|zeh|rss|xml|pdf|axt|fan|nuß|neu|fkk|aal|bug|ost|alt|rom|ddr|fdj|sed|kgb|fbi|cia|sbu|ohr|age|ece|bip|mts|gus|ntn|cme|ntn|iwf|wto|scm|man|uah|eon|nbu|obi|tv|isd|ilo|akw|who|ooo|stb|gas|em))(?=-|$)/ui","", $text),"-");
// Remove all non-whitelisted characters
    $text = preg_replace("/[^\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\-_]/u","",$text);
    $text = trim(mb_strtolower($text,'UTF-8'),'-');
    return $text;
}

maybe someone else can have a look on it?

Offline

#4 2015-02-09 17:44:42

whocarez
Plugin Author
From: Germany/Ukraine
Registered: 2007-10-08
Posts: 168

Re: wcz_utf8_url - native unicode urls

try to change

$text = trim(mb_strtolower($text,'UTF-8'),'-');

to

$text = trim(mb_strtolower($text,mb_detect_encoding($text)),'-');

or try this one: wcz_utf8_url-0.1.6.txt

Last edited by whocarez (2015-02-09 17:51:48)

Offline

Board footer

Powered by FluxBB