Convert string for use in URLs in multilanguage environment

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Markus Ernst

    Convert string for use in URLs in multilanguage environment

    Sorry for the multipost - I forgot to crosspost and alt.php gets less
    attention than comp.lang.php.. . And I hope this will work with UTF-8.

    In order to make strings suitable for URLs in a UTF-8 encoded website, I use
    2 functions, the first of which removes accents from some Latin-1, Latin-2,
    and Turkish characters (suggestions for changes or additions welcome!), and
    the second removes non-word characters by spaces and then urlencode()s the
    string:

    function remove_accents( $string, $german=false) {
    // Single characters
    $single_fr = explode(" ", "À Á Â Ã Ä Å A A Ç C C D D Ð È É Ê Ë E E G Ì Í
    Î Ï I L L L Ñ N N Ò Ó Ô Õ Ö Ø O R R S S S T T Ù Ú Û Ü U U Ý Z Z Z à á â ã ä
    å a a ç c c d d è é ê ë e e g ì í î ï i l l l ñ n n ð ò ó ô õ ö ø o r r s s
    s t t ù ú û ü u u ý ÿ z z z");
    $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I
    I I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a
    a a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s
    s t t u u u u u u y y z z z");
    $single = array();
    for ($i=0; $i<count($singl e_fr); $i++) {
    $single[$single_fr[$i]] = $single_to[$i];
    }
    // Ligatures
    $ligatures = array("Æ"=>"Ae" , "æ"=>"ae", "O"=>"Oe", "o"=>"oe",
    "ß"=>"ss");
    // German umlauts
    $umlauts = array("Ä"=>"Ae" , "ä"=>"ae", "Ö"=>"Oe", "ö"=>"oe", "Ü"=>"Ue",
    "ü"=>"ue");
    // Replace
    $replacements= array_merge($si ngle, $ligatures);
    if ($german) $replacements= array_merge($re placements, $umlauts);
    $string = strtr($string, $replacements);
    return $string;
    }

    function make_url_string ($string) {
    $string = strtolower(remo ve_accents($str ing, true));
    $string = preg_replace("/([\W]+)/", "-", $string);
    return urlencode(trim( $string, "-"));
    }

    I have 2 questions on this:

    1. preg_replace("/([\W]+)/", "-", $string); removes all non-ASCII
    characters. Is there any possibility to remove only punctuation and such
    stuff, but keep all kinds of letters from whatever character sets?

    2. Is there a better way to encode strings for URLs? Or is it maybe
    inevitable to collect the real name and the name for the url separately to
    get an ASCII-only entry?

    Thanks for suggestions!

    --
    Markus


  • John Dunlop

    #2
    Re: Convert string for use in URLs in multilanguage environment

    Markus Ernst wrote:
    [color=blue]
    > 2. Is there a better way to encode strings for URLs?[/color]

    Arguably. Another way is to take your character, encode it as
    UTF-8 octets, and percent-encode that encoding. This is the mapping
    of IRIs to URIs described in RFC3987. Take <é> (U+00E9). Encoded as
    UTF-8 then percent-encoded, you get %C3%A9 (not %E9).



    --
    Jock

    Comment

    Working...