How to remove accents (A-Umlaut to A)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • cody

    How to remove accents (A-Umlaut to A)

    Is there a method to replace special characters like Ä (A-Umlaut) with
    A, Ö (O-Umlaut) with O, and so on?
    Sure, I could look for each character separately and replace it with its
    ascii-counterpart, but there are also such special characters in French
    and Swedish and many other languages which I also want to catch. Is
    there a generic way to do it?
  • Morten Wennevik [C# MVP]

    #2
    Re: How to remove accents (A-Umlaut to A)

    On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
    Is there a method to replace special characters like Ä (A-Umlaut) with
    A, Ö (O-Umlaut) with O, and so on?
    Sure, I could look for each character separately and replace it with its
    ascii-counterpart, but there are also such special characters in French
    and Swedish and many other languages which I also want to catch. Is
    there a generic way to do it?
    >
    Hi Cody,

    There is no generic way to do this. There is a hack that works in most cases involving switching Encoding the string and reading it in a different encoding, but this is by no means ensured to work for you. Your best bet is to create a lookup table and manually translate each character. If you anticipate a wide variety of characters, maybe Unicode or UTF-8 support is best.

    --
    Happy coding!
    Morten Wennevik [C# MVP]

    Comment

    • Jon Skeet [C# MVP]

      #3
      Re: How to remove accents (A-Umlaut to A)

      Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
      On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
      Is there a method to replace special characters like Ä (A-Umlaut) with
      A, Ö (O-Umlaut) with O, and so on?
      Sure, I could look for each character separately and replace it with its
      ascii-counterpart, but there are also such special characters in French
      and Swedish and many other languages which I also want to catch. Is
      there a generic way to do it?
      There is no generic way to do this. There is a hack that works in
      most cases involving switching Encoding the string and reading it in
      a different encoding, but this is by no means ensured to work for
      you. Your best bet is to create a lookup table and manually translate
      each character. If you anticipate a wide variety of characters, maybe
      Unicode or UTF-8 support is best.
      Actually, as of .NET 2.0 there *is* a way of doing this using
      System.Text.Nor malizationForm.

      Look at

      wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
      (the last response, from Chris Mullins).

      Here's the code posted, which does some upper-casing which isn't needed
      in this case - but it should be okay aside from that.

      Original code:

      Encoding ascii = Encoding.GetEnc oding(
      "us-ascii",
      new EncoderReplacem entFallback(str ing.Empty),
      new DecoderReplacem entFallback(str ing.Empty));


      byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
      int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
      normalized.Leng th,
      encodedBytes, 0);

      string s = "áäåãòä:usdBDlG XHHA";
      string normalized = s.Normalize(Nor malizationForm. FormKD);


      Encoding ascii = Encoding.GetEnc oding(
      "us-ascii",
      new EncoderReplacem entFallback(str ing.Empty),
      new DecoderReplacem entFallback(str ing.Empty));


      byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
      int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
      normalized.Leng th,
      encodedBytes, 0);


      string newString = ascii.GetString (encodedBytes). ToUpper();
      MessageBox.Show (newString);

      End of original code.


      Here's a slightly simpler (IMO) version:

      static string RemoveAccents (string input)
      {
      string normalized = input.Normalize (NormalizationF orm.FormKD);
      Encoding removal = Encoding.GetEnc oding
      (Encoding.ASCII .CodePage,
      new EncoderReplacem entFallback("") ,
      new DecoderReplacem entFallback("") );

      byte[] bytes = removal.GetByte s(normalized);
      return Encoding.ASCII. GetString(bytes );
      }

      Or an alternative:

      static string RemoveAccents (string input)
      {
      string normalized = input.Normalize (NormalizationF orm.FormKD);
      StringBuilder builder = new StringBuilder() ;
      foreach (char c in normalized)
      {
      if (char.GetUnicod eCategory(c) !=
      UnicodeCategory .NonSpacingMark )
      {
      builder.Append( c);
      }
      }
      return builder.ToStrin g();
      }


      --
      Jon Skeet - <skeet@pobox.co m>
      http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
      If replying to the group, please do not mail me too

      Comment

      • Morten Wennevik [C# MVP]

        #4
        Re: How to remove accents (A-Umlaut to A)

        On Tue, 07 Aug 2007 19:29:00 +0200, Jon Skeet [C# MVP] <skeet@pobox.co mwrote:
        Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
        >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
        >>
        Is there a method to replace special characters like Ä (A-Umlaut) with
        A, Ö (O-Umlaut) with O, and so on?
        Sure, I could look for each character separately and replace it with its
        ascii-counterpart, but there are also such special characters in French
        and Swedish and many other languages which I also want to catch. Is
        there a generic way to do it?
        >>
        >There is no generic way to do this. There is a hack that works in
        >most cases involving switching Encoding the string and reading it in
        >a different encoding, but this is by no means ensured to work for
        >you. Your best bet is to create a lookup table and manually translate
        >each character. If you anticipate a wide variety of characters, maybe
        >Unicode or UTF-8 support is best.
        >
        Actually, as of .NET 2.0 there *is* a way of doing this using
        System.Text.Nor malizationForm.
        >
        Look at

        wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
        (the last response, from Chris Mullins).
        >
        Here's the code posted, which does some upper-casing which isn't needed
        in this case - but it should be okay aside from that.
        >
        Original code:
        >
        Encoding ascii = Encoding.GetEnc oding(
        "us-ascii",
        new EncoderReplacem entFallback(str ing.Empty),
        new DecoderReplacem entFallback(str ing.Empty));
        >
        >
        byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
        int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
        normalized.Leng th,
        encodedBytes, 0);
        >
        string s = "áäåãòä:usdBDlG XHHA";
        string normalized = s.Normalize(Nor malizationForm. FormKD);
        >
        >
        Encoding ascii = Encoding.GetEnc oding(
        "us-ascii",
        new EncoderReplacem entFallback(str ing.Empty),
        new DecoderReplacem entFallback(str ing.Empty));
        >
        >
        byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
        int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
        normalized.Leng th,
        encodedBytes, 0);
        >
        >
        string newString = ascii.GetString (encodedBytes). ToUpper();
        MessageBox.Show (newString);
        >
        End of original code.
        >
        >
        Here's a slightly simpler (IMO) version:
        >
        static string RemoveAccents (string input)
        {
        string normalized = input.Normalize (NormalizationF orm.FormKD);
        Encoding removal = Encoding.GetEnc oding
        (Encoding.ASCII .CodePage,
        new EncoderReplacem entFallback("") ,
        new DecoderReplacem entFallback("") );
        byte[] bytes = removal.GetByte s(normalized);
        return Encoding.ASCII. GetString(bytes );
        }
        >
        Or an alternative:
        >
        static string RemoveAccents (string input)
        {
        string normalized = input.Normalize (NormalizationF orm.FormKD);
        StringBuilder builder = new StringBuilder() ;
        foreach (char c in normalized)
        {
        if (char.GetUnicod eCategory(c) !=
        UnicodeCategory .NonSpacingMark )
        {
        builder.Append( c);
        }
        }
        return builder.ToStrin g();
        }
        >
        >
        Interesting.

        Well, it would remove what is defined as unicode accents, which is what the OP asked, but it does not normalize other characters into ascii, like the Norwegian æøå, in which case only å is defined as having an accent, though æ and ø could be translated to a and o. The first method would eat æø and return only a and the second would return æøa

        --
        Happy coding!
        Morten Wennevik [C# MVP]

        Comment

        • Jon Skeet [C# MVP]

          #5
          Re: How to remove accents (A-Umlaut to A)

          Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:

          <snip>
          Interesting.

          Well, it would remove what is defined as unicode accents, which is
          what the OP asked, but it does not normalize other characters into
          ascii, like the Norwegian æøå, in which case only å is defined as
          having an accent, though æ and ø could be translated to a and o. The
          first method would eat æø and return only a and the second would
          return æøa
          Right. It's a shame there's not better support in the framework for
          this, but as it's improved from 1.1 to 2.0 there's a chance it'll get
          better in the future :)

          --
          Jon Skeet - <skeet@pobox.co m>
          http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
          If replying to the group, please do not mail me too

          Comment

          • UL-Tomten

            #6
            Re: How to remove accents (A-Umlaut to A)


            On Aug 7, 7:59 pm, "Morten Wennevik [C# MVP]"
            <MortenWenne... @hotmail.comwro te:
            æ and ø could be translated to a and o.
            I don't think that makes sense for all languages. As far as I
            understand Unicode normalization, æ is normalized as far as Unicode is
            concerned, according to the latin normalization chart. Further
            decomposition risks emulating the dreaded "silent ASCII treatment"
            strings are given by .NET unless you're careful, and should likely
            take culture into account. In some regards, I think Unicode
            normalization may even defeat the purpose of the ASCII-fication we're
            discussing here, since the more information you have about a
            character, the better you can ASCII-fy it. In German, ä is a fancy a,
            but not in Swedish, and "normalizat ion" would have to acknowledge
            this. But we digress...

            Comment

            • UL-Tomten

              #7
              Re: How to remove accents (A-Umlaut to A)


              On Aug 7, 2:05 pm, cody <deutron...@gmx .dewrote:
              Is there a method to replace special characters like Ä [...]
              Maybe knowing the reason why you're doing this can help us find you a
              better solution?

              A common example: turning strings into filenames on non-Unicode file
              systems. In this case, using Encoding.ASCII with "" fallback (to avoid
              question marks) is in my opinion not problematic, since the whole idea
              is to truncate the input strings, and the resemblance between filename
              and string is just a bonus. If you don't need that resemblance,
              hashing strings makes things easier. If the purpose is something else,
              maybe you need a different solution.

              Either way, you should be prepared for the contingency that the string
              has _only_ characters without ASCII counterparts, for example.

              Comment

              • cody

                #8
                Re: How to remove accents (A-Umlaut to A)

                Jon Skeet [C# MVP] wrote:
                Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
                >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
                >>
                >>Is there a method to replace special characters like Ä (A-Umlaut) with
                >>A, Ö (O-Umlaut) with O, and so on?
                >>Sure, I could look for each character separately and replace it with its
                >>ascii-counterpart, but there are also such special characters in French
                >>and Swedish and many other languages which I also want to catch. Is
                >>there a generic way to do it?
                >There is no generic way to do this. There is a hack that works in
                >most cases involving switching Encoding the string and reading it in
                >a different encoding, but this is by no means ensured to work for
                >you. Your best bet is to create a lookup table and manually translate
                >each character. If you anticipate a wide variety of characters, maybe
                >Unicode or UTF-8 support is best.
                >
                Actually, as of .NET 2.0 there *is* a way of doing this using
                System.Text.Nor malizationForm.
                >
                Look at

                wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
                (the last response, from Chris Mullins).
                >
                Here's the code posted, which does some upper-casing which isn't needed
                in this case - but it should be okay aside from that.
                >
                Original code:
                >
                Encoding ascii = Encoding.GetEnc oding(
                "us-ascii",
                new EncoderReplacem entFallback(str ing.Empty),
                new DecoderReplacem entFallback(str ing.Empty));
                >
                >
                byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                normalized.Leng th,
                encodedBytes, 0);
                >
                string s = "áäåãòä:usdBDlG XHHA";
                string normalized = s.Normalize(Nor malizationForm. FormKD);
                >
                >
                Encoding ascii = Encoding.GetEnc oding(
                "us-ascii",
                new EncoderReplacem entFallback(str ing.Empty),
                new DecoderReplacem entFallback(str ing.Empty));
                >
                >
                byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                normalized.Leng th,
                encodedBytes, 0);
                >
                >
                string newString = ascii.GetString (encodedBytes). ToUpper();
                MessageBox.Show (newString);
                >
                End of original code.
                >
                >
                Here's a slightly simpler (IMO) version:
                >
                static string RemoveAccents (string input)
                {
                string normalized = input.Normalize (NormalizationF orm.FormKD);
                Encoding removal = Encoding.GetEnc oding
                (Encoding.ASCII .CodePage,
                new EncoderReplacem entFallback("") ,
                new DecoderReplacem entFallback("") );
                >
                byte[] bytes = removal.GetByte s(normalized);
                return Encoding.ASCII. GetString(bytes );
                }
                >
                Or an alternative:
                >
                static string RemoveAccents (string input)
                {
                string normalized = input.Normalize (NormalizationF orm.FormKD);
                StringBuilder builder = new StringBuilder() ;
                foreach (char c in normalized)
                {
                if (char.GetUnicod eCategory(c) !=
                UnicodeCategory .NonSpacingMark )
                {
                builder.Append( c);
                }
                }
                return builder.ToStrin g();
                }
                >
                >
                Thank you very much, this will do it!

                Comment

                • cody

                  #9
                  Re: How to remove accents (A-Umlaut to A)

                  Jon Skeet [C# MVP] wrote:
                  Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
                  >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
                  >>
                  >>Is there a method to replace special characters like Ä (A-Umlaut) with
                  >>A, Ö (O-Umlaut) with O, and so on?
                  >>Sure, I could look for each character separately and replace it with its
                  >>ascii-counterpart, but there are also such special characters in French
                  >>and Swedish and many other languages which I also want to catch. Is
                  >>there a generic way to do it?
                  >There is no generic way to do this. There is a hack that works in
                  >most cases involving switching Encoding the string and reading it in
                  >a different encoding, but this is by no means ensured to work for
                  >you. Your best bet is to create a lookup table and manually translate
                  >each character. If you anticipate a wide variety of characters, maybe
                  >Unicode or UTF-8 support is best.
                  >
                  Actually, as of .NET 2.0 there *is* a way of doing this using
                  System.Text.Nor malizationForm.
                  >
                  Look at

                  wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
                  (the last response, from Chris Mullins).
                  >
                  Here's the code posted, which does some upper-casing which isn't needed
                  in this case - but it should be okay aside from that.
                  >
                  Original code:
                  >
                  Encoding ascii = Encoding.GetEnc oding(
                  "us-ascii",
                  new EncoderReplacem entFallback(str ing.Empty),
                  new DecoderReplacem entFallback(str ing.Empty));
                  >
                  >
                  byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                  int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                  normalized.Leng th,
                  encodedBytes, 0);
                  >
                  string s = "áäåãòä:usdBDlG XHHA";
                  string normalized = s.Normalize(Nor malizationForm. FormKD);
                  >
                  >
                  Encoding ascii = Encoding.GetEnc oding(
                  "us-ascii",
                  new EncoderReplacem entFallback(str ing.Empty),
                  new DecoderReplacem entFallback(str ing.Empty));
                  >
                  >
                  byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                  int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                  normalized.Leng th,
                  encodedBytes, 0);
                  >
                  >
                  string newString = ascii.GetString (encodedBytes). ToUpper();
                  MessageBox.Show (newString);
                  >
                  End of original code.
                  >
                  >
                  Here's a slightly simpler (IMO) version:
                  >
                  static string RemoveAccents (string input)
                  {
                  string normalized = input.Normalize (NormalizationF orm.FormKD);
                  Encoding removal = Encoding.GetEnc oding
                  (Encoding.ASCII .CodePage,
                  new EncoderReplacem entFallback("") ,
                  new DecoderReplacem entFallback("") );
                  >
                  byte[] bytes = removal.GetByte s(normalized);
                  return Encoding.ASCII. GetString(bytes );
                  }
                  >
                  Or an alternative:
                  >
                  static string RemoveAccents (string input)
                  {
                  string normalized = input.Normalize (NormalizationF orm.FormKD);
                  StringBuilder builder = new StringBuilder() ;
                  foreach (char c in normalized)
                  {
                  if (char.GetUnicod eCategory(c) !=
                  UnicodeCategory .NonSpacingMark )
                  {
                  builder.Append( c);
                  }
                  }
                  return builder.ToStrin g();
                  }
                  >
                  >
                  Thank you very much, this will do it!

                  Comment

                  • cody

                    #10
                    Re: How to remove accents (A-Umlaut to A)

                    Jon Skeet [C# MVP] wrote:
                    Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
                    >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
                    >>
                    >>Is there a method to replace special characters like Ä (A-Umlaut) with
                    >>A, Ö (O-Umlaut) with O, and so on?
                    >>Sure, I could look for each character separately and replace it with its
                    >>ascii-counterpart, but there are also such special characters in French
                    >>and Swedish and many other languages which I also want to catch. Is
                    >>there a generic way to do it?
                    >There is no generic way to do this. There is a hack that works in
                    >most cases involving switching Encoding the string and reading it in
                    >a different encoding, but this is by no means ensured to work for
                    >you. Your best bet is to create a lookup table and manually translate
                    >each character. If you anticipate a wide variety of characters, maybe
                    >Unicode or UTF-8 support is best.
                    >
                    Actually, as of .NET 2.0 there *is* a way of doing this using
                    System.Text.Nor malizationForm.
                    >
                    Look at

                    wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
                    (the last response, from Chris Mullins).
                    >
                    Here's the code posted, which does some upper-casing which isn't needed
                    in this case - but it should be okay aside from that.
                    >
                    Original code:
                    >
                    Encoding ascii = Encoding.GetEnc oding(
                    "us-ascii",
                    new EncoderReplacem entFallback(str ing.Empty),
                    new DecoderReplacem entFallback(str ing.Empty));
                    >
                    >
                    byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                    int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                    normalized.Leng th,
                    encodedBytes, 0);
                    >
                    string s = "áäåãòä:usdBDlG XHHA";
                    string normalized = s.Normalize(Nor malizationForm. FormKD);
                    >
                    >
                    Encoding ascii = Encoding.GetEnc oding(
                    "us-ascii",
                    new EncoderReplacem entFallback(str ing.Empty),
                    new DecoderReplacem entFallback(str ing.Empty));
                    >
                    >
                    byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                    int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                    normalized.Leng th,
                    encodedBytes, 0);
                    >
                    >
                    string newString = ascii.GetString (encodedBytes). ToUpper();
                    MessageBox.Show (newString);
                    >
                    End of original code.
                    >
                    >
                    Here's a slightly simpler (IMO) version:
                    >
                    static string RemoveAccents (string input)
                    {
                    string normalized = input.Normalize (NormalizationF orm.FormKD);
                    Encoding removal = Encoding.GetEnc oding
                    (Encoding.ASCII .CodePage,
                    new EncoderReplacem entFallback("") ,
                    new DecoderReplacem entFallback("") );
                    >
                    byte[] bytes = removal.GetByte s(normalized);
                    return Encoding.ASCII. GetString(bytes );
                    }
                    >
                    Or an alternative:
                    >
                    static string RemoveAccents (string input)
                    {
                    string normalized = input.Normalize (NormalizationF orm.FormKD);
                    StringBuilder builder = new StringBuilder() ;
                    foreach (char c in normalized)
                    {
                    if (char.GetUnicod eCategory(c) !=
                    UnicodeCategory .NonSpacingMark )
                    {
                    builder.Append( c);
                    }
                    }
                    return builder.ToStrin g();
                    }
                    >
                    >
                    Thank you very much, this will do it!

                    Comment

                    • cody

                      #11
                      Re: How to remove accents (A-Umlaut to A)

                      Jon Skeet [C# MVP] wrote:
                      Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
                      >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
                      >>
                      >>Is there a method to replace special characters like Ä (A-Umlaut) with
                      >>A, Ö (O-Umlaut) with O, and so on?
                      >>Sure, I could look for each character separately and replace it with its
                      >>ascii-counterpart, but there are also such special characters in French
                      >>and Swedish and many other languages which I also want to catch. Is
                      >>there a generic way to do it?
                      >There is no generic way to do this. There is a hack that works in
                      >most cases involving switching Encoding the string and reading it in
                      >a different encoding, but this is by no means ensured to work for
                      >you. Your best bet is to create a lookup table and manually translate
                      >each character. If you anticipate a wide variety of characters, maybe
                      >Unicode or UTF-8 support is best.
                      >
                      Actually, as of .NET 2.0 there *is* a way of doing this using
                      System.Text.Nor malizationForm.
                      >
                      Look at

                      wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
                      (the last response, from Chris Mullins).
                      >
                      Here's the code posted, which does some upper-casing which isn't needed
                      in this case - but it should be okay aside from that.
                      >
                      Original code:
                      >
                      Encoding ascii = Encoding.GetEnc oding(
                      "us-ascii",
                      new EncoderReplacem entFallback(str ing.Empty),
                      new DecoderReplacem entFallback(str ing.Empty));
                      >
                      >
                      byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                      int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                      normalized.Leng th,
                      encodedBytes, 0);
                      >
                      string s = "áäåãòä:usdBDlG XHHA";
                      string normalized = s.Normalize(Nor malizationForm. FormKD);
                      >
                      >
                      Encoding ascii = Encoding.GetEnc oding(
                      "us-ascii",
                      new EncoderReplacem entFallback(str ing.Empty),
                      new DecoderReplacem entFallback(str ing.Empty));
                      >
                      >
                      byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                      int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                      normalized.Leng th,
                      encodedBytes, 0);
                      >
                      >
                      string newString = ascii.GetString (encodedBytes). ToUpper();
                      MessageBox.Show (newString);
                      >
                      End of original code.
                      >
                      >
                      Here's a slightly simpler (IMO) version:
                      >
                      static string RemoveAccents (string input)
                      {
                      string normalized = input.Normalize (NormalizationF orm.FormKD);
                      Encoding removal = Encoding.GetEnc oding
                      (Encoding.ASCII .CodePage,
                      new EncoderReplacem entFallback("") ,
                      new DecoderReplacem entFallback("") );
                      >
                      byte[] bytes = removal.GetByte s(normalized);
                      return Encoding.ASCII. GetString(bytes );
                      }
                      >
                      Or an alternative:
                      >
                      static string RemoveAccents (string input)
                      {
                      string normalized = input.Normalize (NormalizationF orm.FormKD);
                      StringBuilder builder = new StringBuilder() ;
                      foreach (char c in normalized)
                      {
                      if (char.GetUnicod eCategory(c) !=
                      UnicodeCategory .NonSpacingMark )
                      {
                      builder.Append( c);
                      }
                      }
                      return builder.ToStrin g();
                      }
                      >
                      >
                      Thank you very much, this will do it!

                      Comment

                      • cody

                        #12
                        Re: How to remove accents (A-Umlaut to A)

                        Jon Skeet [C# MVP] wrote:
                        Morten Wennevik [C# MVP] <MortenWennevik @hotmail.comwro te:
                        >On Tue, 07 Aug 2007 14:05:46 +0200, cody <deutronium@gmx .dewrote:
                        >>
                        >>Is there a method to replace special characters like Ä (A-Umlaut) with
                        >>A, Ö (O-Umlaut) with O, and so on?
                        >>Sure, I could look for each character separately and replace it with its
                        >>ascii-counterpart, but there are also such special characters in French
                        >>and Swedish and many other languages which I also want to catch. Is
                        >>there a generic way to do it?
                        >There is no generic way to do this. There is a hack that works in
                        >most cases involving switching Encoding the string and reading it in
                        >a different encoding, but this is by no means ensured to work for
                        >you. Your best bet is to create a lookup table and manually translate
                        >each character. If you anticipate a wide variety of characters, maybe
                        >Unicode or UTF-8 support is best.
                        >
                        Actually, as of .NET 2.0 there *is* a way of doing this using
                        System.Text.Nor malizationForm.
                        >
                        Look at

                        wse_frm/thread/78a09bd184351bc 5/99f090af662c126 c?rnum=11
                        (the last response, from Chris Mullins).
                        >
                        Here's the code posted, which does some upper-casing which isn't needed
                        in this case - but it should be okay aside from that.
                        >
                        Original code:
                        >
                        Encoding ascii = Encoding.GetEnc oding(
                        "us-ascii",
                        new EncoderReplacem entFallback(str ing.Empty),
                        new DecoderReplacem entFallback(str ing.Empty));
                        >
                        >
                        byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                        int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                        normalized.Leng th,
                        encodedBytes, 0);
                        >
                        string s = "áäåãòä:usdBDlG XHHA";
                        string normalized = s.Normalize(Nor malizationForm. FormKD);
                        >
                        >
                        Encoding ascii = Encoding.GetEnc oding(
                        "us-ascii",
                        new EncoderReplacem entFallback(str ing.Empty),
                        new DecoderReplacem entFallback(str ing.Empty));
                        >
                        >
                        byte[] encodedBytes = new byte[ascii.GetByteCo unt(normalized)];
                        int numberOfEncoded Bytes = ascii.GetBytes( normalized, 0,
                        normalized.Leng th,
                        encodedBytes, 0);
                        >
                        >
                        string newString = ascii.GetString (encodedBytes). ToUpper();
                        MessageBox.Show (newString);
                        >
                        End of original code.
                        >
                        >
                        Here's a slightly simpler (IMO) version:
                        >
                        static string RemoveAccents (string input)
                        {
                        string normalized = input.Normalize (NormalizationF orm.FormKD);
                        Encoding removal = Encoding.GetEnc oding
                        (Encoding.ASCII .CodePage,
                        new EncoderReplacem entFallback("") ,
                        new DecoderReplacem entFallback("") );
                        >
                        byte[] bytes = removal.GetByte s(normalized);
                        return Encoding.ASCII. GetString(bytes );
                        }
                        >
                        Or an alternative:
                        >
                        static string RemoveAccents (string input)
                        {
                        string normalized = input.Normalize (NormalizationF orm.FormKD);
                        StringBuilder builder = new StringBuilder() ;
                        foreach (char c in normalized)
                        {
                        if (char.GetUnicod eCategory(c) !=
                        UnicodeCategory .NonSpacingMark )
                        {
                        builder.Append( c);
                        }
                        }
                        return builder.ToStrin g();
                        }
                        >
                        >
                        Thank you very much, this will do it!

                        Comment

                        Working...