Regex to remove \t \r \n from string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • morleyc@gmail.com

    Regex to remove \t \r \n from string

    Hi, i would like to remove a number of characters from my string (\t
    \r \n which are throughout the string), i know regex can do this but i
    have no idea how. Any pointers much appreciated.

    Chris

  • Nicholas Paldino [.NET/C# MVP]

    #2
    Re: Regex to remove \t \r \n from string

    Chris,

    Why not just use three calls to the Replace method on the String class?

    string myString = input.Replace(" \t", "").Replace("\r ", "").Replace("\n ",
    "");

    You can use the character version here as well if you wish.

    --
    - Nicholas Paldino [.NET/C# MVP]
    - mvp@spam.guard. caspershouse.co m

    <morleyc@gmail. comwrote in message
    news:1179238212 .282258.281370@ y80g2000hsf.goo glegroups.com.. .
    Hi, i would like to remove a number of characters from my string (\t
    \r \n which are throughout the string), i know regex can do this but i
    have no idea how. Any pointers much appreciated.
    >
    Chris
    >

    Comment

    • morleyc@gmail.com

      #3
      Re: Regex to remove \t \r \n from string

      Why not just use three calls to the Replace method on the String class?

      I am currently using the 3 replace calls :), however i have always
      avoided reglular expressions before this seemed the ideal excuse to
      learn them! I would also be interested in turning \r\n in a string to
      just \n also. im sure it must be possible?




      Comment

      • Nicholas Paldino [.NET/C# MVP]

        #4
        Re: Regex to remove \t \r \n from string

        Absolutely, just wondering why you wouldn't take the simpler, more
        maintainable (depending on who is looking at it, at least from my point of
        view) approach. =)

        In this case, I believe you can have a regular expression of "[\t\r\n]"
        and then call the Replace method, passing your input string and an empty
        string (or whatever you want to replace any of the characters in that set
        with) and it should work.


        --
        - Nicholas Paldino [.NET/C# MVP]
        - mvp@spam.guard. caspershouse.co m

        <morleyc@gmail. comwrote in message
        news:1179243139 .701933.208180@ y80g2000hsf.goo glegroups.com.. .
        > Why not just use three calls to the Replace method on the String
        >class?
        >
        I am currently using the 3 replace calls :), however i have always
        avoided reglular expressions before this seemed the ideal excuse to
        learn them! I would also be interested in turning \r\n in a string to
        just \n also. im sure it must be possible?
        >
        >
        >
        >

        Comment

        • tomisarobot@gmail.com

          #5
          Re: Regex to remove \t \r \n from string

          it certainly is possible. you should create a little test project and
          play with it. thing to remember about regex is to start small and
          build up. its not hard really, but its horribly easy to assume that
          things will behave differently than the reality.

          been a while since ive done captures with PCRE, but for the simple
          replace you are probably looking at something like this: [\r|\n|\t]

          ..net also has some context variable to make sure you have your
          endlines localized correctly if thats all you are trying to do.



          Comment

          • Ben Voigt

            #6
            Re: Regex to remove \t \r \n from string


            "Nicholas Paldino [.NET/C# MVP]" <mvp@spam.guard .caspershouse.c omwrote in
            message news:49CF19FA-B96B-4CEF-A994-FCE96B26129A@mi crosoft.com...
            Absolutely, just wondering why you wouldn't take the simpler, more
            maintainable (depending on who is looking at it, at least from my point of
            view) approach. =)
            Because your simpler method involves three complete string copies instead of
            one!


            RegEx.Replace ought to do it.


            Comment

            • Jon Skeet [C# MVP]

              #7
              Re: Regex to remove \t \r \n from string

              Ben Voigt <rbv@nospam.nos pamwrote:
              "Nicholas Paldino [.NET/C# MVP]" <mvp@spam.guard .caspershouse.c omwrote in
              message news:49CF19FA-B96B-4CEF-A994-FCE96B26129A@mi crosoft.com...
              Absolutely, just wondering why you wouldn't take the simpler, more
              maintainable (depending on who is looking at it, at least from my point of
              view) approach. =)
              >
              Because your simpler method involves three complete string copies instead of
              one!
              Do we have any evidence that performance is an issue here? Further, do
              we have evidence that regular expressions will actually make this
              faster on the sample data?

              Until both of those have been determined, I'd take a default course of
              the simplest code which does the job.
              RegEx.Replace ought to do it.
              At what cost to readability though?

              --
              Jon Skeet - <skeet@pobox.co m>
              http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
              If replying to the group, please do not mail me too

              Comment

              • Ben Voigt

                #8
                Re: Regex to remove \t \r \n from string


                "Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
                news:MPG.20b43a 5f6978d6c8103@m snews.microsoft .com...
                Ben Voigt <rbv@nospam.nos pamwrote:
                >"Nicholas Paldino [.NET/C# MVP]" <mvp@spam.guard .caspershouse.c omwrote
                >in
                >message news:49CF19FA-B96B-4CEF-A994-FCE96B26129A@mi crosoft.com...
                Absolutely, just wondering why you wouldn't take the simpler, more
                maintainable (depending on who is looking at it, at least from my point
                of
                view) approach. =)
                >>
                >Because your simpler method involves three complete string copies instead
                >of
                >one!
                >
                Do we have any evidence that performance is an issue here? Further, do
                we have evidence that regular expressions will actually make this
                faster on the sample data?
                >
                Until both of those have been determined, I'd take a default course of
                the simplest code which does the job.
                Well, ok, but you asked why anyone would ever choose not to do it that way,
                and I gave an example.
                >
                >RegEx.Replac e ought to do it.
                >
                At what cost to readability though?
                Admittedly, a String.Replace( RegEx, String) method would be far more
                readable, but set up a dependency from string on RegEx.
                >
                --
                Jon Skeet - <skeet@pobox.co m>
                http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
                If replying to the group, please do not mail me too

                Comment

                • Jon Skeet [C# MVP]

                  #9
                  Re: Regex to remove \t \r \n from string

                  Until both of those have been determined, I'd take a default course of
                  the simplest code which does the job.
                  >
                  Well, ok, but you asked why anyone would ever choose not to do it that way,
                  and I gave an example.
                  That's fair enough.
                  RegEx.Replace ought to do it.
                  At what cost to readability though?
                  >
                  Admittedly, a String.Replace( RegEx, String) method would be far more
                  readable, but set up a dependency from string on RegEx.
                  More importantly, it sets up a dependency on the reader understanding
                  regular expressions, which I've seen causing issues time and time again
                  in these newsgroups.

                  I'm all for regular expressions when their power is really needed, but
                  that tends to be pretty rare IME.

                  --
                  Jon Skeet - <skeet@pobox.co m>
                  http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
                  If replying to the group, please do not mail me too

                  Comment

                  • =?ISO-8859-1?Q?Arne_Vajh=F8j?=

                    #10
                    Re: Regex to remove \t \r \n from string

                    Jon Skeet [C# MVP] wrote:
                    Ben Voigt <rbv@nospam.nos pamwrote:
                    >"Nicholas Paldino [.NET/C# MVP]" <mvp@spam.guard .caspershouse.c omwrote in
                    >message news:49CF19FA-B96B-4CEF-A994-FCE96B26129A@mi crosoft.com...
                    >> Absolutely, just wondering why you wouldn't take the simpler, more
                    >>maintainabl e (depending on who is looking at it, at least from my point of
                    >>view) approach. =)
                    >Because your simpler method involves three complete string copies instead of
                    >one!
                    >
                    Do we have any evidence that performance is an issue here? Further, do
                    we have evidence that regular expressions will actually make this
                    faster on the sample data?
                    A simple test seems to indicate that regex is slower.

                    String Replace : 15 -12 x 6666666 : 6,6875
                    StringBuilder Replace : 15 -12 x 6666666 : 6,546875
                    Regex Replace : 15 -12 x 6666666 : 27,1875
                    Regex Replace Optimized : 15 -12 x 6666666 : 15,828125
                    String Replace : 960 -768 x 104166 : 3,3125
                    StringBuilder Replace : 960 -768 x 104166 : 2,03125
                    Regex Replace : 960 -768 x 104166 : 17,421875
                    Regex Replace Optimized : 960 -768 x 104166 : 13,4375
                    String Replace : 1000 -1000 x 100000 : 1,15625
                    StringBuilder Replace : 1000 -1000 x 100000 : 2,4375
                    Regex Replace : 1000 -1000 x 100000 : 3,78125
                    Regex Replace Optimized : 1000 -1000 x 100000 : 2,703125

                    (see code below)
                    >RegEx.Replac e ought to do it.
                    >
                    At what cost to readability though?
                    Actually I think the regex code is more readable.

                    Arne

                    =============== =============== =============== =============

                    using System;
                    using System.Text;
                    using System.Text.Reg ularExpressions ;

                    namespace E
                    {
                    public class MainClass
                    {
                    private const int N = 100000000;
                    private const string FMT = "{0,-25} : {1} -{2} x {3} : {4}";
                    private static void TestStringRepla ce(string s)
                    {
                    int n = N / s.Length;
                    string s2 = null;
                    DateTime dt1 = DateTime.Now;
                    for(int i = 0; i < n; i++)
                    {
                    s2 = s.Replace("\r", "").Replace("\n ", "").Replace("\t ", "");
                    }
                    DateTime dt2 = DateTime.Now;
                    Console.WriteLi ne(String.Forma t(FMT, "String Replace", s.Length,
                    s2.Length, n, (dt2 - dt1).TotalSecon ds));
                    }
                    private static void TestStringBuild erReplace(strin g s)
                    {
                    int n = N / s.Length;
                    StringBuilder sb = new StringBuilder(s );
                    string s2 = null;
                    DateTime dt1 = DateTime.Now;
                    for(int i = 0; i < n; i++)
                    {
                    s2 = sb.Replace("\r" , "").Replace("\n ", "").Replace("\t ",
                    "").ToStrin g();
                    }
                    DateTime dt2 = DateTime.Now;
                    Console.WriteLi ne(String.Forma t(FMT, "StringBuil der Replace",
                    s.Length, s2.Length, n, (dt2 - dt1).TotalSecon ds));
                    }
                    private static void TestRegexReplac e(string s)
                    {
                    int n = N / s.Length;
                    string s2 = null;
                    DateTime dt1 = DateTime.Now;
                    for(int i = 0; i < n; i++)
                    {
                    s2 = Regex.Replace(s , "[\r\n\t]", "");
                    }
                    DateTime dt2 = DateTime.Now;
                    Console.WriteLi ne(String.Forma t(FMT, "Regex Replace", s.Length,
                    s2.Length, n, (dt2 - dt1).TotalSecon ds));
                    }
                    private static void TestRegexReplac eOptimized(stri ng s)
                    {
                    int n = N / s.Length;
                    Regex re = new Regex("[\r\n\t]", RegexOptions.Co mpiled);
                    string s2 = null;
                    DateTime dt1 = DateTime.Now;
                    for(int i = 0; i < n; i++)
                    {
                    s2 = re.Replace(s, "");
                    }
                    DateTime dt2 = DateTime.Now;
                    Console.WriteLi ne(String.Forma t(FMT, "Regex Replace Optimized",
                    s.Length, s2.Length, n, (dt2 - dt1).TotalSecon ds));
                    }
                    private static void Test(string s)
                    {
                    TestStringRepla ce(s);
                    TestStringBuild erReplace(s);
                    TestRegexReplac e(s);
                    TestRegexReplac eOptimized(s);
                    }
                    public static void Main(string[] args)
                    {
                    string shortstr = "aaa\rbbb\nccc\ tddd";
                    Test(shortstr);
                    string longstr = shortstr;
                    longstr += longstr;
                    longstr += longstr;
                    longstr += longstr;
                    longstr += longstr;
                    longstr += longstr;
                    longstr += longstr;
                    Test(longstr);
                    string nonestr = String.Empty.Pa dRight(1000, 'A');
                    Test(nonestr);
                    Console.ReadLin e();
                    }
                    }
                    }

                    Comment

                    • Jon Skeet [C# MVP]

                      #11
                      Re: Regex to remove \t \r \n from string

                      Arne Vajhøj <arne@vajhoej.d kwrote:
                      At what cost to readability though?
                      Actually I think the regex code is more readable.
                      Well, it's interesting that your regex is "[\r\n\t]". I'm actually
                      slightly surprised this even works, as the \r, \n and \t are being
                      taken literally by the regex engine rather than having been escaped in
                      the normal way. I'd have expected "[\\r\\n\\t]" or @"[\r\n\t]" to make
                      it clear to the regex engine that you really meant the carriage return
                      etc to be part of the regex, and not incidental or for the sake of
                      readability (splitting the regex over several lines, as shown in
                      Jesse's example in another thread).

                      That extra level of escaping which is required in *some* cases (but
                      clearly not all) as well as having to understand the basic language of
                      regex in the first place is what makes it less readable in my opinion.

                      --
                      Jon Skeet - <skeet@pobox.co m>
                      http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
                      If replying to the group, please do not mail me too

                      Comment

                      • =?ISO-8859-1?Q?Arne_Vajh=F8j?=

                        #12
                        Re: Regex to remove \t \r \n from string

                        Jon Skeet [C# MVP] wrote:
                        Arne Vajhøj <arne@vajhoej.d kwrote:
                        >>At what cost to readability though?
                        >Actually I think the regex code is more readable.
                        >
                        Well, it's interesting that your regex is "[\r\n\t]". I'm actually
                        slightly surprised this even works, as the \r, \n and \t are being
                        taken literally by the regex engine rather than having been escaped in
                        the normal way. I'd have expected "[\\r\\n\\t]" or @"[\r\n\t]" to make
                        it clear to the regex engine that you really meant the carriage return
                        etc to be part of the regex, and not incidental or for the sake of
                        readability (splitting the regex over several lines, as shown in
                        Jesse's example in another thread).
                        >
                        That extra level of escaping which is required in *some* cases (but
                        clearly not all) as well as having to understand the basic language of
                        regex in the first place is what makes it less readable in my opinion.
                        I just used the regex provided by Nicholas.

                        And yes there are different rules inside and outside character
                        classes.

                        And I can not see the readability problem. The intent of the
                        code is obvious.

                        You are not sure that it works correctly. But that can be
                        verified.

                        The Substring/IndexOf combo could be less obvious to read
                        and would still need to be verified that it works.

                        Arne

                        Comment

                        • Jon Skeet [C# MVP]

                          #13
                          Re: Regex to remove \t \r \n from string

                          Arne Vajhøj <arne@vajhoej.d kwrote:
                          That extra level of escaping which is required in *some* cases (but
                          clearly not all) as well as having to understand the basic language of
                          regex in the first place is what makes it less readable in my opinion.
                          I just used the regex provided by Nicholas.

                          And yes there are different rules inside and outside character
                          classes.

                          And I can not see the readability problem. The intent of the
                          code is obvious.
                          To you, possibly. To me, even - I've done just enough regex to work out
                          what it means, although I wouldn't necessarily say it's obvious. To
                          every maintenance engineer? Not necessarily.
                          You are not sure that it works correctly. But that can be
                          verified.
                          There are lots of things that can be verified, but which are still less
                          obvious than writing things in a simpler way.
                          The Substring/IndexOf combo could be less obvious to read
                          and would still need to be verified that it works.
                          There's no Substring/IndexOf to be done - just three calls to Replace.
                          It's blindingly obvious what *they* do.

                          --
                          Jon Skeet - <skeet@pobox.co m>
                          http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
                          If replying to the group, please do not mail me too

                          Comment

                          • =?ISO-8859-1?Q?Arne_Vajh=F8j?=

                            #14
                            Re: Regex to remove \t \r \n from string

                            Jon Skeet [C# MVP] wrote:
                            Arne Vajhøj <arne@vajhoej.d kwrote:
                            >And I can not see the readability problem. The intent of the
                            >code is obvious.
                            >
                            To you, possibly. To me, even - I've done just enough regex to work out
                            what it means, although I wouldn't necessarily say it's obvious. To
                            every maintenance engineer? Not necessarily.
                            It is a feature in .NET - it is a feature in most programming
                            environments today.

                            If they don't know, then they should learn.
                            >The Substring/IndexOf combo could be less obvious to read
                            >and would still need to be verified that it works.
                            >
                            There's no Substring/IndexOf to be done - just three calls to Replace.
                            It's blindingly obvious what *they* do.
                            No Substring/IndexOf in this case. But often regex is replaced
                            with some string manipulation code in the worst tradition of
                            C str functions.

                            Arne

                            Comment

                            • Jon Skeet [C# MVP]

                              #15
                              Re: Regex to remove \t \r \n from string

                              Arne Vajhøj <arne@vajhoej.d kwrote:
                              To you, possibly. To me, even - I've done just enough regex to work out
                              what it means, although I wouldn't necessarily say it's obvious. To
                              every maintenance engineer? Not necessarily.
                              It is a feature in .NET - it is a feature in most programming
                              environments today.
                              >
                              If they don't know, then they should learn.
                              I'd rather not have to check the ins and outs of regular expressions
                              when there's a *very* simple alternative. It's so easy to go wrong with
                              regular expressions - I only use them when they provide a clear
                              benefit, which I don't believe they do in this case.

                              Just because you *can* do something with a regex doesn't mean you
                              *should*. I'm happy to go back and be really careful with regular
                              expressions when there's a good reason to use them, like validating
                              something which is genuinely a *pattern*, but I've seen enough people
                              get confused by them to be wary of them myself.
                              The Substring/IndexOf combo could be less obvious to read
                              and would still need to be verified that it works.
                              There's no Substring/IndexOf to be done - just three calls to Replace.
                              It's blindingly obvious what *they* do.
                              No Substring/IndexOf in this case. But often regex is replaced
                              with some string manipulation code in the worst tradition of
                              C str functions.
                              And likewise simple string manipulation code is replaced with a regex
                              for no reason whatsoever, sometimes introducing bugs at the same time.

                              --
                              Jon Skeet - <skeet@pobox.co m>
                              http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
                              If replying to the group, please do not mail me too

                              Comment

                              Working...