Easy (?!) regular expression -- find line breaks

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Duniho

    Easy (?!) regular expression -- find line breaks

    So, I'm trying to learn how the Regex class works, and I've been trying to
    use it to do what I think ought to be simple things. Except I can't
    figure out how to do everything I want. :(

    If I want to take a string and break it into individual lines based on a
    specific pattern ("\r\n" in this case, but I don't think it matters), I
    can easily write a loop that does this by scanning through the string
    accumulating characters and spitting out a new string each time it hits
    the "\r\n". But I figured Regex ought to be able to do the scanning for
    me, so that all I have to loop through are the matches.

    I've tried a wide variety of expression strings, but the ones that seem to
    come closest to what I want are:

    "(.+)\r\n" -- works great, except that if the string doesn't terminate
    in a "\r\n", the last line isn't matched

    "(.+)(\r\n) *" -- the idea being to allow the last line to match if no
    "\r\n" is found. works great, except that the "\r" winds up getting
    captured as well (presumably because the second capture group is just
    ignored and everything up to the "\n" gets captured by the first capture
    group because the default is to )

    "(.+?)(\r\n )*" -- works great, except that it's _too_ lazy, and
    happily matches just a single character at a time

    (Note: I'm using a replacement string specifying the first capture group
    so that I can toss out the "\r\n", but if there's a way to match the
    "\r\n" without it winding up in the match itself while at the same time
    preventing it from being included in the subsequent match attempt, that
    would be wonderful).

    I also tried using single-line mode, trying to work around the problem in
    the second example, but when I do that, the expression happily and
    greedily captures _everything_ up to the very last "\r\n".

    What I'm looking for is the expression that represents "capture all text
    up to the first \r\n pair, allowing for the possibility of one last match
    without the \r\n pair at the end of the string".

    Is this actually impossible using Regex, or is there some combination of
    options that will allow me to match the first \r\n pair without requiring
    a \r\n pair at the end of the last match?

    Thanks,
    Pete
  • Peter Duniho

    #2
    Re: Easy (?!) regular expression -- find line breaks

    On Wed, 13 Jun 2007 20:55:10 -0700, Peter Duniho
    <NpOeStPeAdM@nn owslpianmk.comw rote:
    [...]
    If I want to take a string and break it into individual lines based on a
    specific pattern ("\r\n" in this case, but I don't think it matters), I
    can easily write a loop that does this by scanning through the string
    accumulating characters and spitting out a new string each time it hits
    the "\r\n". But I figured Regex ought to be able to do the scanning for
    me, so that all I have to loop through are the matches.
    And just to clarify...

    Yes, I understand that I can just use String.Split() to do this. I'm
    talking about the more general question of the matching, and my little
    self-assigned homework exercise to try to learn how Regex works.

    Comment

    • =?UTF-8?B?R8O2cmFuIEFuZGVyc3Nvbg==?=

      #3
      Re: Easy (?!) regular expression -- find line breaks

      Peter Duniho wrote:
      So, I'm trying to learn how the Regex class works, and I've been trying
      to use it to do what I think ought to be simple things. Except I can't
      figure out how to do everything I want. :(
      >
      If I want to take a string and break it into individual lines based on a
      specific pattern ("\r\n" in this case, but I don't think it matters), I
      can easily write a loop that does this by scanning through the string
      accumulating characters and spitting out a new string each time it hits
      the "\r\n". But I figured Regex ought to be able to do the scanning for
      me, so that all I have to loop through are the matches.
      >
      I've tried a wide variety of expression strings, but the ones that seem
      to come closest to what I want are:
      >
      "(.+)\r\n" -- works great, except that if the string doesn't
      terminate in a "\r\n", the last line isn't matched
      >
      "(.+)(\r\n) *" -- the idea being to allow the last line to match if
      no "\r\n" is found. works great, except that the "\r" winds up getting
      captured as well (presumably because the second capture group is just
      ignored and everything up to the "\n" gets captured by the first capture
      group because the default is to )
      >
      "(.+?)(\r\n )*" -- works great, except that it's _too_ lazy, and
      happily matches just a single character at a time
      >
      (Note: I'm using a replacement string specifying the first capture group
      so that I can toss out the "\r\n", but if there's a way to match the
      "\r\n" without it winding up in the match itself while at the same time
      preventing it from being included in the subsequent match attempt, that
      would be wonderful).
      Use a non-catching group: (?:\r\n)
      I also tried using single-line mode, trying to work around the problem
      in the second example, but when I do that, the expression happily and
      greedily captures _everything_ up to the very last "\r\n".
      >
      What I'm looking for is the expression that represents "capture all text
      up to the first \r\n pair, allowing for the possibility of one last
      match without the \r\n pair at the end of the string".
      Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
      Is this actually impossible using Regex, or is there some combination of
      options that will allow me to match the first \r\n pair without
      requiring a \r\n pair at the end of the last match?
      >
      Thanks,
      Pete

      --
      Göran Andersson
      _____
      Göran Anderssons privata hemsida.

      Comment

      • Peter Duniho

        #4
        Re: Easy (?!) regular expression -- find line breaks

        On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <guffa@guffa.co m>
        wrote:
        Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
        Ah. So simple. Thanks!

        Comment

        • Jesse Houwing

          #5
          Re: Easy (?!) regular expression -- find line breaks

          * Peter Duniho wrote, On 14-6-2007 19:25:
          On Thu, 14 Jun 2007 01:12:51 -0700, Göran Andersson <guffa@guffa.co m>
          wrote:
          >
          >Match either \r\n or $ (end of text): (.+?)(?:\r\n|$)
          >
          Ah. So simple. Thanks!

          Even easier would be to set the RegexOption.Mul tiline on and look for
          the following: "^.*$" This should match on every beginning of a line
          (^), fetch the content (.*) and end on the end of each line ($).

          It's probably faster as well.

          Jesse

          Comment

          • Peter Duniho

            #6
            Re: Easy (?!) regular expression -- find line breaks

            On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
            <jesse.houwing@ nospam-sogeti.nlwrote:
            Even easier would be to set the RegexOption.Mul tiline on and look for
            the following: "^.*$" This should match on every beginning of a line
            (^), fetch the content (.*) and end on the end of each line ($).
            Except that as near as I can tell, Regex only uses Unix-style linebreaks.
            That is, \n by itself. Which means that if I use the Multiline option
            (which seems to be the default, actually), I wind up with the \r as part
            of my matched strings, which I don't want.

            Pete

            Comment

            • Jesse Houwing

              #7
              Re: Easy (?!) regular expression -- find line breaks

              * Peter Duniho wrote, On 14-6-2007 20:39:
              On Thu, 14 Jun 2007 11:13:41 -0700, Jesse Houwing
              <jesse.houwing@ nospam-sogeti.nlwrote:
              >
              >Even easier would be to set the RegexOption.Mul tiline on and look for
              >the following: "^.*$" This should match on every beginning of a line
              >(^), fetch the content (.*) and end on the end of each line ($).
              >
              Except that as near as I can tell, Regex only uses Unix-style
              linebreaks. That is, \n by itself. Which means that if I use the
              Multiline option (which seems to be the default, actually), I wind up
              with the \r as part of my matched strings, which I don't want.
              This shouldn't be so, but does seem to be the case in .NET 2.0. I've
              file a bug against it and it should be fixed in framework Orcas. It
              hasn't been this way in .NET 1.0 and 1.1 as far as I can remember.

              ^.*?\r?^ should fix it in the mean while, but is probably slower.

              Please file a bug against this to get it fixed in the next service pack
              of .net 2.0 if you want to see this fixed there. I tried, but they keep
              closing the bug with the message that they cannot reproduce in orcas,
              which is still far away for quite some of our customers.

              Jesse

              Comment

              Working...