deleting texts between patterns

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • micklee74@hotmail.com

    deleting texts between patterns

    hi
    say i have a text file

    line1
    line2
    line3
    line4
    line5
    line6
    abc
    line8 <---to be delete
    line9 <---to be delete
    line10 <---to be delete
    line11 <---to be delete
    line12 <---to be delete
    line13 <---to be delete
    xyz
    line15
    line16
    line17
    line18

    I wish to delete lines that are in between 'abc' and 'xyz' and print
    the rest of the lines. Which is the best way to do it? Should i get
    everything into a list, get the index of abc and xyz, then pop the
    elements out? or any other better methods?
    thanks

  • Ravi Teja

    #2
    Re: deleting texts between patterns


    mickle...@hotma il.com wrote:[color=blue]
    > hi
    > say i have a text file
    >
    > line1
    > line2
    > line3
    > line4
    > line5
    > line6
    > abc
    > line8 <---to be delete
    > line9 <---to be delete
    > line10 <---to be delete
    > line11 <---to be delete
    > line12 <---to be delete
    > line13 <---to be delete
    > xyz
    > line15
    > line16
    > line17
    > line18
    >
    > I wish to delete lines that are in between 'abc' and 'xyz' and print
    > the rest of the lines. Which is the best way to do it? Should i get
    > everything into a list, get the index of abc and xyz, then pop the
    > elements out? or any other better methods?
    > thanks[/color]

    In other words ...
    lines = open('test.txt' ).readlines()
    for line in lines[lines.index('ab c\n') + 1:lines.index(' xyz\n')]:
    lines.remove(li ne)
    for line in lines:
    print line,

    Regular expressions are better in this case
    import re
    pat = re.compile('abc \n.*?xyz\n', re.DOTALL)
    print re.sub(pat, '', open('test.txt' ).read())

    Comment

    • Duncan Booth

      #3
      Re: deleting texts between patterns

      wrote:
      [color=blue]
      > hi
      > say i have a text file
      >
      > line1
      > line2
      > line3
      > line4
      > line5
      > line6
      > abc
      > line8 <---to be delete
      > line9 <---to be delete
      > line10 <---to be delete
      > line11 <---to be delete
      > line12 <---to be delete
      > line13 <---to be delete
      > xyz
      > line15
      > line16
      > line17
      > line18
      >
      > I wish to delete lines that are in between 'abc' and 'xyz' and print
      > the rest of the lines. Which is the best way to do it? Should i get
      > everything into a list, get the index of abc and xyz, then pop the
      > elements out? or any other better methods?
      > thanks
      >[/color]

      Something like this (untested code):

      def filtered(f, stop, restart):
      f = iter(f)
      for line in f:
      yield line
      if line==stop:
      break
      for line in f:
      if line==restart:
      yield line
      break
      for line in f:
      yield line

      for line in filtered(open(' thefile'), "abc\n", "xyz\n"):
      print line

      Comment

      • Fredrik Lundh

        #4
        Re: deleting texts between patterns


        <micklee74@hotm ail.com> skrev i meddelandet news:1147420279 .664699.181200@ i40g2000cwc.goo glegroups.com.. .[color=blue]
        > hi
        > say i have a text file
        >
        > line1
        > line2
        > line3
        > line4
        > line5
        > line6
        > abc
        > line8 <---to be delete
        > line9 <---to be delete
        > line10 <---to be delete
        > line11 <---to be delete
        > line12 <---to be delete
        > line13 <---to be delete
        > xyz
        > line15
        > line16
        > line17
        > line18
        >
        > I wish to delete lines that are in between 'abc' and 'xyz' and print
        > the rest of the lines. Which is the best way to do it? Should i get
        > everything into a list, get the index of abc and xyz, then pop the
        > elements out? or any other better methods?[/color]

        what's wrong with a simple

        emit = True
        for line in open("q.txt"):
        if line == "xyz\n":
        emit = True
        if emit:
        print line,
        if line == "abc\n":
        emit = False

        loop ? (this is also easy to tweak for cases where you don't want to include
        the patterns in the output).

        to print to a file instead of stdout, just replace the print line with a f.write call.

        </F>



        Comment

        • John Machin

          #5
          Re: deleting texts between patterns

          On 12/05/2006 6:11 PM, Ravi Teja wrote:[color=blue]
          > mickle...@hotma il.com wrote:[color=green]
          >> hi
          >> say i have a text file
          >>
          >> line1[/color][/color]
          [snip][color=blue][color=green]
          >> line6
          >> abc
          >> line8 <---to be delete[/color][/color]
          [snip][color=blue][color=green]
          >> line13 <---to be delete
          >> xyz
          >> line15[/color][/color]
          [snip][color=blue][color=green]
          >> line18
          >>
          >> I wish to delete lines that are in between 'abc' and 'xyz' and print
          >> the rest of the lines. Which is the best way to do it? Should i get
          >> everything into a list, get the index of abc and xyz, then pop the
          >> elements out? or any other better methods?
          >> thanks[/color]
          >
          > In other words ...
          > lines = open('test.txt' ).readlines()
          > for line in lines[lines.index('ab c\n') + 1:lines.index(' xyz\n')]:
          > lines.remove(li ne)[/color]

          I don't think that's what you really meant.
          [color=blue][color=green][color=darkred]
          >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
          >>> for line in lines[lines.index('ab c\n') + 1:lines.index(' xyz\n')]:[/color][/color][/color]
          .... lines.remove(li ne)
          ....[color=blue][color=green][color=darkred]
          >>> lines[/color][/color][/color]
          ['abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']

          Uh-oh.

          Try this:
          [color=blue][color=green][color=darkred]
          >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
          >>> del lines[lines.index('ab c\n') + 1:lines.index(' xyz\n')]
          >>> lines[/color][/color][/color]
          ['blah', 'fubar', 'abc\n', 'xyz\n', 'xyzzy'][color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          Of course wrapping it in try/except would be a good idea, not for the
          slicing, which behaves itself and does nothing if the 'abc\n' appears
          AFTER the 'xyz\n', but for the index() in case the sought markers aren't
          there. Perhaps it might be a good idea even to do it carefully one piece
          at a time: is the abc there? is the xyz there? is the xyz after the abc
          -- then del[index1+1:index2].

          I wonder what the OP wants to happen in a case like this:

          guff1 xyz guff2 abc guff2 xyz guff3
          or this:
          guff1 abc guff2 abc guff2 xyz guff3
          [color=blue]
          > for line in lines:
          > print line,
          >
          > Regular expressions are better in this case[/color]

          Famous last words.
          [color=blue]
          > import re
          > pat = re.compile('abc \n.*?xyz\n', re.DOTALL)
          > print re.sub(pat, '', open('test.txt' ).read())
          >[/color]

          I don't think you really meant that either.
          [color=blue][color=green][color=darkred]
          >>> lines = ['blah', 'fubar', 'abc\n', 'blah', 'fubar', 'xyz\n', 'xyzzy']
          >>> linestr = "".join(lin es)
          >>> linestr[/color][/color][/color]
          'blahfubarabc\n blahfubarxyz\nx yzzy'[color=blue][color=green][color=darkred]
          >>> import re
          >>> pat = re.compile('abc \n.*?xyz\n', re.DOTALL)
          >>> print re.sub(pat, '', linestr)[/color][/color][/color]
          blahfubarxyzzy[color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          Uh-oh.

          Try this:
          [color=blue][color=green][color=darkred]
          >>> pat = re.compile('(?< =abc\n).*?(?=xy z\n)', re.DOTALL)
          >>> re.sub(pat, '', linestr)[/color][/color][/color]
          'blahfubarabc\n xyz\nxyzzy'

          .... and I can't imagine why you're using the confusing [IMHO]
          undocumented [AFAICT] feature that the first arg of the module-level
          functions like sub and friends can be a compiled regular expression
          object. Why not use this:
          [color=blue][color=green][color=darkred]
          >>> pat.sub('', linestr)[/color][/color][/color]
          'blahfubarabc\n xyz\nxyzzy'[color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          One-liner fanboys might prefer this:
          [color=blue][color=green][color=darkred]
          >>> re.sub('(?i)(?< =abc\n).*?(?=xy z\n)', '', linestr)[/color][/color][/color]
          'blahfubarabc\n xyz\nxyzzy'[color=blue][color=green][color=darkred]
          >>>[/color][/color][/color]

          HTH,
          John

          Comment

          • bruno at modulix

            #6
            Re: deleting texts between patterns

            micklee74@hotma il.com wrote:[color=blue]
            > hi
            > say i have a text file
            >
            > line1
            > line2
            > line3
            > line4
            > line5
            > line6
            > abc
            > line8 <---to be delete
            > line9 <---to be delete
            > line10 <---to be delete
            > line11 <---to be delete
            > line12 <---to be delete
            > line13 <---to be delete
            > xyz
            > line15
            > line16
            > line17
            > line18
            >
            > I wish to delete lines that are in between 'abc' and 'xyz' and print
            > the rest of the lines. Which is the best way to do it? Should i get
            > everything into a list, get the index of abc and xyz, then pop the
            > elements out?[/color]

            Would be somewhat inefficient IMHO - at least for big files, since it
            implies reading the whole file in memory.
            [color=blue]
            > or any other better methods?[/color]

            Don't know if it's better for your actual use case, but this avoids
            reading up the whole file:

            def skip(iterable, skipfrom, skipuntil):
            """ example usage :[color=blue][color=green][color=darkred]
            >>> f = open("/path/to/my/file.txt")
            >>> for line in skip_print(f, 'abc', 'yyz'):
            >>> print line
            >>> f.close()[/color][/color][/color]
            """
            skip = False
            for line in iterable:
            if skip:
            if line == skipuntil:
            skip = False
            continue
            else:
            if line == skipfrom:
            skip = True
            continue
            yield line

            def main():
            lines = """
            line1
            line2
            line3
            line4
            line5
            line6
            abc
            line8 <---to be delete
            line9 <---to be delete
            line10 <---to be delete
            line11 <---to be delete
            line12 <---to be delete
            line13 <---to be delete
            xyz
            line15
            line16
            line17
            line18
            """.strip().spl it()
            for line in skip(lines, 'abc', 'xyz'):
            print line


            HTH

            --
            bruno desthuilliers
            python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
            p in 'onurb@xiludom. gro'.split('@')])"

            Comment

            • bruno at modulix

              #7
              Re: deleting texts between patterns

              Fredrik Lundh wrote:
              (snip)[color=blue]
              > to print to a file instead of stdout, just replace the print line with a f.write call.
              >[/color]

              Or redirect stdout to a file when calling the program !-)

              --
              bruno desthuilliers
              python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
              p in 'onurb@xiludom. gro'.split('@')])"

              Comment

              • bruno at modulix

                #8
                Re: deleting texts between patterns

                bruno at modulix wrote:[color=blue]
                > micklee74@hotma il.com wrote:
                >[/color]
                (snip)[color=blue]
                >
                > Don't know if it's better for your actual use case, but this avoids
                > reading up the whole file:[/color]
                [color=blue]
                > def skip(iterable, skipfrom, skipuntil):
                > """ example usage :[color=green][color=darkred]
                > >>> f = open("/path/to/my/file.txt")
                > >>> for line in skip_print(f, 'abc', 'yyz'):
                > >>> print line
                > >>> f.close()[/color][/color]
                > """[/color]
                (snip code)

                Forgot to say this will also skip markers. If you want to keep them, see
                the effbot answer...

                --
                bruno desthuilliers
                python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
                p in 'onurb@xiludom. gro'.split('@')])"

                Comment

                • Tim Chase

                  #9
                  Re: deleting texts between patterns

                  > I wish to delete lines that are in between 'abc' and[color=blue]
                  > 'xyz' and print the rest of the lines. Which is the best
                  > way to do it?[/color]

                  While this *is* the python list, you don't specify whether
                  this is the end goal, or whether it's part of a larger
                  program. If it *is* the end goal (namely, you just want the
                  filtered output someplace), and you're not adverse to using
                  other tools, you can do something like

                  sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt

                  which is pretty straight-forward. It translates to

                  -n don't print each line by default
                  -e execute the following item
                  1,/abc/ from line 1, through the line where you match "abc"
                  p print each line
                  and also
                  -e execute the following item
                  /xyz/,$ from the line matching "abc" through the last line
                  p print each line


                  It assumes that
                  1) there's only one /abc/ & /xyz/ in the file (otherwise, it
                  defaults to the first one it finds in each case)
                  2) that they're in that order (otherwise, you'll get 2x each
                  line, rather than 0x each line)

                  However, it's a oneliner here, and seems to be a bit more
                  complex in python, so if you don't need to integrate the
                  results into further down-stream python processing, this
                  might be a nice way to go. If you need the python, others
                  on the list have offered a panoply of good answers already.

                  -tkc






                  Comment

                  • Dan Sommers

                    #10
                    [OT] Unix Tools (was: deleting texts between patterns)

                    On Fri, 12 May 2006 07:29:54 -0500,
                    Tim Chase <python.list@ti m.thechases.com > wrote:
                    [color=blue][color=green]
                    >> I wish to delete lines that are in between 'abc' and
                    >> 'xyz' and print the rest of the lines. Which is the best
                    >> way to do it?[/color][/color]
                    [color=blue]
                    > While this *is* the python list, you don't specify whether
                    > this is the end goal, or whether it's part of a larger
                    > program. If it *is* the end goal (namely, you just want the
                    > filtered output someplace), and you're not adverse to using
                    > other tools, you can do something like[/color]
                    [color=blue]
                    > sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt[/color]

                    Or even

                    awk '/abc/,/xyz/' file.txt

                    Excluding the abc and xyz lines is left as an exercise to the
                    interested reader.

                    Regards,
                    Dan

                    --
                    Dan Sommers
                    <http://www.tombstoneze ro.net/dan/>
                    "I wish people would die in alphabetical order." -- My wife, the genealogist

                    Comment

                    • Edward Elliott

                      #11
                      Re: [OT] Unix Tools (was: deleting texts between patterns)

                      Dan Sommers wrote:[color=blue]
                      > Or even
                      >
                      > awk '/abc/,/xyz/' file.txt
                      >
                      > Excluding the abc and xyz lines is left as an exercise to the
                      > interested reader.[/color]

                      Once again, us completely disinterested readers get the short end of the
                      stick. :)

                      --
                      Edward Elliott
                      UC Berkeley School of Law (Boalt Hall)
                      complangpython at eddeye dot net

                      Comment

                      • Ravi Teja

                        #12
                        Re: deleting texts between patterns

                        >> I don't think that's what you really meant ^ 2

                        Right! That was very buggy. That's what I get for posting past 1 AM :-(.

                        Comment

                        • John Savage

                          #13
                          Re: deleting texts between patterns

                          Tim Chase <python.list@ti m.thechases.com > writes:[color=blue][color=green]
                          >> I wish to delete lines that are in between 'abc' and
                          >> 'xyz' and print the rest of the lines. Which is the best
                          >> way to do it?[/color]
                          >
                          > sed -n -e'1,/abc/p' -e'/xyz/,$p' file.txt
                          >
                          >which is pretty straight-forward.[/color]

                          While it looks neat, it will not work when /abc/ matches line 1.
                          Non-standard versions of sed, e.g., GNU, allow you to use 0,/abc/
                          to neatly step around this nuisance; but for standard sed you'll
                          need a more complicated sed script.
                          --
                          John Savage (my news address is not valid for email)

                          Comment

                          • Baoqiu Cui

                            #14
                            Re: deleting texts between patterns

                            John Machin <sjmachin@lexic on.net> writes:
                            [color=blue]
                            > Uh-oh.
                            >
                            > Try this:
                            >[color=green][color=darkred]
                            >>>> pat = re.compile('(?< =abc\n).*?(?=xy z\n)', re.DOTALL)
                            >>>> re.sub(pat, '', linestr)[/color][/color]
                            > 'blahfubarabc\n xyz\nxyzzy'[/color]

                            This regexp still has a problem. It may remove the lines between two
                            lines like 'aaabc' and 'xxxyz' (and also removes the first two 'x's in
                            'xxxyz').

                            The following regexp works better:

                            pattern = re.compile('(?< =^abc\n).*?(?=^ xyz\n)', re.DOTALL | re.MULTILINE)
                            [color=blue][color=green][color=darkred]
                            >>> lines = '''line1[/color][/color][/color]
                            .... abc
                            .... line2
                            .... xyz
                            .... line3
                            .... aaabc
                            .... line4
                            .... xxxyz
                            .... line5'''[color=blue][color=green][color=darkred]
                            >>> pattern = re.compile('(?< =^abc\n).*?(?=^ xyz\n)', re.DOTALL | re.MULTILINE)
                            >>> print pattern.sub('', lines)[/color][/color][/color]
                            line1
                            abc
                            xyz
                            line3
                            aaabc
                            line4
                            xxxyz
                            line5[color=blue][color=green][color=darkred]
                            >>>[/color][/color][/color]

                            - Baoqiu

                            --
                            Baoqiu Cui <cbaoqiu at yahoo.com>

                            Comment

                            • John Machin

                              #15
                              Re: deleting texts between patterns

                              On 5/06/2006 2:51 AM, Baoqiu Cui wrote:[color=blue]
                              > John Machin <sjmachin@lexic on.net> writes:
                              >[color=green]
                              >> Uh-oh.
                              >>
                              >> Try this:
                              >>[color=darkred]
                              >>>>> pat = re.compile('(?< =abc\n).*?(?=xy z\n)', re.DOTALL)
                              >>>>> re.sub(pat, '', linestr)[/color]
                              >> 'blahfubarabc\n xyz\nxyzzy'[/color]
                              >
                              > This regexp still has a problem. It may remove the lines between two
                              > lines like 'aaabc' and 'xxxyz' (and also removes the first two 'x's in
                              > 'xxxyz').
                              >
                              > The following regexp works better:
                              >
                              > pattern = re.compile('(?< =^abc\n).*?(?=^ xyz\n)', re.DOTALL | re.MULTILINE)
                              >[/color]

                              You are quite correct. Your reply, and the rejoinder below, only add to
                              the proposition that regexes are not necessarily the best choice for
                              every text-processing job :-)

                              Just in case the last line is 'xyz' but is not terminated by '\n':

                              pattern = re.compile('(?< =^abc\n).*?(?=^ xyz$)', re.DOTALL | re.MULTILINE)

                              Cheers,
                              John

                              Comment

                              Working...