Search for 2 words in a file and print out all the lines between these 2 words

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • dann
    New Member
    • Dec 2010
    • 4

    Search for 2 words in a file and print out all the lines between these 2 words

    I need to read in a file and split it into lines. Then search for a start and an end word and print out all the words in between including the start and end words.

    For example: (file)
    a
    b
    foo
    c
    d
    bar
    e

    output:
    foo
    c
    d
    bar

    I tried to search on the web but without any luck.
    I'm just a beginner, any help is appreciated thanks.

    Code:
    import re
    
    f = open('input, 'r')
    lines = f.readlines()
    for line in lines:
        #if line.startswith("main"):
        # and endswith("end"):
        print re.split(r"\s|," , line)
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    This is how I would approach it:

    Create an open file object for reading
    Iterate on the file object for line in fileObj:
    Strip the white space characters from line
    If line starts with the start word, initialize a list to contain the results
    Start a try/except block to catch an EOFError, because fileObj.next() will raise the error upon reaching the end of the file
    Start a while loop while True:
    Use fileObj.next(). strip() to read the next line
    If line starts with the end word, append the word to the list and break out of both loops
    If line does not start with the end word, append the word to the list and continue

    Then you can print the words like this:print "\n".join(resul tsList)

    Comment

    • dann
      New Member
      • Dec 2010
      • 4

      #3
      Thank you bvdet for the quick reply.
      I tried to follow your approach, but i got stuck the second loop (while loop). I get an error when i do try to leave both loops. I am not sure where to use the breaks to leave of the loops. Thanks again for your help
      Code:
      import re
      
      fileObj = open(inputFile, 'r')
      #lines = f.readlines()
      aList = []
      for line in fileObj:
          sLine = line.strip()
          #print lStrip
          if sLine.startswith("start"):
              aList = sLine
              #print aList
              try:
                  while True:
                      try:
                          fileObj.next().strip()
                          if sLine.startswith("end"):
                              aList.append(sLine)
                              break
              break
                          else:
                              aList.append(sLine)
                      except EOFError:
                          print "something went wrong"

      Comment

      • Sean Pedersen
        New Member
        • Dec 2010
        • 30

        #4
        Code:
        def search(fname, start, end):
            
            file = (line.strip() for line in open(fname))
            
            line = file.next()
            while line != start: line = file.next()
            yield line
            
            line = file.next()
            while line != end: yield line; line = file.next()
            yield line
        
        for item in search("file.txt", "foo", "bar"):
            print item

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          dann,

          You are not too far off. Encapsulate the loops in a function and use the return statement. Also, you must make an assignment to sLine in the inner loop.
          Code:
          def main():
              fileObj = open(inputfile, 'r')
              #lines = f.readlines()
              for line in fileObj:
                  sLine = line.strip()
                  if sLine.startswith("foo"):
                      aList = [sLine,]
                      try:
                          while True:
                              sLine = fileObj.next().strip()
                              if sLine.startswith("bar"):
                                  aList.append(sLine)
                                  return aList
                              else:
                                  aList.append(sLine)
                      except EOFError:
                          print "something went wrong"
          
          print main()
          Here is another way of doing it. The following returns the words in between start and end and was intended to be used on a sentence. It could easily be converted for your application.
          Code:
          from string import punctuation as stuff
          
          def words_between(s, first, second):
              # return the words between first and second words
              words = [word.lower() for word in s.split()]
              # catch error if word not in sentence
              try:
                  # strip punctuation for matching
                  idx1 = [word.strip(stuff) for word in words].index(first.lower())
                  # start search for second word after idx1
                  idx2 = [word.strip(stuff) \
                          for word in words].index(second.lower(),idx1+1)
                  return words[idx1+1:idx2]
              except ValueError, e:
                  return "ValueError: %s" % e
          
          sentence = "The small country was ruled by a truculent dictator."
          print words_between(sentence, "was", "dictator")

          Comment

          • dann
            New Member
            • Dec 2010
            • 4

            #6
            bvdet,

            Thanks again and to the others who replied.
            That's what I need it to do, but after it finds the start word it strips all the lines and I want to keep the file structure as it was. I tried to remove ".strip()" from "sLine = fileObj.next(). strip()", but that didn't work. With my previous code i used readlines().

            Another question, after that the alist has stored the values between the start and end words I need to loop through the alist for new start and end words. Is it better to use the current function and add a new inner loop to it or define a new function and loop through the alist?

            Comment

            • Sean Pedersen
              New Member
              • Dec 2010
              • 30

              #7
              The little generator I posted doesn't modify the file. You tried removing the method .strip from your file iterator, but it didn't do what? It's my understanding readlines buffers the entire file, which is not good for big data.

              If you keep the file open, simply seek to offset 0. And if not you'll need to reopen it.

              Comment

              • dwblas
                Recognized Expert Contributor
                • May 2008
                • 626

                #8
                Is it better to use the current function and add a new inner loop to it or define a new function and loop through the alist?
                I definitely prefer passing the list to a separate function for readability but it is personal preference.
                but after it finds the start word it strips all the lines and I want to keep the file structure as it was.
                You can strip() and not store the results:
                Code:
                ## snipped from code posted above
                                while True:
                                    sLine = fileObj.next()         ## eliminate strip()
                                    ## added here but does not alter sLine
                                    if sLine.strip().startswith("bar"):
                                        aList.append(sLine)
                                        return aList
                                    else:
                                        aList.append(sLine)

                Comment

                • dann
                  New Member
                  • Dec 2010
                  • 4

                  #9
                  Thanks everybody with your help my code is doing what it should do.
                  I am catching all the code between a start and an end point and storing it in a list.
                  Code:
                  import re
                  
                  parsing =False
                  aList = []
                  mylist = []
                  bBlock = []
                  
                  fileObj = open('inputfile', 'r')
                  for line in fileObj:
                      if line.find(".ent") != -1:
                          print "Now parsing **************************************"
                          line
                          parsing = True
                      if line.find(".end") != -1:
                          print "Stopped parsing **********************************"
                          parsing = False
                  
                      if parsing:
                          instruct = re.split(r"\s|," , line)
                          aList.append(instruct)
                      else:
                          mylist.append(line)
                  Next thing I am trying to do is to itrate/loop through my list (aList) I have stored earlier, trying to find a word that ends with double point(:) in my list. I tried the following code but I didn't succedeed. The error code I get is this:
                  File "\Python26\lib\ re.py", line 142, in search
                  return _compile(patter n, flags).search(s tring)
                  TypeError: expected string or buffer
                  Any help is appreciated.
                  Code:
                  for line in aList: 
                      if re.search(":", line):
                          bBlock.append(line)

                  Comment

                  • bvdet
                    Recognized Expert Specialist
                    • Oct 2006
                    • 2851

                    #10
                    It could be a problem with this if block:
                    Code:
                    if parsing:
                            instruct = re.split(r"\s|," , line)
                            aList.append(instruct)
                    re.split() will return a list to the identifier instruct. re.search() expects a string, not a list.

                    Comment

                    Working...