Text file manipulation.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Joe1986
    New Member
    • Jan 2008
    • 5

    Text file manipulation.

    Hi there, im trying to create a python program that can read a text file line by line and search for specified words/text/strings and remove them from the text file. Then finally save the modified text file to an output file. The only problem is, the text file contains code for a BACKSPACE typed in the text. e.g. "<BACKSPACE& gt;" this needs to be removed which sounds quite simple, but often there are numbers involved. e.g. "<BACKSP ACE: 4>" which would represent 4 backspaces, and so the string needs to be removed and 4 backspaces take place on the text before the code. e.g. "repatre<BAC KSPACE: 4>resent" would become "represent" I know a search and delete script that can search for a word at the beggining of the line and then delete the whole line, but im not sure about this. Any ideas would be great.
    Cheers,
    Joe
  • kudos
    Recognized Expert New Member
    • Jul 2006
    • 127

    #2
    hi,
    start by searching for <BACKSPAC E (for instance by using the find method). Note the index of it, then check what the value of the char after BACKSPACE (you handle it differently if it is & or :). substring the string and clue it together without the backspace, and go n characters back, given what the value is. I would recommend to string replace every <BACKSPACE&g t; with <BACKSPAC E: 1> (so you only need to handle one case)

    Did this help, or was my post confusing?
    -kudos

    Originally posted by Joe1986
    Hi there, im trying to create a python program that can read a text file line by line and search for specified words/text/strings and remove them from the text file. Then finally save the modified text file to an output file. The only problem is, the text file contains code for a BACKSPACE typed in the text. e.g. "<BACKSPACE& gt;" this needs to be removed which sounds quite simple, but often there are numbers involved. e.g. "<BACKSP ACE: 4>" which would represent 4 backspaces, and so the string needs to be removed and 4 backspaces take place on the text before the code. e.g. "repatre<BAC KSPACE: 4>resent" would become "represent" I know a search and delete script that can search for a word at the beggining of the line and then delete the whole line, but im not sure about this. Any ideas would be great.
    Cheers,
    Joe

    Comment

    • Joe1986
      New Member
      • Jan 2008
      • 5

      #3
      Originally posted by kudos
      hi,
      start by searching for <BACKSPAC E (for instance by using the find method). Note the index of it, then check what the value of the char after BACKSPACE (you handle it differently if it is & or :). substring the string and clue it together without the backspace, and go n characters back, given what the value is. I would recommend to string replace every <BACKSPACE&g t; with <BACKSPAC E: 1> (so you only need to handle one case)

      Did this help, or was my post confusing?
      -kudos
      Hi there,

      Im a little confused. Im not sure I understand the string replace method you mentioned at the end of your post.
      cheers,
      Joe

      Comment

      • Joe1986
        New Member
        • Jan 2008
        • 5

        #4
        Hi,
        Here is a solution. I can write results to a txt file which is great. All I need to do now is read from a text which im guessing will just go in place of the 'test' section in the code and to be able to iterate this code so that I can run through a whole paragraph containing multiple <BACKSPACE> sections removing all (rather than getting to the first one and terminating.) Would you have any ideas looking at this to do such a thing? Any help would be great. Cheers. Joe

        Code:
        #! /usr/bin/python
        
        import re
        # Global variable
        
        
        bs = re.compile('<BACKSPACE(:[ ]*[0-9]+)?>')
        
        
        #theInFile = open("test2.txt", "r")
        
        theOutFile = open("backspace_out.txt", "w")
        
        
        
        tests = ['re <BACKSPACE>present', 'This is bound to repatreaaaabbbbcccc <BACKSPACE: 17>resent']
        
        def bs_remove(s):
        
            global bs
        
            for m in bs.finditer(s):
        
                if m.groups()[0] is None:
        
                    return s[:m.start() - 1] + s[m.end():]
        
                else:
        
                    return s[:m.start() - int(m.groups()[0][1:])] + s[m.end():]
        
        
        
        for s in tests:
        
              theOutFile.write (bs_remove(s)+ " ")

        Comment

        • bvdet
          Recognized Expert Specialist
          • Oct 2006
          • 2851

          #5
          Originally posted by Joe1986
          Hi,
          Here is a solution. I can write results to a txt file which is great. All I need to do now is read from a text which im guessing will just go in place of the 'test' section in the code and to be able to iterate this code so that I can run through a whole paragraph containing multiple <BACKSPACE> sections removing all (rather than getting to the first one and terminating.) Would you have any ideas looking at this to do such a thing? Any help would be great. Cheers. Joe

          Code:
          #! /usr/bin/python
          
          import re
          # Global variable
          
          
          bs = re.compile('<BACKSPACE(:[ ]*[0-9]+)?>')
          
          
          #theInFile = open("test2.txt", "r")
          
          theOutFile = open("backspace_out.txt", "w")
          
          
          
          tests = ['re <BACKSPACE>present', 'This is bound to repatreaaaabbbbcccc <BACKSPACE: 17>resent']
          
          def bs_remove(s):
          
              global bs
          
              for m in bs.finditer(s):
          
                  if m.groups()[0] is None:
          
                      return s[:m.start() - 1] + s[m.end():]
          
                  else:
          
                      return s[:m.start() - int(m.groups()[0][1:])] + s[m.end():]
          
          
          
          for s in tests:
          
                theOutFile.write (bs_remove(s)+ " ")
          To account for single or multiple occurrences of backspaces in each line:[code=Python]# &lt;BACKSPACE&g t; 1 backspace
          # &lt;BACKSPAC E: 4&gt; 4 backspaces

          import re

          def remove_bs(s):
          patt = re.compile(r'&l t;BACKSPACE:? ?(\d+)?&gt;')
          while True:
          m = patt.search(s)
          if m:
          if m.group(1):
          start = m.start()-int(m.group(1))
          else:
          start = m.start()-1
          s = ''.join([s[:start], s[m.end():]])
          else:
          break
          return s

          def parse_file(fn):
          return ''.join([remove_bs(line) for line in open(fn).readli nes()])

          fn = 'input.txt'
          fnOut = 'output.txt
          f = open(fnOut, 'w')
          f.write(parse_f ile(fn))
          f.close()[/code]

          Comment

          • thisissuma
            New Member
            • Feb 2008
            • 1

            #6
            Hi there, im trying to create a python program that can read a text file line by line and search for specified words/text/strings and place them in another text file.
            where line length should be 65 characters and number of lines per page are 70.

            Comment

            Working...