way to remove all non-ascii characters from a file?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • omission9

    way to remove all non-ascii characters from a file?

    I have a text file which contains the occasional non-ascii charcter.
    What is the best way to remove all of these in python?
  • Larry Bates

    #2
    Re: way to remove all non-ascii characters from a file?

    Something simple like following will work for files
    that fit in memory:

    def onlyascii(char) :
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

    f=open('filenam e.ext','r')
    data=f.read()
    f.close()
    filtered_data=f ilter(onlyascii , data)

    For larger files you will need to loop and read
    the data in chunks.

    -Larry Bates
    ----------------------------
    "omission9" <rus20376@salem state.edu> wrote in message
    news:defa238f.0 402131112.43699 7c1@posting.goo gle.com...[color=blue]
    > I have a text file which contains the occasional non-ascii charcter.
    > What is the best way to remove all of these in python?[/color]


    Comment

    • Ivan Voras

      #3
      Re: way to remove all non-ascii characters from a file?

      omission9 wrote:
      [color=blue]
      > I have a text file which contains the occasional non-ascii charcter.
      > What is the best way to remove all of these in python?[/color]

      file("file2","w ").write("".joi n(
      [ch for ch in file("file1", "r").read()
      if ch in string.ascii_le tters]))

      but this will also strip line breaks and whatnot :)

      (n.b. I didn't actualy test the above code, and wrote it because of
      amusement value :) )

      Comment

      • Peter Otten

        #4
        Re: way to remove all non-ascii characters from a file?

        omission9 wrote:
        [color=blue]
        > I have a text file which contains the occasional non-ascii charcter.
        > What is the best way to remove all of these in python?[/color]

        Read it in chunks, then remove the non-ascii charactors like so:
        [color=blue][color=green][color=darkred]
        >>> t = "".join(map(chr , range(256)))
        >>> d = "".join(map(chr , range(128,256)) )
        >>> "Törichte Logik böser Kobold".transla te(t,d)[/color][/color][/color]
        'Trichte Logik bser Kobold'

        and finally write the maimed chunks to a file. However, it's not clear to
        me, how removing characters could be a good idea in the first place.
        Replacing them at least gives some mimimal hints that something is missing:
        [color=blue][color=green][color=darkred]
        >>> t = "".join(map(chr , range(128))) + "?" * 128
        >>> "Törichte Logik böser Kobold".transla te(t)[/color][/color][/color]
        'T?richte Logik b?ser Kobold'

        Peter

        Comment

        • Gerhard Häring

          #5
          Re: way to remove all non-ascii characters from a file?

          omission9 wrote:[color=blue]
          > I have a text file which contains the occasional non-ascii charcter.
          > What is the best way to remove all of these in python?[/color]

          Here's a simple example that does what you want:
          [color=blue][color=green][color=darkred]
          >>> orig = "Häring"
          >>> "".join([x for x in orig if ord(x) < 128])[/color][/color][/color]
          'Hring'

          -- Gerhard

          Comment

          • Peter Hansen

            #6
            Re: way to remove all non-ascii characters from a file?

            Gerhard Häring wrote:[color=blue]
            >
            > omission9 wrote:[color=green]
            > > I have a text file which contains the occasional non-ascii charcter.
            > > What is the best way to remove all of these in python?[/color]
            >
            > Here's a simple example that does what you want:
            >[color=green][color=darkred]
            > >>> orig = "Häring"
            > >>> "".join([x for x in orig if ord(x) < 128])[/color][/color]
            > 'Hring'[/color]


            Or, if performance is critical, it's possible something like this would
            be faster. (A regex might be even better, avoiding the redundant identity
            transformation step.) :
            [color=blue][color=green][color=darkred]
            >>> from string import maketrans, translate
            >>> table = maketrans('', '')
            >>> translate(orig, table, table[128:])[/color][/color][/color]
            'Hring'


            -Peter

            Comment

            Working...