way to remove all non-ascii characters from a file?

**Larry Bates** · Jul 18 '05, 08:25 AM

Re: way to remove all non-ascii characters from a file?

Something simple like following will work for files
that fit in memory:

def onlyascii(char) :
if ord(char) < 48 or ord(char) > 127: return ''
else: return char

f=open('filenam e.ext','r')
data=f.read()
f.close()
filtered_data=f ilter(onlyascii , data)

For larger files you will need to loop and read
the data in chunks.

-Larry Bates
----------------------------
"omission9" <rus20376@salem state.edu> wrote in message
news:defa238f.0 402131112.43699 7c1@posting.goo gle.com...[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]

**Ivan Voras** · Jul 18 '05, 08:25 AM

Re: way to remove all non-ascii characters from a file?

omission9 wrote:
[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]

file("file2","w ").write("".joi n(
[ch for ch in file("file1", "r").read()
if ch in string.ascii_le tters]))

but this will also strip line breaks and whatnot :)

(n.b. I didn't actualy test the above code, and wrote it because of
amusement value :) )

**Peter Otten** · Jul 18 '05, 08:25 AM

Re: way to remove all non-ascii characters from a file?

omission9 wrote:
[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]

Read it in chunks, then remove the non-ascii charactors like so:
[color=blue][color=green][color=darkred]
>>> t = "".join(map(chr , range(256)))
>>> d = "".join(map(chr , range(128,256)) )
>>> "Törichte Logik böser Kobold".transla te(t,d)[/color][/color][/color]
'Trichte Logik bser Kobold'

and finally write the maimed chunks to a file. However, it's not clear to
me, how removing characters could be a good idea in the first place.
Replacing them at least gives some mimimal hints that something is missing:
[color=blue][color=green][color=darkred]
>>> t = "".join(map(chr , range(128))) + "?" * 128
>>> "Törichte Logik böser Kobold".transla te(t)[/color][/color][/color]
'T?richte Logik b?ser Kobold'

Peter

**Gerhard Häring** · Jul 18 '05, 08:30 AM

Re: way to remove all non-ascii characters from a file?

omission9 wrote:[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]

Here's a simple example that does what you want:
[color=blue][color=green][color=darkred]
>>> orig = "Häring"
>>> "".join([x for x in orig if ord(x) < 128])[/color][/color][/color]
'Hring'

-- Gerhard

**Peter Hansen** · Jul 18 '05, 08:30 AM

Re: way to remove all non-ascii characters from a file?

Gerhard Häring wrote:[color=blue]
>
> omission9 wrote:[color=green]
> > I have a text file which contains the occasional non-ascii charcter.
> > What is the best way to remove all of these in python?[/color]
>
> Here's a simple example that does what you want:
>[color=green][color=darkred]
> >>> orig = "Häring"
> >>> "".join([x for x in orig if ord(x) < 128])[/color][/color]
> 'Hring'[/color]

Or, if performance is critical, it's possible something like this would
be faster. (A regex might be even better, avoiding the redundant identity
transformation step.) :
[color=blue][color=green][color=darkred]
>>> from string import maketrans, translate
>>> table = maketrans('', '')
>>> translate(orig, table, table[128:])[/color][/color][/color]
'Hring'

-Peter

way to remove all non-ascii characters from a file?

way to remove all non-ascii characters from a file?

Comment

Comment

Comment

Comment

Comment