For larger files you will need to loop and read
the data in chunks.
-Larry Bates
----------------------------
"omission9" <rus20376@salem state.edu> wrote in message
news:defa238f.0 402131112.43699 7c1@posting.goo gle.com...[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]
Re: way to remove all non-ascii characters from a file?
omission9 wrote:
[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]
file("file2","w ").write("".joi n(
[ch for ch in file("file1", "r").read()
if ch in string.ascii_le tters]))
but this will also strip line breaks and whatnot :)
(n.b. I didn't actualy test the above code, and wrote it because of
amusement value :) )
Re: way to remove all non-ascii characters from a file?
omission9 wrote:
[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]
Read it in chunks, then remove the non-ascii charactors like so:
[color=blue][color=green][color=darkred]
>>> t = "".join(map(chr , range(256)))
>>> d = "".join(map(chr , range(128,256)) )
>>> "Törichte Logik böser Kobold".transla te(t,d)[/color][/color][/color]
'Trichte Logik bser Kobold'
and finally write the maimed chunks to a file. However, it's not clear to
me, how removing characters could be a good idea in the first place.
Replacing them at least gives some mimimal hints that something is missing:
[color=blue][color=green][color=darkred]
>>> t = "".join(map(chr , range(128))) + "?" * 128
>>> "Törichte Logik böser Kobold".transla te(t)[/color][/color][/color]
'T?richte Logik b?ser Kobold'
Re: way to remove all non-ascii characters from a file?
omission9 wrote:[color=blue]
> I have a text file which contains the occasional non-ascii charcter.
> What is the best way to remove all of these in python?[/color]
Here's a simple example that does what you want:
[color=blue][color=green][color=darkred]
>>> orig = "Häring"
>>> "".join([x for x in orig if ord(x) < 128])[/color][/color][/color]
'Hring'
Re: way to remove all non-ascii characters from a file?
Gerhard Häring wrote:[color=blue]
>
> omission9 wrote:[color=green]
> > I have a text file which contains the occasional non-ascii charcter.
> > What is the best way to remove all of these in python?[/color]
>
> Here's a simple example that does what you want:
>[color=green][color=darkred]
> >>> orig = "Häring"
> >>> "".join([x for x in orig if ord(x) < 128])[/color][/color]
> 'Hring'[/color]
Or, if performance is critical, it's possible something like this would
be faster. (A regex might be even better, avoiding the redundant identity
transformation step.) :
[color=blue][color=green][color=darkred]
>>> from string import maketrans, translate
>>> table = maketrans('', '')
>>> translate(orig, table, table[128:])[/color][/color][/color]
'Hring'
Comment