Hi,
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.
I'm trying to figure out the fastest way to do it; here's what I'm doing now
(below).
I'm still learning Python, love it, and I'm pretty sure that what I'm doing
is naive.
Thanks for taking the time to look at this,
--Tim
----------------------------------------------------------------------------
----------
(1) Create one humongous regexp, compile it and cPickle it. The regexp is
like this:
misspelled = (
'\\bjudgement\\ b|' +
'\\bjudgemental \\b|' +
<snip><snip><sn ip>
'\\bYorksire\\b |' +
'\\bYoyages\\b' )
p = re.compile(miss pelled, re.I)
f = open('misspell. pat', 'w')
cPickle.dump(p, f)
f.close()
----------------------------------------------------------------------------
----------
(2) Check the file(s), report the misspelling, the line number and the
actual line of text.
- only warns on multiple identical misspellings
- using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
didn't give correct results.
- running on HP Unix, Python 2.2
f = open('misspell. pat', 'r')
p = cPickle.load(f)
a = open('myfile.tx t').readlines()
s = 'EtaOinShrdlu'. join(a)
mistake = {}
for mMatch in p.findall(s):
if mistake.get(mMa tch,0):
print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
else:
mistake[mMatch] = s.count('EtaOin Shrdlu', 0, s.index(mMatch) )
for k, v in mistake.items() :
print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
print '%s \n' % a[mistake[k]]
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.
I'm trying to figure out the fastest way to do it; here's what I'm doing now
(below).
I'm still learning Python, love it, and I'm pretty sure that what I'm doing
is naive.
Thanks for taking the time to look at this,
--Tim
----------------------------------------------------------------------------
----------
(1) Create one humongous regexp, compile it and cPickle it. The regexp is
like this:
misspelled = (
'\\bjudgement\\ b|' +
'\\bjudgemental \\b|' +
<snip><snip><sn ip>
'\\bYorksire\\b |' +
'\\bYoyages\\b' )
p = re.compile(miss pelled, re.I)
f = open('misspell. pat', 'w')
cPickle.dump(p, f)
f.close()
----------------------------------------------------------------------------
----------
(2) Check the file(s), report the misspelling, the line number and the
actual line of text.
- only warns on multiple identical misspellings
- using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
didn't give correct results.
- running on HP Unix, Python 2.2
f = open('misspell. pat', 'r')
p = cPickle.load(f)
a = open('myfile.tx t').readlines()
s = 'EtaOinShrdlu'. join(a)
mistake = {}
for mMatch in p.findall(s):
if mistake.get(mMa tch,0):
print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
else:
mistake[mMatch] = s.count('EtaOin Shrdlu', 0, s.index(mMatch) )
for k, v in mistake.items() :
print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
print '%s \n' % a[mistake[k]]
Comment