fastest way for humongous regexp search?

Tim Arnold
#1

fastest way for humongous regexp search?

Jul 18 '05, 05:11 PM

Hi,
I've got a list of 1000 common misspellings, and I'd like to check a set of
text files for those misspellings.
I'm trying to figure out the fastest way to do it; here's what I'm doing now
(below).

I'm still learning Python, love it, and I'm pretty sure that what I'm doing
is naive.

Thanks for taking the time to look at this,
--Tim
----------------------------------------------------------------------------
----------
(1) Create one humongous regexp, compile it and cPickle it. The regexp is
like this:

misspelled = (
'\\bjudgement\\ b|' +
'\\bjudgemental \\b|' +

<snip><snip><sn ip>

'\\bYorksire\\b |' +
'\\bYoyages\\b' )

p = re.compile(miss pelled, re.I)
f = open('misspell. pat', 'w')
cPickle.dump(p, f)
f.close()
----------------------------------------------------------------------------
----------
(2) Check the file(s), report the misspelling, the line number and the
actual line of text.
- only warns on multiple identical misspellings
- using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
didn't give correct results.
- running on HP Unix, Python 2.2

f = open('misspell. pat', 'r')
p = cPickle.load(f)

a = open('myfile.tx t').readlines()
s = 'EtaOinShrdlu'. join(a)

mistake = {}
for mMatch in p.findall(s):
if mistake.get(mMa tch,0):
print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
else:
mistake[mMatch] = s.count('EtaOin Shrdlu', 0, s.index(mMatch) )

for k, v in mistake.items() :
print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
print '%s \n' % a[mistake[k]]
Tags: None
Istvan Albert
#2

Jul 18 '05, 05:11 PM

Re: fastest way for humongous regexp search?

Tim Arnold wrote:
[color=blue]
> I've got a list of 1000 common misspellings, and I'd like to check a set of
> text files for those misspellings.[/color]

A much simpler way would be to just store these misspellings as a dictionary
(or set), read and split each line into words, then check whether each
of words is in the set.

Istvan
Comment

fastest way for humongous regexp search?

fastest way for humongous regexp search?

Comment