fastest way for humongous regexp search?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Tim Arnold

    fastest way for humongous regexp search?

    Hi,
    I've got a list of 1000 common misspellings, and I'd like to check a set of
    text files for those misspellings.
    I'm trying to figure out the fastest way to do it; here's what I'm doing now
    (below).

    I'm still learning Python, love it, and I'm pretty sure that what I'm doing
    is naive.

    Thanks for taking the time to look at this,
    --Tim
    ----------------------------------------------------------------------------
    ----------
    (1) Create one humongous regexp, compile it and cPickle it. The regexp is
    like this:

    misspelled = (
    '\\bjudgement\\ b|' +
    '\\bjudgemental \\b|' +

    <snip><snip><sn ip>

    '\\bYorksire\\b |' +
    '\\bYoyages\\b' )

    p = re.compile(miss pelled, re.I)
    f = open('misspell. pat', 'w')
    cPickle.dump(p, f)
    f.close()
    ----------------------------------------------------------------------------
    ----------
    (2) Check the file(s), report the misspelling, the line number and the
    actual line of text.
    - only warns on multiple identical misspellings
    - using 'EtaOinShrdlu' as a nonsense line-marker; tried \n but that
    didn't give correct results.
    - running on HP Unix, Python 2.2

    f = open('misspell. pat', 'r')
    p = cPickle.load(f)

    a = open('myfile.tx t').readlines()
    s = 'EtaOinShrdlu'. join(a)

    mistake = {}
    for mMatch in p.findall(s):
    if mistake.get(mMa tch,0):
    print 'Warning: multiple occurrences of mistake "%s" ' % mMatch
    else:
    mistake[mMatch] = s.count('EtaOin Shrdlu', 0, s.index(mMatch) )

    for k, v in mistake.items() :
    print 'Misspelling: "%s" on line number %d' % (k, mistake[k]+1)
    print '%s \n' % a[mistake[k]]



  • Istvan Albert

    #2
    Re: fastest way for humongous regexp search?

    Tim Arnold wrote:
    [color=blue]
    > I've got a list of 1000 common misspellings, and I'd like to check a set of
    > text files for those misspellings.[/color]

    A much simpler way would be to just store these misspellings as a dictionary
    (or set), read and split each line into words, then check whether each
    of words is in the set.

    Istvan

    Comment

    Working...