Faster search algorithm for lists

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • rayisalive
    New Member
    • Jun 2018
    • 2

    Faster search algorithm for lists

    I have a file containing 10,000 words which I import as strings and store in a list. Now, I have another list of strings and have to delete the common words in both lists. When I try to run the program, it takes a lot of time. Please let me know if there's any other way to do it.

    For now, I am using:

    For item in list_1:
    If item in list_2:
    {Do the deletion}
  • husoski
    New Member
    • Feb 2017
    • 4

    #2
    It's much faster to search a set than a list. Since that list doesn't seem to change during the program run, simply use
    Code:
        common_words = set(list_2) # after building list_2 from the file
        ...
        for word in list_1:
             if word in common_words:
                 ...do the deletion
    If you are just making a reduced list, then you can do that without any explicit looping:

    Code:
    uncommon = [w for w in list_2 if w not in common_words]
    That's a "list comprehension". It builds a list of every element from the list_2 sequence that is not an element of the common_words set.

    Comment

    • rayisalive
      New Member
      • Jun 2018
      • 2

      #3
      Thanks a lot. I'll definitely try it out.

      Comment

      • kiskiller0
        New Member
        • Jun 2018
        • 2

        #4
        I would use this big list to build sub_lists , one sub_list for each letter , thus , reducing the search time by almost 26 times (in average) , I don't know if python's built-in search functions do that or just loop through every element.

        Comment

        • husoski
          New Member
          • Feb 2017
          • 4

          #5
          Searching a set is much faster than searching lists--even with the sublists you describe. It's a hash table search, and even if the key-hashing algorithm is a horrible match for the actual key distribution (nearly all "words" happen to hash to the same number, modulo the table size) then that worst-case performance is a sequential list search. Heads anytime you win; tails 30 times in a row you lose a little.

          Comment

          Working...