Merging ordered lists

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • etal

    Merging ordered lists

    Here's an algorithm question: How should I efficiently merge a
    collection of mostly similar lists, with different lengths and
    arbitrary contents, while eliminating duplicates and preserving order
    as much as possible?

    My code:

    def merge_to_unique (sources):
    """Merge the unique elements from each list in sources into new
    list.

    Using the longest input list as a reference, merges in the
    elements from
    each of the smaller or equal-length lists, and removes duplicates.

    @return: Combined list of elements.
    """
    sources.sort(No ne, len, True) # Descending length
    ref = sources[0]
    for src in sources[1:]:
    for i, s in enumerate(src):
    if s and (ref[i] != s) and s not in ref:
    ref.insert(ref. index(src[i-1])+1, s)
    # Remove duplicates
    return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]


    This comes up with using the CSV module's DictWriter class to merge a
    set (list, here) of not-quite-perfect CSV sources. The DictWriter
    constructor needs a list of field names so that it can convert
    dictionaries into rows of the CSV file it writes. Some of the input
    CSV files are missing columns, some might have extras -- all of this
    should be accepted, and the order of the columns in the merged file
    should match the order of the input files as much as possible (not
    alphabetical). All of the list elements are strings, in this case, but
    it would be nice if the function didn't require it.

    Speed actually isn't a problem yet; it might matter some day, but for
    now it's just an issue of conceptual aesthetics. Any suggestions?
  • Raymond Hettinger

    #2
    Re: Merging ordered lists

    On May 31, 10:00 pm, etal <eric.talev...@ gmail.comwrote:
    Here's an algorithm question: How should I efficiently merge a
    collection of mostly similar lists, with different lengths and
    arbitrary contents, while eliminating duplicates and preserving order
    as much as possible?
    I would do it two steps. There's a number of ways to merge depending
    on whether everything is pulled into memory or not:



    After merging, the groupby itertool is good for removing duplicates:

    result = [k for k, g in groupby(imerge( *sources))]


    Raymond

    Comment

    • Peter Otten

      #3
      Re: Merging ordered lists

      etal wrote:
      Here's an algorithm question: How should I efficiently merge a
      collection of mostly similar lists, with different lengths and
      arbitrary contents, while eliminating duplicates and preserving order
      as much as possible?
      >
      My code:
      >
      def merge_to_unique (sources):
      """Merge the unique elements from each list in sources into new
      list.
      >
      Using the longest input list as a reference, merges in the
      elements from
      each of the smaller or equal-length lists, and removes duplicates.
      >
      @return: Combined list of elements.
      """
      sources.sort(No ne, len, True) # Descending length
      ref = sources[0]
      for src in sources[1:]:
      for i, s in enumerate(src):
      if s and (ref[i] != s) and s not in ref:
      ref.insert(ref. index(src[i-1])+1, s)
      # Remove duplicates
      return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]
      >
      >
      This comes up with using the CSV module's DictWriter class to merge a
      set (list, here) of not-quite-perfect CSV sources. The DictWriter
      constructor needs a list of field names so that it can convert
      dictionaries into rows of the CSV file it writes. Some of the input
      CSV files are missing columns, some might have extras -- all of this
      should be accepted, and the order of the columns in the merged file
      should match the order of the input files as much as possible (not
      alphabetical). All of the list elements are strings, in this case, but
      it would be nice if the function didn't require it.
      >
      Speed actually isn't a problem yet; it might matter some day, but for
      now it's just an issue of conceptual aesthetics. Any suggestions?
      #untested
      import difflib

      def _merge(a, b):
      sm = difflib.Sequenc eMatcher(None, a, b)
      for op, a1, a2, b1, b2 in sm.get_opcodes( ):
      if op == "insert":
      yield b[b1:b2]
      else:
      yield a[a1:a2]

      def merge(a, b):
      return sum(_merge(a, b), [])

      def merge_to_unique (sources):
      return reduce(merge, sorted(sources, key=len, reverse=True))

      Peter

      Comment

      • Méta-MCI \(MVP\)

        #4
        Re: Merging ordered lists

        Hi!

        Use set (union).
        Example:

        la=[2,1,3,5,4,6]
        lb=[2,8,6,4,12]

        #compact:
        print list(set(la).un ion(set(lb)))

        #detail:
        s1 = set(la)
        s2 = set(lb)
        s3 = s1.union(s2)
        print list(s3)


        @-salutations

        Michel Claveau

        Comment

        • Peter Otten

          #5
          Re: Merging ordered lists

          Peter Otten wrote:
          #untested
          Already found two major blunders :(

          # still untested
          import difflib

          def _merge(a, b):
          sm = difflib.Sequenc eMatcher(None, a, b)
          for op, a1, a2, b1, b2 in sm.get_opcodes( ):
          if op == "insert":
          yield b[b1:b2]
          elif op == "replace":
          yield a[a1:a2]
          yield b[b1:b2]
          else: # delete, equal
          yield a[a1:a2]

          def merge(a, b):
          return sum(_merge(a, b), [])

          def merge_to_unique (sources):
          return unique(reduce(m erge, sorted(sources, key=len, reverse=True)))

          def unique(items):
          u = set(items)
          if len(u) == len(items):
          return items
          result = []
          for item in items:
          if item in u:
          result.append(i tem)
          u.remove(item)
          return result


          Comment

          • MClaveau

            #6
            Re: Merging ordered lists

            Hi!

            Use set (union).
            Example:

            la=[2,1,3,5,4,6]
            lb=[2,8,6,4,12]

            #compact:
            print list(set(la).un ion(set(lb)))

            #detail:
            s1 = set(la)
            s2 = set(lb)
            s3 = s1.union(s2)
            print list(s3)


            @-salutations

            Michel Claveau


            Comment

            • Taekyon

              #7
              Re: Merging ordered lists

              etal wrote:
              Speed actually isn't a problem yet; it might matter some day, but for
              now it's just an issue of conceptual aesthetics. Any suggestions?
              Looks as if set does it for you.

              --
              Taekyon

              Comment

              • Méta-MCI \(MVP\)

                #8
                Re: Merging ordered lists

                Hi!

                Use set (union).
                Example:

                la=[2,1,3,5,4,6]
                lb=[2,8,6,4,12]

                #compact:
                print list(set(la).un ion(set(lb)))

                #detail:
                s1 = set(la)
                s2 = set(lb)
                s3 = s1.union(s2)
                print list(s3)


                @-salutations

                Michel Claveau

                Comment

                • Peter Otten

                  #9
                  Re: Merging ordered lists

                  Peter Otten wrote:
                  #untested
                  Already found two major blunders :(

                  # still untested
                  import difflib

                  def _merge(a, b):
                  sm = difflib.Sequenc eMatcher(None, a, b)
                  for op, a1, a2, b1, b2 in sm.get_opcodes( ):
                  if op == "insert":
                  yield b[b1:b2]
                  elif op == "replace":
                  yield a[a1:a2]
                  yield b[b1:b2]
                  else: # delete, equal
                  yield a[a1:a2]

                  def merge(a, b):
                  return sum(_merge(a, b), [])

                  def merge_to_unique (sources):
                  return unique(reduce(m erge, sorted(sources, key=len, reverse=True)))

                  def unique(items):
                  u = set(items)
                  if len(u) == len(items):
                  return items
                  result = []
                  for item in items:
                  if item in u:
                  result.append(i tem)
                  u.remove(item)
                  return result


                  Comment

                  • MClaveau

                    #10
                    Re: Merging ordered lists

                    Hi!

                    Use set (union).
                    Example:

                    la=[2,1,3,5,4,6]
                    lb=[2,8,6,4,12]

                    #compact:
                    print list(set(la).un ion(set(lb)))

                    #detail:
                    s1 = set(la)
                    s2 = set(lb)
                    s3 = s1.union(s2)
                    print list(s3)


                    @-salutations

                    Michel Claveau


                    Comment

                    • Taekyon

                      #11
                      Re: Merging ordered lists

                      etal wrote:
                      Speed actually isn't a problem yet; it might matter some day, but for
                      now it's just an issue of conceptual aesthetics. Any suggestions?
                      Looks as if set does it for you.

                      --
                      Taekyon

                      Comment

                      • etal

                        #12
                        Re: Merging ordered lists

                        On Jun 1, 1:49 am, Peter Otten <__pete...@web. dewrote:
                        Peter Otten wrote:
                        #untested
                        >
                        Already found two major blunders :(
                        >
                        # still untested
                        import difflib
                        >
                        def _merge(a, b):
                            sm = difflib.Sequenc eMatcher(None, a, b)
                            for op, a1, a2, b1, b2 in sm.get_opcodes( ):
                                if op == "insert":
                                    yield b[b1:b2]
                                elif op == "replace":
                                    yield a[a1:a2]
                                    yield b[b1:b2]
                                else: # delete, equal
                                    yield a[a1:a2]
                        >
                        def merge(a, b):
                            return sum(_merge(a, b), [])
                        >
                        def merge_to_unique (sources):
                            return unique(reduce(m erge, sorted(sources, key=len, reverse=True)))
                        >
                        difflib.Sequenc eMatcher looks promising; I'll try it. Thanks!

                        def unique(items):
                            u = set(items)
                            if len(u) == len(items):
                                return items
                            result = []
                            for item in items:
                                if item in u:
                                    result.append(i tem)
                                    u.remove(item)
                            return result
                        You did right by preserving the original (non-alphabetical) ordering,
                        but I'm less enthusiastic about the shape of this function. My
                        original function used 7 lines of code, and only 1 for the unique()
                        step. This uses up to three container objects. Is it really an
                        improvement?

                        (Secret: the reference list (or, any of the sources) is unlikely to be
                        more than a few dozen elements long. The data set that puts
                        merge_to_unique through a workout will be a giant list of
                        comparatively short lists, so the unique() part just needs to be short
                        and conceptually clean, while merge() should attempt sane behavior for
                        large len(sources).)

                        Comment

                        • etal

                          #13
                          Re: Merging ordered lists

                          On Jun 1, 12:34 am, Raymond Hettinger <pyt...@rcn.com wrote:
                          >
                          I would do it two steps.  There's a number of ways to merge depending
                          on whether everything is pulled into memory or not:http://aspn.activestate..com/ASPN/Co.../Recipe/305269
                          >
                          After merging, the groupby itertool is good for removing duplicates:
                          >
                             result = [k for k, g in groupby(imerge( *sources))]
                          >
                          Raymond
                          Thanks for the tip; itertools never ceases to amaze. One issue:
                          groupby doesn't seem to remove all duplicates, just consecutive ones
                          (for lists of strings and integers, at least):
                          >>[k for k, g in itertools.group by(list("asdfdf ffdf"))]
                          ['a', 's', 'd', 'f', 'd', 'f', 'd', 'f']


                          Another issue: dropping everything into a heap and pulling it back out
                          looks like it loses the original ordering, which isn't necessarily
                          alphabetical, but "however the user wants to organize the
                          spreadsheet". That's why I originally avoided using
                          sorted(set(iter tools.chain(*so urces))). Do you see another way around
                          this?

                          Comment

                          • Peter Otten

                            #14
                            Re: Merging ordered lists

                            etal wrote:
                            def unique(items):
                                u = set(items)
                                if len(u) == len(items):
                                    return items
                                result = []
                                for item in items:
                                    if item in u:
                                        result.append(i tem)
                                        u.remove(item)
                                return result
                            >
                            You did right by preserving the original (non-alphabetical) ordering,
                            but I'm less enthusiastic about the shape of this function. My
                            original function used 7 lines of code, and only 1 for the unique()
                            step. This uses up to three container objects. Is it really an
                            improvement?
                            Yes :)

                            Seriously, you are using O(n) containers and O(n) lookup where mine uses
                            O(1). For short lists it doesn't matter, but as the list length grows the
                            difference gets huge:

                            $ cat unique.py
                            def unique(items):
                            u = set(items)
                            if len(u) == len(items):
                            return items
                            result = []
                            for item in items:
                            if item in u:
                            result.append(i tem)
                            u.remove(item)
                            return result


                            def unique_lc(ref):
                            return [r for i, r in enumerate(ref) if r and r not in ref[i+1:]]


                            sample_a = range(1000)
                            sample_b = range(1000)
                            import random
                            for i in random.sample(s ample_b, 10):
                            sample_b[i] = 0

                            $ python -m timeit -s"from unique import unique as u, sample_a as s" "u(s)"
                            10000 loops, best of 3: 52.8 usec per loop
                            $ python -m timeit -s"from unique import unique_lc as u, sample_a as
                            s" "u(s)"
                            10 loops, best of 3: 44.2 msec per loop

                            3 orders of magnitude for the shortcut.

                            $ python -m timeit -s"from unique import unique as u, sample_b as s" "u(s)"
                            1000 loops, best of 3: 646 usec per loop
                            $ python -m timeit -s"from unique import unique_lc as u, sample_b as
                            s" "u(s)"
                            10 loops, best of 3: 39 msec per loop

                            Still 2 orders of magnitude if my unique() does some actual work.

                            Peter

                            Comment

                            • etal

                              #15
                              Re: Merging ordered lists

                              On Jun 3, 1:22 am, Peter Otten <__pete...@web. dewrote:
                              >
                              Yes :)
                              >
                              Seriously, you are using O(n) containers and O(n) lookup where mine uses
                              O(1). For short lists it doesn't matter, but as the list length grows the
                              difference gets huge:
                              >
                              $ cat unique.py
                              def unique(items):
                                  u = set(items)
                                  if len(u) == len(items):
                                      return items
                                  result = []
                                  for item in items:
                                      if item in u:
                                          result.append(i tem)
                                          u.remove(item)
                                  return result
                              >
                              Yikes... and the list comprehension looked so innocent. I had resigned
                              myself to O(n) lookup, but you and Raymond appear to agree on the
                              basic concept for unique() -- check set membership, then generate the
                              final sequence and update the set based on that.

                              Comment

                              Working...