? about priority_queue how make objects unique

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Rich S.

    ? about priority_queue how make objects unique

    Hi

    In an earlier posting I was asking how to read thru millions of data records
    keeping the top 2000 (where the top values are based upon a field in the
    record) and niklasb suggested using a priority_queue like so


    ------------------- start old message ---------------------
    A more efficient way than what you describe would be
    to use priority_queue. Define the predicate in such
    a way that the top() always returns the lowest scoring
    object.

    Let q be the priority_queue and suppose you want to
    find the top N scores (where N is 2000 in your example),
    the pseudo-code would be something like this:

    q.reserve(N);

    // add the first N elements to the priority queue
    for (i = 0; i < N; ++i)
    {
    if (!ReadRecord(&o bj))
    break;

    q.push(obj);
    }

    // invariant: q contains highest N elements
    // q.top() is the lowest of the highest N elements
    while (ReadRecord(&ob j))
    {
    if (obj > q.top())
    {
    q.pop();
    q.push(obj);
    }
    }

    ------------------- end of old message ---------------------

    This is working well for me but I'm getting cases where I have duplicate
    records in the top 2000 and I'm not quite sure how to fix that problem.

    I was thinking of running a loop around the pop until the obj is NOT greater
    than the top and then pushing obj but I'm convinced thats the best way.

    Thanks for any help.

    Regards.


  • mlimber

    #2
    Re: ? about priority_queue how make objects unique

    Rich S. wrote:[color=blue]
    > Hi
    >
    > In an earlier posting I was asking how to read thru millions of data records
    > keeping the top 2000 (where the top values are based upon a field in the
    > record) and niklasb suggested using a priority_queue like so
    >
    >
    > ------------------- start old message ---------------------
    > A more efficient way than what you describe would be
    > to use priority_queue. Define the predicate in such
    > a way that the top() always returns the lowest scoring
    > object.
    >
    > Let q be the priority_queue and suppose you want to
    > find the top N scores (where N is 2000 in your example),
    > the pseudo-code would be something like this:
    >
    > q.reserve(N);
    >
    > // add the first N elements to the priority queue
    > for (i = 0; i < N; ++i)
    > {
    > if (!ReadRecord(&o bj))
    > break;
    >
    > q.push(obj);
    > }
    >
    > // invariant: q contains highest N elements
    > // q.top() is the lowest of the highest N elements
    > while (ReadRecord(&ob j))
    > {
    > if (obj > q.top())
    > {
    > q.pop();
    > q.push(obj);
    > }
    > }
    >
    > ------------------- end of old message ---------------------
    >
    > This is working well for me but I'm getting cases where I have duplicate
    > records in the top 2000 and I'm not quite sure how to fix that problem.
    >
    > I was thinking of running a loop around the pop until the obj is NOT greater
    > than the top and then pushing obj but I'm convinced thats the best way.
    >
    > Thanks for any help.
    >
    > Regards.[/color]

    Are you getting duplicates because there are duplicates in your file(s)
    or because there is an error in the program? (If the former, you might
    consider using a std::unique, perhaps in conjunction with a
    std::vector, to find the first N unique records. See the docs here:
    http://www.sgi.com/tech/stl/unique.html .)

    Cheers! --M

    Comment

    • Kai-Uwe Bux

      #3
      Re: ? about priority_queue how make objects unique

      Rich S. wrote:
      [color=blue]
      > Hi
      >
      > In an earlier posting I was asking how to read thru millions of data
      > records keeping the top 2000 (where the top values are based upon a field
      > in the record) and niklasb suggested using a priority_queue like so
      >
      >
      > ------------------- start old message ---------------------
      > A more efficient way than what you describe would be
      > to use priority_queue. Define the predicate in such
      > a way that the top() always returns the lowest scoring
      > object.
      >
      > Let q be the priority_queue and suppose you want to
      > find the top N scores (where N is 2000 in your example),
      > the pseudo-code would be something like this:
      >
      > q.reserve(N);
      >
      > // add the first N elements to the priority queue
      > for (i = 0; i < N; ++i)
      > {
      > if (!ReadRecord(&o bj))
      > break;
      >
      > q.push(obj);
      > }
      >
      > // invariant: q contains highest N elements
      > // q.top() is the lowest of the highest N elements
      > while (ReadRecord(&ob j))
      > {
      > if (obj > q.top())
      > {
      > q.pop();
      > q.push(obj);
      > }
      > }
      >
      > ------------------- end of old message ---------------------
      >
      > This is working well for me but I'm getting cases where I have duplicate
      > records in the top 2000 and I'm not quite sure how to fix that problem.
      >
      > I was thinking of running a loop around the pop until the obj is NOT
      > greater than the top and then pushing obj but I'm convinced thats the best
      > way.[/color]

      If you care about uniqueness, then maybe std::set is the way to go. Consider
      someting like this:
      #include <set>

      template < typename T, unsigned long N >
      struct top_N {

      std::set<T> elements;

      void insert ( T const & t ) {
      elements.insert ( t );
      if ( elements.size() > N ) {
      elements.erase( elements.begin( ) );
      }
      }

      }; // struct top_N


      Best

      Kai-Uwe Bux

      Comment

      • Kristo

        #4
        Re: ? about priority_queue how make objects unique

        Rich S. wrote:[color=blue]
        > Hi
        >
        > In an earlier posting I was asking how to read thru millions of data records
        > keeping the top 2000 (where the top values are based upon a field in the
        > record) and niklasb suggested using a priority_queue like so[/color]

        [snip old message]
        [color=blue]
        > This is working well for me but I'm getting cases where I have duplicate
        > records in the top 2000 and I'm not quite sure how to fix that problem.[/color]

        What if you used a set as the underlying container for your
        priority_queue? That would automagically prevent duplicates from being
        inserted.
        [color=blue]
        > I was thinking of running a loop around the pop until the obj is NOT greater
        > than the top and then pushing obj but I'm convinced thats the best way.[/color]

        That's too much work IMHO, and probably inefficient. Just let the set
        do that for you.

        Kristo

        Comment

        • Mark P

          #5
          Re: ? about priority_queue how make objects unique

          Rich S. wrote:[color=blue]
          > Hi
          >
          > In an earlier posting I was asking how to read thru millions of data records
          > keeping the top 2000 (where the top values are based upon a field in the
          > record) and niklasb suggested using a priority_queue like so
          >
          >
          > ------------------- start old message ---------------------
          > A more efficient way than what you describe would be
          > to use priority_queue. Define the predicate in such
          > a way that the top() always returns the lowest scoring
          > object.
          >
          > Let q be the priority_queue and suppose you want to
          > find the top N scores (where N is 2000 in your example),
          > the pseudo-code would be something like this:
          >
          > q.reserve(N);
          >
          > // add the first N elements to the priority queue
          > for (i = 0; i < N; ++i)
          > {
          > if (!ReadRecord(&o bj))
          > break;
          >
          > q.push(obj);
          > }
          >
          > // invariant: q contains highest N elements
          > // q.top() is the lowest of the highest N elements
          > while (ReadRecord(&ob j))
          > {
          > if (obj > q.top())
          > {
          > q.pop();
          > q.push(obj);
          > }
          > }
          >
          > ------------------- end of old message ---------------------
          >
          > This is working well for me but I'm getting cases where I have duplicate
          > records in the top 2000 and I'm not quite sure how to fix that problem.
          >
          > I was thinking of running a loop around the pop until the obj is NOT greater
          > than the top and then pushing obj but I'm convinced thats the best way.
          >
          > Thanks for any help.
          >
          > Regards.
          >
          >[/color]

          I agree with previous replies that a std::set is a good approach (just
          make sure to check the return value of insert to see if an item was
          actually inserted, if you want to ensure always 2000 records). Another
          option is a hashtable to record which elts. are currently in the queue.

          Comment

          • niklasb@microsoft.com

            #6
            Re: ? about priority_queue how make objects unique

            Kristo wrote:[color=blue]
            > Rich S. wrote:[color=green]
            > > Hi
            > >
            > > In an earlier posting I was asking how to read thru millions of data records
            > > keeping the top 2000 (where the top values are based upon a field in the
            > > record) and niklasb suggested using a priority_queue like so[/color]
            >
            > [snip old message]
            >[color=green]
            > > This is working well for me but I'm getting cases where I have duplicate
            > > records in the top 2000 and I'm not quite sure how to fix that problem.[/color]
            >
            > What if you used a set as the underlying container for your
            > priority_queue? That would automagically prevent duplicates from being
            > inserted.[/color]

            I believe the underlying container for priority_queue must be a
            sequence container rather than an associative container. However,
            you could use set _instead_ of priority_queue. The algorithm
            doesn't change very much.

            As before, this is just off the top of my head. I haven't compiled
            it, much lest tested it, so use at your own risk and all that.

            std::set<MyType > highest;
            MyType obj;
            while (ReadRecord(&ob j))
            {
            // Is obj one of the highest so far?
            if (highest.count( ) < count || obj > *highest.begin( ))
            {
            // Insert it into the set; this has no effect if the set
            // already contains an identical value.
            highest.insert( obj);

            // If too many elements discard the lowest one.
            if (highest.count( ) > count)
            {
            highest.erase(h ighest.begin()) ;
            }
            }
            }

            Comment

            • Rich S.

              #7
              Re: ? about priority_queue how make objects unique


              "mlimber" <mlimber@gmail. com> wrote in message
              news:1128535187 .205106.111940@ g47g2000cwa.goo glegroups.com.. .[color=blue]
              > Rich S. wrote:[color=green]
              >> Hi
              >>
              >> In an earlier posting I was asking how to read thru millions of data
              >> records
              >> keeping the top 2000 (where the top values are based upon a field in the
              >> record) and niklasb suggested using a priority_queue like so
              >>
              >>
              >> ------------------- start old message ---------------------
              >> A more efficient way than what you describe would be
              >> to use priority_queue. Define the predicate in such
              >> a way that the top() always returns the lowest scoring
              >> object.
              >>
              >> Let q be the priority_queue and suppose you want to
              >> find the top N scores (where N is 2000 in your example),
              >> the pseudo-code would be something like this:
              >>
              >> q.reserve(N);
              >>
              >> // add the first N elements to the priority queue
              >> for (i = 0; i < N; ++i)
              >> {
              >> if (!ReadRecord(&o bj))
              >> break;
              >>
              >> q.push(obj);
              >> }
              >>
              >> // invariant: q contains highest N elements
              >> // q.top() is the lowest of the highest N elements
              >> while (ReadRecord(&ob j))
              >> {
              >> if (obj > q.top())
              >> {
              >> q.pop();
              >> q.push(obj);
              >> }
              >> }
              >>
              >> ------------------- end of old message ---------------------
              >>
              >> This is working well for me but I'm getting cases where I have duplicate
              >> records in the top 2000 and I'm not quite sure how to fix that problem.
              >>
              >> I was thinking of running a loop around the pop until the obj is NOT
              >> greater
              >> than the top and then pushing obj but I'm convinced thats the best way.
              >>
              >> Thanks for any help.
              >>
              >> Regards.[/color]
              >
              > Are you getting duplicates because there are duplicates in your file(s)
              > or because there is an error in the program? (If the former, you might
              > consider using a std::unique, perhaps in conjunction with a
              > std::vector, to find the first N unique records. See the docs here:
              > http://www.sgi.com/tech/stl/unique.html .)
              >
              > Cheers! --M
              >[/color]
              yes duplicates in my data which I expected. I switched to a set based
              solution and it works good so far.

              Thanks for the tip about unique that'll be handy one day.


              Comment

              • Rich S.

                #8
                Re: ? about priority_queue how make objects unique


                <niklasb@micros oft.com> wrote in message
                news:1128559668 .869414.10710@f 14g2000cwb.goog legroups.com...[color=blue]
                > Kristo wrote:[color=green]
                >> Rich S. wrote:[color=darkred]
                >> > Hi
                >> >
                >> > In an earlier posting I was asking how to read thru millions of data
                >> > records
                >> > keeping the top 2000 (where the top values are based upon a field in
                >> > the
                >> > record) and niklasb suggested using a priority_queue like so[/color]
                >>
                >> [snip old message]
                >>[color=darkred]
                >> > This is working well for me but I'm getting cases where I have
                >> > duplicate
                >> > records in the top 2000 and I'm not quite sure how to fix that problem.[/color]
                >>
                >> What if you used a set as the underlying container for your
                >> priority_queue? That would automagically prevent duplicates from being
                >> inserted.[/color]
                >
                > I believe the underlying container for priority_queue must be a
                > sequence container rather than an associative container. However,
                > you could use set _instead_ of priority_queue. The algorithm
                > doesn't change very much.
                >
                > As before, this is just off the top of my head. I haven't compiled
                > it, much lest tested it, so use at your own risk and all that.
                >
                > std::set<MyType > highest;
                > MyType obj;
                > while (ReadRecord(&ob j))
                > {
                > // Is obj one of the highest so far?
                > if (highest.count( ) < count || obj > *highest.begin( ))
                > {
                > // Insert it into the set; this has no effect if the set
                > // already contains an identical value.
                > highest.insert( obj);
                >
                > // If too many elements discard the lowest one.
                > if (highest.count( ) > count)
                > {
                > highest.erase(h ighest.begin()) ;
                > }
                > }
                > }
                >[/color]

                I switched to a set based solution and it works pretty good.

                Thanks


                Comment

                • Rich S.

                  #9
                  Re: ? about priority_queue how make objects unique


                  "Mark P" <usenet@fall200 5REMOVE.fastmai lCAPS.fm> wrote in message
                  news:_l_0f.9161 $oO2.7945@newss vr27.news.prodi gy.net...[color=blue]
                  > Rich S. wrote:[color=green]
                  >> Hi
                  >>
                  >> In an earlier posting I was asking how to read thru millions of data
                  >> records keeping the top 2000 (where the top values are based upon a field
                  >> in the record) and niklasb suggested using a priority_queue like so
                  >>
                  >>
                  >> ------------------- start old message ---------------------
                  >> A more efficient way than what you describe would be
                  >> to use priority_queue. Define the predicate in such
                  >> a way that the top() always returns the lowest scoring
                  >> object.
                  >>
                  >> Let q be the priority_queue and suppose you want to
                  >> find the top N scores (where N is 2000 in your example),
                  >> the pseudo-code would be something like this:
                  >>
                  >> q.reserve(N);
                  >>
                  >> // add the first N elements to the priority queue
                  >> for (i = 0; i < N; ++i)
                  >> {
                  >> if (!ReadRecord(&o bj))
                  >> break;
                  >>
                  >> q.push(obj);
                  >> }
                  >>
                  >> // invariant: q contains highest N elements
                  >> // q.top() is the lowest of the highest N elements
                  >> while (ReadRecord(&ob j))
                  >> {
                  >> if (obj > q.top())
                  >> {
                  >> q.pop();
                  >> q.push(obj);
                  >> }
                  >> }
                  >>
                  >> ------------------- end of old message ---------------------
                  >>
                  >> This is working well for me but I'm getting cases where I have duplicate
                  >> records in the top 2000 and I'm not quite sure how to fix that problem.
                  >>
                  >> I was thinking of running a loop around the pop until the obj is NOT
                  >> greater than the top and then pushing obj but I'm convinced thats the
                  >> best way.
                  >>
                  >> Thanks for any help.
                  >>
                  >> Regards.
                  >>
                  >>[/color]
                  >
                  > I agree with previous replies that a std::set is a good approach (just
                  > make sure to check the return value of insert to see if an item was
                  > actually inserted, if you want to ensure always 2000 records). Another
                  > option is a hashtable to record which elts. are currently in the queue.[/color]

                  I switched to a set but was thinking about a map solution for speed.


                  Comment

                  • Rich S.

                    #10
                    Re: ? about priority_queue how make objects unique


                    "Kai-Uwe Bux" <jkherciueh@gmx .net> wrote in message
                    news:di1ce7$hcq $1@murdoch.acc. Virginia.EDU...[color=blue]
                    > Rich S. wrote:
                    >[color=green]
                    >> Hi
                    >>
                    >> In an earlier posting I was asking how to read thru millions of data
                    >> records keeping the top 2000 (where the top values are based upon a field
                    >> in the record) and niklasb suggested using a priority_queue like so
                    >>
                    >>
                    >> ------------------- start old message ---------------------
                    >> A more efficient way than what you describe would be
                    >> to use priority_queue. Define the predicate in such
                    >> a way that the top() always returns the lowest scoring
                    >> object.
                    >>
                    >> Let q be the priority_queue and suppose you want to
                    >> find the top N scores (where N is 2000 in your example),
                    >> the pseudo-code would be something like this:
                    >>
                    >> q.reserve(N);
                    >>
                    >> // add the first N elements to the priority queue
                    >> for (i = 0; i < N; ++i)
                    >> {
                    >> if (!ReadRecord(&o bj))
                    >> break;
                    >>
                    >> q.push(obj);
                    >> }
                    >>
                    >> // invariant: q contains highest N elements
                    >> // q.top() is the lowest of the highest N elements
                    >> while (ReadRecord(&ob j))
                    >> {
                    >> if (obj > q.top())
                    >> {
                    >> q.pop();
                    >> q.push(obj);
                    >> }
                    >> }
                    >>
                    >> ------------------- end of old message ---------------------
                    >>
                    >> This is working well for me but I'm getting cases where I have duplicate
                    >> records in the top 2000 and I'm not quite sure how to fix that problem.
                    >>
                    >> I was thinking of running a loop around the pop until the obj is NOT
                    >> greater than the top and then pushing obj but I'm convinced thats the
                    >> best
                    >> way.[/color]
                    >
                    > If you care about uniqueness, then maybe std::set is the way to go.
                    > Consider
                    > someting like this:
                    > #include <set>
                    >
                    > template < typename T, unsigned long N >
                    > struct top_N {
                    >
                    > std::set<T> elements;
                    >
                    > void insert ( T const & t ) {
                    > elements.insert ( t );
                    > if ( elements.size() > N ) {
                    > elements.erase( elements.begin( ) );
                    > }
                    > }
                    >
                    > }; // struct top_N
                    >
                    >
                    > Best
                    >
                    > Kai-Uwe Bux[/color]

                    Thanks, I did switch to a set solution and it works well.


                    Comment

                    Working...