How to select record based on probability?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • TheSmileyCoder
    Recognized Expert Moderator Top Contributor
    • Dec 2009
    • 2322

    How to select record based on probability?

    I have a table of names, with 2 fields in it, dblProbability and txName, where dblProbabilty is the relative probability that the Name in that row should be selected. How can I get 1 random name from my table, taking into account the probability that it should occur?
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    #2
    Make a call to Rnd() seeded with the date and/or time, depending on how often you want run the query. If the returned value is less than or equal to your probability field, then return that row. Something like this:
    Code:
    SELECT *
    FROM someTable
    WHERE probabilityField >= Rnd(Second(Now()))
    Now, normally you also seed it with a primary key so that each row gets a different random number. However, it is unnecessary in this instance because you're not comparing between rows but rather within the row.

    PS: If you're going to validate, seed it with a negative. Otherwise each call to Rnd will return the next number in the sequence and the results will be out of sync with what you see on the screen.
    Last edited by Rabbit; Jun 5 '12, 05:25 PM.

    Comment

    • zmbd
      Recognized Expert Moderator Expert
      • Mar 2012
      • 5501

      #3
      @Rabbit:
      Are we sure that TheSmileyCoder isn't comparing between rows?
      I took this question to mean: say given "name_alpha/5" vs. "name_beta/2" vs. "name_charl ie/1" would state something along the lines that name_alpha would be 5 times more likely than name_charlie and 2.5 times more likely than name_beta to be picked and similarly name_beta would be twice as likely than name_charlie to be picked.
      @ TheSmileyCoder:
      Are you comparing the data against each other? How are you assigning the probabilities, in as I have given in the example above, which is somewhat arbitrary, or do the probabilities have to add up to 100% (i.e. so the above becomes close to "name_alpha/63" vs. "name_beta/25" vs. "name_charl ie/12"… which might be easier to code for...)?
      If no there is no requirement for summation to 100% then are you allowing equal probabilities to be assigned… that will toss a monkey-wrench in the logic!

      In anticipation of your answer:
      So, let’s go with the no monkey-wrench (sets the foundation for the monkey-wrench):
      In the second method... use the rnd(1,100) wherein if the rnd(1,100)>=63 then name_alpha, rnd(1,100)=betw een63and12 then name_beta, rnd(1,100)<=12 then name_charlie. If we trust that the rnd function is truely random then the weighting should work.
      Thus should work even if you don't require the sum to be 100 then convert the values to weighting as for weighted average... sum all of the probilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approx 63, 25, 12.
      From here, the query should return all records where the [probability] is greater than or equal to the rnd(1,100) (back convert to the arbitrary value if needed). Now logically we know that these records should be chosen more often than the remaining records. So do we return the record with the max([probability]) or the min([probability])? I opt for the min([probability]) as if the rnd(1,100) was small then max([probability]) would always returned. You have your single name.
      Monkey-wrench:
      Once again… pull the records as before, look for the min([probability]); however, this time look for count(min([probability]))>1. If you do have more than one record with that probability, then we need to pull all of the records wherein they have min([probability]) then do an un-weighted random pick between records as they would have had an equal chance to have been selected to begin with...
      -z
      Last edited by zmbd; Jun 5 '12, 06:02 PM. Reason: fixed logic error

      Comment

      • Rabbit
        Recognized Expert MVP
        • Jan 2007
        • 12517

        #4
        I assumed they were comparing just within the row. If they are trying to pick one out of a cumulative probability, then it's pretty much the same thing except you will need a start range and end range and use an additional comparison.
        Code:
        SELECT * 
        FROM someTable 
        WHERE startProb >= Rnd(Second(Now()) * -1)
           AND endProb < Rnd(Second(Now()) * -1)

        Comment

        • TheSmileyCoder
          Recognized Expert Moderator Top Contributor
          • Dec 2009
          • 2322

          #5
          I currently have a list of 64000 lastnames, and how many of each is used by persons living in Denmark. Example:
          Code:
          Amount  Name
          460470  Jensen
          446646  Nielsen
          382215  Hansen
          271012  Pedersen
          ...
          ...
          ...
          6       Kausky
          5       Haj
          4       Shorch
          4       Van Sas
          4       Osinka
          4       Devitt
          As you can see, some of these have a way higher weight then others, and some will have the same weight.

          I haven't converted the Amount to a probability yet, though thats just a formality. I do not plan to continually update the data basis, so if any changes needs to be made now, in order to create a more efficient function that is fine.

          I need to generate X random results from the list, where X can be anywhere from 1 to 100.000 or more. As such its important the function is relatively efficient.


          EDIT: hmm. cumulative probability, that sounds like a good way to go, I will look into that some more, and report my findings.

          Comment

          • zmbd
            Recognized Expert Moderator Expert
            • Mar 2012
            • 5501

            #6
            take a look at this

            -z

            Comment

            • zmbd
              Recognized Expert Moderator Expert
              • Mar 2012
              • 5501

              #7
              Should have mentioned this in my last post but I was called away....

              cumulative probability is a very specific statistical model theory. I don't feel that, that, particular methodology is what you are after and research along that line gets into some very heavy math very quickly...

              What I tried to explain in post #3 is a form of partial sums and normalization (mathematical normalization not RDB) skipping the normalization step. This method should work; however, for a very large data-set it might not be the best in terms of speed. Therefor I posted the link to the mysql thread.

              What I think you should do is also do a Google for: "Weighted Random Sampling (2005; Efraimidis, Spirakis)"

              very good paper... the one I have is copyrighted and the agreement has a re-post/reuse restriction (sucks) or I would excerpt it here; however, you should be able to find something online using that as the search (wink-wink-nudge-nudge-dontchknow).

              there are also a few good papers out of Berkly and others on this topic... I'll try and attach an older pdf for a simple random... what I like about this paper is that they attempt to do a random sampling without having to build the entire record-set first! (note: I found this paper some years ago when I was building a random sample dataset, I hope it is in the public domain and is provided here soley for academic purposes. I renamed the file for my own document management... it's not a simple paper!) This paper sent me down the partial-sum-tree method.

              put your thiking cap on and brew a stong cup of coffee... you're in for a heavy read.
              -wc
              Attached Files

              Comment

              • Rabbit
                Recognized Expert MVP
                • Jan 2007
                • 12517

                #8
                If your goal is to generate x number of names given a known distribution, meaning you're allowing for "duplicate" names, you can do something like this:

                1) Create a table just with the numbers 1 to some arbitrarily high x.

                2) In your probability table, have a start and end range for each name.

                3) Use a query to generate a random number on the number table seeding with the number and time values.

                4) Join that query to the probability table using a query similar to the previous post.

                Comment

                • NeoPa
                  Recognized Expert Moderator MVP
                  • Oct 2006
                  • 32656

                  #9
                  I think Rabbit's initial suggestion can be modified to work simply by comparing the random number generated (0 <= X < 1) against the result of :
                  Code:
                  [dblProbability] / Sum([dblProbability])
                  Clearly, the random number would need to be reset for each record (IE. Rerun, rather than necessarily reseeded).

                  Comment

                  • zmbd
                    Recognized Expert Moderator Expert
                    • Mar 2012
                    • 5501

                    #10
                    cumulative probability is a red-herring

                    Originally posted by Rabbit
                    Rabbit:...1) Create a table just with the numbers 1 to some arbitrarily high x...
                    I understand what you're getting at; however, the probability is already part of the table as [dblProbability]... why not use the variation on partial sums tree method as I've suggested in #3 other than it may not be the fastest…
                    Originally posted by zmbd
                    ... sum all of the probabilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approximately 63, 25, 12....
                    Now the 63 is actually 63% and the others 25%, 10%... cardinal sin not using the units... my bad.
                    Notice, this is basically the same as NeoPa's reply in #10.

                    Now, TheSmileyCoder' s original question was for (as I understood it) a specific, single, name to be picked hence my reply and subsequent query... the query can be optimized to pull just the one record.
                    Now however in post #5 TheSmileyCoder tells us that we need to pull not just one record but create a record-set that may consist of between one and several thousand elements; thus, the two suggested papers and the link to the mysql forum so as to provide a path to the solution to a method to do this efficiently and to provide information leading to the consideration for the need for non-duplication or replacement etc... I am working under the notion that the subset must also be in random order and that duplication is allowed.

                    Originally posted by hina
                    hina:...a cumulative probability, then it's pretty much the same thing ...
                    Hina, you've simply re-stated Rabbit's post in #4. As I said in #3, cumulative probability is a very specific statistical theory that for this question may very well be a red-herring. For the sake of argument, please take a look at this very simple explanation (one should also work thru the associated lessons): http://stattrek.com/probability-dist...tribution.aspx cumulative probability simply doesn’t appear, to me, to be the correct path to the solution.

                    -z

                    Comment

                    • dsatino
                      Contributor
                      • May 2010
                      • 393

                      #11
                      I've attached a sample DB that will what you want based on your description.

                      The basis for the selection code is in Section 4.2.1 of the PDF that zmbd posted. There's actually a ton of other good stuff in there as well, but you don't need that for your purpose.

                      The only thing related to the sampling that's really worth pointing out is that this method uses replacement so if you want to use this for other purposes, you must consider that.

                      As for efficiency, I created a sample of 10,000 in about 30 seconds, but you can improve that. It's writing to disk every time it makes a selection so you best bet is probably to capture those and write them en masse.

                      Hope this helps
                      Attached Files

                      Comment

                      • Rabbit
                        Recognized Expert MVP
                        • Jan 2007
                        • 12517

                        #12
                        @zmbd, I think you misunderstood my intention with that second table. I wasn't using it as a way to generate the probabilities. It's there to allow for "duplicatio n".

                        I think, and this is just my interpretation, that the goal is to generate a random population. So let's say I'm modeling the inhabitants of a city. And I want to generate 1000 people to live in this city. Some of them are going to have the same name. Whichever way they decide to calculate the probability and pick one, they will need a way to allow the same name to pop up more than once. That's all that second table is for.

                        Comment

                        • Rabbit
                          Recognized Expert MVP
                          • Jan 2007
                          • 12517

                          #13
                          Here's what I mean.
                          tblSampleSize
                          Code:
                          [b]n[/b]
                          1
                          2
                          3
                          4
                          5
                          tblProb
                          Code:
                          [b]pName aProb bProb[/b]
                          Bob   0     .5
                          Andy  .5    .8
                          Jill  .8    .95
                          Kim   .95   1
                          Query
                          Code:
                          SELECT pName
                          FROM tblSampleSize, tblProb
                          WHERE n <= 3
                             AND Rnd(-1 * n * Second(Now()) * Minute(Now())) >= aProb
                             AND Rnd(-1 * n * Second(Now()) * Minute(Now())) < bProb
                          The query will generate 3 people, with the option of generating up to 5 people, with any of the names in the table. Allowing for names to be repeated.

                          I've attached a sample database. It will generate 15000 names from 20 available names in less than 3 seconds.
                          Attached Files
                          Last edited by Rabbit; Jun 6 '12, 05:38 PM.

                          Comment

                          • Mihail
                            Contributor
                            • Apr 2011
                            • 759

                            #14
                            Sometimes someone not very skilled can have a good idea (agree: very rarely). So:

                            Create a temporary table with two more fields:
                            Code:
                             Amount      Name            Min                 Max
                            460470      Jensen        Max_Nielsen + 1    Max_Nielsen + 460470
                            446646      Nielsen       .............      ..................
                            382215      Hansen
                            271012      Pedersen
                                ...
                                ...
                                ...
                            6           Kausky            22                  27
                            5           Haj               17                  21
                            4           Shorch            13                  16
                            4           Van Sas            9                  12
                            4           Osinka             5                   8
                            4           Devitt             1                   4
                            Then generate a random number between MIN = 1 and MAX = Max_Nielsen + 460470

                            Use a query to select the record you need (The random number is between Min and Max)

                            Comment

                            • NeoPa
                              Recognized Expert Moderator MVP
                              • Oct 2006
                              • 32656

                              #15
                              That scores high on the logic front Mihail, but unfortunately not very high on the practicability front.

                              We have some other, obviously well educated, contributors in here who have suggested solutions which don't require pre-manipulation of the data.

                              This was always going to be a tougher than average question though. If Smiley needs to ask it then it's not likely to be too simple or straightforward ;-)

                              Keep up the good work anyway :-)

                              Comment

                              Working...