How do I select random subjects from different groups in one table?

**lilp32** · Jan 18 '12, 10:23 PM

Works perfectly! Thanks so much!!!

**Rabbit** · Jan 18 '12, 10:34 PM

Not a problem, good luck.

**Rabbit** · Jan 18 '12, 11:24 PM

The results of 100000 runs each on 50 unique IDs. There are a few outliers but for the most part, it approaches random.

Code:

 1 :: 0.5097
 2 :: 0.4913
 3 :: 0.4957
 4 :: 0.4857
 5 :: 0.5064
 6 :: 0.4436
 7 :: 0.3428
 8 :: 0.5
 9 :: 0.52
10 :: 0.5037
11 :: 0.4366
12 :: 0.5158
13 :: 0.469
14 :: 0.4779
15 :: 0.49
16 :: 0.5013
17 :: 0.5315
18 :: 0.4691
19 :: 0.5175
20 :: 0.541
21 :: 0.4312
22 :: 0.421
23 :: 0.4748
24 :: 0.4604
25 :: 0.5207
26 :: 0.5059
27 :: 0.4522
28 :: 0.4169
29 :: 0.6054
30 :: 0.5955
31 :: 0.5092
32 :: 0.5134
33 :: 0.5379
34 :: 0.5026
35 :: 0.4889
36 :: 0.5094
37 :: 0.4758
38 :: 0.452
39 :: 0.5282
40 :: 0.4668
41 :: 0.4673
42 :: 0.4998
43 :: 0.4969
44 :: 0.4396
45 :: 0.5248
46 :: 0.5087
47 :: 0.5215
48 :: 0.4943
49 :: 0.4541
50 :: 0.4971

**NeoPa** · Jan 18 '12, 11:44 PM

Originally posted by Rabbit

Rabbit (Post #18):
As for post #11, I prefer a join approach over a subquery in terms of speed.

I'm sure you do. I would too if there were a version that gave valid results. I don't believe that's the case with what you posted in post #18 though, for the reasons I mentioned earlier (post #11). It doesn't seem sensible to compromise the results for reasons of performance (Hence I went for the less efficient approach that, as far as I can see, gives reliable results logically).

Ref post #30.
Certainly it is better to use the [UNIQUEID] field in the seeding, but this doesn't. It multiplies by a [UNIQUEID] instead. This has the effect that the results may well look random, but are not. It is my belief (until I see something else new to be fair) that this approach cannot yield the results you request. Assume for the sake of understanding that [UNIQUEID] is allocated in order of records created. As the product of the result of Rnd() * [UNIQUENO] will always be a value which follows the pattern 0 <= X < [UNIQUENO], it follows that more recent records will tend towards those to be selected. Nothing is easily testable as the randomness is always involved too, but unless [UNIQUENO] is added as part of the seed this will always be flawed. Of course if that were the case, then this approach wouldn't work anyway.

I'm happy to be proven wrong here, but I'll require more than the fact that the results seem random (as I'd expect that anyway). That doesn't counter the knowledge that largere numbered [UNIQUEID] values (more recent records) would be weighted over those with lower values (entered earlier).

**NeoPa** · Jan 18 '12, 11:55 PM

Originally posted by Lilp32

Lilp32:
@NeoPa: I realized my mistake, thanks! I still haven't gotten the VBA code to work.

Let me know what you're stuck on and I'll be happy to help.

PS. As per my response in post #34, Rabbit is absolutely right in that my suggested SQL code in post #11 would not run as fast as the SQL using JOINs (I believe it would actually work reliably, but I'm not above wanting confirmation on that score from Rabbit who is, as you'll see, one of the best SQL exponents you could hope to come across). As such, I'm not sure whether or not it would run faster or slower than the code solution from post #12 (which references the routine from post #10).

**Rabbit** · Jan 19 '12, 12:12 AM

The only time that a join would return incorrect results is when Rnd() returns the same result for different records in the same group. At which point you would have to balance veracity against speed. And that would be highly dependent on how many records they have. I think at most you would get 19 records rather then your 20 and even that would be pretty a rare occurrence. If this is not the shortcoming you're referring to, can you explain?

I don't think I suggested that the records be picked by Rnd() * [UniqueID] as that would definitely result in incorrect results. Rather, I meant that Rnd(-1 * [seconds or milliseconds] * [UniqueID]) should be used to get the random number. Using that as the seed should result in near randomness for any [UniqueID]. I believe the code in 30 and the results in 33 show that, at least for the numbers 1-50, any one ID approaches randomness.

I suppose the next step would be to show randomness between any two sets of IDs. I can make test code for that as well. Because while randomness is, by definition, difficult to test, it is not impossible to test all possible seed values that may be used to seed the generator.

In the end, I believe the key to approaching randomness would be to multiply more and more time values into the seed.

**NeoPa** · Jan 19 '12, 12:56 AM

Originally posted by Rabbit

Rabbit:
I don't think I suggested that the records be picked by Rnd() * [UniqueID] as that would definitely result in incorrect results.

Indeed. You are correct on that point. I checked over and over again and still managed to get confused on that point, but now I look yet again, I see it was, indeed, within the seed of the Rnd() call (Not only in the most recent example but also in posts #6 and #18).

My other objection was based on the fact that the [RandNo] values for the same records in [T1] and [T2] were identical (assuming Now() is a value that is only set once for the whole run and simply reused as that static value for each reference). I'm ashamed to say that on that score I was also mistaken (Either that or I'm just too tired to think clearly now). This is actually an essential part of the logic.

If I was impressed before, I find I now have the urge to bow before a SQL deity. I see nothing wrong with it at all (If I'm being really picky I might say to update line #14 to GROUP BY T1.UNIQEUEID, but that's just a detail you didn't catch up with after that field was provided late).

@Lilp32.
Go ahead and mark post #30 as Best Answer. I'll be happy to take you through my code still, if you want, but Rabbit's solution is, frankly, in a class of its own. I'm just checking again in case my enthusiasm is a result of tiredness and I'm missing something, but I really don't think so. It also seems more straightforward now somehow (which illustrates its code-elegance). No. I'm happy it's the doggy's doo-dahs.

Originally posted by Rabbit

Rabbit:
In the end, I believe the key to approaching randomness would be to multiply more and more time values into the seed.

No. I led you up the garden path there I'm afraid (inadvertently of course). The code is perfectly random I see now. Any perceived lack of randomness was down to my misunderstandin g of where the parentheses left your multiplication of the value(s) from within the record.

**Rabbit** · Jan 19 '12, 01:37 AM

I wouldn't say that you steered me wrong per se. Only that you sent me down an inevitable and necessary path to check whether or not it is truly random between any two numbers.

I plan on running the following simulation when I get back in the office tomorrow.

Code:

For seconds = 1 To 60
   For minutes = seconds + 1 To 60
      win = 0

      For id1 = 1 To 10001 Step 100
         For id2 = id1 + 100 To 10001 Step 100
            If rnd(minutes * seconds * id1) > rnd(minutes * seconds * id2) Then
               win = win + 1
            End If
         Next id2
      Next id1

      Debug.Print win / 3600
   Next minutes
Next seconds

The win rate for any two number combination should approach 50%.

**Rabbit** · Jan 19 '12, 05:34 PM

I only went from 1 to 1001 with a step of 100. But here are the results:

Code:

1,101,0.497777777777778
1,201,0.506388888888889
1,301,0.508888888888889
1,401,0.487222222222222
1,501,0.493333333333333
1,601,0.495277777777778
1,701,0.495
1,801,0.489444444444444
1,901,0.498055555555556
1,1001,0.494444444444444
101,201,0.501944444444444
101,301,0.505
101,401,0.509444444444444
101,501,0.5125
101,601,0.487222222222222
101,701,0.498055555555556
101,801,0.514722222222222
101,901,0.496111111111111
101,1001,0.498888888888889
201,301,0.510833333333333
201,401,0.501111111111111
201,501,0.500277777777778
201,601,0.500555555555556
201,701,0.496111111111111
201,801,0.500555555555556
201,901,0.491388888888889
201,1001,0.506388888888889
301,401,0.507777777777778
301,501,0.493055555555556
301,601,0.4975
301,701,0.4925
301,801,0.490555555555556
301,901,0.487777777777778
301,1001,0.499166666666667
401,501,0.506944444444444
401,601,0.488888888888889
401,701,0.494166666666667
401,801,0.492777777777778
401,901,0.495
401,1001,0.503055555555556
501,601,0.499166666666667
501,701,0.496666666666667
501,801,0.489166666666667
501,901,0.493333333333333
501,1001,0.495277777777778
601,701,0.495555555555556
601,801,0.498333333333333
601,901,0.496666666666667
601,1001,0.501666666666667
701,801,0.491666666666667
701,901,0.506111111111111
701,1001,0.504166666666667
801,901,0.512222222222222
801,1001,0.504166666666667
901,1001,0.503055555555556

For the most part, it seems to approach random. The largest difference seems to be a 2.4% bias towards one number over another.

P.S. As expected, multiplying more time factors into the seed reduces variance. Minutes and seconds is good but minutes, seconds, hours is better and so on.

**NeoPa** · Jan 19 '12, 11:04 PM

Originally posted by Rabbit

Rabbit:
P.S. As expected, multiplying more time factors into the seed reduces variance. Minutes and seconds is good but minutes, seconds, hours is better and so on.

Indeed. I left out the Hours when I looked at my version as that would have risked an overflow condition. Two fields from the record were required at that time though. Now [UNIQUEID] is available the Hours can be handled without risk of overflow.

How do I select random subjects from different groups in one table?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment