Most pythonic way of weighted random selection

**bearophileHUGS@lycos.com** · Aug 30 '08, 05:05 PM

Re: Most pythonic way of weighted random selection

Manuel Ebert, this may be related/useful:

http://code.activestate.com/recipes/498229/

Note that numpy has a bisection method/function that are probably
quite faster.

Bye,
bearophile

**duncan smith** · Aug 31 '08, 12:15 AM

Re: Most pythonic way of weighted random selection

Manuel Ebert wrote:

Dear list,
>
who's got aesthetic advice for the following problem? I've got some
joint probabilities of two distinct events Pr(X=x, Y=y), stored in a
list of lists of floats, where every row represents a possible outcome
of X and every float in a row a possible outcome of Y (which I will now
call my matrix, altough I can't use numpy or numeric for this), so e.g.
m = [[0.2, 0.4, 0.05], [0.1, 0.05, 0.2]]. All lists in the list are
equally long and the values of the flattened list add up to 1.0, i.e.
sum([sum(row) for row in m]) == 1. In practice, this 'matrix' is about
20x40, i.e. a list with 20 lists á 40 floats each.
>
Now the task is to select one outcome for X and Y based on the joint
probabilites, and afterwards check that the outcomes fullfill certain
criteria. If it doesn't fulfill some criteria a new pair of outcomes has
to be selected, for other criteria it will still be selected with a
certain probability. My approach was to choose a random number, and then
walk through the list adding the values together until the accumulated
sum is greater than my random threshold:
>

[snip]

For a list of cumulative probabilities you could use the bisect module
to find the insertion point for a 0,1 uniform variate (which you could
then map back to a cell index).

Alternatively you could use an alias table,
http://amath.colorado.edu/courses/74.../Web/SS-10.ppt.

You could produce a probability table that takes into account your
criteria, so that you don't need to check them. i.e. Each cell that
does not satisfy criterion a has probability 0 of being selected. Each
other cell has a probability of being selected proportional to its
original value if it satisfies criterion b, or its original value * p
otherwise. Normalise the resulting values to sum to 1 and you're done
with the need to check against criteria.

So, preprocess your table, create a corresponding list of cumulative
probabilities and use the bisect model (is one possibility).

Duncan

**Steven D'Aprano** · Aug 31 '08, 02:05 AM

Re: Most pythonic way of weighted random selection

On Sat, 30 Aug 2008 17:41:27 +0200, Manuel Ebert wrote:

Dear list,
>
who's got aesthetic advice for the following problem?

....

[ugly code removed]

Now that looks plain ugly, and I wonder whether you might find a
slightly more elegant way of doing it without using numpy and the like.

Never be afraid to factor out pieces of code into small functions.
Writing a single huge while loop that does everything is not only hard to
read, hard to write and hard to debug, but it can also run slower. (It
depends on a number of factors.)

Anyway, here's my attempt to solve the problem, as best as I can
understand it:

import random

def eq(x, y, tol=1e-10):
# floating point equality within some tolerance
return abs(x-y) <= tol

M = [[0.2, 0.4, 0.05], [0.1, 0.05, 0.2]]
# the sums of each row must sum to 1.0
assert eq(1.0, sum([sum(row) for row in M]))

# build a cumulative probability matrix
CM = []
for row in M:
for p in row:
CM.append(p) # initialize with the raw probabilities

for i in range(1, len(CM)):
CM[i] += CM[i-1] # and turn into cumulative probabilities

assert CM[0] >= 0.0
assert eq(CM[-1], 1.0)

def index(data, p):
"""Return the index of the item in data
which is no smaller than float p.
"""
# Note: this uses a linear search. If it is too slow,
# you can re-write it using the bisect module.
for i, x in enumerate(data) :
if x >= p:
return i
return len(data-1)

def index_to_rowcol umn(i, num_columns):
"""Convert a linear index number i into a (row, column) tuple."""
# When converting [ [a, b, c, ...], [...] ] into a single
# array [a, b, c, ... z] we have the identity:
# index number = row number * number of columns + column number
return divmod(i, num_columns)

# Now with these two helper functions, we can find the row and column
# number of the first entry in M where the cumulative probability
# exceeds some given value.

# You will need to define your own fulfills_criter ion_a and
# fulfills_criter ion_b, but here's a couple of mock functions
# for testing with:

def fulfills_criter ion_a(row, column):
return random.random() < 0.5

fulfills_criter ion_b = fulfills_criter ion_a

def find_match(p=0. 2):
while True:
r = random.random()
i = index(CM, r)
row, column = index_to_rowcol umn(i, len(M[0]))
if fulfills_criter ion_a(row, column) or \
fulfills_criter ion_b(row, column):
return row, column
else:
if random.random() <= p:
return row, column

And here's my test:

>>find_match( )

(1, 2)

Hope this helps.

--
Steven

Most pythonic way of weighted random selection

Most pythonic way of weighted random selection

Comment

Comment

Comment