Selecting a random row

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kari Lavikka

    Selecting a random row


    Hi!

    I have to select a random row from a table where primary key isn't
    continuous (some rows have been deleted). Postgres just seems to do
    something strange with my method.

    --
    -- Use the order by desc limit 1 -trick to get maximum value
    --
    CREATE OR REPLACE FUNCTION max_uid() RETURNS int4 AS
    'SELECT uid FROM users WHERE status = ''a'' ORDER BY uid DESC LIMIT 1'
    LANGUAGE 'sql';


    --
    -- Choose a random point between 0 and max_uid and select the first
    -- value from the bigger part
    --
    CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS
    'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >=
    cast(cast(max_u id() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid
    ASC LIMIT 1'
    LANGUAGE 'sql';

    --
    -- testing and looks good
    --
    galleria=> SELECT max_uid();
    max_uid
    ---------
    126263

    --
    -- testing...
    --
    galleria=> SELECT random_uid(), random_uid(), random_uid(), random_uid(), random_uid();
    random_uid | random_uid | random_uid | random_uid | random_uid
    ------------+------------+------------+------------+------------
    322 | 601 | 266 | 427 | 369

    .... but what is this? Values seem to vary from 0 to ~1000.
    Not from 0 to 126263!!

    How about doing some manual work...

    --
    -- Testing split point selection
    --
    galleria=> SELECT cast(cast(max_u id() - 1 AS FLOAT) * random() AS INTEGER);
    int4
    -------
    43279

    --
    -- And inserting split point manually
    --
    galleria=> SELECT uid FROM users u WHERE u.status = 'a' AND uid >= 43279
    ORDER BY uid ASC LIMIT 1;
    uid
    -------
    43284

    Works just fine!


    Is there any explanation for this strange behavior or are there better
    ways to select a random row?

    I'm using PG 8.0 b2. Plan for the query is:

    Limit (cost=0.00..5.1 9 rows=1 width=4)
    -> Index Scan using users_pkey on users u (cost=0.00..691 45.26 rows=13329 width=4)
    Filter: ((status = 'a'::bpchar) AND (uid >= ((((max_uid() - 1))::double precision * random()))::int eger))

    |\__/|
    ( oo ) Kari Lavikka - tuner@bdb.fi
    __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
    ""

    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

  • Holger Klawitter

    #2
    Re: Selecting a random row

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    On Thursday 04 November 2004 12:36, Kari Lavikka wrote:[color=blue]
    > Is there any explanation for this strange behavior or are there better
    > ways to select a random row?[/color]

    How about

    SELECT ...whatever... ORDER BY random() LIMIT 1;

    Mit freundlichem Gruß / With kind regards
    Holger Klawitter
    - --
    lists <at> klawitter <dot> de
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.2 (GNU/Linux)

    iD8DBQFBiibF1Xd t0HKSwgYRAlJXAJ 4nUpDfKBKCigPVM t8WpKG4gZmt4wCc D/ZC
    KHBlBl1+5FZ4pgq kZlyzWQA=
    =MrrE
    -----END PGP SIGNATURE-----

    ---------------------------(end of broadcast)---------------------------
    TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

    Comment

    • Kari Lavikka

      #3
      Re: Selecting a random row


      Works but is too slooow. Shuffling whole table and selecting the first
      row is not the way to go in this case.

      Limit (cost=5340.74.. 5340.74 rows=1 width=4)
      -> Sort (cost=5340.74.. 5440.70 rows=39986 width=4)
      Sort Key: random()
      -> Seq Scan on users (cost=0.00..228 4.37 rows=39986 width=4)
      Filter: (status = 'a'::bpchar)

      |\__/|
      ( oo ) Kari Lavikka - tuner@bdb.fi
      __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
      ""

      On Thu, 4 Nov 2004, Holger Klawitter wrote:
      [color=blue]
      > -----BEGIN PGP SIGNED MESSAGE-----
      > Hash: SHA1
      >
      > On Thursday 04 November 2004 12:36, Kari Lavikka wrote:[color=green]
      > > Is there any explanation for this strange behavior or are there better
      > > ways to select a random row?[/color]
      >
      > How about
      >
      > SELECT ...whatever... ORDER BY random() LIMIT 1;
      >
      > Mit freundlichem Gruß / With kind regards
      > Holger Klawitter
      > - --
      > lists <at> klawitter <dot> de
      > -----BEGIN PGP SIGNATURE-----
      > Version: GnuPG v1.2.2 (GNU/Linux)
      >
      > iD8DBQFBiibF1Xd t0HKSwgYRAlJXAJ 4nUpDfKBKCigPVM t8WpKG4gZmt4wCc D/ZC
      > KHBlBl1+5FZ4pgq kZlyzWQA=
      > =MrrE
      > -----END PGP SIGNATURE-----
      >
      > ---------------------------(end of broadcast)---------------------------
      > TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org
      >[/color]

      ---------------------------(end of broadcast)---------------------------
      TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

      Comment

      • Csaba Nagy

        #4
        Re: Selecting a random row

        IIRC, this was discussed a few times on this list, searching the
        archives might get you some results. AFAIR, the only way to do it
        efficiently is to have a column specially assigned for this purpose, and
        populate it with random numbers in a big range. The column should be
        indexed to assure fast access based on it. Then you can select the first
        row with that column's value bigger (or smaller if you like) than a
        random number in the same range.

        HTH,
        Csaba.

        On Thu, 2004-11-04 at 14:34, Kari Lavikka wrote:[color=blue]
        > Works but is too slooow. Shuffling whole table and selecting the first
        > row is not the way to go in this case.
        >
        > Limit (cost=5340.74.. 5340.74 rows=1 width=4)
        > -> Sort (cost=5340.74.. 5440.70 rows=39986 width=4)
        > Sort Key: random()
        > -> Seq Scan on users (cost=0.00..228 4.37 rows=39986 width=4)
        > Filter: (status = 'a'::bpchar)
        >
        > |\__/|
        > ( oo ) Kari Lavikka - tuner@bdb.fi
        > __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
        > ""
        >
        > On Thu, 4 Nov 2004, Holger Klawitter wrote:
        >[color=green]
        > > -----BEGIN PGP SIGNED MESSAGE-----
        > > Hash: SHA1
        > >
        > > On Thursday 04 November 2004 12:36, Kari Lavikka wrote:[color=darkred]
        > > > Is there any explanation for this strange behavior or are there better
        > > > ways to select a random row?[/color]
        > >
        > > How about
        > >
        > > SELECT ...whatever... ORDER BY random() LIMIT 1;
        > >
        > > Mit freundlichem Gruß / With kind regards
        > > Holger Klawitter
        > > - --
        > > lists <at> klawitter <dot> de
        > > -----BEGIN PGP SIGNATURE-----
        > > Version: GnuPG v1.2.2 (GNU/Linux)
        > >
        > > iD8DBQFBiibF1Xd t0HKSwgYRAlJXAJ 4nUpDfKBKCigPVM t8WpKG4gZmt4wCc D/ZC
        > > KHBlBl1+5FZ4pgq kZlyzWQA=
        > > =MrrE
        > > -----END PGP SIGNATURE-----
        > >
        > > ---------------------------(end of broadcast)---------------------------
        > > TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org
        > >[/color]
        >
        > ---------------------------(end of broadcast)---------------------------
        > TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org[/color]


        ---------------------------(end of broadcast)---------------------------
        TIP 8: explain analyze is your friend

        Comment

        • Kari Lavikka

          #5
          Re: Selecting a random row


          Replying to myself..

          Actually I found an answer. If a I wrap the split point selection to
          subquery then the range of results is from 0 to maximum value (~120k in
          this case)

          galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
          (select cast(cast((SELE CT uid FROM users WHERE status = 'a' ORDER BY uid
          DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT 1;
          uid
          -------
          91937
          (1 row)

          Limit (cost=1.73..3.5 3 rows=1 width=4)
          InitPlan
          -> Result (cost=1.71..1.7 3 rows=1 width=0)
          InitPlan
          -> Limit (cost=0.00..1.7 1 rows=1 width=4)
          -> Index Scan Backward using users_pkey on users (cost=0.00..684 23.70 rows=39986 width=4)
          Filter: (status = 'a'::bpchar)
          -> Index Scan using users_pkey on users u (cost=0.00..239 83.04 rows=13329 width=4)
          Index Cond: (uid >= $1)
          Filter: (status = 'a'::bpchar)


          However, without the additional nothing doing subquery the range of
          results is something like 0 to ~1000 which is of course wrong.


          galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
          cast(cast((SELE CT uid FROM users WHERE status = 'a' ORDER BY uid DESC
          LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1;
          uid
          -----
          587
          (1 row)


          Examining the query plan reveals that without subquery the random
          comparison is made for each row. That causes a kind of "premature
          selection".


          galleria=> explain SELECT u.uid FROM users u WHERE u.status = 'a' AND uid[color=blue]
          >= cast(cast((SELE CT uid FROM users WHERE status = 'a' ORDER BY uid DESC[/color]
          LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1;

          ------------------------------------------------------------------------------------------------------------
          Limit (cost=1.71..6.8 9 rows=1 width=4)
          InitPlan
          -> Limit (cost=0.00..1.7 1 rows=1 width=4)
          -> Index Scan Backward using users_pkey on users (cost=0.00..684 23.70 rows=39986 width=4)
          Filter: (status = 'a'::bpchar)
          -> Index Scan using users_pkey on users u (cost=0.00..690 42.18 rows=13329 width=4)
          Filter: ((status = 'a'::bpchar) AND (uid >= (((($0 - 1))::double precision * random()))::int eger))
          (7 rows)


          Well, it works now. Thanks anyway ;)


          |\__/|
          ( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808
          __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
          ""


          ---------------------------(end of broadcast)---------------------------
          TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

          Comment

          • Tom Lane

            #6
            Re: Selecting a random row

            Kari Lavikka <tuner@bdb.fi > writes:[color=blue]
            > --
            > -- Choose a random point between 0 and max_uid and select the first
            > -- value from the bigger part
            > --
            > CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS
            > 'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >=
            > cast(cast(max_u id() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid
            > ASC LIMIT 1'
            > LANGUAGE 'sql';[/color]

            This isn't going to do what you think because the random() function is
            re-evaluated at every row of the table. (For that matter, so is
            max_uid(), which means performance would suck even if it worked ...)

            I'd suggest rewriting in plpgsql so you can assign the (max_uid-1)*random()
            expression to a variable and then just use the variable in the SELECT.

            regards, tom lane

            ---------------------------(end of broadcast)---------------------------
            TIP 7: don't forget to increase your free space map settings

            Comment

            • gnari

              #7
              Re: Selecting a random row

              From: "Kari Lavikka" <tuner@bdb.fi >[color=blue]
              >
              > Actually I found an answer. If a I wrap the split point selection to
              > subquery then the range of results is from 0 to maximum value (~120k in
              > this case)
              >
              > galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
              > (select cast(cast((SELE CT uid FROM users WHERE status = 'a' ORDER BY uid
              > DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT[/color]
              1;[color=blue]
              > uid
              > -------
              > 91937
              > (1 row)[/color]

              Tthe problem with this is that this is not very random.
              If the uids 30000 to 39999 have been missing, but
              the uids are more or less contiguous apart from that,
              the uid 40000 would be 10000 times more likely to be selected
              than average.

              Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1
              could be practical.

              gnari



              ---------------------------(end of broadcast)---------------------------
              TIP 3: if posting/reading through Usenet, please send an appropriate
              subscribe-nomail command to majordomo@postg resql.org so that your
              message can get through to the mailing list cleanly

              Comment

              • Tom Lane

                #8
                Re: Selecting a random row

                "gnari" <gnari@simnet.i s> writes:[color=blue]
                > Tthe problem with this is that this is not very random.
                > If the uids 30000 to 39999 have been missing, but
                > the uids are more or less contiguous apart from that,
                > the uid 40000 would be 10000 times more likely to be selected
                > than average.[/color]

                There is some discussion of selecting random rows in the list archives.
                My recollection is that we decided the only way to be fast *and*
                truly random was to dedicate an extra column to be a random key.

                regards, tom lane

                ---------------------------(end of broadcast)---------------------------
                TIP 5: Have you checked our extensive FAQ?



                Comment

                • Kari Lavikka

                  #9
                  Re: Selecting a random row

                  [color=blue]
                  > Tthe problem with this is that this is not very random.
                  > If the uids 30000 to 39999 have been missing, but
                  > the uids are more or less contiguous apart from that,
                  > the uid 40000 would be 10000 times more likely to be selected
                  > than average.[/color]

                  There are some gaps but distribution of them is quite uniform. And results
                  seem to be random enuff for this particular purpose.
                  [color=blue]
                  > Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1
                  > could be practical.[/color]

                  Something like OFFSET (random() * 10) could be used for additional
                  randomness of course.

                  |\__/|
                  ( oo ) Kari Lavikka - tuner@bdb.fi
                  __ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
                  ""


                  ---------------------------(end of broadcast)---------------------------
                  TIP 7: don't forget to increase your free space map settings

                  Comment

                  Working...