Trying to get postgres to use an index

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Mike Wertheim

    Trying to get postgres to use an index


    Hi,

    I'm using PostgreSQL 8.

    I have two tables that I am doing a join on, and the join executes very
    slowly.

    The table called Notification has a text field called NotificationID,
    which is its primary key. The Notification table also has an int4 field
    called ItemID, and it has an index on the ItemID field. The table
    called Item has an int4 field called ItemID, which is its primary key.


    If I do a simple select on Notification using just the ItemID field, the
    index is used...

    explain select notificationID from NOTIFICATION n where n.itemID = 12;
    QUERY PLAN

    ------------------------------------------------------------------------
    ---------------------
    Index Scan using notification_4_ idx on notification n
    (cost=0.00..129 .22 rows=57 width=44)
    Index Cond: (itemid = 12)

    This query runs in far less than one second.



    But if I do a join, the index isn't used...

    explain select notificationID from NOTIFICATION n, ITEM i where
    n.itemID = i.itemID;
    QUERY PLAN

    ------------------------------------------------------------------------
    ------
    Hash Join (cost=47162.85. .76291.32 rows=223672 width=44)
    Hash Cond: ("outer".ite mid = "inner".ite mid)
    -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671
    width=48)
    -> Hash (cost=42415.28. .42415.28 rows=741028 width=4)
    -> Seq Scan on item i (cost=0.00..424 15.28 rows=741028
    width=4)

    This query takes about 20 seconds to run.


    I have run "vacuum analyze", and it didn't make any difference.

    I've seen people say that sometimes the query optimizer will decide to
    not use an index if it thinks that doing a sequential scan would be
    faster. I don't know if that's what's happening here, but it seems to
    me that using the index should be much faster than the performance I'm
    getting here.

    Does anyone have any suggestions on how to make this query run faster?



    ---------------------------(end of broadcast)---------------------------
    TIP 7: don't forget to increase your free space map settings

  • Troels Arvin

    #2
    Re: Trying to get postgres to use an index

    On Sat, 06 Nov 2004 12:00:02 -0800, Mike Wertheim wrote:
    [color=blue]
    > Does anyone have any suggestions on how to make this query run faster?[/color]

    Does it help if you decrease the value of random_page_cos t? - That value
    can be changed run-time, from within psql. If you find that a certain,
    lower value helps, you can make it permanent in postgresql.conf .

    --
    Greetings from Troels Arvin, Copenhagen, Denmark



    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
    joining column's datatypes do not match

    Comment

    • Pierre-Frédéric Caillaud

      #3
      Re: Trying to get postgres to use an index

      > explain select notificationID from NOTIFICATION n, ITEM i where[color=blue]
      > n.itemID = i.itemID;
      > QUERY PLAN
      >
      > ------------------------------------------------------------------------
      > ------
      > Hash Join (cost=47162.85. .76291.32 rows=223672 width=44)
      > Hash Cond: ("outer".ite mid = "inner".ite mid)
      > -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671
      > width=48)
      > -> Hash (cost=42415.28. .42415.28 rows=741028 width=4)
      > -> Seq Scan on item i (cost=0.00..424 15.28 rows=741028
      > width=4)
      >
      > This query takes about 20 seconds to run.[/color]

      Well, you're joining the entire two tables, so yes, the seq scan might be
      faster.
      Try your query with enable_seqscan= 0 so it'll use an index scan and
      compare the times.
      You may be surprised to find that the planner has indeed made the right
      choice.
      This query selects 223672 rows, are you surprised it's slow ?

      What are you trying to do with this query ? Is it executed often ?
      If you want to select only a subset of this, use an additional where
      condition and the planner will use the index.

      ---------------------------(end of broadcast)---------------------------
      TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@postg resql.org so that your
      message can get through to the mailing list cleanly

      Comment

      • Joel Stevenson

        #4
        Re: Trying to get postgres to use an index

        At 10:11 PM +0100 11/6/04, Pierre-Frédéric Caillaud wrote:[color=blue][color=green]
        >>explain select notificationID from NOTIFICATION n, ITEM i where
        >>n.itemID = i.itemID;
        >> QUERY PLAN
        >>
        >>------------------------------------------------------------------------
        >>------
        >> Hash Join (cost=47162.85. .76291.32 rows=223672 width=44)
        >> Hash Cond: ("outer".ite mid = "inner".ite mid)
        >> -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671
        >>width=48)
        >> -> Hash (cost=42415.28. .42415.28 rows=741028 width=4)
        >> -> Seq Scan on item i (cost=0.00..424 15.28 rows=741028
        >>width=4)
        >>
        >>This query takes about 20 seconds to run.[/color]
        >
        > Well, you're joining the entire two
        >tables, so yes, the seq scan might be faster.
        > Try your query with enable_seqscan= 0 so
        >it'll use an index scan and compare the times.
        > You may be surprised to find that the
        >planner has indeed made the right choice.
        > This query selects 223672 rows, are you surprised it's slow ?[/color]

        I'm not a SQL guru by any stretch but would a
        constrained sub-select be appropriate here?

        e.g. a simple test setup where each record in
        table test1 has a FK referenced to an entry in
        test:

        joels=# \d test
        Table "public.tes t"
        Column | Type | Modifiers
        --------+--------------+-----------
        id | integer | not null
        foo | character(3) |
        Indexes:
        "test_pkey" primary key, btree (id)

        joels=# \d test1
        Table "public.tes t1"
        Column | Type | Modifiers
        ---------+---------+-----------
        id | integer | not null
        test_id | integer |
        Indexes:
        "test1_pkey " primary key, btree (id)
        "test1_test_id_ idx" btree (test_id)
        Foreign-key constraints:
        "$1" FOREIGN KEY (test_id) REFERENCES test(id) ON DELETE CASCADE

        joels=# select count(*) from test;
        count
        -------
        10001
        (1 row)

        joels=# select count(*) from test1;
        count
        -------
        10001
        (1 row)

        joels=# explain select test_id from test1 t1, test t where t1.test_id =t.id;
        QUERY PLAN
        ------------------------------------------------------------------------
        Hash Join (cost=170.01..4 95.05 rows=10002 width=4)
        Hash Cond: ("outer".test_i d = "inner".id)
        -> Seq Scan on test1 t1 (cost=0.00..150 .01 rows=10001 width=4)
        -> Hash (cost=145.01..1 45.01 rows=10001 width=4)
        -> Seq Scan on test t (cost=0.00..145 .01 rows=10001 width=4)
        (5 rows)

        joels=# explain select test_id from test1 t1
        where test_id in (select id from test where id =
        t1.test_id);
        QUERY PLAN
        ------------------------------------------------------------------------------
        Seq Scan on test1 t1 (cost=0.00..152 69.02 rows=5001 width=4)
        Filter: (subplan)
        SubPlan
        -> Index Scan using test_pkey on test (cost=0.00..3.0 1 rows=2 width=4)
        Index Cond: (id = $0)
        (5 rows)


        So with the subselect the query planner would use
        the primary key index on test when finding
        referencing records in the test1 table.

        Pierre, I seen the advice to use an additional
        where condition in certain cases to induce an
        index scan; how is this done?

        my 1.2 pennies,
        -Joel

        ---------------------------(end of broadcast)---------------------------
        TIP 7: don't forget to increase your free space map settings

        Comment

        • mike.wertheim@linkify.com

          #5
          Re: Trying to get postgres to use an index

          [color=blue]
          > I'm not a SQL guru by any stretch but would a
          > constrained sub-select be appropriate here?[/color]
          [color=blue]
          > Well, you're joining the entire two tables, so yes, the seq scan might
          > be faster.[/color]

          My mistake! When composing the email to state the problem, I accidentally
          gave a wrong version of the join query.

          Here is the corrected version, which still has the sequential scan...

          explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
          = i.itemID and i.projectID = 12;
          QUERY PLAN
          ---------------------------------------------------------------------------
          -----------------------
          Hash Join (cost=2237.54.. 15382.32 rows=271 width=44)
          Hash Cond: ("outer".ite mid = "inner".ite mid)
          -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671
          width=48)
          -> Hash (cost=2235.31.. 2235.31 rows=895 width=4)
          -> Index Scan using item_ix_item_4_ idx on item i
          (cost=0.00..223 5.31 rows=895width=4 )
          Index Cond: (projectid = 12)





          ---------------------------(end of broadcast)---------------------------
          TIP 6: Have you searched our list archives?



          Comment

          • Tom Lane

            #6
            Re: Trying to get postgres to use an index

            <mike.wertheim@ linkify.com> writes:[color=blue]
            > Here is the corrected version, which still has the sequential scan...[/color]
            [color=blue]
            > explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
            > = i.itemID and i.projectID = 12;
            > QUERY PLAN
            > ---------------------------------------------------------------------------
            > -----------------------
            > Hash Join (cost=2237.54.. 15382.32 rows=271 width=44)
            > Hash Cond: ("outer".ite mid = "inner".ite mid)
            > -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671
            > width=48)
            > -> Hash (cost=2235.31.. 2235.31 rows=895 width=4)
            > -> Index Scan using item_ix_item_4_ idx on item i
            > (cost=0.00..223 5.31 rows=895width=4 )
            > Index Cond: (projectid = 12)[/color]

            This seems like a perfectly fine plan to me. If it were turned around
            into a nested indexscan as you suggest, there would need to be 895
            indexscans of NOTIFICATION (one for each row retrieved from ITEM)
            and from your original mail we can see the planner thinks that an
            indexscan on NOTIFICATION will take about 129 cost units, for a total
            cost of 129 * 895 = 115455 units (and that's not counting the indexscan
            on ITEM nor any join overhead). So at least according to these
            estimates, using the index would take 10x more time than this plan.

            If you want to see whether this costing is accurate, you could do
            EXPLAIN ANALYZE for this way and the other (I expect that you'd get the
            other if you did "set enable_seqscan = off"). But with a 10x
            discrepancy I suspect the planner probably did the right thing.

            regards, tom lane

            ---------------------------(end of broadcast)---------------------------
            TIP 6: Have you searched our list archives?



            Comment

            • mike@linkify.com

              #7
              Re: Trying to get postgres to use an index

              [color=blue]
              > Well, you're joining the entire two tables, so yes, the seq scan might
              > be faster.[/color]

              My mistake. When composing the email to state the problem, I accidentally
              gave a wrong versionof the join query.

              Here is the corrected version, which still has the sequential scan...

              explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
              = i.itemID andi.projectID = 12;
              QUERY PLAN
              --------------------------------------------------------------------------------------------------
              Hash Join (cost=2237.54.. 15382.32 rows=271 width=44)
              Hash Cond: ("outer".ite mid = "inner".ite mid)
              -> Seq Scan on notification n (cost=0.00..120 23.71 rows=223671 width=48)
              -> Hash (cost=2235.31.. 2235.31 rows=895 width=4)
              -> Index Scan using item_ix_item_4_ idx on item i
              (cost=0.00..223 5.31 rows=895width=4 )
              Index Cond: (projectid = 12)






              ---------------------------(end of broadcast)---------------------------
              TIP 7: don't forget to increase your free space map settings

              Comment

              • Mike Wertheim

                #8
                Re: Trying to get postgres to use an index

                I have some more info on my indexing situation, and a new question.

                In my previous email, I told about 2 tables: Notification and Item,
                which join on a field called ItemID. The joining query didn't execute
                as quickly as I thought it should. I now notice that I have another
                table, Folder, which joins with Item in a similar way, and the
                performance of that join is excellent.

                So my new questions is... What makes the Folder join faster than the
                Notification join?


                Here is some info on the tables, queries, and "explain analyze"
                output...

                Item's primary key is ItemID (int4).
                Folder's primary key is ItemID (int4). Folder also contains 4 varchar
                columns, 2 text columns, 6 bool columns, 7 datetime columns and 1 int4
                column.
                Notification has an index on its ItemID (int4) field. Notification also
                contains 7 text columns (1 of them being the primary key), 3 timestamp
                columns and 4 int4 columns.

                Folder and Notification have a similar number of rows. "select count(*)
                from folder" returns 193043. "select count(*) from notification"
                returns 223689.


                The first query is: "select count(*) from FOLDER f, ITEM i where
                f.itemID = i.itemID and i.projectid=772 0". This query returns the
                result "5" and executes in less than 1 second.

                The second query is: "select count(*) from NOTIFICATION n, ITEM i where
                n.itemID = i.itemID and i.projectid=772 0". This query returns the
                result "2" and executes in about 40 seconds.


                Here's the "explain analyze" output...

                The Folder query uses the indexes:
                explain analyze select count(*) from FOLDER f, ITEM i where f.itemID =
                i.itemID and i.projectid=772 0; Aggregate (cost=6371.88.. 6371.88
                rows=1 width=0) (actual time=83.557..83 .558 rows=1 loops=1)
                -> Nested Loop (cost=0.00..637 1.31 rows=227 width=0) (actual
                time=17.929..83 .502 rows=5 loops=1)
                -> Index Scan using item_ix_item_4_ idx on item i
                (cost=0.00..210 5.51 rows=869 width=4) (actual time=0.098..19. 409 rows=51
                loops=1)
                Index Cond: (projectid = 7720)
                -> Index Scan using folder_pkey on folder f (cost=0.00..4.9 0
                rows=1 width=4) (actual time=1.255..1.2 55 rows=0 loops=51)
                Index Cond: (f.itemid = "outer".ite mid)
                Total runtime: 92.185 ms


                The Notification query does a sequential scan on Notification:
                explain analyze select count(*) from NOTIFICATION n, ITEM i where
                n.itemID = i.itemID and i.projectid=772 0;
                Aggregate (cost=38732.31. .38732.31 rows=1 width=0) (actual
                time=40380.497. .40380.498 rows=1 loops=1)
                -> Hash Join (cost=2107.69.. 38731.65 rows=263 width=0) (actual
                time=36341.174. .40380.447 rows=2 loops=1)
                Hash Cond: ("outer".ite mid = "inner".ite mid)
                -> Seq Scan on notification n (cost=0.00..355 02.89
                rows=223689 width=4) (actual time=8289.236.. 40255.341 rows=223689
                loops=1)
                -> Hash (cost=2105.51.. 2105.51 rows=869 width=4) (actual
                time=0.177..0.1 77 rows=0 loops=1)
                -> Index Scan using item_ix_item_4_ idx on item i
                (cost=0.00..210 5.51 rows=869 width=4) (actual time=0.025..0.1 27 rows=51
                loops=1)
                Index Cond: (projectid = 7720)
                Total runtime: 40380.657 ms


                So my question is... What difference do you see between the Folder and
                Notification tables that would account for such a big difference in
                query performance? And how can I make the Notification query run about
                as fast as the Folder query?



                ---------------------------(end of broadcast)---------------------------
                TIP 8: explain analyze is your friend

                Comment

                • Martijn van Oosterhout

                  #9
                  Re: Trying to get postgres to use an index

                  Firstly, check that all your columns are actually of the same type and
                  the indexes are where you say they are. Using \d will show this.
                  Secondly, if you do the EXPLAIN ANALYZE with "set enable_seqscan= off",
                  what is the output?

                  Hope this helps,

                  On Tue, Nov 09, 2004 at 05:02:01PM -0800, Mike Wertheim wrote:[color=blue]
                  > I have some more info on my indexing situation, and a new question.
                  >
                  > In my previous email, I told about 2 tables: Notification and Item,
                  > which join on a field called ItemID. The joining query didn't execute
                  > as quickly as I thought it should. I now notice that I have another
                  > table, Folder, which joins with Item in a similar way, and the
                  > performance of that join is excellent.
                  >
                  > So my new questions is... What makes the Folder join faster than the
                  > Notification join?
                  >
                  >
                  > Here is some info on the tables, queries, and "explain analyze"
                  > output...
                  >
                  > Item's primary key is ItemID (int4).
                  > Folder's primary key is ItemID (int4). Folder also contains 4 varchar
                  > columns, 2 text columns, 6 bool columns, 7 datetime columns and 1 int4
                  > column.
                  > Notification has an index on its ItemID (int4) field. Notification also
                  > contains 7 text columns (1 of them being the primary key), 3 timestamp
                  > columns and 4 int4 columns.
                  >
                  > Folder and Notification have a similar number of rows. "select count(*)
                  > from folder" returns 193043. "select count(*) from notification"
                  > returns 223689.
                  >
                  >
                  > The first query is: "select count(*) from FOLDER f, ITEM i where
                  > f.itemID = i.itemID and i.projectid=772 0". This query returns the
                  > result "5" and executes in less than 1 second.
                  >
                  > The second query is: "select count(*) from NOTIFICATION n, ITEM i where
                  > n.itemID = i.itemID and i.projectid=772 0". This query returns the
                  > result "2" and executes in about 40 seconds.
                  >
                  >
                  > Here's the "explain analyze" output...
                  >
                  > The Folder query uses the indexes:
                  > explain analyze select count(*) from FOLDER f, ITEM i where f.itemID =
                  > i.itemID and i.projectid=772 0; Aggregate (cost=6371.88.. 6371.88
                  > rows=1 width=0) (actual time=83.557..83 .558 rows=1 loops=1)
                  > -> Nested Loop (cost=0.00..637 1.31 rows=227 width=0) (actual
                  > time=17.929..83 .502 rows=5 loops=1)
                  > -> Index Scan using item_ix_item_4_ idx on item i
                  > (cost=0.00..210 5.51 rows=869 width=4) (actual time=0.098..19. 409 rows=51
                  > loops=1)
                  > Index Cond: (projectid = 7720)
                  > -> Index Scan using folder_pkey on folder f (cost=0.00..4.9 0
                  > rows=1 width=4) (actual time=1.255..1.2 55 rows=0 loops=51)
                  > Index Cond: (f.itemid = "outer".ite mid)
                  > Total runtime: 92.185 ms
                  >
                  >
                  > The Notification query does a sequential scan on Notification:
                  > explain analyze select count(*) from NOTIFICATION n, ITEM i where
                  > n.itemID = i.itemID and i.projectid=772 0;
                  > Aggregate (cost=38732.31. .38732.31 rows=1 width=0) (actual
                  > time=40380.497. .40380.498 rows=1 loops=1)
                  > -> Hash Join (cost=2107.69.. 38731.65 rows=263 width=0) (actual
                  > time=36341.174. .40380.447 rows=2 loops=1)
                  > Hash Cond: ("outer".ite mid = "inner".ite mid)
                  > -> Seq Scan on notification n (cost=0.00..355 02.89
                  > rows=223689 width=4) (actual time=8289.236.. 40255.341 rows=223689
                  > loops=1)
                  > -> Hash (cost=2105.51.. 2105.51 rows=869 width=4) (actual
                  > time=0.177..0.1 77 rows=0 loops=1)
                  > -> Index Scan using item_ix_item_4_ idx on item i
                  > (cost=0.00..210 5.51 rows=869 width=4) (actual time=0.025..0.1 27 rows=51
                  > loops=1)
                  > Index Cond: (projectid = 7720)
                  > Total runtime: 40380.657 ms
                  >
                  >
                  > So my question is... What difference do you see between the Folder and
                  > Notification tables that would account for such a big difference in
                  > query performance? And how can I make the Notification query run about
                  > as fast as the Folder query?
                  >
                  >
                  >
                  > ---------------------------(end of broadcast)---------------------------
                  > TIP 8: explain analyze is your friend[/color]

                  --
                  Martijn van Oosterhout <kleptog@svana. org> http://svana.org/kleptog/[color=blue]
                  > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
                  > tool for doing 5% of the work and then sitting around waiting for someone
                  > else to do the other 95% so you can sue them.[/color]

                  -----BEGIN PGP SIGNATURE-----
                  Version: GnuPG v1.0.6 (GNU/Linux)
                  Comment: For info see http://www.gnupg.org

                  iD8DBQFBkfmNY5T wig3Ge+YRAmqwAK CrZCLZAOf1ls309 w8ITMNhdwDhJgCe M5BF
                  NbpZi90BDGeVFCQ umU5Wd/o=
                  =NzQs
                  -----END PGP SIGNATURE-----

                  Comment

                  Working...