Huge Data

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sezai YILMAZ

    Huge Data

    Hi,

    I use PostgreSQL 7.4 for storing huge amount of data. For example 7
    million rows. But when I run the query "select count(*) from table;", it
    results after about 120 seconds. Is this result normal for such a huge
    table? Is there any methods for speed up the querying time? The huge
    table has integer primary key and some other indexes for other columns.

    The hardware is: PIII 800 MHz processor, 512 MB RAM, and IDE hard disk
    drive.

    -sezai

    ---------------------------(end of broadcast)---------------------------
    TIP 8: explain analyze is your friend

  • Richard Huxton

    #2
    Re: Huge Data

    On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:[color=blue]
    > Hi,
    >
    > I use PostgreSQL 7.4 for storing huge amount of data. For example 7
    > million rows. But when I run the query "select count(*) from table;", it
    > results after about 120 seconds. Is this result normal for such a huge
    > table? Is there any methods for speed up the querying time? The huge
    > table has integer primary key and some other indexes for other columns.[/color]

    PG uses MVCC to manage concurrency. A downside of this is that to verify the
    exact number of rows in a table you have to visit them all.

    There's plenty on this in the archives, and probably the FAQ too.

    What are you using the count() for?

    --
    Richard Huxton
    Archonet Ltd

    ---------------------------(end of broadcast)---------------------------
    TIP 5: Have you checked our extensive FAQ?



    Comment

    • Sezai YILMAZ

      #3
      Re: Huge Data

      Richard Huxton wrote:
      [color=blue]
      >On Wednesday 14 January 2004 11:11, Sezai YILMAZ wrote:
      >
      >[color=green]
      >>Hi,
      >>
      >>I use PostgreSQL 7.4 for storing huge amount of data. For example 7
      >>million rows. But when I run the query "select count(*) from table;", it
      >>results after about 120 seconds. Is this result normal for such a huge
      >>table? Is there any methods for speed up the querying time? The huge
      >>table has integer primary key and some other indexes for other columns.
      >>
      >>[/color]
      >
      >PG uses MVCC to manage concurrency. A downside of this is that to verify the
      >exact number of rows in a table you have to visit them all.
      >
      >There's plenty on this in the archives, and probably the FAQ too.
      >
      >What are you using the count() for?
      >
      >[/color]
      I use count() for some statistics. Just to show how many records
      collected so far.

      -sezai

      ---------------------------(end of broadcast)---------------------------
      TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

      Comment

      • Sezai YILMAZ

        #4
        Re: Huge Data

        Richard Huxton wrote:
        [color=blue]
        >PG uses MVCC to manage concurrency. A downside of this is that to verify the
        >exact number of rows in a table you have to visit them all.
        >
        >There's plenty on this in the archives, and probably the FAQ too.
        >
        >What are you using the count() for?
        >
        >
        >[/color]

        select logid, agentid, logbody from log where logid=3000000;

        this query also returns after about 120 seconds. The table log has about
        7 million records, and logid is the primary key of log table. What about
        that? Why is it too slow?

        -sezai


        ---------------------------(end of broadcast)---------------------------
        TIP 9: the planner will ignore your desire to choose an index scan if your
        joining column's datatypes do not match

        Comment

        • Shridhar Daithankar

          #5
          Re: Huge Data

          On Wednesday 14 January 2004 17:57, Sezai YILMAZ wrote:[color=blue]
          > Richard Huxton wrote:[color=green]
          > >What are you using the count() for?[/color]
          >
          > I use count() for some statistics. Just to show how many records
          > collected so far.[/color]

          Rather than doing count(*), you should either cache the count in application
          memory

          or analyze often and use following.

          'select reltuples from pg_class where relname = 'foo';

          This would give you approximate count. I believe it should suffice for your
          needs.

          HTH

          Shridhar


          ---------------------------(end of broadcast)---------------------------
          TIP 2: you can get off all lists at once with the unregister command
          (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

          Comment

          • Matthew Lunnon

            #6
            Re: Huge Data

            Have you run 'vacuum analyze log;'? Also I believe that in Oracle count(1) used to be quicker than count(*).
            Matthew
            ----- Original Message -----
            From: Sezai YILMAZ
            To: Richard Huxton
            Cc: pgsql-general@postgre sql.org
            Sent: Wednesday, January 14, 2004 12:39 PM
            Subject: Re: [GENERAL] Huge Data


            Richard Huxton wrote:
            [color=blue]
            >PG uses MVCC to manage concurrency. A downside of this is that to verifythe
            >exact number of rows in a table you have to visit them all.
            >
            >There's plenty on this in the archives, and probably the FAQ too.
            >
            >What are you using the count() for?
            >
            >
            >[/color]

            select logid, agentid, logbody from log where logid=3000000;

            this query also returns after about 120 seconds. The table log has about
            7 million records, and logid is the primary key of log table. What about
            that? Why is it too slow?

            -sezai


            ---------------------------(end of broadcast)---------------------------
            TIP 9: the planner will ignore your desire to choose an index scan if your
            joining column's datatypes do not match

            _______________ _______________ _______________ _______________ _________
            This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com

            Comment

            • Shridhar Daithankar

              #7
              Re: Huge Data

              On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:[color=blue]
              > select logid, agentid, logbody from log where logid=3000000;
              >
              > this query also returns after about 120 seconds. The table log has about
              > 7 million records, and logid is the primary key of log table. What about
              > that? Why is it too slow?[/color]

              How about

              select logid, agentid, logbody from log where logid='3000000' ;

              or

              select logid, agentid, logbody from log where logid=3000000:: int4;

              Basically you need to typecast the constant. Then it would use the index.

              I am not sure of first form of it though. I recommend you use the later form.

              Shridhar


              ---------------------------(end of broadcast)---------------------------
              TIP 4: Don't 'kill -9' the postmaster

              Comment

              • Richard Huxton

                #8
                Re: Huge Data

                On Wednesday 14 January 2004 12:39, Sezai YILMAZ wrote:[color=blue]
                >
                > select logid, agentid, logbody from log where logid=3000000;[/color]

                At a guess, because logid is bigint, whereas 300000 is taken to be integer.
                Try ... where logid = 300000::bigint;

                This is in the FAQ too I think, and is certainly in the archives.

                Other things you might come across:
                SELECT max() involves a sequential scan just like count(), you can rewrite it
                as SELECT target_column FROM my_table ORDER BY target_column DESC LIMIT 1

                The config values are very conservative. You will definitely want to tune them
                for performance. See the articles here for a good introduction:


                The VACUUM command is used to reclaim unused space, and the ANALYZE command to
                regenerate statistics. It's worth reading up on both.

                You can use EXPLAIN ANALYSE <query here> to see the plan that PG uses. I think
                there's a discussion of it at http://techdocs.postgresql.org/

                --
                Richard Huxton
                Archonet Ltd

                ---------------------------(end of broadcast)---------------------------
                TIP 9: the planner will ignore your desire to choose an index scan if your
                joining column's datatypes do not match

                Comment

                • Richard Huxton

                  #9
                  Re: Huge Data

                  On Wednesday 14 January 2004 12:27, Sezai YILMAZ wrote:[color=blue]
                  > Richard Huxton wrote:[color=green]
                  > >There's plenty on this in the archives, and probably the FAQ too.
                  > >
                  > >What are you using the count() for?[/color]
                  >
                  > I use count() for some statistics. Just to show how many records
                  > collected so far.[/color]

                  If you want an accurate number without scanning the table, you'll need to use
                  a trigger to keep a count up to date.

                  --
                  Richard Huxton
                  Archonet Ltd

                  ---------------------------(end of broadcast)---------------------------
                  TIP 8: explain analyze is your friend

                  Comment

                  • Sezai YILMAZ

                    #10
                    Re: Huge Data

                    Shridhar Daithankar wrote:
                    [color=blue]
                    >Rather than doing count(*), you should either cache the count in application
                    >memory
                    >
                    > or analyze often and use following.
                    >
                    >'select reltuples from pg_class where relname = 'foo';
                    >[/color]
                    Thank you very much Shridhar. This one is responsive immediately. I
                    think I will use this method for gathering row count. But I complain to
                    break SQL standards. The code will become unmovable.

                    -sezai

                    ---------------------------(end of broadcast)---------------------------
                    TIP 6: Have you searched our list archives?



                    Comment

                    • Franco Bruno Borghesi

                      #11
                      Re: Huge Data

                      If the mentioned solution fits your needs, you could create a stored
                      procedure using that. The postgreSQL implementation could select from
                      pg_class, while the same function in other database could execute the
                      select count() on the table.

                      On Wed, 2004-01-14 at 10:25, Sezai YILMAZ wrote:
                      [color=blue]
                      > Shridhar Daithankar wrote:
                      > [color=green]
                      > >Rather than doing count(*), you should either cache the count in application
                      > >memory
                      > >
                      > > or analyze often and use following.
                      > >
                      > >'select reltuples from pg_class where relname = 'foo';
                      > >[/color]
                      > Thank you very much Shridhar. This one is responsive immediately. I
                      > think I will use this method for gathering row count. But I complain to
                      > break SQL standards. The code will become unmovable.
                      >
                      > -sezai
                      >
                      > ---------------------------(end of broadcast)---------------------------
                      > TIP 6: Have you searched our list archives?
                      >
                      > http://archives.postgresql.org
                      > [/color]

                      -----BEGIN PGP SIGNATURE-----
                      Version: GnuPG v1.2.3 (GNU/Linux)

                      iD8DBQBABUbK21d VnhLsBV0RAuyGAJ 4z6AJcbnEw9njiO rtbsFnd/a7sGwCeNeTy
                      G/GkE8IfE/MSNGLqGsHGoNM=
                      =197f
                      -----END PGP SIGNATURE-----

                      Comment

                      • Sezai YILMAZ

                        #12
                        Re: Huge Data

                        Shridhar Daithankar wrote:
                        [color=blue]
                        >On Wednesday 14 January 2004 18:22, Matthew Lunnon wrote:
                        >
                        >[color=green]
                        >> select logid, agentid, logbody from log where logid=3000000;
                        >>
                        >> this query also returns after about 120 seconds. The table log has about
                        >> 7 million records, and logid is the primary key of log table. What about
                        >> that? Why is it too slow?
                        >>
                        >>[/color]
                        >
                        >How about
                        >
                        >select logid, agentid, logbody from log where logid='3000000' ;
                        >
                        >[/color]
                        Oh my god. It is unbelievable. The result is great. Thanks for all guys
                        who helped me.

                        -sezai

                        ---------------------------(end of broadcast)---------------------------
                        TIP 3: if posting/reading through Usenet, please send an appropriate
                        subscribe-nomail command to majordomo@postg resql.org so that your
                        message can get through to the mailing list cleanly

                        Comment

                        • Shridhar Daithankar

                          #13
                          Re: Huge Data

                          On Wednesday 14 January 2004 18:55, Sezai YILMAZ wrote:[color=blue]
                          > Shridhar Daithankar wrote:[color=green]
                          > >Rather than doing count(*), you should either cache the count in
                          > > application memory
                          > >
                          > > or analyze often and use following.
                          > >
                          > >'select reltuples from pg_class where relname = 'foo';[/color]
                          >
                          > Thank you very much Shridhar. This one is responsive immediately. I
                          > think I will use this method for gathering row count. But I complain to
                          > break SQL standards. The code will become unmovable.[/color]

                          Well, you could document it somewhere for your reference. It is not that
                          hard..:-)

                          And remember, the value you get is just an estimate. You need to analyze the
                          table often with respect to it's update/insert/delete activity to keep the
                          estimate reasonable accurate. Vacuuming would also update the estimate.

                          Shridhar


                          ---------------------------(end of broadcast)---------------------------
                          TIP 8: explain analyze is your friend

                          Comment

                          Working...