how many record versions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • David Garamond

    how many record versions

    begin;
    update t set val=val+1; -- 1000 times
    commit;

    How many record versions does it create? 1 or 1000? I'm implementing a
    banner counter which is incremented at least 2-3 millions a day. I
    thought I'd cheat by only commiting after every few minutes. Would that
    work or would I still create as many record versions?

    --
    dave


    ---------------------------(end of broadcast)---------------------------
    TIP 4: Don't 'kill -9' the postmaster

  • Bruno Wolff III

    #2
    Re: how many record versions

    On Sun, May 23, 2004 at 02:44:31 +0700,
    David Garamond <lists@zara.6.i sreserved.com> wrote:[color=blue]
    > begin;
    > update t set val=val+1; -- 1000 times
    > commit;
    >
    > How many record versions does it create? 1 or 1000? I'm implementing a
    > banner counter which is incremented at least 2-3 millions a day. I
    > thought I'd cheat by only commiting after every few minutes. Would that
    > work or would I still create as many record versions?[/color]

    You might be better off keeping the counter in its own table and vacuuming
    that table very often. It is unlikely that holding transactions open
    for several minutes is a good idea. Also if you are doing multiple updates
    in a single transaction, you are still going to get multiple rows.

    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

    Comment

    • Tom Lane

      #3
      Re: how many record versions

      David Garamond <lists@zara.6.i sreserved.com> writes:[color=blue]
      > begin;
      > update t set val=val+1; -- 1000 times
      > commit;[/color]
      [color=blue]
      > How many record versions does it create? 1 or 1000?[/color]

      1000.
      [color=blue]
      > I'm implementing a
      > banner counter which is incremented at least 2-3 millions a day. I
      > thought I'd cheat by only commiting after every few minutes. Would that
      > work or would I still create as many record versions?[/color]

      Won't make a difference. You should think seriously about using a
      sequence rather than an ordinary table for this.

      regards, tom lane

      ---------------------------(end of broadcast)---------------------------
      TIP 6: Have you searched our list archives?



      Comment

      • David Garamond

        #4
        Re: how many record versions

        Manfred Koizar wrote:[color=blue][color=green]
        >>begin;
        >>update t set val=val+1; -- 1000 times
        >>commit;
        >>
        >>How many record versions does it create? 1 or 1000?[/color]
        >
        > 1000
        >[color=green]
        >>I'm implementing a
        >>banner counter which is incremented at least 2-3 millions a day.[/color]
        >
        > How many rows? I would VACUUM that table after every few hundred
        > updates or whenever 10% to 20% of the rows have been updated, whichever
        > is greater.[/color]

        Actually, each record will be incremented probably only thousands of
        times a day. But there are many banners. Each record has a (bannerid,
        campaignid, websiteid, date, countrycode) "dimensions " and (impression,
        click) "measures". The table currently has +- 1,5-2 mil records (it's in
        MyISAM MySQL), so I'm not sure if I can use that many sequences which
        Tom suggested. Every impression (banner view) and click will result in a
        SQL statement (= a "transactio n" in MyISAM, since MyISAM doesn't support
        BEGIN + COMMIT).

        I'm contemplating of moving to Postgres, but am worried with the MVCC
        thing. I've previously tried briefly using InnoDB in MySQL but have to
        revert back to MyISAM because the load increased significantly.

        --
        dave


        ---------------------------(end of broadcast)---------------------------
        TIP 6: Have you searched our list archives?



        Comment

        • Manfred Koizar

          #5
          Re: how many record versions

          On Sun, 23 May 2004 23:32:48 +0700, David Garamond
          <lists@zara.6.i sreserved.com> wrote:[color=blue]
          >Actually, each record will be incremented probably only thousands of
          >times a day. But there are many banners. Each record has a (bannerid,
          >campaignid, websiteid, date, countrycode) "dimensions " and (impression,
          >click) "measures".[/color]

          If you need all of bannerid, campaignid, websiteid, date, countrycode to
          identify a row, it may be worth the effort to split this up into two
          tables:

          CREATE TABLE dimensions (
          dimensionid int PRIMARY KEY,
          bannerid ...,
          campaignid ...,
          websiteid ...,
          date ...,
          countrycode ...,
          UNIQUE (bannerid, ..., countrycode)
          );

          CREATE TABLE measures (
          dimensionid int PRIMARY KEY REFERENCES dimensions,
          impression ...,
          click ...
          );

          Thus you'd only update measures thousands of times and the index would
          be much more compact, because the PK is only a four byte integer.
          [color=blue]
          > The table currently has +- 1,5-2 mil records (it's in
          >MyISAM MySQL), so I'm not sure if I can use that many sequences which
          >Tom suggested. Every impression (banner view) and click will result in a
          >SQL statement[/color]

          Schedule a
          VACUUM ANALYSE measures;
          for every 100000 updates or so.
          [color=blue]
          >I'm contemplating of moving to Postgres, but am worried with the MVCC
          >thing. I've previously tried briefly using InnoDB in MySQL but have to
          >revert back to MyISAM because the load increased significantly.[/color]

          You mean InnoDB cannot handle the load?

          Servus
          Manfred

          ---------------------------(end of broadcast)---------------------------
          TIP 2: you can get off all lists at once with the unregister command
          (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

          Comment

          • Tom Lane

            #6
            Re: how many record versions

            Manfred Koizar <mkoi-pg@aon.at> writes:[color=blue]
            > On Sun, 23 May 2004 23:32:48 +0700, David Garamond
            > <lists@zara.6.i sreserved.com> wrote:[color=green]
            >> I'm contemplating of moving to Postgres, but am worried with the MVCC
            >> thing. I've previously tried briefly using InnoDB in MySQL but have to
            >> revert back to MyISAM because the load increased significantly.[/color][/color]
            [color=blue]
            > You mean InnoDB cannot handle the load?[/color]

            I suspect what he meant is that InnoDB had exactly the same performance
            issues with lots-of-dead-rows that Postgres will have. Around here,
            however, we are accustomed to that behavior and know how to deal with
            it, whereas I'll bet the MySQL community hasn't got that down yet ...

            regards, tom lane

            ---------------------------(end of broadcast)---------------------------
            TIP 8: explain analyze is your friend

            Comment

            • Greg Stark

              #7
              Re: how many record versions


              David Garamond <lists@zara.6.i sreserved.com> writes:
              [color=blue]
              > Actually, each record will be incremented probably only thousands of times a
              > day. But there are many banners. Each record has a (bannerid, campaignid,
              > websiteid, date, countrycode) "dimensions " and (impression, click) "measures".[/color]

              In the past when I had a very similar situation we kept the raw impression and
              click event data. Ie, one record per impression in the impression table and
              one record per click in the click data.

              That makes the tables insert-only which is efficient and not prone to locking
              contention. They would never have to be vacuumed except after purging old data.

              Then to accelerate queries we had denormalized aggregate tables with a cron
              job that did the equivalent of

              insert into agg_clicks (
              select count(*),banner id
              from clicks
              where date between ? and ?
              group by bannerid
              )

              Where the ?s were actually hourly periods. Ie, at 12:15 it ran this query for
              the 11-12 period.

              This meant we didn't have immediate up-to-date stats on banners but it meant
              we did have stats on every single impression and click including time and
              information about the users.

              This worked out very well for reporting needs. If your system is using the
              data to handle serving the ads, though, it's a different kettle of fish. For
              that I think you'll want something that avoids having to do a database query
              for every single impression.


              --
              greg


              ---------------------------(end of broadcast)---------------------------
              TIP 2: you can get off all lists at once with the unregister command
              (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

              Comment

              • David Garamond

                #8
                Re: how many record versions

                Manfred Koizar wrote:[color=blue]
                > You mean InnoDB cannot handle the load?[/color]

                Perhaps it's more appropriate to say that the disk becomes the bottleneck.

                --
                dave


                ---------------------------(end of broadcast)---------------------------
                TIP 3: if posting/reading through Usenet, please send an appropriate
                subscribe-nomail command to majordomo@postg resql.org so that your
                message can get through to the mailing list cleanly

                Comment

                • David Garamond

                  #9
                  Re: how many record versions

                  Greg Stark wrote:[color=blue][color=green]
                  >>Actually, each record will be incremented probably only thousands of times a
                  >>day. But there are many banners. Each record has a (bannerid, campaignid,
                  >>websiteid, date, countrycode) "dimensions " and (impression, click) "measures".[/color]
                  >
                  > In the past when I had a very similar situation we kept the raw impression and
                  > click event data. Ie, one record per impression in the impression table and
                  > one record per click in the click data.[/color]
                  [color=blue]
                  > That makes the tables insert-only which is efficient and not prone to locking
                  > contention. They would never have to be vacuumed except after purging old data.[/color]

                  Assuming there are 10 millions of impressions per day, the impression
                  table will grow at least 200-400MB per day, is that correct? What do you
                  do and how often do you purge old data? Do you do a mass DELETE on the
                  impression table itself or do you switch to another table? I've found
                  that deleting tens/hundreds of thousands of row, at least in InnoDB,
                  takes long, long time (plus it sucks CPU and slows other queries).
                  [color=blue]
                  > Then to accelerate queries we had denormalized aggregate tables with a cron
                  > job that did the equivalent of
                  >
                  > insert into agg_clicks (
                  > select count(*),banner id
                  > from clicks
                  > where date between ? and ?
                  > group by bannerid
                  > )[/color]

                  --
                  dave


                  ---------------------------(end of broadcast)---------------------------
                  TIP 9: the planner will ignore your desire to choose an index scan if your
                  joining column's datatypes do not match

                  Comment

                  • Marc Slemko

                    #10
                    Re: how many record versions

                    On Mon, 24 May 2004, David Garamond wrote:
                    [color=blue]
                    > Manfred Koizar wrote:[color=green]
                    > > You mean InnoDB cannot handle the load?[/color]
                    >
                    > Perhaps it's more appropriate to say that the disk becomes the bottleneck.[/color]

                    Was this attempting to do each update in a separate transaction?

                    If so, that is certainly expected, with whatever disk based transactional
                    database you use. With innodb, you could hack around it by configuring
                    innodb not to flush its log to disk at every transaction, obviously at
                    the risk of loosing data if something crashes.

                    From what I have seen, I would expect innodb's multiversioning to work
                    better for the use you describe than postgresql's due how it implements
                    undo logs for updates. However, there could well be other differences
                    that could make postgresql work better for your application depending on
                    exactly what issues you are seeing.

                    ---------------------------(end of broadcast)---------------------------
                    TIP 9: the planner will ignore your desire to choose an index scan if your
                    joining column's datatypes do not match

                    Comment

                    • Greg Stark

                      #11
                      Re: how many record versions

                      David Garamond <lists@zara.6.i sreserved.com> writes:
                      [color=blue]
                      > Greg Stark wrote:[color=green][color=darkred]
                      > >>Actually, each record will be incremented probably only thousands of times a
                      > >>day. But there are many banners. Each record has a (bannerid, campaignid,
                      > >>websiteid, date, countrycode) "dimensions " and (impression, click) "measures".[/color]
                      > > In the past when I had a very similar situation we kept the raw impression and
                      > > click event data. Ie, one record per impression in the impression table and
                      > > one record per click in the click data.[/color]
                      >[color=green]
                      > > That makes the tables insert-only which is efficient and not prone to locking
                      > > contention. They would never have to be vacuumed except after purging old data.[/color]
                      >
                      > Assuming there are 10 millions of impressions per day, the impression table
                      > will grow at least 200-400MB per day, is that correct? What do you do and how
                      > often do you purge old data? Do you do a mass DELETE on the impression table
                      > itself or do you switch to another table? I've found that deleting
                      > tens/hundreds of thousands of row, at least in InnoDB, takes long, long time
                      > (plus it sucks CPU and slows other queries).[/color]


                      Well this was actually under Oracle, but I can extrapolate to Postgres given
                      my experience.

                      The idea tool for the job is a feature that Postgres has discussed but hasn't
                      implemented yet, "partitione d tables". Under Oracle with partitioned tables we
                      were able to drop entire partitions virtually instantaneously . It also made
                      copying the data out to near-line backups much more efficient than index
                      scanning as well.

                      Before we implemented partitioned tables we used both techniques you
                      described. At first we had an ad-hoc procedure of creating a new table and
                      swapping it out. But that involved a short downtime and was a manual process.
                      Eventually we set up an automated batch job which used deletes.

                      Deletes under postgres should be fairly efficient. The I/O use would be
                      unavoidable, so doing it during off-peak hours would still be good. But it
                      shouldn't otherwise interfere with other queries. There should be no locking
                      contention, no additional work for other queries (like checking rollback
                      segments or logs) or any of the other problems other databases suffer from
                      with large updates.

                      I find the 10 million impressions per day pretty scary though. That's over
                      100/s across the entire 24 period. Probably twice that at peak hours. That
                      would have to be one pretty beefy server just to handle the transaction
                      processing itself. (And updates under postgres are essentially inserts where
                      vacuum cleans up the old tuple later, so they would be no less taxing.) A
                      server capable of handling that ought to be able to make quick work of
                      deleting a few hundred megabytes of records.

                      Another option is simply logging this data to a text file. Or multiple text
                      files one per server. Then you can load the text files with batch loads
                      offline. This avoids slowing down your servers handling the transactions in
                      the critical path. But it's yet more complex with more points for failure.

                      Something else you might be interested in is using a tool like this:



                      I could see it being useful for caching the counts you were looking to keep so
                      that the ad server doesn't need to consult the database to calculate which ad
                      to show. A separate job could periodically sync the counts to the database or
                      from the database.

                      --
                      greg


                      ---------------------------(end of broadcast)---------------------------
                      TIP 2: you can get off all lists at once with the unregister command
                      (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

                      Comment

                      • David Garamond

                        #12
                        Re: how many record versions

                        Greg Stark wrote:[color=blue]
                        > Another option is simply logging this data to a text file. Or multiple text[/color]

                        Yes, this is what we've been doing recently. We write to a set of text
                        files and there's a process to commit to MySQL every 2-3 minutes (and if
                        the commit fails, we write to another text file to avoid the data being
                        lost). It works but I keep thinking how ugly the whole thing is :-)
                        [color=blue]
                        > files one per server. Then you can load the text files with batch loads
                        > offline. This avoids slowing down your servers handling the transactions in
                        > the critical path. But it's yet more complex with more points for failure.[/color]

                        --
                        dave


                        ---------------------------(end of broadcast)---------------------------
                        TIP 4: Don't 'kill -9' the postmaster

                        Comment

                        • pw

                          #13
                          unsubscribe

                          unsubscribe all


                          ---------------------------(end of broadcast)---------------------------
                          TIP 2: you can get off all lists at once with the unregister command
                          (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

                          Comment

                          • Joe Conway

                            #14
                            Re: how many record versions

                            Greg Stark wrote:[color=blue]
                            > Well this was actually under Oracle, but I can extrapolate to Postgres given
                            > my experience.
                            >
                            > The idea tool for the job is a feature that Postgres has discussed but hasn't
                            > implemented yet, "partitione d tables". Under Oracle with partitioned tables we
                            > were able to drop entire partitions virtually instantaneously . It also made
                            > copying the data out to near-line backups much more efficient than index
                            > scanning as well.[/color]

                            I think you can get a similar effect by using inherited tables. Create
                            one "master" table, and then inherit individual "partition" tables from
                            that. Then you can easily create or drop a "partition" , while still
                            being able to query the "master" and see all the rows.

                            HTH,

                            Joe

                            ---------------------------(end of broadcast)---------------------------
                            TIP 6: Have you searched our list archives?



                            Comment

                            • Steve Atkins

                              #15
                              Re: how many record versions

                              On Mon, May 24, 2004 at 11:15:07AM -0700, Joe Conway wrote:[color=blue]
                              > Greg Stark wrote:[color=green]
                              > >Well this was actually under Oracle, but I can extrapolate to Postgres
                              > >given
                              > >my experience.
                              > >
                              > >The idea tool for the job is a feature that Postgres has discussed but
                              > >hasn't
                              > >implemented yet, "partitione d tables". Under Oracle with partitioned
                              > >tables we
                              > >were able to drop entire partitions virtually instantaneously . It also made
                              > >copying the data out to near-line backups much more efficient than index
                              > >scanning as well.[/color]
                              >
                              > I think you can get a similar effect by using inherited tables. Create
                              > one "master" table, and then inherit individual "partition" tables from
                              > that. Then you can easily create or drop a "partition" , while still
                              > being able to query the "master" and see all the rows.[/color]

                              I've done this, in production, and it works fairly well. It's not as
                              clean as true partitioned tables (as a lot of things don't inherit)
                              but you can localise the nastiness in a pretty small bit of
                              application code.

                              Any query ends up looking like a long union of selects, which'll slow
                              things down somewhat, but I found that most of my queries had date range
                              selection on them so I could take advantage of that in the application
                              code to only query some subset of the inherited tables for most of the
                              application generated queries, while I could still do ad-hoc work from
                              the psql commandline using the parent table.

                              Cheers,
                              Steve


                              ---------------------------(end of broadcast)---------------------------
                              TIP 7: don't forget to increase your free space map settings

                              Comment

                              Working...