design issues with large amounts of data

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • nflacco

    design issues with large amounts of data

    I'm tinkering around with a data collection system, and have come up
    with a very hackish way to store my data- for reference, I'm
    anticipating collecting at least 100 million different dataId
    whatevers per year, possibly much more.

    ---366 data tables ( one for each day of the year ), each row being
    assigned a unique DataId ( unique across all 366 tables too )
    ---100 data_map tables, table 0 having all DataIds ending in 00, table
    99 having all DataIds ending in 99 and so on.

    This is mostly because a friend of mine who works with mySQL said it
    is very slow to index large tables, even if you work with mostly
    integers.
    However, I've read mysql can handle millions of rows no problem, so it
    seems my basic design is overly complicated and will lead to tons of
    slowdowns thanks to all the joins.
    Another friend of mine suggested using file partioning ( though he
    uses MSSQL ), so is that another option?

    Any advice?
  • Tom van Stiphout

    #2
    Re: design issues with large amounts of data

    On Thu, 3 Jul 2008 18:55:08 -0700 (PDT), nflacco
    <mail.flacco@gm ail.comwrote:

    A table per day is a REALLY BAD IDEA.
    Remember that indexed lookups are very efficient. 1 record out of 4
    billion can be found using 32 comparisons.
    Sure, indexing that table (if there was no index before) might take a
    while but doesn't have to be done more than once.

    Why don't you run some performance tests on realistic hardware with a
    realistic data set.

    -Tom.

    >I'm tinkering around with a data collection system, and have come up
    >with a very hackish way to store my data- for reference, I'm
    >anticipating collecting at least 100 million different dataId
    >whatevers per year, possibly much more.
    >
    >---366 data tables ( one for each day of the year ), each row being
    >assigned a unique DataId ( unique across all 366 tables too )
    >---100 data_map tables, table 0 having all DataIds ending in 00, table
    >99 having all DataIds ending in 99 and so on.
    >
    >This is mostly because a friend of mine who works with mySQL said it
    >is very slow to index large tables, even if you work with mostly
    >integers.
    >However, I've read mysql can handle millions of rows no problem, so it
    >seems my basic design is overly complicated and will lead to tons of
    >slowdowns thanks to all the joins.
    >Another friend of mine suggested using file partioning ( though he
    >uses MSSQL ), so is that another option?
    >
    >Any advice?

    Comment

    • Dan Guzman

      #3
      Re: design issues with large amounts of data

      Another friend of mine suggested using file partioning ( though he
      uses MSSQL ), so is that another option?
      Partitioning is good for managing very large tables because you can rebuild
      individual partition indexes without touching the entire table. This
      reduces rebuild time and intermediate space requirements. Be aware that the
      partitioning feature is available only in Enterprise and Developer editions.

      With a good indexing strategy, response time should ideally be proportional
      to the amount of data retrieved (barring cached data) regardless of whether
      or not partitioning is used. Partitioning by date can facilitate certain
      processes, like incremental data loads and purge/archival as well as certain
      types of queries. However, with or without partitioning, indexing is the
      key from a a performance perspective.

      --
      Hope this helps.

      Dan Guzman
      SQL Server MVP


      "nflacco" <mail.flacco@gm ail.comwrote in message
      news:9ee68f4f-04e9-4039-8f8f-b09af9c4a8b9@d1 9g2000prm.googl egroups.com...
      I'm tinkering around with a data collection system, and have come up
      with a very hackish way to store my data- for reference, I'm
      anticipating collecting at least 100 million different dataId
      whatevers per year, possibly much more.
      >
      ---366 data tables ( one for each day of the year ), each row being
      assigned a unique DataId ( unique across all 366 tables too )
      ---100 data_map tables, table 0 having all DataIds ending in 00, table
      99 having all DataIds ending in 99 and so on.
      >
      This is mostly because a friend of mine who works with mySQL said it
      is very slow to index large tables, even if you work with mostly
      integers.
      However, I've read mysql can handle millions of rows no problem, so it
      seems my basic design is overly complicated and will lead to tons of
      slowdowns thanks to all the joins.
      Another friend of mine suggested using file partioning ( though he
      uses MSSQL ), so is that another option?
      >
      Any advice?

      Comment

      • nflacco

        #4
        Re: design issues with large amounts of data

        On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospa m-online.sbcgloba l.net>
        wrote:
        Another friend of mine suggested using file partioning ( though he
        uses MSSQL ), so is that another option?
        >
        Partitioning is good for managing very large tables because you can rebuild
        individual partition indexes without touching the entire table. This
        reduces rebuild time and intermediate space requirements. Be aware that the
        partitioning feature is available only in Enterprise and Developer editions.
        >
        With a good indexing strategy, response time should ideally be proportional
        to the amount of data retrieved (barring cached data) regardless of whether
        or not partitioning is used. Partitioning by date can facilitate certain
        processes, like incremental data loads and purge/archival as well as certain
        types of queries. However, with or without partitioning, indexing is the
        key from a a performance perspective.
        >
        --
        Hope this helps.
        >
        Dan Guzman
        SQL Server MVPhttp://weblogs.sqlteam .com/dang/
        >
        "nflacco" <mail.fla...@gm ail.comwrote in message
        >
        news:9ee68f4f-04e9-4039-8f8f-b09af9c4a8b9@d1 9g2000prm.googl egroups.com...
        >
        I'm tinkering around with a data collection system, and have come up
        with a very hackish way to store my data- for reference, I'm
        anticipating collecting at least 100 million different dataId
        whatevers per year, possibly much more.
        >
        ---366 data tables ( one for each day of the year ), each row being
        assigned a unique DataId ( unique across all 366 tables too )
        ---100 data_map tables, table 0 having all DataIds ending in 00, table
        99 having all DataIds ending in 99 and so on.
        >
        This is mostly because a friend of mine who works with mySQL said it
        is very slow to index large tables, even if you work with mostly
        integers.
        However, I've read mysql can handle millions of rows no problem, so it
        seems my basic design is overly complicated and will lead to tons of
        slowdowns thanks to all the joins.
        Another friend of mine suggested using file partioning ( though he
        uses MSSQL ), so is that another option?
        >
        Any advice?
        The re-indexing is what worries me. I'll be constantly adding new data
        to main table ( formerly the 366 day tables ) if we follow the not use
        too many tables scheme, as well as the processed-data tables.
        (

        Comment

        • Dan Guzman

          #5
          Re: design issues with large amounts of data

          The re-indexing is what worries me. I'll be constantly adding new data
          to main table ( formerly the 366 day tables ) if we follow the not use
          too many tables scheme, as well as the processed-data tables.
          (
          I think partitioning by date will address the indexing issue since you can
          rebuild individual partitions (assuming Enterprise Edition is available for
          your application). In the case of separate tables, I would go with Erland's
          suggestion of monthly tables and a partitioned view. One table per day
          seems overkill to me.

          --
          Hope this helps.

          Dan Guzman
          SQL Server MVP


          "nflacco" <mail.flacco@gm ail.comwrote in message
          news:be2ae17e-44d6-4497-839f-aed5ce46b4b1@v1 g2000pra.google groups.com...
          On Jul 3, 8:14 pm, "Dan Guzman" <guzma...@nospa m-online.sbcgloba l.net>
          wrote:
          Another friend of mine suggested using file partioning ( though he
          uses MSSQL ), so is that another option?
          >>
          >Partitioning is good for managing very large tables because you can
          >rebuild
          >individual partition indexes without touching the entire table. This
          >reduces rebuild time and intermediate space requirements. Be aware that
          >the
          >partitioning feature is available only in Enterprise and Developer
          >editions.
          >>
          >With a good indexing strategy, response time should ideally be
          >proportional
          >to the amount of data retrieved (barring cached data) regardless of
          >whether
          >or not partitioning is used. Partitioning by date can facilitate certain
          >processes, like incremental data loads and purge/archival as well as
          >certain
          >types of queries. However, with or without partitioning, indexing is the
          >key from a a performance perspective.
          >>
          >--
          >Hope this helps.
          >>
          >Dan Guzman
          >SQL Server MVPhttp://weblogs.sqlteam .com/dang/
          >>
          >"nflacco" <mail.fla...@gm ail.comwrote in message
          >>
          >news:9ee68f4 f-04e9-4039-8f8f-b09af9c4a8b9@d1 9g2000prm.googl egroups.com...
          >>
          I'm tinkering around with a data collection system, and have come up
          with a very hackish way to store my data- for reference, I'm
          anticipating collecting at least 100 million different dataId
          whatevers per year, possibly much more.
          >>
          ---366 data tables ( one for each day of the year ), each row being
          assigned a unique DataId ( unique across all 366 tables too )
          ---100 data_map tables, table 0 having all DataIds ending in 00, table
          99 having all DataIds ending in 99 and so on.
          >>
          This is mostly because a friend of mine who works with mySQL said it
          is very slow to index large tables, even if you work with mostly
          integers.
          However, I've read mysql can handle millions of rows no problem, so it
          seems my basic design is overly complicated and will lead to tons of
          slowdowns thanks to all the joins.
          Another friend of mine suggested using file partioning ( though he
          uses MSSQL ), so is that another option?
          >>
          Any advice?
          >
          The re-indexing is what worries me. I'll be constantly adding new data
          to main table ( formerly the 366 day tables ) if we follow the not use
          too many tables scheme, as well as the processed-data tables.
          (

          Comment

          • Arved Sandstrom

            #6
            Re: design issues with large amounts of data

            "Erland Sommarskog" <esquel@sommars kog.sewrote in message
            news:Xns9AD1664 2DF5B0Yazorman@ 127.0.0.1...
            [ SNIP ]
            You seem to be concerned with indexing, and it cannot be denied that
            there might be a problem. Then again, if you build the index once,
            and new data is added in such a way that there is little fragmentation,
            it's not an issue at all. We would need to know more how the table looks
            like, and how it's loaded to give more specific advice.
            I've been reading this thread with interest. I have never had to deal with
            monster databases, for some suitable definition of "monster" (you'll know
            one when you see one, I guess). I started doing some research, which is
            ongoing: for an intermediate-level guy like myself these were helpful:



            and



            were very useful. Several questions I have related to indexing are:

            1) I saw that MySQL has a delayed insert feature, that is, inserts wait
            until the table is not in use by any other thread, plus inserts get written
            as a block. I can see this being useful if there are lots of selects, and an
            insert can take a fair bit of time. Does SQL Server have anything like this?

            2) If the new data is being delivered in batches (large batches) it seems to
            me that scheduling a drop index - bulk insert - create index would be
            better, at least in many cases, since the index would get rebuilt from
            scratch. Is this correct?

            AHS


            Comment

            • Erland Sommarskog

              #7
              Re: design issues with large amounts of data

              Arved Sandstrom (asandstrom@acc esswave.ca) writes:
              1) I saw that MySQL has a delayed insert feature, that is, inserts wait
              until the table is not in use by any other thread, plus inserts get
              written as a block. I can see this being useful if there are lots of
              selects, and an insert can take a fair bit of time. Does SQL Server have
              anything like this?
              One wonders what happens with the connection that performs the INSERT?
              Does it block while the INSERT is delayed? Or does this mean that if a
              process inserts, commits and then select for its inserted data, it may
              not be there? Appears likely violation of the ACID principle to me.

              No, there is nothing like that in SQL Server. (There is a deferred deleted
              model, where rows are not always physically deleted immediately, but
              are freed up by a cleanup process, but the rows are logically deleted,
              so this is just an implementation detail.)
              2) If the new data is being delivered in batches (large batches) it
              seems to me that scheduling a drop index - bulk insert - create index
              would be better, at least in many cases, since the index would get
              rebuilt from scratch. Is this correct?
              As always when it comes to performance issues: you will have to benchmark.


              --
              Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

              Books Online for SQL Server 2005 at

              Books Online for SQL Server 2000 at

              Comment

              • Arved Sandstrom

                #8
                Re: design issues with large amounts of data

                "Erland Sommarskog" <esquel@sommars kog.sewrote in message
                news:Xns9AD1ED7 E17952Yazorman@ 127.0.0.1...
                Arved Sandstrom (asandstrom@acc esswave.ca) writes:
                >1) I saw that MySQL has a delayed insert feature, that is, inserts wait
                >until the table is not in use by any other thread, plus inserts get
                >written as a block. I can see this being useful if there are lots of
                >selects, and an insert can take a fair bit of time. Does SQL Server have
                >anything like this?
                >
                One wonders what happens with the connection that performs the INSERT?
                Does it block while the INSERT is delayed? Or does this mean that if a
                process inserts, commits and then select for its inserted data, it may
                not be there? Appears likely violation of the ACID principle to me.
                The INSERT DELAYED in MySQL returns to the client immediately...t here is no
                blocking. The insert data are queued in memory until the insert. This means
                that delayed rows are not visible to SELECT statements until the actual
                insert. The MySQL documentation
                (http://dev.mysql.com/doc/refman/6.0/...t-delayed.html) has a lot of
                caveats and explanations, which upon further perusal sort of answers my
                question: no, one should not normally use the feature.
                No, there is nothing like that in SQL Server. (There is a deferred deleted
                model, where rows are not always physically deleted immediately, but
                are freed up by a cleanup process, but the rows are logically deleted,
                so this is just an implementation detail.)
                >
                >2) If the new data is being delivered in batches (large batches) it
                >seems to me that scheduling a drop index - bulk insert - create index
                >would be better, at least in many cases, since the index would get
                >rebuilt from scratch. Is this correct?
                >
                As always when it comes to performance issues: you will have to benchmark.
                True enough.
                --
                Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se
                >
                Books Online for SQL Server 2005 at

                Books Online for SQL Server 2000 at
                http://www.microsoft.com/sql/prodinf...ons/books.mspx
                AHS


                Comment

                • Erland Sommarskog

                  #9
                  Re: design issues with large amounts of data

                  Arved Sandstrom (asandstrom@acc esswave.ca) writes:
                  The INSERT DELAYED in MySQL returns to the client immediately...t here is
                  no blocking. The insert data are queued in memory until the insert. This
                  means that delayed rows are not visible to SELECT statements until the
                  actual insert.
                  And if the server goes down before that, the rows are never persisted?
                  sort of answers my question: no, one should not normally use the
                  feature.
                  Certainly! Possibly they should rename it to INSERT MAYBE? :-)

                  Then again, with SQL Server (and surely other products as well) some client
                  API offers asynchronous call, so that you can send away an SQL statement,
                  and pick the result later. The difference here is that the caller is
                  actually compelled to pick up the acknowledge, and cannot submit a new
                  command on that connetion until this has been done. And most of all, the
                  RDBMS as such does not violate ACID.

                  --
                  Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

                  Books Online for SQL Server 2005 at

                  Books Online for SQL Server 2000 at

                  Comment

                  • Arved Sandstrom

                    #10
                    Re: design issues with large amounts of data

                    "Erland Sommarskog" <esquel@sommars kog.sewrote in message
                    news:Xns9AD2730 239061Yazorman@ 127.0.0.1...
                    Arved Sandstrom (asandstrom@acc esswave.ca) writes:
                    >The INSERT DELAYED in MySQL returns to the client immediately...t here is
                    >no blocking. The insert data are queued in memory until the insert. This
                    >means that delayed rows are not visible to SELECT statements until the
                    >actual insert.
                    >
                    And if the server goes down before that, the rows are never persisted?
                    That is correct.
                    >sort of answers my question: no, one should not normally use the
                    >feature.
                    >
                    Certainly! Possibly they should rename it to INSERT MAYBE? :-)
                    One should treat it that way. MySQL also has INSERT LOW PRIORITY which
                    causes the client to wait until the insert is done. It appears not to have
                    an asynchronous call (such as you mention below for SQL Server) that
                    _compels_ the client not to do anything (command-wise) on that connection
                    until an ack; however, I'd assume that in those use cases where one feels
                    that INSERT DELAYED is called for, one would do a SHOW STATUS and get the
                    value of the 'not_flushed_de layed_rows' variable before doing a SELECT on
                    the affected table.
                    Then again, with SQL Server (and surely other products as well) some
                    client
                    API offers asynchronous call, so that you can send away an SQL statement,
                    and pick the result later.
                    Yes, I'd seen that in ADO.NET 2.0.
                    The difference here is that the caller is
                    actually compelled to pick up the acknowledge, and cannot submit a new
                    command on that connetion until this has been done. And most of all, the
                    RDBMS as such does not violate ACID.
                    PostgreSQL has this too. PGsendQuery (in the C libpq library, for example),
                    along with variants, sends a command without waiting for the result.
                    Invoking PGgetResult (again, the C function) will return a null pointer if
                    the command is complete. You cannot call PGsendQuery again on the same
                    connection until you get that null pointer from PGgetResult. There is also
                    asynchronous notification.

                    AHS


                    Comment

                    • Sybaseguru

                      #11
                      Re: design issues with large amounts of data

                      nflacco wrote:
                      >
                      The re-indexing is what worries me. I'll be constantly adding new data
                      to main table ( formerly the 366 day tables ) if we follow the not use
                      too many tables scheme, as well as the processed-data tables.
                      (
                      If you have a unique clustered index on Day, DataId where both are
                      incrementing then there's no reason to reindex. As I understand your
                      proposal the data will always be appended to the table rather than causing
                      page splits.

                      Comment

                      Working...