Proposal for a cascaded master-slave replication system

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Jan Wieck

    Proposal for a cascaded master-slave replication system

    Dear community,

    for some reason the post I sent yesterday night still did not show up on
    the mailing lists. I have set up some links on the developers side under


    The concept will be the base for some of my work as a Software Engineer
    here at Afilias USA INC. in the near future. Afilias is like many of you
    in need of reliable and performant replication solutions for backup and
    failover purposes. We started this work a couple of weeks ago by
    defining the goals and required features for our usage of PostgreSQL.

    Slony-I will be the first of 2 distinct replication systems designed
    with the 24/7 datacenter in mind.

    We want to build this system as a community project. The plan was from
    the beginning to release the product under the BSD license. And we think
    it is best to start it as such and to ask for suggestions during the
    design phase already.

    I would like to start developing the replication engine itself as soon
    as possible. And as a PostgreSQL CORE developer I will sure put some of
    my spare time into this as well. On the other hand there is absolutely
    no design other than "they mostly call some stored procedures" done for
    the frontend tools yet, and I think that we need some real good admin
    tools in the end.

    I look forward to your comments.


    Jan

    --
    #============== =============== =============== =============== ===========#
    # It's easier to get forgiveness for being wrong than for being right. #
    # Let's break this rule - forgive me. #
    #============== =============== =============== ====== JanWieck@Yahoo. com #






    ---------------------------(end of broadcast)---------------------------
    TIP 6: Have you searched our list archives?



  • Joe Conway

    #2
    Re: Proposal for a cascaded master-slave replication system

    Jan Wieck wrote:[color=blue]
    > http://developer.postgresql.org/~wieck/slony1.html[/color]

    Very interesting read. Nice work!
    [color=blue]
    > We want to build this system as a community project. The plan was from
    > the beginning to release the product under the BSD license. And we think
    > it is best to start it as such and to ask for suggestions during the
    > design phase already.[/color]

    I couldn't quite tell from the design doc -- do you intend to support
    conditional replication at a row level?

    I'm also curious, with cascaded replication, how do you handle the case
    where a second level slave has a transaction failure for some reason, i.e.:

    M
    / \
    / \
    Sa Sb
    / \ / \
    Sc Sd Se Sf

    What happens if data is successfully replicated to Sa, Sb, Sc, and Sd,
    and then an exception/rollback occurs on Se?

    Joe


    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddres sHere" to majordomo@postg resql.org)

    Comment

    • Jan Wieck

      #3
      Re: Proposal for a cascaded master-slave replication system

      Joe Conway wrote:
      [color=blue]
      > Jan Wieck wrote:[color=green]
      >> http://developer.postgresql.org/~wieck/slony1.html[/color]
      >
      > Very interesting read. Nice work!
      >[color=green]
      >> We want to build this system as a community project. The plan was from
      >> the beginning to release the product under the BSD license. And we think
      >> it is best to start it as such and to ask for suggestions during the
      >> design phase already.[/color]
      >
      > I couldn't quite tell from the design doc -- do you intend to support
      > conditional replication at a row level?[/color]

      If you mean to configure the system to replicate rows to different
      destinations (slaves) based on arbitrary qualifications, no. I had
      thought about it, but it does not really fit into the "datacenter and
      failover" picture, so it is not required to meet the goals and adds
      unnecessary complexity.

      This sort of feature is much more important for a replication system
      designed for hundreds or thousands of sporadic, asynchronous
      multi-master systems, the typical "salesman on the street" kind of
      replication.
      [color=blue]
      >
      > I'm also curious, with cascaded replication, how do you handle the case
      > where a second level slave has a transaction failure for some reason, i.e.:
      >
      > M
      > / \
      > / \
      > Sa Sb
      > / \ / \
      > Sc Sd Se Sf
      >
      > What happens if data is successfully replicated to Sa, Sb, Sc, and Sd,
      > and then an exception/rollback occurs on Se?[/color]

      First, it does not replicate single transactions. It replicates batches
      of them together. Since the transactions are already committed (and
      possibly some other depending on them too), there is no way - you loose Se.

      If this is only a temporary failure, like a power fail and the database
      recovers on restart fine including the last confirmed SYNC event (they
      get confirmed after they commit locally, but that's before the next
      checkpoint so there is actually a gap where the slave could loose a
      committed transaction and then it's lost for sure) ... so if it comes
      back up without loosing the last confirmed SYNC, it will catch up.


      Jan

      --
      #============== =============== =============== =============== ===========#
      # It's easier to get forgiveness for being wrong than for being right. #
      # Let's break this rule - forgive me. #
      #============== =============== =============== ====== JanWieck@Yahoo. com #


      ---------------------------(end of broadcast)---------------------------
      TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

      Comment

      • Christopher Browne

        #4
        Re: Proposal for a cascaded master-slave replication system

        In the last exciting episode, JanWieck@Yahoo. com (Jan Wieck) wrote:[color=blue]
        > I look forward to your comments.[/color]

        It is not evident from the paper what approach is taken to dealing
        with the duplicate key conflicts.

        The example:

        UPDATE table SET col1 = 'temp' where col = 'A';
        UPDATE table SET col1 = 'A' where col = 'B';
        UPDATE table SET col1 = 'B' where col = 'temp';

        I can think of several approaches to this:

        1. The present eRserv code reads what is in the table at the time of
        the 'snapshot', and so tries to pass on:

        update table set col1 = 'B' where otherkey = 123;
        update table set col1 = 'A' where otherkey = 456;

        which breaks because at some point, col1 is not unique, irrespective
        of what order we apply the changes in.

        2. If the contents as at the time of the COMMIT are stored in the log
        table, then we would do all three updates in the destination DB, in
        order, as shown above.

        Either we have to:
        a) Store the updated fields in the replication tables somewhere, or
        b) Make the third UPDATE wait for the updates to be stored in a
        file somewhere.

        3. The replication code requires that any given key only be updated
        once in a 'snapshot', so that the updates may be unambiguously
        partitioned:

        UPDATE table SET col1 = 'temp' where col = 'A' ; -- and otherkey = 123
        UPDATE table SET col1 = 'A' where col = 'B'; -- and otherkey = 456
        -- Must partition here before hitting #123 again --
        UPDATE table SET col1 = 'B' where col = 'temp'; -- and otherkey = 123

        The third UPDATE may have to be held up until the "partition" is set
        up, right?

        4. I seem to recall a recent discussion about the possibility of
        deferring the UNIQUE constraint 'til the END of a commit, with the
        result that we could simplify to

        update table set col1 = 'B' where otherkey = 123;
        update table set col1 = 'A' where otherkey = 456;

        and discover that the UNIQUE constraint was relaxed just long enough
        for us to make the TWO changes that in the end combined to being
        unique.

        None of these look like they turn out totally happily, or am I missing
        an approach?
        --
        wm(X,Y):-write(X),write( '@'),write(Y). wm('cbbrowne',' ntlug.org').

        "Java and C++ make you think that the new ideas are like the old ones.
        Java is the most distressing thing to hit computing since MS-DOS."
        -- Alan Kay

        Comment

        • Joe Conway

          #5
          Re: Proposal for a cascaded master-slave replication system

          Jan Wieck wrote:[color=blue]
          > If you mean to configure the system to replicate rows to different
          > destinations (slaves) based on arbitrary qualifications, no. I had
          > thought about it, but it does not really fit into the "datacenter and
          > failover" picture, so it is not required to meet the goals and adds
          > unnecessary complexity.
          >
          > This sort of feature is much more important for a replication system
          > designed for hundreds or thousands of sporadic, asynchronous
          > multi-master systems, the typical "salesman on the street" kind of
          > replication.[/color]

          OK, thanks. This actually fits any kind of distributed application. We
          have one that lives in our datacenters, but needs to replicate across
          both fast LAN/MAN and slow WAN. It is multimaster in the sense that
          individual data rows can be originated anywhere, but they are read-only
          in nodes other than where they were originated. Anyway, I'm using a
          hacked copy of dbmirror at the moment.
          [color=blue]
          > First, it does not replicate single transactions. It replicates batches
          > of them together. Since the transactions are already committed (and
          > possibly some other depending on them too), there is no way - you loose Se.[/color]

          OK, got it. Thanks.

          Joe



          ---------------------------(end of broadcast)---------------------------
          TIP 8: explain analyze is your friend

          Comment

          • Jan Wieck

            #6
            Re: [HACKERS] Proposal for a cascaded master-slave replication system

            Hans-Jürgen Schönig wrote:
            [color=blue]
            > Jan,
            >
            > First of all we really appreciate that this is going to be an Open
            > Source project.
            > There is something I wanted to add from a marketing point of view: I
            > have done many public talks in the 2 years or so. There is one question
            > people keep asking me: "How about the pgreplication project?". In every
            > training course, at any conference people keep asking for synchronous
            > replication. We have offered this people some async solutions which are
            > already out there but nobody seems to be interested in having it (my
            > person impression). People keep asking for a sync approach via email but
            > nobody seems to care about an async approach. This does not mean that
            > async is bad but we can see a strong demand for synchronous replication.
            >
            > Meanwhile we seem to be in a situation where PostgreSQL is rather
            > competing against Oracle than against MySQL. In our case there are more
            > people asking for Oracle -> Pg migration than for MySQL -> Pg. MySQL
            > does not seem to be the great enemy because most people know that it is
            > an inferior product anyway. What I want to point out is that some people
            > want an alternative Oracle's Real Application Cluster. They want load
            > balancing and hot failover. Even data centers asking for replication did
            > not want to have an async approach in the past.[/color]

            Hans-Jürgen,

            we are well aware of the high demand for multi-master replication
            addressing load balancing and clustering. We have that need ourself as
            well and I plan to work on a follow-up project as soon as Slony-I is
            released. But as of now, we see a higher priority for a reliable master
            slave system that includes the cascading and backup features described
            in my concept. There are a couple of different similar product out
            there, I know. But show me one of them where you can failover without
            becoming the single point of failure? We've just recently seen ... or
            better "where not able to see anything any more" how failures tend to
            ripple through systems - half of the US East Coast was dark. So where is
            the replication system where a slave becomes the "master", and not a
            standalone server. Show me one that has a clear concept of failback, one
            that has hot-join as a primary design goal. These are the features that
            I expect if something is labeled "Enterprise Level".

            As far as my ideas for multi-master go, it will be a synchronous
            solution using group communication. My idea is "group commit" instead of
            2-Phase ... and an early stage test hack has replicated some update 3
            weeks ago. The big challange will be to integrate the two systems so
            that a node can start as an asynchronous Slony-I slave, catch up ... and
            switch over to synchronous multimaster without stopping the cluster. I
            have no clue yet how to do that, but I refuse to think smaller.


            Jan

            --
            #============== =============== =============== =============== ===========#
            # It's easier to get forgiveness for being wrong than for being right. #
            # Let's break this rule - forgive me. #
            #============== =============== =============== ====== JanWieck@Yahoo. com #


            ---------------------------(end of broadcast)---------------------------
            TIP 3: if posting/reading through Usenet, please send an appropriate
            subscribe-nomail command to majordomo@postg resql.org so that your
            message can get through to the mailing list cleanly

            Comment

            • Jan Wieck

              #7
              Re: [HACKERS] Proposal for a cascaded master-slave replication system

              Jordan Henderson wrote:
              [color=blue]
              > Jan,
              >
              > I am wondering if you are familar with the work covered in 'Recovery in
              > Parallel Database Systems' by Svein-Olaf Hvasshovd (Vieweg) ? The book is an
              > excellent detailed description covering high availablility DB
              > implementations .[/color]

              No, but it sounds like something I allways wanted to have.
              [color=blue]
              >
              > I think your right on by not thinking smaller!![/color]

              Thanks

              Jan
              [color=blue]
              >
              > Jordan Henderson
              > On Wednesday 12 November 2003 10:45, Jan Wieck wrote:[color=green]
              >> Hans-Jürgen Schönig wrote:[color=darkred]
              >> > Jan,
              >> >
              >> > First of all we really appreciate that this is going to be an Open
              >> > Source project.
              >> > There is something I wanted to add from a marketing point of view: I
              >> > have done many public talks in the 2 years or so. There is one question
              >> > people keep asking me: "How about the pgreplication project?". In every
              >> > training course, at any conference people keep asking for synchronous
              >> > replication. We have offered this people some async solutions which are
              >> > already out there but nobody seems to be interested in having it (my
              >> > person impression). People keep asking for a sync approach via email but
              >> > nobody seems to care about an async approach. This does not mean that
              >> > async is bad but we can see a strong demand for synchronous replication.
              >> >
              >> > Meanwhile we seem to be in a situation where PostgreSQL is rather
              >> > competing against Oracle than against MySQL. In our case there are more
              >> > people asking for Oracle -> Pg migration than for MySQL -> Pg. MySQL
              >> > does not seem to be the great enemy because most people know that it is
              >> > an inferior product anyway. What I want to point out is that some people
              >> > want an alternative Oracle's Real Application Cluster. They want load
              >> > balancing and hot failover. Even data centers asking for replication did
              >> > not want to have an async approach in the past.[/color]
              >>
              >> Hans-Jürgen,
              >>
              >> we are well aware of the high demand for multi-master replication
              >> addressing load balancing and clustering. We have that need ourself as
              >> well and I plan to work on a follow-up project as soon as Slony-I is
              >> released. But as of now, we see a higher priority for a reliable master
              >> slave system that includes the cascading and backup features described
              >> in my concept. There are a couple of different similar product out
              >> there, I know. But show me one of them where you can failover without
              >> becoming the single point of failure? We've just recently seen ... or
              >> better "where not able to see anything any more" how failures tend to
              >> ripple through systems - half of the US East Coast was dark. So where is
              >> the replication system where a slave becomes the "master", and not a
              >> standalone server. Show me one that has a clear concept of failback, one
              >> that has hot-join as a primary design goal. These are the features that
              >> I expect if something is labeled "Enterprise Level".
              >>
              >> As far as my ideas for multi-master go, it will be a synchronous
              >> solution using group communication. My idea is "group commit" instead of
              >> 2-Phase ... and an early stage test hack has replicated some update 3
              >> weeks ago. The big challange will be to integrate the two systems so
              >> that a node can start as an asynchronous Slony-I slave, catch up ... and
              >> switch over to synchronous multimaster without stopping the cluster. I
              >> have no clue yet how to do that, but I refuse to think smaller.
              >>
              >>
              >> Jan[/color]
              >
              >
              > ---------------------------(end of broadcast)---------------------------
              > TIP 7: don't forget to increase your free space map settings[/color]


              --
              #============== =============== =============== =============== ===========#
              # It's easier to get forgiveness for being wrong than for being right. #
              # Let's break this rule - forgive me. #
              #============== =============== =============== ====== JanWieck@Yahoo. com #


              ---------------------------(end of broadcast)---------------------------
              TIP 1: subscribe and unsubscribe commands go to majordomo@postg resql.org

              Comment

              • Jan Wieck

                #8
                Re: Proposal for a cascaded master-slave replication system

                Christopher Browne wrote:
                [color=blue]
                > In the last exciting episode, JanWieck@Yahoo. com (Jan Wieck) wrote:[color=green]
                >> I look forward to your comments.[/color]
                >
                > It is not evident from the paper what approach is taken to dealing
                > with the duplicate key conflicts.
                >
                > The example:
                >
                > UPDATE table SET col1 = 'temp' where col = 'A';
                > UPDATE table SET col1 = 'A' where col = 'B';
                > UPDATE table SET col1 = 'B' where col = 'temp';
                >
                > I can think of several approaches to this:[/color]

                One fundamental flaw in eRServer is that it tries to "combine" multiple
                updates into one update at snapshot-time in the first place. The
                application can do these three steps in one single transaction, how do
                you split that?

                You can develop an automatic recovery for that. At the time you got a
                dupkey error, you rollback but remember the _rserv_ts and table_id that
                caused the dupkey. In the next sync attempt, you fetch the row with that
                _rserv_ts and delete all rows from the slave table with that primary key
                plus fake INSERT log rows on the master for the same. Then you prepare
                and apply and cross fingers that nobody touched the same row again
                already between your last attempt and now ... which was how many hours
                ago? And since you can only find one dupkey per round, you might do this
                a few times with larger and larger lists of _rserv_ts,table _id.

                The idea of not accumulating log forever, but just holding this status
                table (the name log is misleading in eRServer, it holds flags telling
                "the row with _rserv_ts=nnnn got INS|UPD|DEL'd") has one big advantage.
                However long your slave does not sync, your master will not run out of
                space.

                But I don't think that there is value in the attempt to let a slave
                catch up the last 4 days at once anyway. Drop it and use COPY. When your
                slave does not come up before you have modified half your database, it
                will be faster this way anyway.


                Jan
                [color=blue]
                >
                > 1. The present eRserv code reads what is in the table at the time of
                > the 'snapshot', and so tries to pass on:
                >
                > update table set col1 = 'B' where otherkey = 123;
                > update table set col1 = 'A' where otherkey = 456;
                >
                > which breaks because at some point, col1 is not unique, irrespective
                > of what order we apply the changes in.
                >
                > 2. If the contents as at the time of the COMMIT are stored in the log
                > table, then we would do all three updates in the destination DB, in
                > order, as shown above.
                >
                > Either we have to:
                > a) Store the updated fields in the replication tables somewhere, or
                > b) Make the third UPDATE wait for the updates to be stored in a
                > file somewhere.
                >
                > 3. The replication code requires that any given key only be updated
                > once in a 'snapshot', so that the updates may be unambiguously
                > partitioned:
                >
                > UPDATE table SET col1 = 'temp' where col = 'A' ; -- and otherkey = 123
                > UPDATE table SET col1 = 'A' where col = 'B'; -- and otherkey = 456
                > -- Must partition here before hitting #123 again --
                > UPDATE table SET col1 = 'B' where col = 'temp'; -- and otherkey = 123
                >
                > The third UPDATE may have to be held up until the "partition" is set
                > up, right?
                >
                > 4. I seem to recall a recent discussion about the possibility of
                > deferring the UNIQUE constraint 'til the END of a commit, with the
                > result that we could simplify to
                >
                > update table set col1 = 'B' where otherkey = 123;
                > update table set col1 = 'A' where otherkey = 456;
                >
                > and discover that the UNIQUE constraint was relaxed just long enough
                > for us to make the TWO changes that in the end combined to being
                > unique.
                >
                > None of these look like they turn out totally happily, or am I missing
                > an approach?[/color]


                --
                #============== =============== =============== =============== ===========#
                # It's easier to get forgiveness for being wrong than for being right. #
                # Let's break this rule - forgive me. #
                #============== =============== =============== ====== JanWieck@Yahoo. com #


                ---------------------------(end of broadcast)---------------------------
                TIP 3: if posting/reading through Usenet, please send an appropriate
                subscribe-nomail command to majordomo@postg resql.org so that your
                message can get through to the mailing list cleanly

                Comment

                • Andrew Sullivan

                  #9
                  Re: [HACKERS] Proposal for a cascaded master-slave replication system

                  On Wed, Nov 12, 2003 at 02:08:23PM +0100, Hans-J?rgen Sch?nig wrote:
                  [color=blue]
                  > an inferior product anyway. What I want to point out is that some people
                  > want an alternative Oracle's Real Application Cluster. They want load
                  > balancing and hot failover. Even data centers asking for replication did
                  > not want to have an async approach in the past.[/color]

                  I think Jan has already outlined his more-distant-future idea, but
                  I'd also like to know whether the people who are asking for a
                  replacement for RAC are willing to invest in it? You could buy some
                  _awfully_ good development time for even a year's worth of licensing
                  for RAC. I get the impression from the Postgres-R list that their
                  biggest obstacle is development resources.

                  <rant> People often like to say they need hot-fail-capable, five
                  nines, 24/7/365 systems. For most applications, I just do not
                  believe that, and the truth is that the cost of getting from three
                  nines to four (never mind five) is so great that people cheat: one
                  paragraph has the "five nines" clause, and the next paragraph talks
                  about scheduled downtime. In a real "five nines" system (the phone
                  company, say, or the air traffic control system), the time for
                  scheduled downtime is just the cumulative possible outage at any node
                  when it is being switched with its replacement. Five minutes a year
                  is a pretty high bar to jump, and most people long ago concluded that
                  you don't actually need it for most applications. </rant>

                  A


                  --
                  ----
                  Andrew Sullivan 204-4141 Yonge Street
                  Afilias Canada Toronto, Ontario Canada
                  <andrew@liberty rms.info> M2P 2A8
                  +1 416 646 3304 x110


                  ---------------------------(end of broadcast)---------------------------
                  TIP 8: explain analyze is your friend

                  Comment

                  • Andrew Sullivan

                    #10
                    Re: Proposal for a cascaded master-slave replication system

                    On Tue, Nov 11, 2003 at 03:38:53PM -0500, Christopher Browne wrote:[color=blue]
                    > In the last exciting episode, JanWieck@Yahoo. com (Jan Wieck) wrote:[color=green]
                    > > I look forward to your comments.[/color]
                    >
                    > It is not evident from the paper what approach is taken to dealing
                    > with the duplicate key conflicts.
                    >
                    > The example:
                    >
                    > UPDATE table SET col1 = 'temp' where col = 'A';
                    > UPDATE table SET col1 = 'A' where col = 'B';
                    > UPDATE table SET col1 = 'B' where col = 'temp';[/color]

                    It's not a problem, because as the proposal states, the actual SQL is
                    to be sent in order to the slave. That is, only consistent sets are
                    sent: you can't have a condition on the slave that never could have
                    obtained on the master. This means greater overhead for cases where
                    the same row is altered repeatedly, but it's safe.

                    A

                    --
                    ----
                    Andrew Sullivan 204-4141 Yonge Street
                    Afilias Canada Toronto, Ontario Canada
                    <andrew@liberty rms.info> M2P 2A8
                    +1 416 646 3304 x110


                    ---------------------------(end of broadcast)---------------------------
                    TIP 3: if posting/reading through Usenet, please send an appropriate
                    subscribe-nomail command to majordomo@postg resql.org so that your
                    message can get through to the mailing list cleanly

                    Comment

                    Working...