Subtle difference between C++0x MM and other MMs

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Dmitriy V'jukov

    Subtle difference between C++0x MM and other MMs

    Consider following Peterson's algorithm implementation from Wikipedia:


    flag[0] = 0
    flag[1] = 0
    turn = 0

    P0: flag[0] = 1
    turn = 1
    memory_barrier( )
    while( flag[1] && turn == 1 );
    // do nothing
    // critical section
    ...
    // end of critical section
    flag[0] = 0

    P1: flag[1] = 1
    turn = 0
    memory_barrier( )
    while( flag[0] && turn == 0 );
    // do nothing
    // critical section
    ...
    // end of critical section
    flag[1] = 0

    We can implement this in Java using volatile variables, and needed
    memory_barrier( ) will be emitted automatically by compiler.
    We can implement this in C# using volatile variables, and
    Thread.MemoryBa rrier() as memory_barrier( ).
    We can implement this in x86 MM using plain loads and stores, and
    mfence instruction as memory_barrier( ).
    We can implement this in C++0x using std::atomic<and issuing loads
    with memory_order_ac quire, stores with memory_order_re lease, and
    atomic_thread_f ence(memory_ord er_seq_cst) as memory_barrier( ). This is
    the most straightforward translation of Java/C#/x86 implementations .

    The only problem is that C++0x implementation will not work.
    Personally for me, it's quite counter-intuitive. And following
    question arise. What is the most simple way to translate some existing
    Java/C#/x86 algorithm implementation to C++0x? It seems that it's not
    so easy...

    Dmitriy V'jukov
  • Peter Dimov

    #2
    Re: Subtle difference between C++0x MM and other MMs

    On Aug 24, 7:46 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
    Consider following Peterson's algorithm implementation from Wikipedia:http://en.wikipedia.org/wiki/Peterson%27s_algorithm
    >
     flag[0]   = 0
     flag[1]   = 0
     turn      = 0
    >
     P0: flag[0] = 1
        turn = 1
        memory_barrier( )
        while( flag[1] && turn == 1 );
                // do nothing
        // critical section
        ...
        // end of critical section
        flag[0] = 0
    >
    P1: flag[1] = 1
        turn = 0
        memory_barrier( )
        while( flag[0] && turn == 0 );
                // do nothing
        // critical section
        ...
        // end of critical section
        flag[1] = 0
    >
    We can implement this in Java using volatile variables, and needed
    memory_barrier( ) will be emitted automatically by compiler.
    And the C++ equivalent is to use seq_cst load and stores, which are
    equivalent to Java volatiles.
    We can implement this in C# using volatile variables, and
    Thread.MemoryBa rrier() as memory_barrier( ).
    We can implement this in x86 MM using plain loads and stores, and
    mfence instruction as memory_barrier( ).
    We can implement this in C++0x using std::atomic<and issuing loads
    with memory_order_ac quire, stores with memory_order_re lease, and
    atomic_thread_f ence(memory_ord er_seq_cst) as memory_barrier( ). This is
    the most straightforward translation of Java/C#/x86 implementations .
    >
    The only problem is that C++0x implementation will not work.
    Why will it not work?

    Comment

    • Dmitriy V'jukov

      #3
      Re: Subtle difference between C++0x MM and other MMs

      On 24 Á×Ç, 21:52, Peter Dimov <pdi...@gmail.c omwrote:
      On Aug 24, 7:46 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
      >
      >
      >
      Consider following Peterson's algorithm implementation from Wikipedia:http://en.wikipedia.org/wiki/Peterson%27s_algorithm
      >
      flag[0] = 0
      flag[1] = 0
      turn = 0
      >
      P0: flag[0] = 1
      turn = 1
      memory_barrier( )
      while( flag[1] && turn == 1 );
      // do nothing
      // critical section
      ...
      // end of critical section
      flag[0] = 0
      >
      P1: flag[1] = 1
      turn = 0
      memory_barrier( )
      while( flag[0] && turn == 0 );
      // do nothing
      // critical section
      ...
      // end of critical section
      flag[1] = 0
      >
      We can implement this in Java using volatile variables, and needed
      memory_barrier( ) will be emitted automatically by compiler.
      >
      And the C++ equivalent is to use seq_cst load and stores, which are
      equivalent to Java volatiles.

      Yes, it's possible to implement any algorithm that relying on
      sequentially consistent memory model in C++0x using seq_cst atomic
      operations. But! Seq_cst atomic operations, especially stores, can be
      quite expensive. So, one has general desire to use weaker operations,
      like store-release and load-acquire. And in Java/C#/x86 it's possible
      to implement Peterson's algorithm using weak operations + 1 strong
      fence. In C++0x - NOT.

      We can implement this in C# using volatile variables, and
      Thread.MemoryBa rrier() as memory_barrier( ).
      We can implement this in x86 MM using plain loads and stores, and
      mfence instruction as memory_barrier( ).
      We can implement this in C++0x using std::atomic<and issuing loads
      with memory_order_ac quire, stores with memory_order_re lease, and
      atomic_thread_f ence(memory_ord er_seq_cst) as memory_barrier( ). This is
      the most straightforward translation of Java/C#/x86 implementations .
      >
      The only problem is that C++0x implementation will not work.
      >
      Why will it not work?

      I mean not every C++0x implementation of Peterson's algorithm, but
      particular implementation which uses store-release, load-acquire + 1
      seq_cst fence.


      Dmitriy V'jukov

      Comment

      • Peter Dimov

        #4
        Re: Subtle difference between C++0x MM and other MMs

        On Aug 24, 9:44 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
        And in Java/C#/x86 it's possible
        to implement Peterson's algorithm using weak operations + 1 strong
        fence. In C++0x - NOT.
        How would you implement Peterson's algorithm in Java using weak
        operations and a fence? Java doesn't have weak operations or fences.
        Its volatile loads and stores are equivalent to C++MM's seq_cst loads
        and stores. Both promise sequential consistency (no more and no less).
        I mean not every C++0x implementation of Peterson's algorithm, but
        particular implementation which uses store-release, load-acquire + 1
        seq_cst fence.
        Why do you think that this implementation doesn't work?

        Comment

        • Peter Dimov

          #5
          Re: Subtle difference between C++0x MM and other MMs

          On Aug 24, 11:53 pm, Peter Dimov <pdi...@gmail.c omwrote:
          On Aug 24, 9:44 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
          I mean not every C++0x implementation of Peterson's algorithm, but
          particular implementation which uses store-release, load-acquire + 1
          seq_cst fence.
          >
          Why do you think that this implementation doesn't work?
          I think I see your point. Getting back to

          P0: flag[0] = 1
          turn = 1
          memory_barrier( )
          while( flag[1] && turn == 1 );
          // do nothing
          // critical section
          ...
          // end of critical section
          flag[0] = 0

          P1: flag[1] = 1
          turn = 0
          memory_barrier( )
          while( flag[0] && turn == 0 );
          // do nothing
          // critical section
          ...
          // end of critical section
          flag[1] = 0

          It's easy to show that P0 and P1 can't block each other forever;
          eventually they will agree on the value of 'turn' and one of them will
          proceed.

          The case where P0 sees flag[1] == 0 and P1 sees flag[0] == 0 is a
          classic SC violation example and every reasonable definition of
          memory_barrier rules it out.

          The interesting case you must have had in mind is the sequence

          P1:flag[1] = 1
          P1:turn = 0
          P0:flag[0] = 1
          P0:turn = 1
          P0:memory_barri er

          Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
          the critical section.)

          I wonder whether the formal CLR memory model (or even the current
          formal x86 memory model) disallows this. (XCHG for turn instead of a
          fence should work.)

          I think that the C++MM does, if the condition is while( turn == 1 &&
          flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
          started by P1:turn=0 because turn=1 is executed by the same thread
          (first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
          'turn' in P0 and ensures that P1:flag[1]=1 is seen.

          Comment

          • Dmitriy V'jukov

            #6
            Re: Subtle difference between C++0x MM and other MMs

            On 25 Á×Ç, 02:47, Peter Dimov <pdi...@gmail.c omwrote:
            On Aug 24, 11:53 pm, Peter Dimov <pdi...@gmail.c omwrote:
            >
            On Aug 24, 9:44 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
            I mean not every C++0x implementation of Peterson's algorithm, but
            particular implementation which uses store-release, load-acquire + 1
            seq_cst fence.
            >
            Why do you think that this implementation doesn't work?
            >
            I think I see your point. Getting back to
            >
            P0: flag[0] = 1
            turn = 1
            memory_barrier( )
            while( flag[1] && turn == 1 );
            // do nothing
            // critical section
            ...
            // end of critical section
            flag[0] = 0
            >
            P1: flag[1] = 1
            turn = 0
            memory_barrier( )
            while( flag[0] && turn == 0 );
            // do nothing
            // critical section
            ...
            // end of critical section
            flag[1] = 0
            >
            It's easy to show that P0 and P1 can't block each other forever;
            eventually they will agree on the value of 'turn' and one of them will
            proceed.
            >
            The case where P0 sees flag[1] == 0 and P1 sees flag[0] == 0 is a
            classic SC violation example and every reasonable definition of
            memory_barrier rules it out.
            >
            The interesting case you must have had in mind is the sequence
            >
            P1:flag[1] = 1
            P1:turn = 0
            P0:flag[0] = 1
            P0:turn = 1
            P0:memory_barri er
            >
            Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
            the critical section.)
            Exactly! And this behavior is very counter-intuitive for me!
            I wonder whether the formal CLR memory model (or even the current
            formal x86 memory model) disallows this. (XCHG for turn instead of a
            fence should work.)
            As for x86 MM, I think that Yes, such behavior is disallowed. x86 MM
            defines possible reorderings in one thread. And then intended reading
            is that you must try to construct interleaving of threads (taking into
            account reorderings in threads). If it's possible to construct
            interleaving, then behavior is allowed. If it's impossible - then
            disallowed.

            For Peterson's algorithm it's impossible to construct interleaving,
            which will break algorithm.

            For for CLR, it's very informal. But I think intended reading is the
            same as for x86... just because Microsoft targets mainly to x86 :)

            But for C++0x the fact that it's impossible to construct interleaving
            doesn't matter...

            I think that the C++MM does, if the condition is while( turn == 1 &&
            flag[1] ). P0 seeing its own turn=1 doesn't break the release sequence
            started by P1:turn=0 because turn=1 is executed by the same thread
            (first bullet in 1.10/6). So P1:turn=0 synchronizes-with the read from
            'turn' in P0 and ensures that P1:flag[1]=1 is seen.

            "by the same thread" which execute release, not "by the same thread"
            which execute acquire.
            So this won't work too.


            Dmitriy V'jukov

            Comment

            • Dmitriy V'jukov

              #7
              Re: Subtle difference between C++0x MM and other MMs

              On 25 Á×Ç, 00:53, Peter Dimov <pdi...@gmail.c omwrote:
              How would you implement Peterson's algorithm in Java using weak
              operations and a fence? Java doesn't have weak operations or fences.
              Its volatile loads and stores are equivalent to C++MM's seq_cst loads
              and stores. Both promise sequential consistency (no more and no less).

              Java volatiles promise more. volatile store is release, and volatile
              load is acquire. for x86 this means that plain stores and loads will
              be used. Well, yes, sometimes heavy membar will be emitted.
              C++0x's seq_cst atomic store on x86 will be locked instruction.
              So, in my opinion, translating Java volatiles to C++0x's seq_cst
              atomics is not fair.
              IMVHO more fair way is to translate volatile store to store-release,
              volatile load to load-acquire, and manually emit seq_cst fence. At
              least this is what initially comes to mind.

              Dmitriy V'jukov

              Comment

              • Peter Dimov

                #8
                Re: Subtle difference between C++0x MM and other MMs

                On Aug 25, 9:27 am, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
                "by the same thread" which execute release, not "by the same thread"
                which execute acquire.
                So this won't work too.
                Yes, you are right.

                Comment

                • Peter Dimov

                  #9
                  Re: Subtle difference between C++0x MM and other MMs

                  On Aug 25, 9:37 am, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
                  On 25 Á×Ç, 00:53, Peter Dimov <pdi...@gmail.c omwrote:
                  >
                  How would you implement Peterson's algorithm in Java using weak
                  operations and a fence? Java doesn't have weak operations or fences.
                  Its volatile loads and stores are equivalent to C++MM's seq_cst loads
                  and stores. Both promise sequential consistency (no more and no less).
                  >
                  Java volatiles promise more. volatile store is release, and volatile
                  load is acquire.
                  Exactly the same as seq_cst.
                  for x86 this means that plain stores and loads will
                  be used. Well, yes, sometimes heavy membar will be emitted.
                  C++0x's seq_cst atomic store on x86 will be locked instruction.
                  Java VMs will also (be changed to) emit XCHG for volatile stores,
                  because plain stores do not guarantee SC, no matter how many MFENCEs
                  one uses. x86 is now officially not TSO.

                  T0: x = 1
                  T1: y = 1
                  T2: r1 = x; r2 = y; // 1 0 allowed
                  T3: r3 = y; r4 = x; // 1 0 allowed

                  Comment

                  • Peter Dimov

                    #10
                    Re: Subtle difference between C++0x MM and other MMs

                    On Aug 25, 1:47 am, Peter Dimov <pdi...@gmail.c omwrote:
                    The interesting case you must have had in mind is the sequence
                    >
                    P1:flag[1] = 1
                    P1:turn = 0
                    P0:flag[0] = 1
                    P0:turn = 1
                    P0:memory_barri er
                    >
                    Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
                    the critical section.)
                    You may be interested to know that this is (likely to be) disallowed
                    on PowerPC as well. It'd definitely be nice to somehow disallow it in
                    the C++MM as well, if we figure out how to do that.

                    Otherwise, we'll have to live with something like

                    P0: flag[0].store( 1, relaxed );
                    turn.exchange( 1, acq_rel );
                    while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
                    // do nothing
                    // critical section
                    ...
                    // end of critical section
                    flag[0].store( 0, release );

                    (I hope I got this right). Of course using a RMW somewhat defeats the
                    point of the algorithm, but we're using it as an illustration.

                    Comment

                    • Dmitriy V'jukov

                      #11
                      Re: Subtle difference between C++0x MM and other MMs

                      On Aug 25, 3:37 pm, Peter Dimov <pdi...@gmail.c omwrote:
                      I agree with you that the translation from x86 to the C++MM doesn't
                      work, and that this is a potential problem. I don't see at the moment
                      how the definition of a seq_cst fence can be modified to match the x86
                      behavior though.

                      This is how I currently made this in Relacy Race Detector in Java/CLR
                      mode:



                      Dmitriy V'jukov

                      Comment

                      • Dmitriy V'jukov

                        #12
                        Re: Subtle difference between C++0x MM and other MMs

                        On Aug 26, 12:25 am, Peter Dimov <pdi...@gmail.c omwrote:
                        On Aug 25, 1:47 am, Peter Dimov <pdi...@gmail.c omwrote:
                        >
                        The interesting case you must have had in mind is the sequence
                        >
                        P1:flag[1] = 1
                        P1:turn = 0
                        P0:flag[0] = 1
                        P0:turn = 1
                        P0:memory_barri er
                        >
                        Can P0 now see flag[1] == 0? (P1 will later see turn == 1 and enter
                        the critical section.)
                        >
                        You may be interested to know that this is (likely to be) disallowed
                        on PowerPC as well. It'd definitely be nice to somehow disallow it in
                        the C++MM as well, if we figure out how to do that.

                        I think that this is disallowed on Sparc too.


                        Dmitriy V'jukov

                        Comment

                        • Peter Dimov

                          #13
                          Re: Subtle difference between C++0x MM and other MMs

                          On Aug 26, 2:20 pm, "Dmitriy V'jukov" <dvyu...@gmail. comwrote:
                          On Aug 26, 12:25 am, Peter Dimov <pdi...@gmail.c omwrote:
                          >
                          You may be interested to know that this is (likely to be) disallowed
                          on PowerPC as well. It'd definitely be nice to somehow disallow it in
                          the C++MM as well, if we figure out how to do that.
                          >
                          I think that this is disallowed on Sparc too.
                          Yes, I believe it is. FWIW, the following:

                          P0: flag[0].store( 1, relaxed );
                          fence( seq_cst ); // #1
                          turn.store( 1, relaxed );
                          fence( seq_cst );
                          while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
                          // do nothing
                          // critical section
                          ...
                          // end of critical section
                          flag[0].store( 0, release );

                          seems to work under the C++MM. The model in its current form doesn't
                          let us get away with just a fence(release) at #1.

                          Comment

                          • Dmitriy V'jukov

                            #14
                            Re: Subtle difference between C++0x MM and other MMs

                            On Aug 27, 4:37 am, Peter Dimov <pdi...@gmail.c omwrote:
                            You may be interested to know that this is (likely to be) disallowed
                            on PowerPC as well. It'd definitely be nice to somehow disallow it in
                            the C++MM as well, if we figure out how to do that.
                            >
                            I think that this is disallowed on Sparc too.
                            >
                            Yes, I believe it is. FWIW, the following:
                            >
                            P0: flag[0].store( 1, relaxed );
                                fence( seq_cst ); // #1
                                turn.store( 1, relaxed );
                                fence( seq_cst );
                                while( flag[1].load( acquire ) && turn.load( relaxed ) == 1 );
                                        // do nothing
                                // critical section
                                ...
                                // end of critical section
                                flag[0].store( 0, release );
                            >
                            seems to work under the C++MM. The model in its current form doesn't
                            let us get away with just a fence(release) at #1.

                            It looks like it works.
                            But it's easy to construct heavyweight correct algorithm.
                            The question that bothers me - how much lightweight (and at the same
                            time formally correct) algorithm I can construct with C++0x?


                            Dmitriy V'jukov

                            Comment

                            Working...