Re: atomically thread-safe Meyers singleton impl (fixed)...

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anthony Williams

    Re: atomically thread-safe Meyers singleton impl (fixed)...

    You need compiler barriers (_ReadWriteBarr ier() in MSVC) to ensure
    things don't get rearranged across your atomic access
    functions. There's no need to drop to assembler either: you're not
    doing anything more complicated than a simple MOV.

    Anyway, if I was writing this (and I wouldn't be, because I really
    dislike singletons), I'd just use boost::call_onc e. It doesn't use a
    lock unless it has to and is portable across pthreads and win32
    threads.

    Oh, and one other thing: you don't need inline assembler for atomic
    ops with gcc from version 4.2 onwards, as the compiler has built-in
    functions for atomic operations.

    Anthony
    --
    Anthony Williams | Just Software Solutions Ltd
    Custom Software Development | http://www.justsoftwaresolutions.co.uk
    Registered in England, Company Number 5478976.
    Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL
  • Dmitriy V'jukov

    #2
    Re: atomically thread-safe Meyers singleton impl (fixed)...

    On 30 ÉÀÌ, 11:20, Anthony Williams <anthony....@gm ail.comwrote:
    You need compiler barriers (_ReadWriteBarr ier() in MSVC) to ensure
    things don't get rearranged across your atomic access
    functions. There's no need to drop to assembler either: you're not
    doing anything more complicated than a simple MOV.

    In MSVC one can use just volatile variables, accesses to volatile
    variables guaranteed to be 'load-acquire' and 'store-release'. I.e. on
    Itanium/PPC MSVC will emit hardware memory fences (along with compiler
    fences).



    Dmitriy V'jukov

    Comment

    • Anthony Williams

      #3
      Re: atomically thread-safe Meyers singleton impl (fixed)...

      "Dmitriy V'jukov" <dvyukov@gmail. comwrites:
      On 30 июл, 11:20, Anthony Williams <anthony....@gm ail.comwrote:
      >You need compiler barriers (_ReadWriteBarr ier() in MSVC) to ensure
      >things don't get rearranged across your atomic access
      >functions. There's no need to drop to assembler either: you're not
      >doing anything more complicated than a simple MOV.
      >
      >
      In MSVC one can use just volatile variables, accesses to volatile
      variables guaranteed to be 'load-acquire' and 'store-release'. I.e. on
      Itanium/PPC MSVC will emit hardware memory fences (along with compiler
      fences).
      >
      http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
      I am aware of this. However, the description only says it orders
      accesses to "global and static data", and that it refers to objects
      *declared* as volatile. I haven't tested it enough to be confident
      that it is entirely equivalent to _ReadWriteBarri er(), and that it
      works on variables *cast* to volatile.

      Anthony
      --
      Anthony Williams | Just Software Solutions Ltd
      Custom Software Development | http://www.justsoftwaresolutions.co.uk
      Registered in England, Company Number 5478976.
      Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

      Comment

      • Chris M. Thomasson

        #4
        Re: atomically thread-safe Meyers singleton impl (fixed)...


        "Anthony Williams" <anthony.ajw@gm ail.comwrote in message
        news:u63qn63yk. fsf@gmail.com.. .
        "Chris M. Thomasson" <no@spam.invali dwrites:
        [...]
        The algorithm used by boost::call_onc e on pthreads platforms is
        described here:
        >

        >
        >>It doesn't use a
        >>lock unless it has to and is portable across threads and win32
        >>threads.
        >>
        >The code I posted does not use a lock unless it absolutely has to
        >because it attempts to efficiently take advantage of the double
        >checked locking pattern.
        >
        Oh yes, I realise that: the code for call_once is similar. However, it
        attempts to avoid contention on the mutex by using thread-local
        storage. If you have atomic ops, you can go even further in
        eliminating the mutex, e.g. using compare_exchang e and fetch_add.
        [...]

        Before I reply to your entire post I should point out that:



        the Boost mechanism is not 100% portable, but is elegant in practice. It
        uses a similar technique that a certain distributed reference counting
        algorithm I created claims:





        Not 100% portable, but _highly_ portable indeed!

        Comment

        • Anthony Williams

          #5
          Re: atomically thread-safe Meyers singleton impl (fixed)...

          "Chris M. Thomasson" <no@spam.invali dwrites:
          "Anthony Williams" <anthony.ajw@gm ail.comwrote in message
          news:u63qn63yk. fsf@gmail.com.. .
          >"Chris M. Thomasson" <no@spam.invali dwrites:
          [...]
          >The algorithm used by boost::call_onc e on pthreads platforms is
          >described here:
          >>
          >http://www.open-std.org/jtc1/sc22/wg...007/n2444.html
          >>
          >>>It doesn't use a
          >>>lock unless it has to and is portable across threads and win32
          >>>threads.
          >>>
          >>The code I posted does not use a lock unless it absolutely has to
          >>because it attempts to efficiently take advantage of the double
          >>checked locking pattern.
          >>
          >Oh yes, I realise that: the code for call_once is similar. However, it
          >attempts to avoid contention on the mutex by using thread-local
          >storage. If you have atomic ops, you can go even further in
          >eliminating the mutex, e.g. using compare_exchang e and fetch_add.
          [...]
          >
          Before I reply to your entire post I should point out that:
          >

          >
          the Boost mechanism is not 100% portable, but is elegant in
          practice.
          Yes. If you look at the whole thread, you'll see a comment by me there
          where I admit as much.
          It uses a similar technique that a certain distributed
          reference counting algorithm I created claims:
          I wasn't aware that you were using something similar in vZOOM.

          Anthony
          --
          Anthony Williams | Just Software Solutions Ltd
          Custom Software Development | http://www.justsoftwaresolutions.co.uk
          Registered in England, Company Number 5478976.
          Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

          Comment

          • Chris M. Thomasson

            #6
            Re: atomically thread-safe Meyers singleton impl (fixed)...

            "Anthony Williams" <anthony.ajw@gm ail.comwrote in message
            news:uhca74h7e. fsf@gmail.com.. .
            "Chris M. Thomasson" <no@spam.invali dwrites:
            >
            >"Anthony Williams" <anthony.ajw@gm ail.comwrote in message
            >news:u63qn63yk .fsf@gmail.com. ..
            >>"Chris M. Thomasson" <no@spam.invali dwrites:
            >[...]
            >>The algorithm used by boost::call_onc e on pthreads platforms is
            >>described here:
            >>>
            >>http://www.open-std.org/jtc1/sc22/wg...007/n2444.html
            >>>
            >>>>It doesn't use a
            >>>>lock unless it has to and is portable across threads and win32
            >>>>threads.
            >>>>
            >>>The code I posted does not use a lock unless it absolutely has to
            >>>because it attempts to efficiently take advantage of the double
            >>>checked locking pattern.
            >>>
            >>Oh yes, I realise that: the code for call_once is similar. However, it
            >>attempts to avoid contention on the mutex by using thread-local
            >>storage. If you have atomic ops, you can go even further in
            >>eliminating the mutex, e.g. using compare_exchang e and fetch_add.
            >[...]
            >>
            >Before I reply to your entire post I should point out that:
            >>
            >http://groups.google.com/group/comp....9c7aff738f9102
            >>
            >the Boost mechanism is not 100% portable, but is elegant in
            >practice.
            >
            Yes. If you look at the whole thread, you'll see a comment by me there
            where I admit as much.
            Does the following line:

            __thread fast_pthread_on ce_t _fast_pthread_o nce_per_thread_ epoch;

            explicitly set `_fast_pthread_ once_per_thread _epoch' to zero? If so, is it
            guaranteed?



            >It uses a similar technique that a certain distributed
            >reference counting algorithm I created claims:
            >
            I wasn't aware that you were using something similar in vZOOM.
            Humm, now that I think about it, it seems like I am totally mistaken. The
            "most portable" version of vZOOM relies on an assumption that pointer
            load/stores are atomic and the unlocking of a mutex executes at least a
            release-barrier, and the loading of a shared variable executes at least a
            data-dependant load-barrier; very similar to RCU without the explicit
            #LoadStore | #StoreStore before storing into a shared pointer location...
            Something like:




            _______________ _______________ _______________ _______________ ________
            struct foo {
            int a;
            };


            static foo* shared_f = NULL;


            // single producer thread {
            foo* local_f = new foo;
            pthread_mutex_t * lock = get_per_thread_ mutex();
            pthread_mutex_l ock(lock);
            local_f->a = 666;
            pthread_mutex_u nlock(lock);
            shared_f = local_f;
            }


            // single consumer thread {
            foo* local_f;
            while (! (local_f = shared_f)) {
            sched_yield();
            }
            assert(local_f->a == 666);
            delete local_f;
            }
            _______________ _______________ _______________ _______________ ________





            If the `pthread_mutex_ unlock()' function does not execute at least a
            release-barrier in the producer, and if the load of the shared variable does
            not execute at least a data-dependant load-barrier in the consumer, the
            "most portable" version of vZOOM will NOT work on that platform in any way
            shape or form, it will need a platform-dependant version. However, the only
            platform I can think of where the intra-node memory visibility requirements
            do not hold is the Alpha... For multi-node super-computers, inter-node
            communication is adapted to using MPI.

            Comment

            • Anthony Williams

              #7
              Re: atomically thread-safe Meyers singleton impl (fixed)...

              "Chris M. Thomasson" <no@spam.invali dwrites:
              "Anthony Williams" <anthony.ajw@gm ail.comwrote in message
              news:uhca74h7e. fsf@gmail.com.. .
              >"Chris M. Thomasson" <no@spam.invali dwrites:
              >>
              >>"Anthony Williams" <anthony.ajw@gm ail.comwrote in message
              >>news:u63qn63y k.fsf@gmail.com ...
              >>>"Chris M. Thomasson" <no@spam.invali dwrites:
              >>[...]
              >>>The algorithm used by boost::call_onc e on pthreads platforms is
              >>>described here:
              >>>>
              >>>http://www.open-std.org/jtc1/sc22/wg...007/n2444.html
              >>>>
              >>>>>It doesn't use a
              >>>>>lock unless it has to and is portable across threads and win32
              >>>>>threads.
              >>>>>
              >>>>The code I posted does not use a lock unless it absolutely has to
              >>>>because it attempts to efficiently take advantage of the double
              >>>>checked locking pattern.
              >>>>
              >>>Oh yes, I realise that: the code for call_once is similar. However, it
              >>>attempts to avoid contention on the mutex by using thread-local
              >>>storage. If you have atomic ops, you can go even further in
              >>>eliminatin g the mutex, e.g. using compare_exchang e and fetch_add.
              >>[...]
              >>>
              >>Before I reply to your entire post I should point out that:
              >>>
              >>http://groups.google.com/group/comp....9c7aff738f9102
              >>>
              >>the Boost mechanism is not 100% portable, but is elegant in
              >>practice.
              >>
              >Yes. If you look at the whole thread, you'll see a comment by me there
              >where I admit as much.
              >
              Does the following line:
              >
              __thread fast_pthread_on ce_t _fast_pthread_o nce_per_thread_ epoch;
              >
              explicitly set `_fast_pthread_ once_per_thread _epoch' to zero? If so,
              is it guaranteed?
              The algorithm assumes it does, but it depends which compiler you
              user. In the Boost implementation, the value is explicitly
              initialized (to ~0 --- I found it worked better with exception
              handling to count backwards).
              >
              >>It uses a similar technique that a certain distributed
              >>reference counting algorithm I created claims:
              >>
              >I wasn't aware that you were using something similar in vZOOM.
              >
              Humm, now that I think about it, it seems like I am totally
              mistaken. The "most portable" version of vZOOM relies on an assumption
              that pointer load/stores are atomic and the unlocking of a mutex
              executes at least a release-barrier, and the loading of a shared
              variable executes at least a data-dependant load-barrier; very similar
              to RCU without the explicit #LoadStore | #StoreStore before storing
              into a shared pointer location... Something like:
              >
              // single producer thread {
              foo* local_f = new foo;
              pthread_mutex_t * lock = get_per_thread_ mutex();
              pthread_mutex_l ock(lock);
              local_f->a = 666;
              pthread_mutex_u nlock(lock);
              shared_f = local_f;
              So you're using the lock just for the barrier properties. Interesting
              idea.

              Anthony
              --
              Anthony Williams | Just Software Solutions Ltd
              Custom Software Development | http://www.justsoftwaresolutions.co.uk
              Registered in England, Company Number 5478976.
              Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

              Comment

              • Chris M. Thomasson

                #8
                Re: atomically thread-safe Meyers singleton impl (fixed)...


                "Anthony Williams" <anthony.ajw@gm ail.comwrote in message
                news:ud4kv4fbp. fsf@gmail.com.. .
                "Chris M. Thomasson" <no@spam.invali dwrites:
                [...]
                >>>the Boost mechanism is not 100% portable, but is elegant in
                >>>practice.
                >>>
                >>Yes. If you look at the whole thread, you'll see a comment by me there
                >>where I admit as much.
                >>
                >Does the following line:
                >>
                >__thread fast_pthread_on ce_t _fast_pthread_o nce_per_thread_ epoch;
                >>
                >explicitly set `_fast_pthread_ once_per_thread _epoch' to zero? If so,
                >is it guaranteed?
                >
                The algorithm assumes it does, but it depends which compiler you
                user. In the Boost implementation, the value is explicitly
                initialized (to ~0 --- I found it worked better with exception
                handling to count backwards).
                >
                >>
                >>>It uses a similar technique that a certain distributed
                >>>reference counting algorithm I created claims:
                >>>
                >>I wasn't aware that you were using something similar in vZOOM.
                >>
                >Humm, now that I think about it, it seems like I am totally
                >mistaken. The "most portable" version of vZOOM relies on an assumption
                >that pointer load/stores are atomic and the unlocking of a mutex
                >executes at least a release-barrier, and the loading of a shared
                >variable executes at least a data-dependant load-barrier; very similar
                >to RCU without the explicit #LoadStore | #StoreStore before storing
                >into a shared pointer location... Something like:
                >>
                >// single producer thread {
                > foo* local_f = new foo;
                > pthread_mutex_t * lock = get_per_thread_ mutex();
                > pthread_mutex_l ock(lock);
                > local_f->a = 666;
                > pthread_mutex_u nlock(lock);
                > shared_f = local_f;
                >
                So you're using the lock just for the barrier properties. Interesting
                idea.
                Yes. Actually, I did not show the whole algorithm. The code above is busted
                because I forgot to show it all; STUPID ME!!! Its busted because the store
                to shared_f can legally be hoisted up above the unlock. Here is the whole
                picture... Each thread has a special dedicated mutex which is locked from
                its birth... Here is exactly how production of an object can occur:


                static foo* volatile shared_f = NULL;

                // single producer thread {
                00: foo* local_f;
                01: pthread_mutex_t * const mem_mutex = get_per_thread_ mem_mutex();
                02: local_f = new foo;
                03: local_f->a = 666;
                04: pthread_mutex_u nlock(mem_mutex );
                05: pthread_mutex_l ock(mem_mutex);
                06: shared_f = local_f;
                }


                Here are the production rules wrt POSIX:

                1. Steps 02-03 CANNOT sink below step 04
                2. Step 06 CANNOT rise above step 05
                3. vZOOM assumes that step 04 has a release barrier



                Those __two__guarante es__and__single __assumption__ ensure the ordering and
                visibility of the operations is correct. After that, the consumer can do:


                // single consumer thread {
                00: foo* local_f;
                01: while (! (local_f = shared_f)) {
                02: sched_yield();
                }
                03: assert(local_f->a == 666);
                04: delete local_f;
                }


                Consumption rules:

                01: vZOOM assumes that the load from `shared_f' will have implied
                data-dependant load-barrier.



                BTW, here is a brief outline of how the "most portable" version of vZOOM
                distributed reference counting works with the above idea:




                (an __execlelent__ question from Dmitriy...)




                What do you think Anthony?

                Comment

                • Chris M. Thomasson

                  #9
                  Re: atomically thread-safe Meyers singleton impl (fixed)...

                  [...]
                  >
                  BTW, here is a brief outline of how the "most portable" version of vZOOM
                  distributed reference counting works with the above idea:
                  >
                  http://groups.google.ru/group/comp.p...e9b6e427b4a144
                  Take note of the per-thread memory lock. Its vital to vZOOM.


                  >

                  (an __execlelent__ question from Dmitriy...)
                  >
                  >
                  >
                  >
                  What do you think Anthony?

                  Comment

                  Working...