Implementing my own memcpy

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nils Weller

    #61
    [OT] Re: Implementing my own memcpy

    In article <42C5F9B9.CB646 6B2@yahoo.com>, CBFalconer wrote:[color=blue]
    > Nils Weller wrote:[color=green]
    >> Nils Weller wrote:
    >>[color=darkred]
    >>> rc = read(buf, sizeof buf - 1, fd);[/color]
    >>
    >> Of course I had to goof this one!
    >>
    >> rc = read(fd, buf, sizeof buf - 1);[/color]
    >
    > I have no idea what it goes with, because your previous article was
    > much too long to read. :-) However, you have still goofed, because
    > there is no such standard function as 'read'. Look up fread, which
    > IS portable.[/color]

    And nobody claimed that read() is a standard C function. I explicitly
    commented the code as being Unix-specific in the previous, too long
    post. Moreover, the macro that triggered this sub-thread has also been
    pointed out to be Unix-specific, and there has been some talk about Unix
    kernel implementation and compatibility system software.

    Perhaps an OT tag was missing, but I think it is clear that we aren't
    talking about standard C anymore.

    --
    Nils R. Weller, Bremen / Germany
    My real email address is ``nils<at>gnuli nux<dot>nl''
    .... but I'm not speaking for the Software Libre Foundation!

    Comment

    • Dave Thompson

      #62
      Re: Implementing my own memcpy

      On Sat, 25 Jun 2005 12:04:24 -0400, Clark S. Cox III
      <clarkcox3@gmai l.com> wrote:
      [color=blue]
      > On 2005-06-25 11:45:13 -0400, Netocrat <netocrat@dodo. com.au> said:[/color]
      <snip>[color=blue][color=green]
      > > I believe that there is no portable, generic way to copy a structure[/color]
      >
      > Of course there is; in fact, there are several:
      >
      > Assuming a and b are of the same complete type, any of the following
      > will copy the contents of a into b:
      >
      > #include <stdlib.h>[/color]

      Not actually needed for anything in this code. (size_t is in string.h)
      [color=blue]
      > #include <string.h>
      >
      > /*1*/ b = a;[/color]

      For complete _nonarray_ types.
      [color=blue]
      > /*2*/ memcpy(&b, &a, sizeof b);
      > /*3*/ memmove(&b, &a, sizeof b);
      > /*4*/ const unsigned char *src = (const unsigned char*)&a;
      > unsigned char *dst = (unsigned char*)&b;
      > for(size_t i=0; i<sizeof b; ++i)
      > {
      > dst[i] = src[i];
      > }[/color]

      Rest for all complete types. And if you can determine the (a?) size by
      some other means not sizeof, even objects declared-not-defined with
      incomplete types.

      - David.Thompson1 at worldnet.att.ne t

      Comment

      • Dave Thompson

        #63
        Re: Implementing my own memcpy

        On Sat, 25 Jun 2005 17:05:08 GMT, CBFalconer <cbfalconer@yah oo.com>
        wrote:
        <snip>[color=blue]
        > The void * type can point at arbitrary things, and a size_t can
        > specify a size on any machine. But to use void* you have to
        > convert to other types, thus:
        >
        > void *dupmem(void *src, size_t sz)
        > {
        > unsigned char *sp = src;
        > unsigned char *dst;
        >
        > if (dst = malloc(sz)) /* memory is available */
        > while (sz--) *dst++ = *sp++; /* copy away */
        > return dst; /* will be NULL for failure */[/color]

        return dst - sz, unless all your callers will (and must) adjust down
        the pointer before using it to access the memory, and free() it.
        [color=blue]
        > } /* dupmem, untested */
        >
        > Note how src is typed into sp, without any casts. Similarly the
        > reverse typing for the return value of dupmem. The usage will be,
        > for p some type of pointer:
        >[/color]
        Although it would be more informative, and convenient for some
        call(er)s, to declare src and sp as pointer to const void/uchar.
        [color=blue]
        > if (p = dupmem(whatever , howbig)) {
        > /* success, carry on */
        > }
        > else {
        > /* abject failure, panic */
        > }[/color]

        - David.Thompson1 at worldnet.att.ne t

        Comment

        • Dave Thompson

          #64
          Re: Implementing my own memcpy

          On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:
          [color=blue][color=green]
          > >On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:[color=darkred]
          > >> (you can also write the loop as "while (n--) *dst++ = *src++" but I find
          > >> the above easier to read and think about).[/color][/color]
          >
          > In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au>
          > Netocrat <netocrat@dodo. com.au> wrote:[color=green]
          > >I prefer the conciseness of the second, but I prefer even more testing
          > >against a maximum pointer.[/color]
          >
          > My thinking is perhaps colored by too many years of assembly coding
          > and instruction sets that include "decrement and branch if nonzero":
          >
          > test r3
          > bz Laround
          > Lloop:
          > mov (r1)+,(r2)+
          > sobgtr r3,Lloop # cheating (but this is OK)
          > Laround:[/color]
          <snip>[color=blue]
          > and so on. (The first loop is VAX assembly, and "cheating" is OK
          > because r1 and/or r2 should never cross from P0/P1 space to S space,
          > nor vice versa, so the maximum block size never exceeds 2 GB; <snip>[/color]

          Not movb? Isn't the default word=long? Or is this some overambitious
          assembler that you (have to) tell about value types?

          Most (I think all but first two or so) models of PDP-11 also had
          sub-1-brback-ne (only) which they managed to publish as SOB before
          marketing caught them. PDP-6/10 already had a whole series of SOB*,
          but only SOBN or SOBG would do what you wanted here not SOB.
          (All 16 dyadic booleans are implemented, but SKIP doesn't; JUMP
          doesn't; the fastest jump varies but is never JUMP*; etc., etc.)

          ISTR 68k, which you also mentioned (snipped), also had a mildly
          offcolor opcode, somewhere else.

          - David.Thompson1 at worldnet.att.ne t

          Comment

          • Dave Thompson

            #65
            Re: Implementing my own memcpy

            On Tue, 28 Jun 2005 03:05:48 +1000, Netocrat <netocrat@dodo. com.au>
            wrote:
            [color=blue]
            > Also C90 and C89 seem to be interchangeable terms - correct?
            >[/color]
            Effectively. C89 was the document developed "by" (under) ANSI, then
            submitted to "ISO" (already JTC1?) and adopted with technically
            identical contents but different numbering scheme and (I believe) some
            of the boilerplate about copyright, authority, and such. Thus if you
            want to refer to a clause number, as we fairly often do, you need to
            specify which; and if you had a lawsuit turning on compliance to one
            or the other standard you might have to produce that exact document to
            support your case. But as far as what a C implementation is required
            or permitted to do, and thus what a program(mer) can rely on or
            expect, they are interchangeable .

            In contrast C99 was voted first by "ISO" (as I understand it really
            SC22), and adopted as-is by ANSI (really NCITS? INCITS?).
            [color=blue]
            > Finally I understand that C90/C89 had some modifications made prior to C99
            > - where are those detailed?[/color]

            See FAQ 11.1 and .2 -- at least in the text version posted and online
            at usual places; the webized http://www.eskimo.com/~scs/C-faq/top.html
            has been out-of-date the last few times I checked and this is one of
            the points that has changed. But:
            - the statement about the Rationale was for only the original ANSI
            version C89, which is no longer (realistically) available;
            - it says Normative Addendum which I'm pretty sure should be
            Amendment; C90 plus that amendment is sometimes called C95
            - (several!) drafts of an updated Rationale for C99, as well as drafts
            of C99 itself (through n869) and C0X (n1124) can be gotten from the WG
            site which is now (renamed?) www.open-std.org/JTC1/SC22/WG14 .
            (As well as other stuff you might be interested in, for that matter.)

            And for your further delectation and enjoyment, you could get the
            ~1600-page e-book by Derek M Jones discussed in another thread, which
            AFAICT-so-far exegizes the standard process, the resulting document,
            and the language specified in it, and more.

            If you actually want C90 instead of or in addition to C99, ANSI
            apparently no longer sells it, but webstore.ansi.o rg (still) lists DIN
            and AS adoptions of 9899:1990, and I'm guessing the latter might be
            available to you more conveniently.

            - David.Thompson1 at worldnet.att.ne t

            Comment

            • Chris Torek

              #66
              Re: Implementing my own memcpy

              (Off-topic drift warning :-) )
              [color=blue]
              >On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:[color=green]
              >> mov (r1)+,(r2)+[/color][/color]

              In article <c6ghc1dufoi95b ipm9qopb69llvq9 cs3qr@4ax.com>
              Dave Thompson <david.thompson 1@worldnet.att. net> wrote:[color=blue]
              >Not movb? Isn't the default word=long? Or is this some overambitious
              >assembler that you (have to) tell about value types?[/color]

              No, just a goof; it should have been "movb".
              [color=blue]
              >ISTR 68k, which you also mentioned (snipped), also had a mildly
              >offcolor opcode, somewhere else.[/color]

              I do not recall any from the 680x0 series, but the 1802 had several.

              Each register was 16 bits (I am almost certain, despite the 8-bit
              claim on the page referenced below), but the 8-bit opcodes could
              address only the high or low half of each register, so there was
              a "put low" and "put high" to write to each half, and the corresponding
              pair of "get"s. This meant the 1802 had GHI, the "get high"
              instruction.

              The 1802 also had two special registers named P (program counter)
              and X (index). However, neither P nor X were actual registers;
              instead, they were register *numbers*, pointing to one of the 16
              general-purpose registers. You had to use a "set p" or "set x"
              instruction to point the P and X indirection at the appropriate
              register. These had three-letter assembler mnemonics; the first
              was SEP, and the second was the now-obvious.

              (See also <http://shop-pdp.kent.edu/ashtml/as1802.htm>.)
              --
              In-Real-Life: Chris Torek, Wind River Systems
              Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
              email: forget about it http://web.torek.net/torek/index.html
              Reading email is like searching for food in the garbage, thanks to spammers.

              Comment

              • CBFalconer

                #67
                Re: Implementing my own memcpy

                Dave Thompson wrote:[color=blue]
                > CBFalconer <cbfalconer@yah oo.com> wrote:
                > <snip>[color=green]
                >> The void * type can point at arbitrary things, and a size_t can
                >> specify a size on any machine. But to use void* you have to
                >> convert to other types, thus:
                >>
                >> void *dupmem(void *src, size_t sz)
                >> {
                >> unsigned char *sp = src;
                >> unsigned char *dst;
                >>
                >> if (dst = malloc(sz)) /* memory is available */
                >> while (sz--) *dst++ = *sp++; /* copy away */
                >> return dst; /* will be NULL for failure */[/color]
                >
                > return dst - sz, unless all your callers will (and must) adjust down
                > the pointer before using it to access the memory, and free() it.[/color]

                That still doesn't fix my goof above. sz ends at 0. Try this:

                void *dupmem(void *src, size_t sz)
                {
                unsigned char *sp = src;
                unsigned char *dst, *p;

                if (p = dst = malloc(sz)) /* memory is available */
                while (sz--) *p++ = *sp++; /* copy away */
                return dst; /* will be NULL for failure */
                }
                --
                "If you want to post a followup via groups.google.c om, don't use
                the broken "Reply" link at the bottom of the article. Click on
                "show options" at the top of the article, then click on the
                "Reply" at the bottom of the article headers." - Keith Thompson


                Comment

                • BGreene

                  #68
                  Re: Implementing my own memcpy

                  I apologize to the group but i haven't heard "decrement and branch if not
                  zero" in many a year.

                  "Dave Thompson" <david.thompson 1@worldnet.att. net> wrote in message
                  news:c6ghc1dufo i95bipm9qopb69l lvq9cs3qr@4ax.c om...[color=blue]
                  > On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:
                  >[color=green][color=darkred]
                  > > >On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:
                  > > >> (you can also write the loop as "while (n--) *dst++ = *src++" but I[/color][/color][/color]
                  find[color=blue][color=green][color=darkred]
                  > > >> the above easier to read and think about).[/color]
                  > >
                  > > In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au>
                  > > Netocrat <netocrat@dodo. com.au> wrote:[color=darkred]
                  > > >I prefer the conciseness of the second, but I prefer even more testing
                  > > >against a maximum pointer.[/color]
                  > >
                  > > My thinking is perhaps colored by too many years of assembly coding
                  > > and instruction sets that include "decrement and branch if nonzero":
                  > >
                  > > test r3
                  > > bz Laround
                  > > Lloop:
                  > > mov (r1)+,(r2)+
                  > > sobgtr r3,Lloop # cheating (but this is OK)
                  > > Laround:[/color]
                  > <snip>[color=green]
                  > > and so on. (The first loop is VAX assembly, and "cheating" is OK
                  > > because r1 and/or r2 should never cross from P0/P1 space to S space,
                  > > nor vice versa, so the maximum block size never exceeds 2 GB; <snip>[/color]
                  >
                  > Not movb? Isn't the default word=long? Or is this some overambitious
                  > assembler that you (have to) tell about value types?
                  >
                  > Most (I think all but first two or so) models of PDP-11 also had
                  > sub-1-brback-ne (only) which they managed to publish as SOB before
                  > marketing caught them. PDP-6/10 already had a whole series of SOB*,
                  > but only SOBN or SOBG would do what you wanted here not SOB.
                  > (All 16 dyadic booleans are implemented, but SKIP doesn't; JUMP
                  > doesn't; the fastest jump varies but is never JUMP*; etc., etc.)
                  >
                  > ISTR 68k, which you also mentioned (snipped), also had a mildly
                  > offcolor opcode, somewhere else.
                  >
                  > - David.Thompson1 at worldnet.att.ne t[/color]


                  Comment

                  • Netocrat

                    #69
                    Re: Implementing my own memcpy

                    On Sat, 25 Jun 2005 19:58:19 +0000, Chris Torek wrote:

                    [a memcpy function in response to my buggy version][color=blue]
                    > void *like_memcpy(vo id *restrict dst0, const void *restrict src0,
                    > size_t n) {
                    > unsigned char *restrict dst = dst0;
                    > unsigned char *restrict src = src0;
                    >
                    > if (n)
                    > do
                    > *dst++ = *src++;
                    > while (--n != 0);
                    > return dst0;
                    > }
                    >[color=green]
                    >>On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:[color=darkred]
                    >>> (you can also write the loop as "while (n--) *dst++ = *src++" but I
                    >>> find the above easier to read and think about).[/color][/color]
                    >
                    > In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au> Netocrat
                    > <netocrat@dodo. com.au> wrote:[color=green]
                    >>I prefer the conciseness of the second, but I prefer even more testing
                    >>against a maximum pointer.[/color]
                    >
                    > My thinking is perhaps colored by too many years of assembly coding and
                    > instruction sets that include "decrement and branch if nonzero":[/color]

                    <snip discussion to which I responded in a later post>

                    I was spurred to actually benchmark the different approaches on my
                    machine. It's a little over the top, but my belief is that it's not
                    really possible to predict which approach will be faster - even knowing
                    the machine's architecture you can't know what the compiler will do. So
                    to me these sort of things are really a matter of personal preference.
                    So here is my attempt to back up that intuition at least on my machine.

                    I used the function quoted above, as well as the quoted proposed
                    alternative, and my function as fixed by Kevin Bagust:
                    [color=blue]
                    > void *mem_cpy( void *dest, const void *src, size_t bytes ) {
                    > unsigned char *destPtr = dest;
                    > unsigned char const *srcPtr = src;
                    > unsigned char const *srcEnd = srcPtr + bytes;
                    >
                    > while ( srcPtr < srcEnd ) {
                    > *destPtr++ = *srcPtr++;
                    > }
                    > return dest;
                    > }[/color]

                    I compiled at four of the levels of optimisation available on gcc (none,
                    -O1, -O2, -O3), and at each level performed two tests - with and without
                    -march=pentium4 (my machine architecture). I performed the tests at
                    multiple iterations of 0, 1, 2, 8, 25 and 80 bytes and timed the duration
                    using clock().

                    And the results?

                    At the unoptimised level, both of Chris's alternatives were equal.

                    In every other case the first of Chris's alternatives far outperformed the
                    second (by a minimum of 14% and maximum of 21%).

                    So I modified the 'alternative' expression from
                    while (n--) *dst++ = *src++;
                    to
                    while (n) {
                    *dst++ = *src++;
                    n--;
                    }

                    This brought the alternative function back close to the performance of the
                    original. I don't know why the degradation was occurring; presumably
                    something to do with one or more of the variables being decremented or
                    incremented one more time than necessary.

                    In the unoptimised case, my function outperformed Chris's functions by
                    about 15%. In all of the optimised cases, they were roughly equal -
                    varying from his performing 3% better than mine to mine performing 2%
                    better than his.

                    So even though it's platform-specific I think that this test shows that
                    choosing between these loop constructions should be based on personal
                    preference as to readability - a performance benefit can't be assumed for
                    any particular style - unless you are developing for a particular system
                    for which you know one style is more performant than the others.

                    Comment

                    • Chris Croughton

                      #70
                      Re: Implementing my own memcpy

                      On Sun, 10 Jul 2005 22:27:54 +1000, Netocrat
                      <netocrat@dodo. com.au> wrote:
                      [color=blue]
                      > In every other case the first of Chris's alternatives far outperformed the
                      > second (by a minimum of 14% and maximum of 21%).
                      >
                      > So I modified the 'alternative' expression from
                      > while (n--) *dst++ = *src++;
                      > to
                      > while (n) {
                      > *dst++ = *src++;
                      > n--;
                      > }
                      >
                      > This brought the alternative function back close to the performance of the
                      > original. I don't know why the degradation was occurring; presumably
                      > something to do with one or more of the variables being decremented or
                      > incremented one more time than necessary.[/color]

                      Some odd optimisation?

                      Incidentally, if you still have the test code around, could you also try

                      while (n) {
                      *dst = *src;
                      ++src;
                      ++dst;
                      --n;
                      }

                      (And is there a difference between n--; and --n; on your system?)

                      Just to get the results from the same system as used for your original
                      results. (Incidentally, how did they compare with the system-supplied
                      memcpy? I believe gcc inlines that to assembler at some optimisation
                      levels...)
                      [color=blue]
                      > So even though it's platform-specific I think that this test shows that
                      > choosing between these loop constructions should be based on personal
                      > preference as to readability - a performance benefit can't be assumed for
                      > any particular style - unless you are developing for a particular system
                      > for which you know one style is more performant than the others.[/color]

                      Indeed. And bear in mind that it may change completely with the next
                      version of the compiler, or switching to another compiler on the same
                      platform. I've found that trusting the compiler and library writers to
                      have picked the best optimisations is right most of the time...

                      Chris C

                      Comment

                      • Netocrat

                        #71
                        Re: Implementing my own memcpy

                        On Sun, 10 Jul 2005 14:34:09 +0100, Chris Croughton wrote:[color=blue]
                        > On Sun, 10 Jul 2005 22:27:54 +1000, Netocrat
                        > <netocrat@dodo. com.au> wrote:
                        >[color=green]
                        >> In every other case the first of Chris's alternatives far outperformed
                        >> the second (by a minimum of 14% and maximum of 21%).
                        >>
                        >> So I modified the 'alternative' expression from
                        >> while (n--) *dst++ = *src++;
                        >> to
                        >> while (n) {
                        >> *dst++ = *src++;
                        >> n--;
                        >> }
                        >> This brought the alternative function back close to the performance of
                        >> the original. I don't know why the degradation was occurring;
                        >> presumably something to do with one or more of the variables being
                        >> decremented or incremented one more time than necessary.[/color]
                        >
                        > Some odd optimisation?[/color]

                        Anything's possible.
                        [color=blue]
                        > Incidentally, if you still have the test code around, could you also try
                        >
                        > while (n) {
                        > *dst = *src;
                        > ++src;
                        > ++dst;
                        > --n;
                        > }[/color]

                        I retested and included this modification that you suggested. Your
                        modification is always faster than the original while(n--) loop and is
                        roughly the same across all of the optimisation levels as the modification
                        that I made (worst performance is 17% slower than my mod at -O1 - an
                        aberration since for all other cases their separation is a few percent -
                        and best performance is 5% faster at -O3 -march=pentium4) .
                        [color=blue]
                        > (And is there a difference between n--; and --n; on your system?)[/color]

                        I'm not sure about the general case - but I tested your modification above
                        with n-- and --n. There is a small variation that differs between the
                        optimisation levels - neither is consistently faster. The biggest
                        separation I found was post-decrement being about 3% faster at -O3
                        -march=pentium4. I repeated this test a few times to check that it wasn't
                        a one-off error due to system loading and the result was consistently
                        within the bounds of .05% and 3%. The initial 3% result is probably not
                        accurate but there's no doubt that in this case the compiler generates
                        slightly faster code for post-decrement.
                        [color=blue]
                        > Just to get the results from the same system as used for your original
                        > results. (Incidentally, how did they compare with the system-supplied
                        > memcpy? I believe gcc inlines that to assembler at some optimisation
                        > levels...)[/color]

                        Its execution time doesn't vary between the sizes I originally tested as
                        much as the other functions' times do. Nor is its performance affected by
                        optimisation level. With or without optimisations, it is always the
                        slowest function for sizes of 0..8 bytes. Without optimisations, from
                        about 16 bytes it starts consistently performing far better - eg at 40
                        bytes it is 150% faster than any other function. With optimisations it's
                        "in the mix" - not much better or worse than the others up to roughly 40
                        bytes and from then on it consistently beats them.

                        I tested for larger sizes at all optimisation levels:

                        At 80 bytes the library function was a minimum of 34% faster than any
                        other function (340% faster when optimisation switches were not used).

                        At 1024 bytes it was at least 270% faster (1400% faster without
                        optimisations).

                        At 10 kilobytes it was at least 400% faster.

                        At 100 kilobytes it was at least 65% faster. Also optimisations changed
                        its performance - it was fastest without optimisations and at -O1 it was
                        twice as slow as without optimisations.

                        At 1 megabyte things had evened out and it was roughly the same as the
                        others and in some cases slightly slower. It performed the same at all
                        optimisation levels.

                        At 10 and 100 megabytes I only tested for -O3 -march=pentium4 and again it
                        was roughly the same as the other functions.
                        [color=blue][color=green]
                        >> So even though it's platform-specific I think that this test shows that
                        >> choosing between these loop constructions should be based on personal
                        >> preference as to readability - a performance benefit can't be assumed
                        >> for any particular style - unless you are developing for a particular
                        >> system for which you know one style is more performant than the others.[/color]
                        >
                        > Indeed. And bear in mind that it may change completely with the next
                        > version of the compiler, or switching to another compiler on the same
                        > platform. I've found that trusting the compiler and library writers to
                        > have picked the best optimisations is right most of the time...[/color]

                        Agreed - and if you _really_ need specific hard-core optimisations, don't
                        rely on the compiler except perhaps to use its output as a base - go with
                        assembly. That way the results aren't dependent on things beyond your
                        control like compiler code-generation.

                        Comment

                        Working...