Implementing my own memcpy

**Nils Weller** · Nov 15 '05, 12:19 AM

[OT] Re: Implementing my own memcpy

In article <42C5F9B9.CB646 6B2@yahoo.com>, CBFalconer wrote:[color=blue]
> Nils Weller wrote:[color=green]
>> Nils Weller wrote:
>>[color=darkred]
>>> rc = read(buf, sizeof buf - 1, fd);[/color]
>>
>> Of course I had to goof this one!
>>
>> rc = read(fd, buf, sizeof buf - 1);[/color]
>
> I have no idea what it goes with, because your previous article was
> much too long to read. :-) However, you have still goofed, because
> there is no such standard function as 'read'. Look up fread, which
> IS portable.[/color]

And nobody claimed that read() is a standard C function. I explicitly
commented the code as being Unix-specific in the previous, too long
post. Moreover, the macro that triggered this sub-thread has also been
pointed out to be Unix-specific, and there has been some talk about Unix
kernel implementation and compatibility system software.

Perhaps an OT tag was missing, but I think it is clear that we aren't
talking about standard C anymore.

--
Nils R. Weller, Bremen / Germany
My real email address is ``nils<at>gnuli nux<dot>nl''
.... but I'm not speaking for the Software Libre Foundation!

**Dave Thompson** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

On Sat, 25 Jun 2005 12:04:24 -0400, Clark S. Cox III
<clarkcox3@gmai l.com> wrote:
[color=blue]
> On 2005-06-25 11:45:13 -0400, Netocrat <netocrat@dodo. com.au> said:[/color]
<snip>[color=blue][color=green]
> > I believe that there is no portable, generic way to copy a structure[/color]
>
> Of course there is; in fact, there are several:
>
> Assuming a and b are of the same complete type, any of the following
> will copy the contents of a into b:
>
> #include <stdlib.h>[/color]

Not actually needed for anything in this code. (size_t is in string.h)
[color=blue]
> #include <string.h>
>
> /*1*/ b = a;[/color]

For complete _nonarray_ types.
[color=blue]
> /*2*/ memcpy(&b, &a, sizeof b);
> /*3*/ memmove(&b, &a, sizeof b);
> /*4*/ const unsigned char *src = (const unsigned char*)&a;
> unsigned char *dst = (unsigned char*)&b;
> for(size_t i=0; i<sizeof b; ++i)
> {
> dst[i] = src[i];
> }[/color]

Rest for all complete types. And if you can determine the (a?) size by
some other means not sizeof, even objects declared-not-defined with
incomplete types.

- David.Thompson1 at worldnet.att.ne t

**Dave Thompson** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

On Sat, 25 Jun 2005 17:05:08 GMT, CBFalconer <cbfalconer@yah oo.com>
wrote:
<snip>[color=blue]
> The void * type can point at arbitrary things, and a size_t can
> specify a size on any machine. But to use void* you have to
> convert to other types, thus:
>
> void *dupmem(void *src, size_t sz)
> {
> unsigned char *sp = src;
> unsigned char *dst;
>
> if (dst = malloc(sz)) /* memory is available */
> while (sz--) *dst++ = *sp++; /* copy away */
> return dst; /* will be NULL for failure */[/color]

return dst - sz, unless all your callers will (and must) adjust down
the pointer before using it to access the memory, and free() it.
[color=blue]
> } /* dupmem, untested */
>
> Note how src is typed into sp, without any casts. Similarly the
> reverse typing for the return value of dupmem. The usage will be,
> for p some type of pointer:
>[/color]
Although it would be more informative, and convenient for some
call(er)s, to declare src and sp as pointer to const void/uchar.
[color=blue]
> if (p = dupmem(whatever , howbig)) {
> /* success, carry on */
> }
> else {
> /* abject failure, panic */
> }[/color]

- David.Thompson1 at worldnet.att.ne t

**Dave Thompson** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:
[color=blue][color=green]
> >On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:[color=darkred]
> >> (you can also write the loop as "while (n--) *dst++ = *src++" but I find
> >> the above easier to read and think about).[/color][/color]
>
> In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au>
> Netocrat <netocrat@dodo. com.au> wrote:[color=green]
> >I prefer the conciseness of the second, but I prefer even more testing
> >against a maximum pointer.[/color]
>
> My thinking is perhaps colored by too many years of assembly coding
> and instruction sets that include "decrement and branch if nonzero":
>
> test r3
> bz Laround
> Lloop:
> mov (r1)+,(r2)+
> sobgtr r3,Lloop # cheating (but this is OK)
> Laround:[/color]
<snip>[color=blue]
> and so on. (The first loop is VAX assembly, and "cheating" is OK
> because r1 and/or r2 should never cross from P0/P1 space to S space,
> nor vice versa, so the maximum block size never exceeds 2 GB; <snip>[/color]

Not movb? Isn't the default word=long? Or is this some overambitious
assembler that you (have to) tell about value types?

Most (I think all but first two or so) models of PDP-11 also had
sub-1-brback-ne (only) which they managed to publish as SOB before
marketing caught them. PDP-6/10 already had a whole series of SOB*,
but only SOBN or SOBG would do what you wanted here not SOB.
(All 16 dyadic booleans are implemented, but SKIP doesn't; JUMP
doesn't; the fastest jump varies but is never JUMP*; etc., etc.)

ISTR 68k, which you also mentioned (snipped), also had a mildly
offcolor opcode, somewhere else.

- David.Thompson1 at worldnet.att.ne t

**Dave Thompson** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

On Tue, 28 Jun 2005 03:05:48 +1000, Netocrat <netocrat@dodo. com.au>
wrote:
[color=blue]
> Also C90 and C89 seem to be interchangeable terms - correct?
>[/color]
Effectively. C89 was the document developed "by" (under) ANSI, then
submitted to "ISO" (already JTC1?) and adopted with technically
identical contents but different numbering scheme and (I believe) some
of the boilerplate about copyright, authority, and such. Thus if you
want to refer to a clause number, as we fairly often do, you need to
specify which; and if you had a lawsuit turning on compliance to one
or the other standard you might have to produce that exact document to
support your case. But as far as what a C implementation is required
or permitted to do, and thus what a program(mer) can rely on or
expect, they are interchangeable .

In contrast C99 was voted first by "ISO" (as I understand it really
SC22), and adopted as-is by ANSI (really NCITS? INCITS?).
[color=blue]
> Finally I understand that C90/C89 had some modifications made prior to C99
> - where are those detailed?[/color]

See FAQ 11.1 and .2 -- at least in the text version posted and online
at usual places; the webized http://www.eskimo.com/~scs/C-faq/top.html
has been out-of-date the last few times I checked and this is one of
the points that has changed. But:
- the statement about the Rationale was for only the original ANSI
version C89, which is no longer (realistically) available;
- it says Normative Addendum which I'm pretty sure should be
Amendment; C90 plus that amendment is sometimes called C95
- (several!) drafts of an updated Rationale for C99, as well as drafts
of C99 itself (through n869) and C0X (n1124) can be gotten from the WG
site which is now (renamed?) www.open-std.org/JTC1/SC22/WG14 .
(As well as other stuff you might be interested in, for that matter.)

And for your further delectation and enjoyment, you could get the
~1600-page e-book by Derek M Jones discussed in another thread, which
AFAICT-so-far exegizes the standard process, the resulting document,
and the language specified in it, and more.

If you actually want C90 instead of or in addition to C99, ANSI
apparently no longer sells it, but webstore.ansi.o rg (still) lists DIN
and AS adoptions of 9899:1990, and I'm guessing the latter might be
available to you more conveniently.

- David.Thompson1 at worldnet.att.ne t

**Chris Torek** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

(Off-topic drift warning :-) )
[color=blue]
>On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:[color=green]
>> mov (r1)+,(r2)+[/color][/color]

In article <c6ghc1dufoi95b ipm9qopb69llvq9 cs3qr@4ax.com>
Dave Thompson <david.thompson 1@worldnet.att. net> wrote:[color=blue]
>Not movb? Isn't the default word=long? Or is this some overambitious
>assembler that you (have to) tell about value types?[/color]

No, just a goof; it should have been "movb".
[color=blue]
>ISTR 68k, which you also mentioned (snipped), also had a mildly
>offcolor opcode, somewhere else.[/color]

I do not recall any from the 680x0 series, but the 1802 had several.

Each register was 16 bits (I am almost certain, despite the 8-bit
claim on the page referenced below), but the 8-bit opcodes could
address only the high or low half of each register, so there was
a "put low" and "put high" to write to each half, and the corresponding
pair of "get"s. This meant the 1802 had GHI, the "get high"
instruction.

The 1802 also had two special registers named P (program counter)
and X (index). However, neither P nor X were actual registers;
instead, they were register *numbers*, pointing to one of the 16
general-purpose registers. You had to use a "set p" or "set x"
instruction to point the P and X indirection at the appropriate
register. These had three-letter assembler mnemonics; the first
was SEP, and the second was the now-obvious.

(See also <http://shop-pdp.kent.edu/ashtml/as1802.htm>.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

**CBFalconer** · Nov 15 '05, 12:21 AM

Re: Implementing my own memcpy

Dave Thompson wrote:[color=blue]
> CBFalconer <cbfalconer@yah oo.com> wrote:
> <snip>[color=green]
>> The void * type can point at arbitrary things, and a size_t can
>> specify a size on any machine. But to use void* you have to
>> convert to other types, thus:
>>
>> void *dupmem(void *src, size_t sz)
>> {
>> unsigned char *sp = src;
>> unsigned char *dst;
>>
>> if (dst = malloc(sz)) /* memory is available */
>> while (sz--) *dst++ = *sp++; /* copy away */
>> return dst; /* will be NULL for failure */[/color]
>
> return dst - sz, unless all your callers will (and must) adjust down
> the pointer before using it to access the memory, and free() it.[/color]

That still doesn't fix my goof above. sz ends at 0. Try this:

void *dupmem(void *src, size_t sz)
{
unsigned char *sp = src;
unsigned char *dst, *p;

if (p = dst = malloc(sz)) /* memory is available */
while (sz--) *p++ = *sp++; /* copy away */
return dst; /* will be NULL for failure */
}
--
"If you want to post a followup via groups.google.c om, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson

**BGreene** · Nov 15 '05, 12:24 AM

Re: Implementing my own memcpy

I apologize to the group but i haven't heard "decrement and branch if not
zero" in many a year.

"Dave Thompson" <david.thompson 1@worldnet.att. net> wrote in message
news:c6ghc1dufo i95bipm9qopb69l lvq9cs3qr@4ax.c om...[color=blue]
> On 25 Jun 2005 19:58:19 GMT, Chris Torek <nospam@torek.n et> wrote:
>[color=green][color=darkred]
> > >On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:
> > >> (you can also write the loop as "while (n--) *dst++ = *src++" but I[/color][/color][/color]
find[color=blue][color=green][color=darkred]
> > >> the above easier to read and think about).[/color]
> >
> > In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au>
> > Netocrat <netocrat@dodo. com.au> wrote:[color=darkred]
> > >I prefer the conciseness of the second, but I prefer even more testing
> > >against a maximum pointer.[/color]
> >
> > My thinking is perhaps colored by too many years of assembly coding
> > and instruction sets that include "decrement and branch if nonzero":
> >
> > test r3
> > bz Laround
> > Lloop:
> > mov (r1)+,(r2)+
> > sobgtr r3,Lloop # cheating (but this is OK)
> > Laround:[/color]
> <snip>[color=green]
> > and so on. (The first loop is VAX assembly, and "cheating" is OK
> > because r1 and/or r2 should never cross from P0/P1 space to S space,
> > nor vice versa, so the maximum block size never exceeds 2 GB; <snip>[/color]
>
> Not movb? Isn't the default word=long? Or is this some overambitious
> assembler that you (have to) tell about value types?
>
> Most (I think all but first two or so) models of PDP-11 also had
> sub-1-brback-ne (only) which they managed to publish as SOB before
> marketing caught them. PDP-6/10 already had a whole series of SOB*,
> but only SOBN or SOBG would do what you wanted here not SOB.
> (All 16 dyadic booleans are implemented, but SKIP doesn't; JUMP
> doesn't; the fastest jump varies but is never JUMP*; etc., etc.)
>
> ISTR 68k, which you also mentioned (snipped), also had a mildly
> offcolor opcode, somewhere else.
>
> - David.Thompson1 at worldnet.att.ne t[/color]

**Netocrat** · Nov 15 '05, 12:31 AM

Re: Implementing my own memcpy

On Sat, 25 Jun 2005 19:58:19 +0000, Chris Torek wrote:

[a memcpy function in response to my buggy version][color=blue]
> void *like_memcpy(vo id *restrict dst0, const void *restrict src0,
> size_t n) {
> unsigned char *restrict dst = dst0;
> unsigned char *restrict src = src0;
>
> if (n)
> do
> *dst++ = *src++;
> while (--n != 0);
> return dst0;
> }
>[color=green]
>>On Sat, 25 Jun 2005 18:31:30 +0000, Chris Torek wrote:[color=darkred]
>>> (you can also write the loop as "while (n--) *dst++ = *src++" but I
>>> find the above easier to read and think about).[/color][/color]
>
> In article <pan.2005.06.25 .19.33.15.62928 5@dodo.com.au> Netocrat
> <netocrat@dodo. com.au> wrote:[color=green]
>>I prefer the conciseness of the second, but I prefer even more testing
>>against a maximum pointer.[/color]
>
> My thinking is perhaps colored by too many years of assembly coding and
> instruction sets that include "decrement and branch if nonzero":[/color]

<snip discussion to which I responded in a later post>

I was spurred to actually benchmark the different approaches on my
machine. It's a little over the top, but my belief is that it's not
really possible to predict which approach will be faster - even knowing
the machine's architecture you can't know what the compiler will do. So
to me these sort of things are really a matter of personal preference.
So here is my attempt to back up that intuition at least on my machine.

I used the function quoted above, as well as the quoted proposed
alternative, and my function as fixed by Kevin Bagust:
[color=blue]
> void *mem_cpy( void *dest, const void *src, size_t bytes ) {
> unsigned char *destPtr = dest;
> unsigned char const *srcPtr = src;
> unsigned char const *srcEnd = srcPtr + bytes;
>
> while ( srcPtr < srcEnd ) {
> *destPtr++ = *srcPtr++;
> }
> return dest;
> }[/color]

I compiled at four of the levels of optimisation available on gcc (none,
-O1, -O2, -O3), and at each level performed two tests - with and without
-march=pentium4 (my machine architecture). I performed the tests at
multiple iterations of 0, 1, 2, 8, 25 and 80 bytes and timed the duration
using clock().

And the results?

At the unoptimised level, both of Chris's alternatives were equal.

In every other case the first of Chris's alternatives far outperformed the
second (by a minimum of 14% and maximum of 21%).

So I modified the 'alternative' expression from
while (n--) *dst++ = *src++;
to
while (n) {
*dst++ = *src++;
n--;
}

This brought the alternative function back close to the performance of the
original. I don't know why the degradation was occurring; presumably
something to do with one or more of the variables being decremented or
incremented one more time than necessary.

In the unoptimised case, my function outperformed Chris's functions by
about 15%. In all of the optimised cases, they were roughly equal -
varying from his performing 3% better than mine to mine performing 2%
better than his.

So even though it's platform-specific I think that this test shows that
choosing between these loop constructions should be based on personal
preference as to readability - a performance benefit can't be assumed for
any particular style - unless you are developing for a particular system
for which you know one style is more performant than the others.

**Chris Croughton** · Nov 15 '05, 12:31 AM

Re: Implementing my own memcpy

On Sun, 10 Jul 2005 22:27:54 +1000, Netocrat
<netocrat@dodo. com.au> wrote:
[color=blue]
> In every other case the first of Chris's alternatives far outperformed the
> second (by a minimum of 14% and maximum of 21%).
>
> So I modified the 'alternative' expression from
> while (n--) *dst++ = *src++;
> to
> while (n) {
> *dst++ = *src++;
> n--;
> }
>
> This brought the alternative function back close to the performance of the
> original. I don't know why the degradation was occurring; presumably
> something to do with one or more of the variables being decremented or
> incremented one more time than necessary.[/color]

Some odd optimisation?

Incidentally, if you still have the test code around, could you also try

while (n) {
*dst = *src;
++src;
++dst;
--n;
}

(And is there a difference between n--; and --n; on your system?)

Just to get the results from the same system as used for your original
results. (Incidentally, how did they compare with the system-supplied
memcpy? I believe gcc inlines that to assembler at some optimisation
levels...)
[color=blue]
> So even though it's platform-specific I think that this test shows that
> choosing between these loop constructions should be based on personal
> preference as to readability - a performance benefit can't be assumed for
> any particular style - unless you are developing for a particular system
> for which you know one style is more performant than the others.[/color]

Indeed. And bear in mind that it may change completely with the next
version of the compiler, or switching to another compiler on the same
platform. I've found that trusting the compiler and library writers to
have picked the best optimisations is right most of the time...

Chris C

**Netocrat** · Nov 15 '05, 12:33 AM

Re: Implementing my own memcpy

On Sun, 10 Jul 2005 14:34:09 +0100, Chris Croughton wrote:[color=blue]
> On Sun, 10 Jul 2005 22:27:54 +1000, Netocrat
> <netocrat@dodo. com.au> wrote:
>[color=green]
>> In every other case the first of Chris's alternatives far outperformed
>> the second (by a minimum of 14% and maximum of 21%).
>>
>> So I modified the 'alternative' expression from
>> while (n--) *dst++ = *src++;
>> to
>> while (n) {
>> *dst++ = *src++;
>> n--;
>> }
>> This brought the alternative function back close to the performance of
>> the original. I don't know why the degradation was occurring;
>> presumably something to do with one or more of the variables being
>> decremented or incremented one more time than necessary.[/color]
>
> Some odd optimisation?[/color]

Anything's possible.
[color=blue]
> Incidentally, if you still have the test code around, could you also try
>
> while (n) {
> *dst = *src;
> ++src;
> ++dst;
> --n;
> }[/color]

I retested and included this modification that you suggested. Your
modification is always faster than the original while(n--) loop and is
roughly the same across all of the optimisation levels as the modification
that I made (worst performance is 17% slower than my mod at -O1 - an
aberration since for all other cases their separation is a few percent -
and best performance is 5% faster at -O3 -march=pentium4) .
[color=blue]
> (And is there a difference between n--; and --n; on your system?)[/color]

I'm not sure about the general case - but I tested your modification above
with n-- and --n. There is a small variation that differs between the
optimisation levels - neither is consistently faster. The biggest
separation I found was post-decrement being about 3% faster at -O3
-march=pentium4. I repeated this test a few times to check that it wasn't
a one-off error due to system loading and the result was consistently
within the bounds of .05% and 3%. The initial 3% result is probably not
accurate but there's no doubt that in this case the compiler generates
slightly faster code for post-decrement.
[color=blue]
> Just to get the results from the same system as used for your original
> results. (Incidentally, how did they compare with the system-supplied
> memcpy? I believe gcc inlines that to assembler at some optimisation
> levels...)[/color]

Its execution time doesn't vary between the sizes I originally tested as
much as the other functions' times do. Nor is its performance affected by
optimisation level. With or without optimisations, it is always the
slowest function for sizes of 0..8 bytes. Without optimisations, from
about 16 bytes it starts consistently performing far better - eg at 40
bytes it is 150% faster than any other function. With optimisations it's
"in the mix" - not much better or worse than the others up to roughly 40
bytes and from then on it consistently beats them.

I tested for larger sizes at all optimisation levels:

At 80 bytes the library function was a minimum of 34% faster than any
other function (340% faster when optimisation switches were not used).

At 1024 bytes it was at least 270% faster (1400% faster without
optimisations).

At 10 kilobytes it was at least 400% faster.

At 100 kilobytes it was at least 65% faster. Also optimisations changed
its performance - it was fastest without optimisations and at -O1 it was
twice as slow as without optimisations.

At 1 megabyte things had evened out and it was roughly the same as the
others and in some cases slightly slower. It performed the same at all
optimisation levels.

At 10 and 100 megabytes I only tested for -O3 -march=pentium4 and again it
was roughly the same as the other functions.
[color=blue][color=green]
>> So even though it's platform-specific I think that this test shows that
>> choosing between these loop constructions should be based on personal
>> preference as to readability - a performance benefit can't be assumed
>> for any particular style - unless you are developing for a particular
>> system for which you know one style is more performant than the others.[/color]
>
> Indeed. And bear in mind that it may change completely with the next
> version of the compiler, or switching to another compiler on the same
> platform. I've found that trusting the compiler and library writers to
> have picked the best optimisations is right most of the time...[/color]

Agreed - and if you _really_ need specific hard-core optimisations, don't
rely on the compiler except perhaps to use its output as a base - go with
assembly. That way the results aren't dependent on things beyond your
control like compiler code-generation.

Implementing my own memcpy

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment