getc and "large" bytes

**Eric Sosman** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Keith Thompson wrote:

Eric Sosman <Eric.Sosman@su n.comwrites:
[...]

> It seems to me that the behavior required of getc() places
>far-reaching requirements on implementations where `int' and
>`char' have the same width. Here are a few:
>>
> 1) Since `unsigned char' can represent 2**N distinct values
>and all of these must be distinguishable when converted to `int',
>it follows that `int' must also have 2**N distinct values. Thus,
>signed-magnitude and ones' complement representations are ruled
>out, and INT_MIN must have its most negative possible value
>(that is, INT_MIN == -INT_MAX - 1, all-bits-set cannot be a trap
>representation ).

[...]
>
How do you conclude that all 2**N distinct values of type unsigned
char must be distinguishable when converted to int? The result of the
conversion is implementation-defined. If, for example, int has the
range -32768 .. +32767, and unsigned char has the range 0 .. 65536, I
see nothing in the standard that forbids converting all unsigned char
values greater than 32767 to 32767 (saturation). It would break
stdio, but I'm not convinced that that would make it non-conforming
(particularly for a freestanding implementation that needn't provide
stdio).

My case for distinguishabil ity was in the part you snipped,
labeled "1a)". It derives from the Standard's requirement that
bytes read back from a binary stream must compare equal to those
written to it (on the same implementation, not counting trailing
zeroes, et cetera). If there are fewer `int' values than there
are `unsigned char' values, then by the pigeonhole principle there
must be at least one collision where two distinct `unsigned char'
values V1 and V2 convert to the same `int' value. Then this
code fragment

putc(V1, stream);
putc(V2, stream);
rewind(stream);
assert(getc(str eam) == V1);
assert(getc(str eam) == V2);

.... cannot succeed. (Yes, I know, it's very bad to generate
side-effects in an assert(), but this is just for illustration.)

"Upon further review," as they say in American football, I
guess an implementation could choose to report an I/O error if
it ever encountered V2, say, on input. (If "helpful," it would
also report an error for any attempt to write V2.) That would
give an extremely low QoI, but the Standard does not forbid I/O
operations from failing "predictabl y." (Indeed, on many systems
fopen("/", "w") will fail predictably.) So perhaps a sufficiently
bad implementation could in fact claim conformance even if unable
to read and write all `unsigned char' values, and this would allow
signed magnitude and ones' complement (and two's complement with
one trap representation) .

And, of course, no argument based on the behavior of getc()
has any force for freestanding implementations .

--
Eric.Sosman@sun .com

**Bartc** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

"Keith Thompson" <kst-u@mib.orgwrote in message
news:lnve150wg2 .fsf@nuthaus.mi b.org...

"Bartc" <bc@freeuk.comw rites:

>"Ben Pfaff" <blp@cs.stanfor d.eduwrote in message
>news:87y7617zq b.fsf@blp.benpf aff.org...

>>vippstar@gmail. com writes:
>>>
>>>On May 23, 6:35 pm, Ben Pfaff <b...@cs.stanfo rd.eduwrote:
>>>>vipps...@gm ail.com writes:
>>>Assuming all the values of int are in the range of unsigned char,
>>>what
>>>happends if getc returns EOF?
>>>>>
>>>>Your assumption is false.
>>>Would you please elaborate?
>>>
>>-1 is in the range of int.
>>-1 is not in the range of unsigned char.
>>Therefore it is not true that all the values of int are in the
>>range of unsigned char.

>>
>The OP mentioned an example where both might be 16 bits. So -1 in one
>could
>be 0xFFFF in the other, causing ambiguity in the (I think unlikely) event
>of
>reading a 16-bit character 0xFFFF from a file with 16-bit encoding.

>
No, -1 and 0xFFFF are two different values. It's possible that one of
those values is the result of converting the other.

I don't understand. In a 16-bit system where all 65536 bit patterns might
represent characters, what bit pattern would you use to signal EOF?

(Reading Eric's first post:

An implication of (1) for the programmer is that yes, there
will be a legitimate `unsigned char' value that maps to EOF
when converted to `int'.

this seems to suggest that yes an ambiguity can occur.)

--
bartc

**Keith Thompson** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

"Bartc" <bc@freeuk.comw rites:

"Keith Thompson" <kst-u@mib.orgwrote in message
news:lnve150wg2 .fsf@nuthaus.mi b.org...

>"Bartc" <bc@freeuk.comw rites:

[...]

>>The OP mentioned an example where both might be 16 bits. So -1 in
>>one could be 0xFFFF in the other, causing ambiguity in the (I
>>think unlikely) event of reading a 16-bit character 0xFFFF from a
>>file with 16-bit encoding.

>>
>No, -1 and 0xFFFF are two different values. It's possible that one of
>those values is the result of converting the other.

>
I don't understand. In a 16-bit system where all 65536 bit patterns might
represent characters, what bit pattern would you use to signal EOF?

Bit patterns are not values. A value is an *interpretation * of a bit
pattern; the interpretation is done with respect to a specified type.

For example, an object of type float with the value 123.0 and an
object of type unsigned int with the value 0x42f60000 might happen to
contain the same bit pattern, but they have distinct values because
those bit patterns (representation s) are interpreted as having
different types.

(Reading Eric's first post:
>

> An implication of (1) for the programmer is that yes, there
>will be a legitimate `unsigned char' value that maps to EOF
>when converted to `int'.

>
this seems to suggest that yes an ambiguity can occur.)

Yes.

--
Keith Thompson (The_Other_Keit h) kst-u@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

**Walter Roberson** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

In article <lnskw8zwcy.fsf @nuthaus.mib.or g>,
Keith Thompson <kst-u@mib.orgwrote:

>Bit patterns are not values. A value is an *interpretation * of a bit
>pattern; the interpretation is done with respect to a specified type.

Not in C: in C, a bit pattern is a *representation * of a value.
A machine doesn't have to use real bits (binary digits) as long as
the operators produce the right -values-.

(Though, I'd want to have another look over the wording on
floating point representations , as I seem to recall that that
wording could be interpreted as requiring Real Bits (SM).)
--
"Walter is undoubtedly the country's and club's most popular player."
-- vitalfootball.c o.uk

**lawrence.jones@siemens.com** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Richard Tobin <richard@cogsci .ed.ac.ukwrote:

>
An implementation with, say, 16-bit ints and chars is still likely to
have 8-bit data on disk and most other input sources. In which case
fgetc() could read 8-bit values, and have no problem. At least, I
don't konw of anything in the standard that prevents this.

Writing a byte with fputc() and then reading it back with fgetc() must
produce the same value. That won't happen if you only write or read
half the bits.

-- Larry Jones

It must be sad being a species with so little imagination. -- Calvin

**Richard Tobin** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

In article <483725b4$0$600 2$88260bb3@news .teranews.com>,
<lawrence.jones @siemens.comwro te:

>An implementation with, say, 16-bit ints and chars is still likely to
>have 8-bit data on disk and most other input sources. In which case
>fgetc() could read 8-bit values, and have no problem. At least, I
>don't konw of anything in the standard that prevents this.

>Writing a byte with fputc() and then reading it back with fgetc() must
>produce the same value. That won't happen if you only write or read
>half the bits.

Are all possible unsigned char values required to be characters that
can be written and read? If char was 16 bits, could putchar(999)
always produce an i/o error?

-- Richard
--
In the selection of the two characters immediately succeeding the numeral 9,
consideration shall be given to their replacement by the graphics 10 and 11 to
facilitate the adoption of the code in the sterling monetary area. (X3.4-1963)

**pete** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Keith Thompson wrote:

vippstar@gmail. com writes:

>On May 23, 6:42 pm, Richard Heathfield <r...@see.sig.i nvalidwrote:

>>vipps...@gmai l.com said:
>>>
>>>On May 23, 6:35 pm, Ben Pfaff <b...@cs.stanfo rd.eduwrote:
>>>>vipps...@gm ail.com writes:
>>>>>Assuming all the values of int are in the range of unsigned char, what
>>>>>happends if getc returns EOF?
>>>>Your assumption is false.
>>>Would you please elaborate?
>>The int type must be able to represent values in the range INT_MIN to -1,
>>none of which values are in the range of unsigned char (which, lacking a
>>sign bit, cannot represent negative values).

>I'm talking about the case that both int and unsigned char are 16
>bits, and to be honest I'm still not convinced that this is false.

>
Your underlying point is right; you just stated it incorrectly. The
problem occurs when not all values of unsigned char are in the range
of int.
>
The value returned by getc() is either the next character from the
input stream, interpreted as an unsigned char and converted to int, or
the value EOF (which must be negative and is typically -1).
>
On most systems, all values of type unsigned char can be converted to
int without changing their numeric value.
>
If both int and unsigned char are 16 bits, then (a) the conversion
from unsigned char to int is implementation-defined for values
numerically greater than INT_MAX, and (b) some valid unsigned char
value might be converted to the value EOF.
>
You can work around (b) by checking feof() and ferror() after getc()
returns EOF.

That's the way I do it:

int get_line(char **lineptr, size_t *n, FILE *stream)
{
int rc;
void *p;
size_t count;

count = 0;
while ((rc = getc(stream)) != EOF
|| !feof(stream) && !ferror(stream) )
{
++count;
if (count == (size_t)-2) {
if (rc != '\n') {
(*lineptr)[count] = '\0';
(*lineptr)[count - 1] = (char)rc;
} else {
(*lineptr)[count - 1] = '\0';
}
break;
}
if (count + 2 *n) {
p = realloc(*linept r, count + 2);
if (p == NULL) {
if (*n count) {
if (rc != '\n') {
(*lineptr)[count] = '\0';
(*lineptr)[count - 1] = (char)rc;
} else {
(*lineptr)[count - 1] = '\0';
}
} else {
if (*n != 0) {
**lineptr = '\0';
}
ungetc(rc, stream);
}
count = 0;
break;
}
*lineptr = p;
*n = count + 2;
}
if (rc != '\n') {
(*lineptr)[count - 1] = (char)rc;
} else {
(*lineptr)[count - 1] = '\0';
break;
}
}
if (rc != EOF || !feof(stream) && !ferror(stream) ) {
rc = INT_MAX count ? count : INT_MAX;
} else {
if (*n count) {
(*lineptr)[count] = '\0';
}
}
return rc;
}

--
pete

**Jack Klein** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

On Fri, 23 May 2008 08:23:25 -0700 (PDT), vippstar@gmail. com wrote in
comp.lang.c:

Assuming all the values of int are in the range of unsigned char, what
happends if getc returns EOF?
Is it possible that EOF was the value of the byte read?
Does that mean that code aiming for maximum portability needs to check
for both feof() and ferror()?
(for example, if both feof() and ferror() return 0 for the stream when
getc() returned EOF, consider EOF a valid byte read)
To me, that seems to be the case, but maybe the standard says this to
be incorrect.
>
As always, all replies appreciated.

I've looked at a large number of posts in this thread, and I'm a bit
puzzled. I have actually done a little work with a DSP where all the
integer types were 32-bit, and am still doing a lot of work on a
platform where char, int, and short are all 16 bits.

I just want to ask all of you a few questions:

1. How many of you have actually ever worked on an implementation
where CHAR_BIT was greater than 8? Show of hands, please.

2. How many of you have actually ever worked on a hosted
implementation where CHAR_BIT was greater than 8, that fully supported
binary streams?

3. How many of you have even heard of hosted environments, with full
support for binary streams, where CHAR_BIT is greater than 8?

I can raise my hand for #1, not for #2 or #3.

Any takers on 2 or 3?

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.l earn.c-c++

http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html

**Ian Collins** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Jack Klein wrote:

On Fri, 23 May 2008 08:23:25 -0700 (PDT), vippstar@gmail. com wrote in
comp.lang.c:
>

>Assuming all the values of int are in the range of unsigned char, what
>happends if getc returns EOF?
>Is it possible that EOF was the value of the byte read?
>Does that mean that code aiming for maximum portability needs to check
>for both feof() and ferror()?
>(for example, if both feof() and ferror() return 0 for the stream when
>getc() returned EOF, consider EOF a valid byte read)
>To me, that seems to be the case, but maybe the standard says this to
>be incorrect.
>>
>As always, all replies appreciated.

>
I've looked at a large number of posts in this thread, and I'm a bit
puzzled. I have actually done a little work with a DSP where all the
integer types were 32-bit, and am still doing a lot of work on a
platform where char, int, and short are all 16 bits.
>
I just want to ask all of you a few questions:
>
1. How many of you have actually ever worked on an implementation
where CHAR_BIT was greater than 8? Show of hands, please.
>

Yes (Long ago and far away, a DSP)

2. How many of you have actually ever worked on a hosted
implementation where CHAR_BIT was greater than 8, that fully supported
binary streams?
>

No.

3. How many of you have even heard of hosted environments, with full
support for binary streams, where CHAR_BIT is greater than 8?
>

No.

--
Ian Collins.

**Richard Heathfield** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Jack Klein said:

<snip>

I just want to ask all of you a few questions:
>
1. How many of you have actually ever worked on an implementation
where CHAR_BIT was greater than 8? Show of hands, please.

Me.

2. How many of you have actually ever worked on a hosted
implementation where CHAR_BIT was greater than 8, that fully supported
binary streams?

IIRC, not me.

3. How many of you have even heard of hosted environments, with full
support for binary streams, where CHAR_BIT is greater than 8?

I *nearly* heard of one once. Apparently one of the Crays escaped that fate
by a hair's breadth when someone pointed out to their compiler guy that C
programmers might not be too happy with a CHAR_BIT of 64. If he had been a
bit more ornery, it would be an example of such a platform.

--
Richard Heathfield <http://www.cpax.org.uk >
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

**vippstar@gmail.com** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

On May 24, 4:28 am, pete <pfil...@mindsp ring.comwrote:

Keith Thompson wrote:

vipps...@gmail. com writes:

On May 23, 6:42 pm, Richard Heathfield <r...@see.sig.i nvalidwrote:
>vipps...@gmail .com said:

>

>>On May 23, 6:35 pm, Ben Pfaff <b...@cs.stanfo rd.eduwrote:
>>>vipps...@gma il.com writes:
>>>>Assuming all the values of int are in the range of unsigned char, what
>>>>happends if getc returns EOF?
>>>Your assumption is false.
>>Would you please elaborate?
>The int type must be able to represent values in the range INT_MIN to -1,
>none of which values are in the range of unsigned char (which, lacking a
>sign bit, cannot represent negative values).
I'm talking about the case that both int and unsigned char are 16
bits, and to be honest I'm still not convinced that this is false.

>

Your underlying point is right; you just stated it incorrectly. The
problem occurs when not all values of unsigned char are in the range
of int.

>

The value returned by getc() is either the next character from the
input stream, interpreted as an unsigned char and converted to int, or
the value EOF (which must be negative and is typically -1).

>

On most systems, all values of type unsigned char can be converted to
int without changing their numeric value.

>

If both int and unsigned char are 16 bits, then (a) the conversion
from unsigned char to int is implementation-defined for values
numerically greater than INT_MAX, and (b) some valid unsigned char
value might be converted to the value EOF.

>

You can work around (b) by checking feof() and ferror() after getc()
returns EOF.

>
That's the way I do it:
>
int get_line(char **lineptr, size_t *n, FILE *stream)
<snip code>

Thanks pete. I will look into your get_line function.

**vippstar@gmail.com** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

On May 24, 12:52 am, rich...@cogsci. ed.ac.uk (Richard Tobin) wrote:

In article <483725b4$0$600 2$88260...@news .teranews.com>,
>
<lawrence.jo... @siemens.comwro te:

An implementation with, say, 16-bit ints and chars is still likely to
have 8-bit data on disk and most other input sources. In which case
fgetc() could read 8-bit values, and have no problem. At least, I
don't konw of anything in the standard that prevents this.

Writing a byte with fputc() and then reading it back with fgetc() must
produce the same value. That won't happen if you only write or read
half the bits.

>
Are all possible unsigned char values required to be characters that
can be written and read? If char was 16 bits, could putchar(999)
always produce an i/o error?

Looks like a good idea. Or for the implementation to guarantee that
INT_MAX >= UCHAR_MAX. Mr Thompson suggested the standard to guarantee
that, but I think any implementation that has the-problem-i-cant-
express-right (as noted by others), would be solely to make the
programmers life difficult.

**Flash Gordon** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

Jack Klein wrote, On 24/05/08 03:56:

<snip discussion of getc/putc with sizeof(int)==1>

I just want to ask all of you a few questions:
>
1. How many of you have actually ever worked on an implementation
where CHAR_BIT was greater than 8? Show of hands, please.

/me puts up two hands, one for each implementation

2. How many of you have actually ever worked on a hosted
implementation where CHAR_BIT was greater than 8, that fully supported
binary streams?

No.

3. How many of you have even heard of hosted environments, with full
support for binary streams, where CHAR_BIT is greater than 8?

How about the 9bit machines people have mentioned in the past? Did they
have C implementations with binary streams fully supported?

A more pertinent question would be about implementations where
sizeof(int)==1, and I've not heard of any.

I can raise my hand for #1, not for #2 or #3.
>
Any takers on 2 or 3?

--
Flash Gordon

**pete** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

vippstar@gmail. com wrote:

On May 24, 4:28 am, pete <pfil...@mindsp ring.comwrote:

>Keith Thompson wrote:

>>vipps...@gmai l.com writes:
>>>On May 23, 6:42 pm, Richard Heathfield <r...@see.sig.i nvalidwrote:
>>>>vipps...@gm ail.com said:
>>>>>On May 23, 6:35 pm, Ben Pfaff <b...@cs.stanfo rd.eduwrote:
>>>>>>vipps...@ gmail.com writes:
>>>>>>>Assumi ng all the values of int are in the range of unsigned char, what
>>>>>>>happen ds if getc returns EOF?
>>>>>>Your assumption is false.
>>>>>Would you please elaborate?
>>>>The int type must be able to represent values in the range INT_MIN to -1,
>>>>none of which values are in the range of unsigned char (which, lacking a
>>>>sign bit, cannot represent negative values).
>>>I'm talking about the case that both int and unsigned char are 16
>>>bits, and to be honest I'm still not convinced that this is false.
>>Your underlying point is right; you just stated it incorrectly. The
>>problem occurs when not all values of unsigned char are in the range
>>of int.
>>The value returned by getc() is either the next character from the
>>input stream, interpreted as an unsigned char and converted to int, or
>>the value EOF (which must be negative and is typically -1).
>>On most systems, all values of type unsigned char can be converted to
>>int without changing their numeric value.
>>If both int and unsigned char are 16 bits, then (a) the conversion
>>from unsigned char to int is implementation-defined for values
>>numerically greater than INT_MAX, and (b) some valid unsigned char
>>value might be converted to the value EOF.
>>You can work around (b) by checking feof() and ferror() after getc()
>>returns EOF.

>That's the way I do it:
>>
>int get_line(char **lineptr, size_t *n, FILE *stream)
><snip code>

Thanks pete. I will look into your get_line function.

File Code:

http://www.mindspring.com/~pfilandr/C/lists_and_files/file_lib.h

http://www.mindspring.com/~pfilandr/C/lists_and_files/file_lib.c

Examples of usage:

http://www.mindspring.com/~pfilandr/C/lists_and_files/file_sort.c

http://www.mindspring.com/~pfilandr/C/lists_and_files/file_parse.c

http://www.mindspring.com/~pfilandr/C/lists_and_files/file_collate.c

List code:

http://www.mindspring.com/~pfilandr/C/lists_and_files/list_lib.c

stdin example:

http://www.mindspring.com/~pfilandr/C/get_line/get_line.c

--
pete

**vippstar@gmail.com** · Jun 27 '08, 07:38 PM

Re: getc and "large&quo t; bytes

On May 24, 3:12 pm, pete <pfil...@mindsp ring.comwrote:

vipps...@gmail. com wrote:

On May 24, 4:28 am, pete <pfil...@mindsp ring.comwrote:

Keith Thompson wrote:
>vipps...@gmail .com writes:
>>On May 23, 6:42 pm, Richard Heathfield <r...@see.sig.i nvalidwrote:
>>>vipps...@gma il.com said:
>>>>On May 23, 6:35 pm, Ben Pfaff <b...@cs.stanfo rd.eduwrote:
>>>>>vipps...@g mail.com writes:
>>>>>>Assumin g all the values of int are in the range of unsigned char, what
>>>>>>happend s if getc returns EOF?
>>>>>Your assumption is false.
>>>>Would you please elaborate?
>>>The int type must be able to represent values in the range INT_MIN to -1,
>>>none of which values are in the range of unsigned char (which, lacking a
>>>sign bit, cannot represent negative values).
>>I'm talking about the case that both int and unsigned char are 16
>>bits, and to be honest I'm still not convinced that this is false.
>Your underlying point is right; you just stated it incorrectly. The
>problem occurs when not all values of unsigned char are in the range
>of int.
>The value returned by getc() is either the next character from the
>input stream, interpreted as an unsigned char and converted to int, or
>the value EOF (which must be negative and is typically -1).
>On most systems, all values of type unsigned char can be converted to
>int without changing their numeric value.
>If both int and unsigned char are 16 bits, then (a) the conversion
>from unsigned char to int is implementation-defined for values
>numerically greater than INT_MAX, and (b) some valid unsigned char
>value might be converted to the value EOF.
>You can work around (b) by checking feof() and ferror() after getc()
>returns EOF.
That's the way I do it:

>

int get_line(char **lineptr, size_t *n, FILE *stream)
<snip code>

Thanks pete. I will look into your get_line function.

>
File Code:http://www.mindspring.com/~pfilandr/...les/file_lib.c
Examples of usage:http://www.mindspring.com/~pfilandr/...file_collate.c
List code:http://www.mindspring.com/~pfilandr/...les/list_lib.c
>
stdin example:http://www.mindspring.com/~pfilandr/...ine/get_line.c

Thanks, I like the code. Is there any particular reason that you cast
rc to (char) before you assign it to the buffer?

getc and "large" bytes

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment