-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
char
should be an alias for u8
#128
Comments
This is a whole can of worms you're opening :)
|
I know this is an old debate, but this issue is very dear to me. For
Yes, beyond the ASCII range, the byte values may have different meanings, depending on the character set, encoding, platform etc. But Note the expression above that: don't we all intuitively assume positive values 128 and above ?
Actually, this argument supports the opposite choice,
Of course, and UTF-8 is the way to go with 8-bit strings. What type do you advocate for Unicode code points? C2 is about making the language simpler and less error prone. |
This is a bit of a rabbit hole, but I think your choice to make
char
always signed is a problem:Making
char
signed by default on some historic versions of C, at a time where character sets were only 7-bits, is a congenial defect that is inconsistent with the behavior of many of the standard library functions:memcmp
,strcmp
,strncmp
... are explicitly specified as treating the elements of the character arrays asunsigned char
.getc
,fgetc
,getchar
... may return all values of typeunsigned char
in addition to the negativeEOF
value.ungetc
fails when given achar
value of'\xFF'
isxxx
andtoxxx
macros and functions defined in<ctype.h>
have undefined behavior for negativechar
values except for theEOF
value, which cannot be distinguished from'\xFF'
.Even more compelling: dealing with signed
char
seems counter-intuitive for many programmers who will mistakenky index arrays withchar
values, thereby invoking undefined behavior on negative values, which may be coined in input streams to exploit these bugs.For example, in file c2_tokenizer.c2, you scan for the end of the identifier with an innocent looking loop:
This would be a bug in C, but I am not sure about the precise semantics of this loop in C2, yet it is translated almost unchanged in bootstrap.c:
There might be many other occurrences if this problem, which would not happen were
char
defined an alias foru8
.This change would be consistent with the C2 philosophy: Should help to avoid common mistakes.
The text was updated successfully, but these errors were encountered: