Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

char should be an alias for u8 #128

Open
chqrlie opened this issue Dec 13, 2024 · 2 comments
Open

char should be an alias for u8 #128

chqrlie opened this issue Dec 13, 2024 · 2 comments

Comments

@chqrlie
Copy link
Contributor

chqrlie commented Dec 13, 2024

This is a bit of a rabbit hole, but I think your choice to make char always signed is a problem:

Making char signed by default on some historic versions of C, at a time where character sets were only 7-bits, is a congenial defect that is inconsistent with the behavior of many of the standard library functions:

  • memcmp, strcmp, strncmp... are explicitly specified as treating the elements of the character arrays as unsigned char.
  • getc, fgetc, getchar... may return all values of type unsigned char in addition to the negative EOF value.
  • ungetc fails when given a char value of '\xFF'
  • the isxxx and toxxx macros and functions defined in <ctype.h> have undefined behavior for negative char values except for the EOF value, which cannot be distinguished from '\xFF'.

Even more compelling: dealing with signed char seems counter-intuitive for many programmers who will mistakenky index arrays with char values, thereby invoking undefined behavior on negative values, which may be coined in input streams to exploit these bugs.

For example, in file c2_tokenizer.c2, you scan for the end of the identifier with an innocent looking loop:

    const char* end = t.cur + 1;
    while (Identifier_char[*end]) end++;

This would be a bug in C, but I am not sure about the precise semantics of this loop in C2, yet it is translated almost unchanged in bootstrap.c:

   const char* end = (t->cur + 1);
   while (c2_tokenizer_Identifier_char[*end]) end++;

There might be many other occurrences if this problem, which would not happen were char defined an alias for u8.

This change would be consistent with the C2 philosophy: Should help to avoid common mistakes.

@bvdberg
Copy link
Member

bvdberg commented Dec 18, 2024

This is a whole can of worms you're opening :)
I've had this discussion with a lot of people. In my opinion:

  • char is used for strings. Technically you could use i8/u8 also, but using a char is more familiar to C programmers and makes it a bit clearer.
  • Only the first 128 ASCII values are standard, above that it becomes a mess. That's why I chose to make char signed. You can then use -1 to indicate some error and otherwise have a valid ASCII value. Unicode is a whole different story of course

@chqrlie
Copy link
Contributor Author

chqrlie commented Dec 18, 2024

This is a whole can of worms you're opening :) I've had this discussion with a lot of people. In my opinion:

I know this is an old debate, but this issue is very dear to me. For char to be signed by default on many platforms is a side effect of an original design oversight and a source of bugs (as the one shown here above).

char is used for strings. Technically you could use i8/u8 also, but using a char is more familiar to C programmers and makes it a bit clearer.

char is absolutely the right type for character string elements. As a matter of fact, It should only be used for that. Prototypes from the C library would still use the char type as these functions actually use unsigned semantics or have neutral behavior. Regarding C code generation, character strings should definitely still generate char based objects, the simplest approach to get unsigned char semantics in generated code is to pass the option -funsigned-char to the compiler. All modern compilers have an option like that.

Only the first 128 ASCII values are standard, above that it becomes a mess.

Yes, beyond the ASCII range, the byte values may have different meanings, depending on the character set, encoding, platform etc. But getchar() returns them as positive values, and strcmp and memcmp compare them based on their unsigned char value. char c = -1; ungetc(c, stdin) fails because -1 is also the value of EOF.

Note the expression above that: don't we all intuitively assume positive values 128 and above ?

That's why I chose to make char signed. You can then use -1 to indicate some error and otherwise have a valid ASCII value.

Actually, this argument supports the opposite choice,unsigned char, so -1 can be distinguished from all char values. With signed char, -1 is a valid char ('\377' or '\xFF' both evaluate to -1 when char is signed).

Unicode is a whole different story of course.

Of course, and UTF-8 is the way to go with 8-bit strings. What type do you advocate for Unicode code points? wchar_t is an architecture dependent mess. i32 or u32 are fine but a specific alias such as rune or char32 seems more explicit.

C2 is about making the language simpler and less error prone. char signedness is a tricky issue. I stopped counting the questions I answer on stackoverflow where macros from <ctype.h> are misused with potential undefined behavior or where char values are used to index arrays with similar problems. Making char an alias for u8 fixes these problems. Testing for characters outside the ASCII range can be done more explicitly with (c & 0x80) != 0 instead of c < 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants