`char` should be an alias for `u8` #128

chqrlie · 2024-12-13T21:24:56Z

This is a bit of a rabbit hole, but I think your choice to make char always signed is a problem:

Making char signed by default on some historic versions of C, at a time where character sets were only 7-bits, is a congenial defect that is inconsistent with the behavior of many of the standard library functions:

memcmp, strcmp, strncmp... are explicitly specified as treating the elements of the character arrays as unsigned char.
getc, fgetc, getchar... may return all values of type unsigned char in addition to the negative EOF value.
ungetc fails when given a char value of '\xFF'
the isxxx and toxxx macros and functions defined in <ctype.h> have undefined behavior for negative char values except for the EOF value, which cannot be distinguished from '\xFF'.

Even more compelling: dealing with signed char seems counter-intuitive for many programmers who will mistakenky index arrays with char values, thereby invoking undefined behavior on negative values, which may be coined in input streams to exploit these bugs.

For example, in file c2_tokenizer.c2, you scan for the end of the identifier with an innocent looking loop:

    const char* end = t.cur + 1;
    while (Identifier_char[*end]) end++;

This would be a bug in C, but I am not sure about the precise semantics of this loop in C2, yet it is translated almost unchanged in bootstrap.c:

   const char* end = (t->cur + 1);
   while (c2_tokenizer_Identifier_char[*end]) end++;

There might be many other occurrences if this problem, which would not happen were char defined an alias for u8.

This change would be consistent with the C2 philosophy: Should help to avoid common mistakes.

The text was updated successfully, but these errors were encountered:

bvdberg · 2024-12-18T09:13:13Z

This is a whole can of worms you're opening :)
I've had this discussion with a lot of people. In my opinion:

char is used for strings. Technically you could use i8/u8 also, but using a char is more familiar to C programmers and makes it a bit clearer.
Only the first 128 ASCII values are standard, above that it becomes a mess. That's why I chose to make char signed. You can then use -1 to indicate some error and otherwise have a valid ASCII value. Unicode is a whole different story of course

chqrlie · 2024-12-18T10:10:55Z

This is a whole can of worms you're opening :) I've had this discussion with a lot of people. In my opinion:

I know this is an old debate, but this issue is very dear to me. For char to be signed by default on many platforms is a side effect of an original design oversight and a source of bugs (as the one shown here above).

char is used for strings. Technically you could use i8/u8 also, but using a char is more familiar to C programmers and makes it a bit clearer.

char is absolutely the right type for character string elements. As a matter of fact, It should only be used for that. Prototypes from the C library would still use the char type as these functions actually use unsigned semantics or have neutral behavior. Regarding C code generation, character strings should definitely still generate char based objects, the simplest approach to get unsigned char semantics in generated code is to pass the option -funsigned-char to the compiler. All modern compilers have an option like that.

Only the first 128 ASCII values are standard, above that it becomes a mess.

Yes, beyond the ASCII range, the byte values may have different meanings, depending on the character set, encoding, platform etc. But getchar() returns them as positive values, and strcmp and memcmp compare them based on their unsigned char value. char c = -1; ungetc(c, stdin) fails because -1 is also the value of EOF.

Note the expression above that: don't we all intuitively assume positive values 128 and above ?

That's why I chose to make char signed. You can then use -1 to indicate some error and otherwise have a valid ASCII value.

Actually, this argument supports the opposite choice,unsigned char, so -1 can be distinguished from all char values. With signed char, -1 is a valid char ('\377' or '\xFF' both evaluate to -1 when char is signed).

Unicode is a whole different story of course.

Of course, and UTF-8 is the way to go with 8-bit strings. What type do you advocate for Unicode code points? wchar_t is an architecture dependent mess. i32 or u32 are fine but a specific alias such as rune or char32 seems more explicit.

C2 is about making the language simpler and less error prone. char signedness is a tricky issue. I stopped counting the questions I answer on stackoverflow where macros from <ctype.h> are misused with potential undefined behavior or where char values are used to index arrays with similar problems. Making char an alias for u8 fixes these problems. Testing for characters outside the ASCII range can be done more explicitly with (c & 0x80) != 0 instead of c < 0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`char` should be an alias for `u8` #128

`char` should be an alias for `u8` #128

chqrlie commented Dec 13, 2024 •

edited

Loading

bvdberg commented Dec 18, 2024

chqrlie commented Dec 18, 2024

char should be an alias for u8 #128

char should be an alias for u8 #128

Comments

chqrlie commented Dec 13, 2024 • edited Loading

bvdberg commented Dec 18, 2024

chqrlie commented Dec 18, 2024

`char` should be an alias for `u8` #128

`char` should be an alias for `u8` #128

chqrlie commented Dec 13, 2024 •

edited

Loading