ZScript: String.CharAt does not like Unicode characters

Post by **Player701** » Fri Apr 12, 2019 12:18 pm

Thanks, but I'd rather wait until this becomes part of the official API of ZScript. According to what Graf has said, it seems that the engine is already capable of doing this, but the corresponding methods should be exported so that they can be used on the script side.

Regarding the use of strings to read binary data: if I ever needed to do anything like it, I'd never think of using strings for that! Strings are a tool to work with text, an abstraction in other words, and their underlying binary contents constitute implementation details of this abstraction. It is generally a bad idea to rely on the underlying implementation to stay unchanged. There should be a separate API for working with binary data, and if it doesn't exist at the moment, I will either request that it is implemented, or try to code it myself and make a pull request - provided that I ever have a need to work with binary data in a mod, of course - right now I don't see a potential use case for that, but it of course doesn't mean that none exist.

gramps · Post by **gramps** » Fri Apr 12, 2019 12:44 pm

Again, it depends on the language. Lua and Ruby both read binary files as strings with their standard file-reading functions. A string is just a sequence of bytes in those languages; it's the most appropriate data structure for the job.

In Python, a file opened in text mode is read into a string, and a file opened in binary mode is read into a "bytes object."

I don't know what the most appropriate data structure in zscript is, but array support is limited. Trying to impose a single philosophy across all languages about what strings are / should be doesn't make a lot of sense; strings are handled differently in different languages, not all languages have byte arrays, etc.

Post by **Player701** » Fri Apr 12, 2019 12:58 pm

The problem is that it all used to fit together nicely in ZScript until UTF-8 came along. Maybe using a string as a byte array in the past was indeed justifiable. But right now it looks like one awful gross hack, at least to me. I guess that other people may see it differently, though. I also understand that UTF-8 support may not have been considered from the start, but it probably should have been. Anyway, since 4.0.0 is already out, it is quite pointless to discuss this now, when these design decisions have already been made. Just please give us an official API to work with UTF-8 characters, that's all.

Post by **Graf Zahl** » Fri Apr 12, 2019 1:21 pm

Player701 wrote:I also understand that UTF-8 support may not have been considered from the start, but it probably should have been.

How? This was never on the table until two months ago. And generally speaking, even with proper code point extraction you still haven't won. What if Arabic support gets added, for example, or decomposed characters, control sequences or other things UTF-8 allows? All of these share the same property that one code point is not necessarily one character. Some code points map to nothing, other sequences need multiple code points to form one character. Some existing code points may require multiple font glyphs to be rendered, and so on, and so on. If you try to read Unicode with the same naive approach that is inherited from ASCII, you may in for a nasty surprise eventually. On native code all this can be fixed once the feature is needed - on the script side you'd be stuck with a bad algorithm if the feature set gets extended and what may work now will most likely break then.

The general advice with Unicode is normally, not to process strings by character on any higher level than actually printing the text to the screen. What are you trying to do anyway?

gramps · Post by **gramps** » Fri Apr 12, 2019 1:27 pm

It would have most likely been an iterator-style API anyway, even if utf-8 support had been planned from the start.

The problem with utf-8 is what Graf mentioned earlier; not all characters are the same number of bytes. That means an API similar to the one we have now just doesn't work for utf-8 without shoehorning it in.

Getting the number of characters in a string with .length, getting the i-th through j-th characters with .mid, and so on all become much less efficient, and should be avoided. It's a reasonable and useful abstraction for working with bytes (and for ascii strings, which you can still use of course) -- but it's the wrong abstraction for utf-8 characters.

Post by **Player701** » Fri Apr 12, 2019 1:32 pm

Graf Zahl wrote:How? This was never on the table until two months ago.

Well, I haven't seen any software without Unicode support for a while. Almost everything I use appears to be able to work with it. It was only natural that GZDoom would eventually get this support as well, wasn't it?

gramps wrote:Getting the number of characters in a string with .length, getting the i-th through j-th characters with .mid, and so on all become much less efficient, and should be avoided. It's a reasonable and useful abstraction for working with bytes, (and for ascii strings, which you can still use of course) -- but it's the wrong abstraction for utf-8 characters.

True, but there is still a need for an official API. Using only ASCII characters is out of the question when strings come from a language lump.

BTW, it would also be interesting to know why UTF-8, and not UTF-16 was chosen. From what I've gathered, it seems that the latter uses a fixed 16 bits per character, so there wouldn't be a need for any iterators... Although I guess it was probably due to the same kind of backwards compatibility issues...

Post by **Chris** » Fri Apr 12, 2019 2:29 pm

Player701 wrote:Well, I haven't seen any software without Unicode support for a while. Almost everything I use appears to be able to work with it. It was only natural that GZDoom would eventually get this support as well, wasn't it?

Eventually, but don't forget the code is derived from a game made in the early 90s. You can't just start being unicode-aware in a codebase when you're not sure what unicode support will look like from an (internal and public) API point of view.

gramps wrote:Getting the number of characters in a string with .length, getting the i-th through j-th characters with .mid, and so on all become much less efficient, and should be avoided. It's a reasonable and useful abstraction for working with bytes, (and for ascii strings, which you can still use of course) -- but it's the wrong abstraction for utf-8 characters.
True, but there is still a need for an official API. Using only ASCII characters is out of the question when strings come from a language lump.

There's still the question of what you need it for, and what you're expecting to do. As Graf alluded to, the way unicode text is composed doesn't lend itself well to poking at individual characters or code points for random reasons (as these accessor functions would be for).

BTW, it would also be interesting to know why UTF-8, and not UTF-16 was chosen. From what I've gathered, it seems that the latter uses a fixed 16 bits per character, so there wouldn't be a need for any iterators... Although I guess it was probably due to the same kind of backwards compatibility issues...

You need more than 16 bits. CJK (Asian) text will quickly break if you assume 16 bits is enough for everyone, for instance. So you'll still have the complexity of code points potentially needing multiple wide-char elements, and be wasting a byte for every character that fits under 0x7f (which is the most common). With UTF-8, you can use just as many bytes as you need, for the most part.

Post by **Player701** » Fri Apr 12, 2019 2:36 pm

Chris wrote:There's still the question of what you need it for, and what you're expecting to do. As Graf alluded to, the way unicode text is composed doesn't lend itself well to poking at individual characters or code points for random reasons (as these accessor functions would be for).

There is a link to an example method in one of my previous posts. This method does not work correctly in 4.0.0 if the string that is passed to it contains multibyte characters. Here it is again so that you don't have to search for it.

I can think of other use cases, of course. For example, suppose I need to capitalize the first letter of a word, or pad a string with whitespace characters to force it to a certain length. It is currently not possible to do such things with only the built-in API methods, and since the engine already supports this functionality but just doesn't export it (again, according to Graf's words), it needs to be part of the API. It doesn't matter what it looks like, as long as it is possible to achieve the desired results with it.

Post by **Graf Zahl** » Fri Apr 12, 2019 3:14 pm

Player701 wrote: BTW, it would also be interesting to know why UTF-8, and not UTF-16 was chosen. From what I've gathered, it seems that the latter uses a fixed 16 bits per character, so there wouldn't be a need for any iterators... Although I guess it was probably due to the same kind of backwards compatibility issues...

Simple reason: UTF-8 doesn't require a rewrite of the entire text processing chain from top to bottom. It's compatible enough with ASCII so that the vast majority of the existing code could be left unchanged. Trying to swap out the string class would have been way too much work.

Post by **Chris** » Fri Apr 12, 2019 3:24 pm

Player701 wrote:There is a link to an example method in one of my previous posts. This method does not work correctly in 4.0.0 if the string that is passed to it contains multibyte characters. Here it is again so that you don't have to search for it.

The method is flawed, it won't work in scripts that need more than correct spacing to be monospace. See Figure 6

"In Arabic and other scripts, text inside fixed margins is justified by elongating the horizontal parts of certain glyphs, rather than by expanding the spaces between words. Ideally this is implemented by changing the shape of the glyph depending on the desired width. On some systems, this stretching is approximated by inserting extra connecting, dash-shaped glyphs called kashidas, as shown in Figure 6. In such a case, a single character may conceivably correspond to a whole sequence of kashidas + glyphs + kashidas."

I can think of other use cases, of course. For example, suppose I need to capitalize the first letter of a word, or pad a string with whitespace characters to force it to a certain length.

Capitalization rules vary wildly across languages. In some cases, capitalization affects how multiple letters are written, not just the first. This is actually one of the bigger arguments against case-insensitive file systems, that properly handling case (to change or ignore it), is a lot more complicated than a simple per-character substitution.

Forcing text to a specific length sounds more like a hack to work around a separate issue. To me, it would make more sense to be able to draw text right where it needs to be, rather than force padding with invisible characters in the hopes it lines up correctly.

It is currently not possible to do such things with only the built-in API methods, and since the engine already supports this functionality but just doesn't export it (again, according to Graf's words), it needs to be part of the API.

The engine supports it as far as it currently needs to. The difference is, the engine can be changed to add or fix support for features as it needs them, and it doesn't need to worry about anything but itself. But once its exported to ZScript, it needs to be maintained as-is indefinitely, or else mods will break. Having to maintain flawed/broken code to prevent other things from breaking is not a nice situation to be in.

gramps · Post by **gramps** » Fri Apr 12, 2019 9:00 pm

I'm still not really seeing the advantage of doing it engine-side over letting users create a library to support our own needs, letting that mature, and then maybe integrating something similar into the engine once utf-8 support has a little time to settle.

To the extent that the examples like monospacing and uppercasing can be handled by a general purpose utf-8 class engine-side, what would be the benefit of doing it that way over letting the community do it as a separate library? Wouldn't a simple iterator like the example I posted earlier handle a majority of the cases that can actually be handled?

Post by **Graf Zahl** » Fri Apr 12, 2019 11:43 pm

The problem with letting this be handled by the community is that it will produce lots of broken code and no means to fix released mods.
Unicode handling is a minefield that seriously cannot be left to amateurs. Regarding monospacing, that really (*REALLY*) needs to be made part of the DrawString function. Even engine-internally, it's currently hacked in in the places that have it, most importantly the status bar's text drawer.

Post by **Player701** » Sat Apr 13, 2019 12:33 am

Graf Zahl wrote:The problem with letting this be handled by the community is that it will produce lots of broken code and no means to fix released mods.
Unicode handling is a minefield that seriously cannot be left to amateurs. Regarding monospacing, that really (*REALLY*) needs to be made part of the DrawString function. Even engine-internally, it's currently hacked in in the places that have it, most importantly the status bar's text drawer.

Yes, this is exactly why I want the API to be part of the engine. Same about monospacing. Yes, my method is largely a workaround and it indeed might not work correctly with more exotic characters. But right now it doesn't even work with, say, Cyrillic characters. Same problem with capitalization.

Regarding padding: the main problem is that I need to know the lengths of my strings to position them correctly. Whether this is done by forcing them all to the same length with invisible characters or by specifying the correct positions from the start does not really matter. It is also possible that I might need padding with non-invisible characters as well (say, "*" or "-") in certain cases.

Post by **Graf Zahl** » Sat Apr 13, 2019 1:50 am

Proper handlers have been added for extracting code points from strings.

Post by **Player701** » Sat Apr 13, 2019 1:58 am

Thank you very much.

Question: Is there a way to know when GetNextCodePoint has reached the end of the string? Or do I have to use CodePointCount for that?

Edit: Also, how do I convert the resulting integers to strings to pass them to Screen.DrawText / BaseStatusBar.DrawString etc. ?

ZDoom

ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters

Re: ZScript: String.CharAt does not like Unicode characters