character sets

Alain_Bonnefoy2 · October 14, 2003, 11:59am

Hi,

I don’t understand how to set my configuration to be able to use
accentuated characters in filenames.
Initially I wanted to see through Samba a FAT32 mounted filesystem. As I
wasn’t able to see filenames like ‘Démarrage’, I made some try on QNX,
and I encountered some strange behaviours.
-In bash, it’s impossible to input such characters.
-In sh, I can ‘touch démarrage’ (well displayed) but ‘é’ isn’t displayed
in pfm.
-In Photon, I can create a ‘démarrage’ filename (well displayed) but ‘é’
isn’t displayed correctly in sh.
-Through Samba (2.2.4 - client character set = 850 as for our NT
stations) none of the both files is correctly displayed.

Slts,
Alain.

Wojtek_Lerch1 · October 15, 2003, 5:29pm

Alain Bonnefoy <alain.bonnefoy@icbt.com> wrote:

I don’t understand how to set my configuration to be able to use
accentuated characters in filenames.

Well, the problem is that a filename does not really consist of
“characters”. It’s just a string of bytes. What characters those bytes
represent – or even whether that represent any characters at all –
depends on how you have entered them and how you try to display them.

The QNX6 text-mode console and pterm use the ISO 8859-1 character set.
When you type ‘é’, a read() from the terminal returns the byte 0xE9.
When you run “touch démarrage” in a pterm, the argv[1] that the shell
gives to the touch program is equivalent to the C string “d\xE9marrage”.

The old QNX4 terminal emulation (“pterm -Q”) uses the IBM PC character
set (a.k.a. code page 437). The ‘é’ has the value 0x82 in this
character set, and typing “touch démarrage” gives the string
“d\x82marrage” to the touch utility.

Photon uses UTF-8. If you enter “démarrage” into a text field, the
string in the widget will be “d\C3\A9marrage”.

Initially I wanted to see through Samba a FAT32 mounted filesystem. As I
wasn’t able to see filenames like ‘Démarrage’, I made some try on QNX,
and I encountered some strange behaviours.
-In bash, it’s impossible to input such characters.

Yeah, bash doesn’t seem to like them…

-In sh, I can ‘touch démarrage’ (well displayed) but ‘é’ isn’t displayed
in pfm.

That’s because “d\xE9marrage” is not valid UTF-8. I don’t know what
exactly our pfm or the file selector widget do when a filename turns out
not to be valid UTF-8; apparently, it’s not working…

-In Photon, I can create a ‘démarrage’ filename (well displayed) but ‘é’
isn’t displayed correctly in sh.
-Through Samba (2.2.4 - client character set = 850 as for our NT
stations) none of the both files is correctly displayed.

I know very little about Samba, but I think code page 850 has the ‘é’ in
the same spot as code page 437. Perhaps your filenames will look the
same way they do under Windows when you display (or enter) them in a
pterm running the QNX4 emulation (“pterm -Q”).

Alain_Bonnefoy2 · October 16, 2003, 6:20am

Hi Wojtek,
Seems to be somethnig like that. It’s surprising that pfm doesn’t take
care about character set conversion.
regards,
Alain.

Wojtek Lerch a écrit:

Alain Bonnefoy <> alain.bonnefoy@icbt.com> > wrote:

I don’t understand how to set my configuration to be able to use
accentuated characters in filenames.

Well, the problem is that a filename does not really consist of
“characters”. It’s just a string of bytes. What characters those bytes
represent – or even whether that represent any characters at all –
depends on how you have entered them and how you try to display them.

The QNX6 text-mode console and pterm use the ISO 8859-1 character set.
When you type ‘é’, a read() from the terminal returns the byte 0xE9.
When you run “touch démarrage” in a pterm, the argv[1] that the shell
gives to the touch program is equivalent to the C string “d\xE9marrage”.

The old QNX4 terminal emulation (“pterm -Q”) uses the IBM PC character
set (a.k.a. code page 437). The ‘é’ has the value 0x82 in this
character set, and typing “touch démarrage” gives the string
“d\x82marrage” to the touch utility.

Photon uses UTF-8. If you enter “démarrage” into a text field, the
string in the widget will be “d\C3\A9marrage”.

Initially I wanted to see through Samba a FAT32 mounted filesystem. As I
wasn’t able to see filenames like ‘Démarrage’, I made some try on QNX,
and I encountered some strange behaviours.
-In bash, it’s impossible to input such characters.

Yeah, bash doesn’t seem to like them…

-In sh, I can ‘touch démarrage’ (well displayed) but ‘é’ isn’t displayed
in pfm.

That’s because “d\xE9marrage” is not valid UTF-8. I don’t know what
exactly our pfm or the file selector widget do when a filename turns out
not to be valid UTF-8; apparently, it’s not working…

-In Photon, I can create a ‘démarrage’ filename (well displayed) but ‘é’
isn’t displayed correctly in sh.
-Through Samba (2.2.4 - client character set = 850 as for our NT
stations) none of the both files is correctly displayed.

I know very little about Samba, but I think code page 850 has the ‘é’ in
the same spot as code page 437. Perhaps your filenames will look the
same way they do under Windows when you display (or enter) them in a
pterm running the QNX4 emulation (“pterm -Q”).

Wojtek_Lerch1 · October 17, 2003, 3:04pm

Alain Bonnefoy <alain.bonnefoy@icbt.com> wrote:

Seems to be somethnig like that. It’s surprising that pfm doesn’t take
care about character set conversion.

Well, doing it right certainly wouldn’t be trivial. Pfm would have to
keep both the real filename and the guessed displayable UTF-8 filename
around, and use one or the other depending on the context. And it would
have to deal with the possibility that two different filenames could map
to the same UTF string. And certain things, like the Rename dialog,
would be funny – if you enter “démarrage.txt” for the new name, should
pfm always use UTF-8 for the ‘é’ or should it try to make a guess based
on what character set pfm thinks the old name was using?

Of course, guessing what character set a filename was meant to be
displayed in would be the toughest part. Just because a filename looks
like valid UTF-8 doesn’t prove that it was originally created using UTF.
If it doesn’t look like valid UTF, that still doesn’t tell us much about
the character set it uses, especially if you take into account the
possibility of using non-standard character sets in pterm.

Still, I think it would be good if pfm at least tried to detect
filenames that aren’t valid UTF-8 and to do something reasonable about
them. I agree that assuming that the Photon library will do the right
thing is not the best way of dealing with it. I’ll add this to our bug
tracking database…

Alain_Bonnefoy2 · October 20, 2003, 6:33am

well, maybe easier if we consider that the filesystem allways record the
filename in ISO8859-1.
In that case, we just have to take care about actual character set to
convert from photon character set to ISO8859-1 and vice versa.

Alain.

Wojtek Lerch a écrit:

Alain Bonnefoy <> alain.bonnefoy@icbt.com> > wrote:

Seems to be somethnig like that. It’s surprising that pfm doesn’t take
care about character set conversion.

Well, doing it right certainly wouldn’t be trivial. Pfm would have to
keep both the real filename and the guessed displayable UTF-8 filename
around, and use one or the other depending on the context. And it would
have to deal with the possibility that two different filenames could map
to the same UTF string. And certain things, like the Rename dialog,
would be funny – if you enter “démarrage.txt” for the new name, should
pfm always use UTF-8 for the ‘é’ or should it try to make a guess based
on what character set pfm thinks the old name was using?

Of course, guessing what character set a filename was meant to be
displayed in would be the toughest part. Just because a filename looks
like valid UTF-8 doesn’t prove that it was originally created using UTF.
If it doesn’t look like valid UTF, that still doesn’t tell us much about
the character set it uses, especially if you take into account the
possibility of using non-standard character sets in pterm.

Still, I think it would be good if pfm at least tried to detect
filenames that aren’t valid UTF-8 and to do something reasonable about
them. I agree that assuming that the Photon library will do the right
thing is not the best way of dealing with it. I’ll add this to our bug
tracking database…

Wojtek_Lerch1 · October 20, 2003, 2:18pm

Alain Bonnefoy <alain.bonnefoy@icbt.com> wrote:

well, maybe easier if we consider that the filesystem allways record the
filename in ISO8859-1.

You mean, you’d like pfm to never even consider a filename to be UTF-8?

I imagine it would upset a lot of people if we told them that their
filenames can only contain characters that 8859-1 supports.

In that case, we just have to take care about actual character set to
convert from photon character set to ISO8859-1 and vice versa.

Sure: assuming that the filenames are always encoded in 8859-1 would be
almost as simple as assuming that they’re always encoded in UTF-8. But
it would create at least as many problems as it would solve.