fs-dos long filename creation failure?

I guess this is already in bugreport, but

I was really pissed that on FAT filesystem
% tar cvf tmp.tar windows
% rm -fr windows
% tar xvpf tmp.tar
won’t give the same result.

Looks like UTF-8’ed long filenames are written somewhat incorrect
on the filesystem. You won’t notice it on English-only-filename
filled directories; only localized filenames
(japanese kana and kanji in my case) were screwed up
unable to rename/delete/scandisk from Windows.

In case someone tries to backup Windows from QNX – it won’t work for
localised filenames!

kabe

kabe@sra-tohoku.co.jp wrote:
: I guess this is already in bugreport, but
: Looks like UTF-8’ed long filenames are written somewhat incorrect
: on the filesystem. You won’t notice it on English-only-filename
: filled directories; only localized filenames

Well, as I don’t have a Windows system, let alone a localised-non-US
one, I always suspected it wouldn’t work, but you’re probably the first
person to actually notice this. I can see all sorts of issues with
processing of the (non-ASCII) multibyte names and 8.3 munging, but it
would help to know exactly what you see. Are files missing in the
archive? Are they missing when restored? Are they incorrectly named
in the archive or when restored? Is there a pattern to the mis-naming
(short, long, starting with certain characters, etc)? Thanks …

I was tinkering with the LFN and FAT all this day and it turned out
this is Win95 specific issue.

To make long story short, “don’t put UTF-8 in 8.3”.

in <9u4bkk$1ah$1@nntp.qnx.com> jgarvey@qnx.com wrote:

kabe@sra-tohoku.co.jp > wrote:
: I guess this is already in bugreport, but
: Looks like UTF-8’ed long filenames are written somewhat incorrect
: on the filesystem. You won’t notice it on English-only-filename
: filled directories; only localized filenames

person to actually notice this. I can see all sorts of issues with
processing of the (non-ASCII) multibyte names and 8.3 munging, but it
would help to know exactly what you see. Are files missing in the
archive? Are they missing when restored? Are they incorrectly named
in the archive or when restored? Is there a pattern to the mis-naming
(short, long, starting with certain characters, etc)? Thanks …

  • tar is correctly created, with UTF-8 filenames.
    I confirmed this by also binary dump/untarring in Solaris.

  • It looks correct after restore in QNX File Manager.

  • By hexdumping the FAT entries, I confirmed that at least LFN is restored
    correctly, including checksums.

the localised “Program” filename:
Unicode FF8C FF9F FF9B FF78 FF9E FF97 FF91
was stored in FAT entry as:

00004A40 41 8C FF 9F FF 9B FF 78:FF 9E FF 0F 00 5A 97 FF|A…x…Z…
00004A50 91 FF 00 00 FF FF FF FF:FF FF 00 00 FF FF FF FF|…
00004A60 EF BE 8C EF BE 9F 7E 31:20 20 20 20 00 00 00 00|…~1 …
00004A70 00 00 00 00 00 00 88 2E:87 2B 03 00 0A 00 00 00|…+…

(the 8.3 is (EF BE 8C)(EF BE 9F)~1, seemingly mangled from UTF-8)

  • There are no missing files after tar restore; file counts were intact
    from both from QNX and Win95.

  • WinNT and Linux VFAT has no problem and correctly identifies all files.

However, the 8.3 filename is munged directly from UTF-8 and
Win95 seems to not like this.
After numerous try-and-error, Win95 seems to check

  • the 8 of 8.3 ends with “~1” or similar munge indicator
  • the 8.3 looks sane in system codepage
    (“chev us” (use US codepage) doesn’t cure)

In Win95, so that only those entries having “sane” 8.3 by chance,
will have LFN assigned.

  • The problem seems to lie in 8.3 generator.

As result, most filenames are presented with QNX mangled 8.3 in raw UTF-8.
Some of these would contain illegal characters, so Win95 can’t
stat/delete them (bad).

in above example,[EF BE 8C EF BE 9F 7E 31:20 20 20] doesn’t look like
a valid string in codepage 932 (japanese), so Win95 thinks
this doesn’t have valid LFN. (don’t ask me why)

I don’t expect QNX will generate 8.3 exactly as in Windows,
as this requires codepage option in fs-dos, but at least
it’s better to have LFN picked up.

So here’s the short-term fix suggestion:

  • Have the 8.3 generator only include ASCII filenames.
    You don’t have to decode Unicode to DOS codepage;
    just masking into 7 bit (and excluding illegal chars) is enough.

I know (internationalized,multilingual,whatever) iso-8859-* folks
won’t like 7bit, so anyway you may need some 8.3 generation option
for fs-dos.

In long term you may need “codepage=” option for fs-dos.


kabe

kabe@sra-tohoku.co.jp wrote:

I was tinkering with the LFN and FAT all this day and it turned out

Excellent, thanks very much for your investigations. I managed to find
a Win2k system and come to much the same conclusions, namely that the
way fs-dos munges 8.3 names (by just taking the first 1-6 of the multibyte
long name) is not going to work very well in general non-ASCII locales.
I’ll look into some kind of fix or workaround …