Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the issue of file name becoming messy code #562

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

willson-chen
Copy link
Member

The issue appeas when the file with non-utf8 encoding file name is zipped by third party tools, and then unzipped with ZipArchive.

The issue appeas when the file with non-utf8 encoding file name is
zipped by third party tools, and then unzipped with ZipArchive.
@willson-chen
Copy link
Member Author

Basically, I have reverted #443. But I keep cp 437 decoding with changing the function flow.

Comment on lines -972 to -999
// Respect Language encoding flag only reading filename as UTF-8 when this is set
// when file entry created on dos system.
//
// https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
// Bit 11: Language encoding flag (EFS). If this bit is set,
// the filename and comment fields for this file
// MUST be encoded using UTF-8. (see APPENDIX D)
uint16_t made_by = version_made_by >> 8;
BOOL made_on_dos = made_by == 0;
BOOL languageEncoding = (flag & (1 << 11)) != 0;
if (!languageEncoding && made_on_dos) {
// APPNOTE.TXT D.1:
// D.2 If general purpose bit 11 is unset, the file name and comment should conform
// to the original ZIP character encoding. If general purpose bit 11 is set, the
// filename and comment must support The Unicode Standard, Version 4.1.0 or
// greater using the character encoding form defined by the UTF-8 storage
// specification. The Unicode Standard is published by the The Unicode
// Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
// is expected to not include a byte order mark (BOM).

// Code Page 437 corresponds to kCFStringEncodingDOSLatinUS
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS);
NSString* strPath = [NSString stringWithCString:filename encoding:encoding];
if (strPath) {
return strPath;
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this code block is to guanartee that the files can be unzipped correctly, the files that are encoded with non-utf8 format and zipped in windows platform.

But sadly, most zip tool in windows never set the utf8 flag even the file name is encoded with utf8 format. And that is the reason why the file names become messy code after being processed by this code block.

return strPath;
}
}

// attempting unicode encoding
NSString * strPath = @(filename);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utf8 file will process here

Comment on lines +985 to +990
BOOL isloss = NO;
NSStringEncoding encGB_18030_2000 = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
NSStringEncoding encShiftJIS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingShiftJIS);
NSStringEncoding encDOSLatinUS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS);
NSArray * encList = @[@(encGB_18030_2000), @(encShiftJIS), @(encDOSLatinUS)];
[NSString stringEncodingForData:data encodingOptions:@{NSStringEncodingDetectionSuggestedEncodingsKey:encList} convertedString:&strPath usedLossyConversion:&isloss];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-utf8 file will process here. The suggestion encoding set includes GB_18030_2000, ShiftJIS, and DOSLatinUS. More encoding format can be appended. And utf8 flag and platform(made_by) can be ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky to define our own set of suggested encodings (although I know, it's done in present code for macOS 10.9 or older).

I would be in favor of changing our API to have the fallback encoding options as a parameter to unzipping (or as a callback). Normally, well-formed archives should be in unicode, and old archived should be dealt with according to client preferences instead of attempting imperfect encoding detection.

A. If done as an encoding option: that parameter will tell if we fallback non-unicode to a particular encoding, or to risky encoding-detection, or to hexa-strings.
B. If done as a callback option: that parameter will be called with the NSData each time it's not-unicode, and that callback will return the desired NSString.

Note that neither of those solutions (present code and above suggestions) is dealing with duplicate filenames: if an archive has two identical filenames in it (unicode or not), the behavior is possibly undefined (error? overwrite? skip? rename?) and that would also ideally require a callback to let the client decide on what to do.

@willson-chen
Copy link
Member Author

I have tested the attached zipped file in some related issues, and the patch works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants