-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed the issue of file name becoming messy code #562
base: main
Are you sure you want to change the base?
Fixed the issue of file name becoming messy code #562
Conversation
The issue appeas when the file with non-utf8 encoding file name is zipped by third party tools, and then unzipped with ZipArchive.
Basically, I have reverted #443. But I keep cp 437 decoding with changing the function flow. |
// Respect Language encoding flag only reading filename as UTF-8 when this is set | ||
// when file entry created on dos system. | ||
// | ||
// https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT | ||
// Bit 11: Language encoding flag (EFS). If this bit is set, | ||
// the filename and comment fields for this file | ||
// MUST be encoded using UTF-8. (see APPENDIX D) | ||
uint16_t made_by = version_made_by >> 8; | ||
BOOL made_on_dos = made_by == 0; | ||
BOOL languageEncoding = (flag & (1 << 11)) != 0; | ||
if (!languageEncoding && made_on_dos) { | ||
// APPNOTE.TXT D.1: | ||
// D.2 If general purpose bit 11 is unset, the file name and comment should conform | ||
// to the original ZIP character encoding. If general purpose bit 11 is set, the | ||
// filename and comment must support The Unicode Standard, Version 4.1.0 or | ||
// greater using the character encoding form defined by the UTF-8 storage | ||
// specification. The Unicode Standard is published by the The Unicode | ||
// Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files | ||
// is expected to not include a byte order mark (BOM). | ||
|
||
// Code Page 437 corresponds to kCFStringEncodingDOSLatinUS | ||
NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS); | ||
NSString* strPath = [NSString stringWithCString:filename encoding:encoding]; | ||
if (strPath) { | ||
return strPath; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this code block is to guanartee that the files can be unzipped correctly, the files that are encoded with non-utf8 format and zipped in windows platform.
But sadly, most zip tool in windows never set the utf8 flag even the file name is encoded with utf8 format. And that is the reason why the file names become messy code after being processed by this code block.
return strPath; | ||
} | ||
} | ||
|
||
// attempting unicode encoding | ||
NSString * strPath = @(filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utf8 file will process here
BOOL isloss = NO; | ||
NSStringEncoding encGB_18030_2000 = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000); | ||
NSStringEncoding encShiftJIS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingShiftJIS); | ||
NSStringEncoding encDOSLatinUS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS); | ||
NSArray * encList = @[@(encGB_18030_2000), @(encShiftJIS), @(encDOSLatinUS)]; | ||
[NSString stringEncodingForData:data encodingOptions:@{NSStringEncodingDetectionSuggestedEncodingsKey:encList} convertedString:&strPath usedLossyConversion:&isloss]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-utf8 file will process here. The suggestion encoding set includes GB_18030_2000, ShiftJIS, and DOSLatinUS. More encoding format can be appended. And utf8 flag and platform(made_by) can be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky to define our own set of suggested encodings (although I know, it's done in present code for macOS 10.9 or older).
I would be in favor of changing our API to have the fallback encoding options as a parameter to unzipping (or as a callback). Normally, well-formed archives should be in unicode, and old archived should be dealt with according to client preferences instead of attempting imperfect encoding detection.
A. If done as an encoding option: that parameter will tell if we fallback non-unicode to a particular encoding, or to risky encoding-detection, or to hexa-strings.
B. If done as a callback option: that parameter will be called with the NSData each time it's not-unicode, and that callback will return the desired NSString.
Note that neither of those solutions (present code and above suggestions) is dealing with duplicate filenames: if an archive has two identical filenames in it (unicode or not), the behavior is possibly undefined (error? overwrite? skip? rename?) and that would also ideally require a callback to let the client decide on what to do.
I have tested the attached zipped file in some related issues, and the patch works. |
…to decompression_encoding_fix_449_384 add testcase
The issue appeas when the file with non-utf8 encoding file name is zipped by third party tools, and then unzipped with ZipArchive.