Fixed the issue of file name becoming messy code #562

willson-chen · 2020-03-29T08:04:35Z

The issue appeas when the file with non-utf8 encoding file name is zipped by third party tools, and then unzipped with ZipArchive.

willson-chen · 2020-03-29T08:11:51Z

Basically, I have reverted #443. But I keep cp 437 decoding with changing the function flow.

willson-chen · 2020-03-29T08:32:30Z

SSZipArchive/SSZipArchive.m

- // Respect Language encoding flag only reading filename as UTF-8 when this is set
- // when file entry created on dos system.
- //
- // https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
- // Bit 11: Language encoding flag (EFS). If this bit is set,
- // the filename and comment fields for this file
- // MUST be encoded using UTF-8. (see APPENDIX D)
- uint16_t made_by = version_made_by >> 8;
- BOOL made_on_dos = made_by == 0;
- BOOL languageEncoding = (flag & (1 << 11)) != 0;
- if (!languageEncoding && made_on_dos) {
- // APPNOTE.TXT D.1:
- // D.2 If general purpose bit 11 is unset, the file name and comment should conform
- // to the original ZIP character encoding. If general purpose bit 11 is set, the
- // filename and comment must support The Unicode Standard, Version 4.1.0 or
- // greater using the character encoding form defined by the UTF-8 storage
- // specification. The Unicode Standard is published by the The Unicode
- // Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files
- // is expected to not include a byte order mark (BOM).
-
- // Code Page 437 corresponds to kCFStringEncodingDOSLatinUS
- NSStringEncoding encoding = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS);
- NSString* strPath = [NSString stringWithCString:filename encoding:encoding];
- if (strPath) {
- return strPath;
- }
- }
-


The purpose of this code block is to guanartee that the files can be unzipped correctly, the files that are encoded with non-utf8 format and zipped in windows platform.

But sadly, most zip tool in windows never set the utf8 flag even the file name is encoded with utf8 format. And that is the reason why the file names become messy code after being processed by this code block.

willson-chen · 2020-03-29T08:33:16Z

SSZipArchive/SSZipArchive.m

- return strPath;
- }
- }
-
 // attempting unicode encoding
 NSString * strPath = @(filename);


The utf8 file will process here

willson-chen · 2020-03-29T08:37:38Z

SSZipArchive/SSZipArchive.m

+ BOOL isloss = NO;
+ NSStringEncoding encGB_18030_2000 = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000);
+ NSStringEncoding encShiftJIS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingShiftJIS);
+ NSStringEncoding encDOSLatinUS = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingDOSLatinUS);
+ NSArray * encList = @[@(encGB_18030_2000), @(encShiftJIS), @(encDOSLatinUS)];
+ [NSString stringEncodingForData:data encodingOptions:@{NSStringEncodingDetectionSuggestedEncodingsKey:encList} convertedString:&strPath usedLossyConversion:&isloss];


The non-utf8 file will process here. The suggestion encoding set includes GB_18030_2000, ShiftJIS, and DOSLatinUS. More encoding format can be appended. And utf8 flag and platform(made_by) can be ignored.

It's tricky to define our own set of suggested encodings (although I know, it's done in present code for macOS 10.9 or older).

I would be in favor of changing our API to have the fallback encoding options as a parameter to unzipping (or as a callback). Normally, well-formed archives should be in unicode, and old archived should be dealt with according to client preferences instead of attempting imperfect encoding detection.

A. If done as an encoding option: that parameter will tell if we fallback non-unicode to a particular encoding, or to risky encoding-detection, or to hexa-strings.
B. If done as a callback option: that parameter will be called with the NSData each time it's not-unicode, and that callback will return the desired NSString.

Note that neither of those solutions (present code and above suggestions) is dealing with duplicate filenames: if an archive has two identical filenames in it (unicode or not), the behavior is possibly undefined (error? overwrite? skip? rename?) and that would also ideally require a callback to let the client decide on what to do.

willson-chen · 2020-03-29T08:41:00Z

I have tested the attached zipped file in some related issues, and the patch works.

…to decompression_encoding_fix_449_384 add testcase

Fixed the issue of file name becoming messy code

e724785

The issue appeas when the file with non-utf8 encoding file name is zipped by third party tools, and then unzipped with ZipArchive.

willson-chen commented Mar 29, 2020

View reviewed changes

This was linked to issues Apr 1, 2020

Problem unzipping files with word "accent" - "dissolução" becomes "dissoluc╠ºa╠âo" #545

Closed

unzip shift-ijs file is failed #384

Open

File names with chinese are messed up at 2.1.2 #449

Open

Special characters in file Names #461

Closed

willson-chen and others added 3 commits April 3, 2020 17:20

upload zip file first

a1f33e1

add testcase of PR ZipArchive#562

dfc6704

Merge remote-tracking branch 'remotes/origin/TestSpecialCharacter' in…

e173faa

…to decompression_encoding_fix_449_384 add testcase

willson-chen mentioned this pull request Apr 28, 2020

replace kCFStringEncodingDOSLatinUS with NSUTF8StringEncoding #476

Closed

willson-chen mentioned this pull request Jun 27, 2020

Problem unzipping files with word "accent" - "dissolução" becomes "dissoluc╠ºa╠âo" #545

Closed

Coeur changed the base branch from master to main July 22, 2023 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed the issue of file name becoming messy code #562

Fixed the issue of file name becoming messy code #562

willson-chen commented Mar 29, 2020

willson-chen commented Mar 29, 2020

willson-chen Mar 29, 2020 •

edited

willson-chen Mar 29, 2020

willson-chen Mar 29, 2020

Coeur Jul 22, 2023

willson-chen commented Mar 29, 2020

Fixed the issue of file name becoming messy code #562

Are you sure you want to change the base?

Fixed the issue of file name becoming messy code #562

Conversation

willson-chen commented Mar 29, 2020

willson-chen commented Mar 29, 2020

willson-chen Mar 29, 2020 • edited

Choose a reason for hiding this comment

willson-chen Mar 29, 2020

Choose a reason for hiding this comment

willson-chen Mar 29, 2020

Choose a reason for hiding this comment

Coeur Jul 22, 2023

Choose a reason for hiding this comment

willson-chen commented Mar 29, 2020

willson-chen Mar 29, 2020 •

edited