ZipArchive: Unicode Support: Using Non-English Characters in Filenames, Comments and Passwords

Unicode Support: Using Non-English Characters in Filenames, Comments and Passwords

Applies To: All

Introduction
Using Unicode in Filenames and File Comments (Full Version Only)
Setting Unicode Password and Archive Comment (Windows Only)
Setting Locale in STL Applications
Additional Considerations (Windows Only)
- Unicode Normalization
- Safe Windows API
Custom Unicode Handling (Windows Only)
See Also API Links

Introduction

You should use the Unicode features of the ZipArchive Library when the filenames, comments or passwords in the archives you use contain non-ASCII characters.
Without the Unicode support, the strings in archives are stored under Windows using the following code page:
- filenames - current system OEM code page (CP_OEMCP),
- comments, passwords - current system ANSI code page (CP_ACP).
Under other platforms, all strings are stored using the current system's code page.
To use the Unicode functionality under Windows, you should compile the library and your application for Unicode. Under systems that use Unicode UTF-8 as the default code page (like Linux and OS X), there are no special considerations needed. On other systems, the Unicode support is not available.
When calling the CZipFileHeader::SetFileName() method, the current Unicode mode will be applied to the file being renamed. This will also affect the Unicode mode used for the file's comment.

Using Unicode in Filenames and File Comments (Full Version Only)

Introduction

This feature is compatible with WinZip Unicode support and allows creating cross-platform Unicode archives that are extractable by utilities provided with the system under Linux and OS X.
To use this functionality, make sure _ZIP_UNICODE is defined in the _features.h file. Rebuild the ZipArchive Library and your application, if you modify this definition.

Usage

Call the CZipArchive::SetUnicodeMode() method and pass CZipArchive::umExtra or CZipArchive::umString as the parameter. You can also use a combination of these two parameters.

CZipArchive::umExtra will store Unicode information in extra headers. This will cause to use the extra headers for a filename or comment only when the string contains non-ASCII characters. This value is used by default under Windows.
CZipArchive::umString will store filename and comment directly in Unicode and will set a special flag in the file header inside of the archive. Some utilities under Windows may display an invalid strings in this case. This value is used by default under Linux/OS X.
To determine what Unicode mode is used by a file, use the CZipFileHeader::GetState() method.

Preserving the Compatibility

The ZipArchive Library correctly decompresses archives created under different systems without additional settings.

If you need an archive created under Windows to be extracted correctly by Linux utilities, set the archive compatibility to ZipCompatibility::zcUnix with the
CZipArchive::SetSystemCompatibility() method. To make the archive readable also by Windows utilities, set additionally one of the Unicode modes. Not all Windows utilities support the Unicode modes.
If you need an archive created under Windows to be extracted correctly by Mac OS X utilities, set the Unicode mode to CZipArchive::umString or use the same way as for the Linux platform.
If you need an archive created under Linux/OS X to be extracted correctly by WinZip under Windows there is no need to change anything as the CZipArchive::umString mode is set by default, but you may need to set CZipArchive::umExtra for other Windows utilities that do not support the CZipArchive::umString mode.

Setting Unicode Password and Archive Comment (Windows Only)

You can set a code page to be used while setting a password with the
CZipArchive::SetPassword() method.
You can set a code page to be used while setting an archive global comment with the
CZipArchive::SetGlobalComment() method.
If your password or a comment contains non-ASCII characters and you intend to compress files under Windows and extract them under Linux/OS X or vice versa, set the appropriate code page to CP_UTF8.

Setting Locale in STL Applications

If your locale is different from English and you wish to use non-English characters in archives, you need to set your locale globally; setlocale() function is not sufficient in this case.

To set the global locale to be the same as your system locale use the function:
std::locale::global(std::locale(""));
To set the global locale to a particular value, use the function e.g. this way:
std::locale::global(std::locale("German"));
When you use Unicode, do not use _T() macro in the above calls.
Remember about putting #include <locale> in your code.

Remember to restore the global locale to the previous value (returned by std:locale::global ) after processing (it may affect other parts of your application).

Additional Considerations (Windows Only)

Unicode Normalization

When you decompress archives that store filenames using different Unicode Normalization than form C (used by Windows), you should define _ZIP_UNICODE_NORMALIZE in the _features.h file, because some software under Windows may be unable to open files with filenames in a different form. This will convert any other normalization form to form C. This is e.g. the case when extracting archives created under OS X (it uses form D).

Under Windows Vista and later you need to use the appropriate for your system Windows SDK and make sure that you compile for that platform (WINVER should be defined to be at least 0x600).
Under Windows XP and Windows Server 2003, you need to download Microsoft Internationalized Domain Name (IDN) Mitigation APIs to use this functionality.
Under Windows 95/98/Me this functionality is unsupported.

Safe Windows API

The Unicode version the library uses Windows API WideCharToMultiByte and MultiByteToWideChar functions to perform conversions from ANSI code page to OEM code page and vice versa. It takes four function's calls to perform one conversion. The alternative is to use the CharToOemBuffA and


			OemToCharBuffA

functions and it takes only one function call per conversion in that case. However, this functions are considered unsafe and banned by Microsoft. If you prefer using the fast solution with unsafe functions, comment out the _ZIP_SAFE_WINDOWS_API definition in the ZipPlatform_win.cpp file.

Custom Unicode Handling (Windows Only)

This functionality is specific to the ZipArchive Library and external software will not be able to benefit from it.
To use this functionality, make sure _ZIP_UNICODE_CUSTOM is defined in the _features.h file. Rebuild the ZipArchive Library and your application, if you modify this definition. You also need to set the Unicode mode with the CZipArchive::SetUnicodeMode() method to the CZipArchive::umCustom value.
The ZipArchive Library will save the code pages used during compression and automatically use them during extraction. The code pages are saved in zip extra fields. See below for more information.
Setting string store settings with one of the API calls does not affect existing files and comments.
If you open an existing archive with intent to add new files to it and you want the new files to use the same string store settings as the existing files, then:
- Retrieve the settings from one of the existing files with the
  CZipFileHeader::GetStringStoreSettings() method.
- Set the retrieved settings to be active for the new files with the
  CZipArchive::SetStringStoreSettings(const CZipStringStoreSettings&) method.
Otherwise the library will use the default settings for the current system (ZipPlatform::GetSystemID()).
If you want to open an archive created with a previous version of the ZipArchive library or any program, that uses a different filename of comment encoding code pages than the standard ones, set the code pages before opening the archive. The library will use them while decoding filenames and comments. The settings will be ignored, if the archive contains extra fields with code pages created by the ZipArchive Library. In this case, code pages from extra fields will be used instead. Note, that these settings will be used during compression in the same archive (unless changed).
When you close an archive, the string store settings are reset to its default values for the current system, just like with the CZipStringStoreSettings::Reset() method call. This way, if you open the next archive using the same CZipArchive object, its string store settings are not affected by the previous archive settings.

Storing Unicode Filenames in a Zip Archive

You may control the way the ZipArchive Library stores filenames in archives by adjusting the first parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method.

If you plan that the archive will be extracted under Linux/OS X, set this parameter to the identifier of the code page used by the system under which you want to extract the archive. You may try setting it to CP_ACP, then the current system ANSI code page will be used - it will work correctly if the target platform uses the same code page as your system.
If you use e.g. Japanese or Korean characters, you may set this parameter to CP_UTF8. Unicode UTF-8 will be used.
You can set the code page directly using its identifier. Be sure it is installed on your system and on the system you plan to extract the archive on.
To restore the OEM encoding under Windows, set this parameter back to CP_OEMCP.

Sample Code

CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
// by default the current OEM code page is used, change it to UTF-8
zip.SetStringStoreSettings(CP_UTF8);
// use some non-English characters
zip.AddNewFile(emptyFile, _T("\u0391\u03A9"));
// set the code page using its identifier
zip.SetStringStoreSettings(1250);
zip.AddNewFile(emptyFile, _T("\u010D\u011B"));
// restore the OEM code page
zip.SetStringStoreSettings(CP_OEMCP);
zip.AddNewFile(emptyFile, _T("English characters only"));    
zip.Close();
// extract one file now
zip.Open(zipFileName);
zip.ExtractFile(1, _T("C:\\Temp"));
zip.Close();

Preserving Compatibility with the Standard Zip Format

It is assumed that under Windows filenames are stored using the current system OEM code page (CP_OEMCP). Hence external software will not be able to properly decode filenames if they are stored using a different code page. For this reason, the ZipArchive Library allows storing filenames encoded with a custom code page in extra fields. The filenames in the standard location (the central directory and local headers) are encoded using OEM code page. This way, external software will see a typically encoded filenames and the ZipArchive Library will know the original filenames while extraction.

You should note that this method takes additional space needed for storing a filename in an extra field.

To store filenames in extra fields, set the second parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method to true.

Sample Code

CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
// set the code page and request storing it in the extra field
// the filename encoded using this code page
zip.SetStringStoreSettings(1250, true);
// use some non-English characters
zip.AddNewFile(emptyFile, _T("\u0104\u0118"));
zip.Close();

Choosing a Code Page for Comments in a Zip Archive

You can specify a different code page for file comments, e.g. by modifying the object returned by the
CZipArchive::GetStringStoreSettings() method call.

Archive Global Comment Encoding and Decoding

The comment code page settings does not affect the global comment. Use CZipArchive::SetGlobalComment() to use a different code page in this case.

Sample Code

CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.AddNewFile(emptyFile, _T("empty file"));
// set a specific code page for comments
zip.GetStringStoreSettings().m_uCommentCodePage = CP_UTF8;
// use some non-English characters
LPCTSTR comment = _T("\u0104\u0118");
zip[0]->SetComment(comment);
// the comment code page setting affects the global comment encoding as well
zip.SetGlobalComment(comment);
zip.Close();    
// extract the comments
zip.Open(zipFileName);
CZipFileHeader* info = zip.GetFileInfo(0);
// the file comment, the comment code page is read from the stored settings
CZipString result = info->GetComment();
// adjust the settings to properly decode the global comment
zip.SetStringStoreSettings(info->GetStringStoreSettings());    
result = zip.GetGlobalComment();    

ZipArchive Library Extra Field Format

The ZipArchive Library stores code page information and if requested, encoded filename, in extra fields in the central directory. The global format of the ZipArchive extra field is as follows:

Sub-field	Size in bytes	Value
Header ID	2	0x5A4C
Data Size	2
Data	as specified by Data Size

The format of the Data field is as follows (not all sub-fields may be present):

Sub-field	Size in bytes	Values
Version	1	0x01
Flag	1	1, 3, 4
Filename Code Page	4
Encoded Filename	variable
Comment Code Page	4

The Flag field values have the following meaning:

Bits Set	Value	Meaning
0	1	the Filename Code Page field is present
0 and 1	3	the Encoded Filename field is present (and the Filename Code Page field must be present too)
2	4	the Comment Code Page field is present

Introduction

Using Unicode in Filenames and File Comments (Full Version Only)

Introduction

Usage

Preserving the Compatibility

Setting Unicode Password and Archive Comment (Windows Only)

Setting Locale in STL Applications

Additional Considerations (Windows Only)

Unicode Normalization

Safe Windows API

Custom Unicode Handling (Windows Only)

Storing Unicode Filenames in a Zip Archive

Preserving Compatibility with the Standard Zip Format

Choosing a Code Page for Comments in a Zip Archive

Archive Global Comment Encoding and Decoding

ZipArchive Library Extra Field Format

See Also API Links