Skip Navigation Links
Skip Navigation LinksHome > ZipArchive > How to Use > Article
Unicode Support: Using Non-English Characters in Filenames and Comments
Applies To: Windows Only.

Introduction

  • This functionality is specific to the ZipArchive Library and external software will usually not be able to benefit from it.
  • The ZipArchive Library will save the code pages used during compression and automatically use them during extraction. The code pages are saved in zip extra fields. See below for more information.
  • Setting string store settings with one of the API calls does not affect existing files and comments.
  • If you open an existing archive with intent to add new files to it and you want the new files to use the same string store settings as the existing files, then: Otherwise the library will use the default settings for the current system (ZipPlatform::GetSystemID()).
  • If you want to open an archive created with a previous version of the ZipArchive library or any program, that uses a different filename of comment encoding code pages than the standard ones, set the code pages before opening the archive. The library will use them while decoding filenames and comments. The settings will be ignored, if the archive contains extra fields with code pages created by the ZipArchive Library. In this case, code pages from extra fields will be used instead. Note, that these settings will be used while compression in the same archive, unless changed.
  • When you close an archive, the string store settings are reset to its default values for the current system, just like with the CZipStringStoreSettings::Reset() method call. This way, if you open the next archive using the same CZipArchive object, its string store settings are not affected by the previous archive settings.

Storing Unicode Filenames in a Zip Archive

Zip compression programs working under Windows use current system OEM code page by default to encode filenames in archives. This may not be desirable in all cases. You may control the way the ZipArchive Library stores filenames in archives by adjusting the first parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method.
  • If you plan that the archive will be extracted under Linux, set this parameter to the identifier of the code page used by the Linux system under which you want to extract the archive. You may try setting it to CP_ACP, then the current system ANSI code page will be used - it will work all right if the target Linux platform uses the same code page as your system.
  • If you use e.g. Japanese or Korean characters, you may set this parameter to CP_UTF8. Unicode UTF-8 will be used. You need to compile the library and your application with the Unicode support.
  • You can set the code page directly using its identifier. Be sure it is installed on your system and on the system you plan to extract the archive on.
  • To restore the OEM encoding, set this parameter back to CP_OEMCP.
Sample Code
    CZipMemFile emptyFile;
    CZipArchive zip;
    LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
    zip.Open(zipFileName, CZipArchive::zipCreate);
    // by default the current OEM code page is used, change it to UTF-8
    zip.SetStringStoreSettings(CP_UTF8);
    // use some non-English characters
    zip.AddNewFile(emptyFile, _T("\u0391\u03A9"));
    // set the code page using its identifier
    zip.SetStringStoreSettings(1250);
    zip.AddNewFile(emptyFile, _T("\u010D\u011B"));
    // restore the OEM code page
    zip.SetStringStoreSettings(CP_OEMCP);
    zip.AddNewFile(emptyFile, _T("English characters only"));    
    zip.Close();
    // extract one file now
    zip.Open(zipFileName);
    zip.ExtractFile(1, _T("C:\\Temp"));
    zip.Close();

Preserving Compatibility with the Standard Zip Format

It is assumed that under Windows filenames are stored using the current system OEM code page (CP_OEMCP). Hence external software will not be able to properly decode filenames, if they are stored using a different code page. For this reason, the ZipArchive Library allows storing filenames encoded with a custom code page in extra fields. The filenames in the standard location (the central directory and local headers) are encoded using OEM code page. This way, external software will see a typically encoded filenames and the ZipArchive Library will know the original filenames while extraction.

You should note that this method takes additional space needed for storing a filename in an extra field. 

To store filenames in extra fields, set the second parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method to true.

Sample Code
    CZipMemFile emptyFile;
    CZipArchive zip;
    LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
    zip.Open(zipFileName, CZipArchive::zipCreate);
    // set the code page and request storing it in extra field
    // the filename encoded using this code page
    zip.SetStringStoreSettings(1250, true);
    // use some non-English characters
    zip.AddNewFile(emptyFile, _T("\u0104\u0118"));
    zip.Close();

Choosing a Code Page for Comments in a Zip Archive

The comments in a zip archive are usually stored using the current system ANSI code page (CP_ACP). You can specify a different code page, e.g. by modifying the object returned by the
CZipArchive::GetStringStoreSettings() method call.

Archive Global Comment Encoding and Decoding

The comment code page settings affect the global comment encoding as well, but there is no information stored in the archive which code page was used to encode the global comment. You can retrieve this information, e.g. from the first file in the archive, please refer to the sample code below.
Sample Code
    CZipMemFile emptyFile;
    CZipArchive zip;
    LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
    zip.Open(zipFileName, CZipArchive::zipCreate);
    zip.AddNewFile(emptyFile, _T("empty file"));
    // set a specific code page for comments
    zip.GetStringStoreSettings().m_uCommentCodePage = CP_UTF8;
    // use some non-English characters
    LPCTSTR comment = _T("\u0104\u0118");
    zip.SetFileComment(0, comment);
    // the comment code page setting affects the global comment encoding as well
    zip.SetGlobalComment(comment);
    zip.Close();    
    // extract the comments
    zip.Open(zipFileName);
    CZipFileHeader* info = zip.GetFileInfo(0);
    // the file comment, the comment code page is read from the stored settings
    CZipString result = info->GetComment();
    // adjust the settings to properly decode the global comment
    zip.SetStringStoreSettings(info->GetStringStoreSettings());    
    result = zip.GetGlobalComment();    

ZipArchive Library Extra Field Format

The ZipArchive Library stores code page information and if requested, encoded filename, in extra fields in the central directory. The global format of the ZipArchive extra field is as follows:

Sub-field Size in bytes Value
Header ID 2 0x5A4C
Data Size 2
Data as specified by Data Size

The format of the Data field is as follows (not all sub-fields may be present):

Sub-field Size in bytes Values
Version 1 0x01
Flag 1 1, 3, 4
Filename Code Page 4
Encoded Filename variable
Comment Code Page 4

The Flag field values have the following meaning:

Bits Set Value Meaning
0 1 the Filename Code Page field is present
0 and 1 3 the Encoded Filename field is present
(and the Filename Code Page field must be present too)
2 4 the Comment Code Page field is present

Setting Locale in STL Applications

If your locale is different from English and you wish to use non-English characters in archives, you need to set your locale globally; setlocale() function is not sufficient in this case.
  • To set the global locale to be the same as your system locale use the function:
    std::locale::global(std::locale(""));
  • To set the global locale to a particular value, use the function e.g. this way:
    std::locale::global(std::locale("German"));
  • When you use Unicode, do not use _T() macro in the above calls.
  • Remember about putting #include <locale> in your code.
Remember to restore the global locale to the previous value (returned by std:locale::global ) after processing (it may affect other parts of your application).

Additional Considerations

  • The ZipArchive Library under Linux has no Unicode version, however, to make your archive filenames readable under Linux, you can try using the UTF-8 code page when creating an archive under Windows.
  • In Windows/non-Unicode version the library uses Windows API WideCharToMultiByte and MultiByteToWideChar functions to perform conversions from ANSI code page to OEM code page and vice versa. It takes four function's calls to perform one conversion. The alternative is to use the CharToOemBuffA and OemToCharBuffA functions and it takes only one function call per conversion in that case. However, this functions are considered unsafe and banned by Microsoft. If you prefer using the fast solution with unsafe functions, comment out the ZIP_USES_SAFE_WINDOWS_API definition in the ZipPlatform_win.cpp file.

See Also API Calls

Article ID: 0610051525
Back To Top Up