Unicode Support: Using Non-English Characters in Filenames and Comments
Introduction
- This functionality is specific to the ZipArchive Library and external software will
usually not be able to benefit from it.
- The ZipArchive Library will save the code pages used during compression and automatically
use them during extraction. The code pages are saved in zip extra fields. See below for more information.
- Setting string store settings with one of the API
calls does not affect existing files and comments.
- If you open an existing archive with intent to add new files to it and you want
the new files to use the same string store settings as the existing files, then:
Otherwise the library will use the default settings for the current system (ZipPlatform::GetSystemID()).
- If you want to open an archive created with a previous version of the ZipArchive
library or any program, that uses a different filename of comment encoding code
pages than the standard ones, set the code pages before opening the archive. The
library will use them while decoding filenames and comments. The settings will be
ignored, if the archive contains extra fields with code pages created by the ZipArchive
Library. In this case, code pages from extra fields will be used instead. Note,
that these settings will be used while compression in the same archive, unless changed.
- When you close an archive, the string store settings are reset to its default values
for the current system, just like with the CZipStringStoreSettings::Reset()
method call. This way, if you open the next archive using the same
CZipArchive
object, its string store settings are not affected by the previous archive settings.
Storing Unicode Filenames in a Zip Archive
Zip compression programs working under Windows use current system OEM code page
by default to encode filenames in archives. This may not be desirable in all cases.
You may control the way the ZipArchive Library stores filenames in archives by adjusting
the first parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method.
- If you plan that the archive will be extracted under Linux, set this parameter to
the identifier of the code page used by the Linux system under which you want to
extract the archive. You may try setting it to
CP_ACP, then the current
system ANSI code page will be used - it will work all right if the target Linux
platform uses the same code page as your system.
- If you use e.g. Japanese or Korean characters, you may set this parameter to
CP_UTF8. Unicode UTF-8 will be used. You need to compile the library and
your application with the Unicode support.
- You can set the code page directly using its identifier. Be sure it is installed
on your system and on the system you plan to extract the archive on.
- To restore the OEM encoding, set this parameter back to
CP_OEMCP.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.SetStringStoreSettings(CP_UTF8);
zip.AddNewFile(emptyFile, _T("\u0391\u03A9"));
zip.SetStringStoreSettings(1250);
zip.AddNewFile(emptyFile, _T("\u010D\u011B"));
zip.SetStringStoreSettings(CP_OEMCP);
zip.AddNewFile(emptyFile, _T("English characters only"));
zip.Close();
zip.Open(zipFileName);
zip.ExtractFile(1, _T("C:\\Temp"));
zip.Close();
Preserving Compatibility with the Standard Zip Format
It is assumed that under Windows filenames are stored using the current system OEM
code page (
CP_OEMCP). Hence external software will not be able to properly
decode filenames, if they are stored using a different code page. For this reason,
the ZipArchive Library allows storing filenames encoded with a custom code page
in extra fields. The filenames in the standard location (the central directory and
local headers) are encoded using OEM code page. This way, external software will
see a typically encoded filenames and the ZipArchive Library will know the original
filenames while extraction.
You should note that this method takes additional space needed for storing a filename
in an extra field.
To store filenames in extra fields, set the second parameter of the
CZipArchive::SetStringStoreSettings(UINT, bool) method
to
true.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.SetStringStoreSettings(1250, true);
zip.AddNewFile(emptyFile, _T("\u0104\u0118"));
zip.Close();
The comments in a zip archive are usually stored using the current system ANSI code
page (
CP_ACP). You can specify a different code page, e.g. by modifying
the object returned by the
CZipArchive::GetStringStoreSettings() method call.
The comment code page settings affect the global comment encoding as well, but there
is no information stored in the archive which code page was used to encode the global
comment. You can retrieve this information, e.g. from the first file in the archive,
please refer to the sample code below.
Sample Code
CZipMemFile emptyFile;
CZipArchive zip;
LPCTSTR zipFileName = _T("C:\\Temp\\test.zip");
zip.Open(zipFileName, CZipArchive::zipCreate);
zip.AddNewFile(emptyFile, _T("empty file"));
zip.GetStringStoreSettings().m_uCommentCodePage = CP_UTF8;
LPCTSTR comment = _T("\u0104\u0118");
zip.SetFileComment(0, comment);
zip.SetGlobalComment(comment);
zip.Close();
zip.Open(zipFileName);
CZipFileHeader* info = zip.GetFileInfo(0);
CZipString result = info->GetComment();
zip.SetStringStoreSettings(info->GetStringStoreSettings());
result = zip.GetGlobalComment();
The ZipArchive Library stores code page information and if requested, encoded filename,
in extra fields in the central directory. The global format of the ZipArchive extra
field is as follows:
|
Header ID |
2 |
0x5A4C |
|
Data Size |
2 |
|
|
Data |
as specified by Data Size |
|
The format of the
Data field is as follows (not all sub-fields
may be present):
|
Version |
1 |
0x01 |
|
Flag |
1 |
1, 3, 4 |
|
Filename Code Page |
4 |
|
|
Encoded Filename |
variable |
|
|
Comment Code Page |
4 |
|
The
Flag field values have the following meaning:
|
0 |
1 |
the Filename Code Page field is present |
|
0 and 1 |
3 |
the Encoded Filename field is present
(and the Filename Code Page field must be present too)
|
|
2 |
4 |
the Comment Code Page field is present |
Setting Locale in STL Applications
If your locale is different from English and you wish to use non-English characters
in archives, you need to set your locale globally;
setlocale() function
is not sufficient in this case.
- To set the global locale to be the same as your system locale use the function:
std::locale::global(std::locale(""));
- To set the global locale to a particular value, use the function e.g. this way:
std::locale::global(std::locale("German"));
- When you use Unicode, do not use _T() macro in the
above calls.
- Remember about putting
#include <locale> in your code.
Remember to restore the global locale to the previous value (returned by
std:locale::global
) after processing (it may affect other parts of your application).
Additional Considerations
- The ZipArchive Library under Linux has no Unicode version, however, to make your
archive filenames readable under Linux, you can try using the UTF-8 code page when
creating an archive under Windows.
- In Windows/non-Unicode version the library uses Windows API
WideCharToMultiByte
and MultiByteToWideChar functions to perform conversions from ANSI
code page to OEM code page and vice versa. It takes four function's calls to perform
one conversion. The alternative is to use the CharToOemBuffA and
OemToCharBuffA functions and it takes only one function call per conversion
in that case. However, this functions are considered unsafe and banned by Microsoft.
If you prefer using the fast solution with unsafe functions, comment out the ZIP_USES_SAFE_WINDOWS_API definition in the
ZipPlatform_win.cpp file.
See Also API Calls