- NAME
- encoding — Manipulate encodings
- SYNOPSIS
- INTRODUCTION
- DESCRIPTION
- encoding convertfrom ?encoding? data
- encoding convertfrom ?-profile profile? ?-failindex var? encoding data
- encoding convertto ?encoding? data
- encoding convertto ?-profile profile? ?-failindex var? encoding data
- encoding dirs ?directoryList?
- encoding names
- encoding profiles
- encoding system ?encoding?
- PROFILES
- strict
- tcl8
- replace
- EXAMPLES
- SEE ALSO
- KEYWORDS
encoding — Manipulate encodings
encoding option ?
arg arg ...?
Strings in Tcl are logically a sequence of Unicode characters.
These strings are represented in memory as a sequence of bytes that
may be in one of several encodings: modified UTF-8 (which uses 1 to 4
bytes per character), or a custom encoding start as 8 bit binary data.
Different operating system interfaces or applications may generate
strings in other encodings such as Shift-JIS. The encoding
command helps to bridge the gap between Unicode and these other
formats.
Performs one of several encoding related operations, depending on
option. The legal
options are:
- encoding convertfrom ?encoding? data
-
- encoding convertfrom ?-profile profile? ?-failindex var? encoding data
-
Converts data, which should be in binary string encoded as per
encoding, to a Tcl string. If encoding is not specified, the current
system encoding is used.
The -profile option determines the command behavior in the presence
of conversion errors. See the PROFILES section below for details. Any premature
termination of processing due to errors is reported through an exception if
the -failindex option is not specified.
If the -failindex is specified, instead of an exception being raised
on premature termination, the result of the conversion up to the point of the
error is returned as the result of the command. In addition, the index
of the source byte triggering the error is stored in var. If no
errors are encountered, the entire result of the conversion is returned and
the value -1 is stored in var.
- encoding convertto ?encoding? data
-
- encoding convertto ?-profile profile? ?-failindex var? encoding data
-
Convert string to the specified encoding. The result is a Tcl binary
string that contains the sequence of bytes representing the converted string in
the specified encoding. If encoding is not specified, the current system
encoding is used.
The -profile and -failindex options have the same effect as
described for the encoding convertfrom command.
- encoding dirs ?directoryList?
-
Tcl can load encoding data files from the file system that describe
additional encodings for it to work with. This command sets the search
path for *.enc encoding data files to the list of directories
directoryList. If directoryList is omitted then the
command returns the current list of directories that make up the
search path. It is an error for directoryList to not be a valid
list. If, when a search for an encoding data file is happening, an
element in directoryList does not refer to a readable,
searchable directory, that element is ignored.
- encoding names
-
Returns a list containing the names of all of the encodings that are
currently available.
The encodings
“utf-8”
and
“iso8859-1”
are guaranteed to be present in the list.
- encoding profiles
-
Returns a list of the names of encoding profiles. See PROFILES below.
- encoding system ?encoding?
-
Set the system encoding to encoding. If encoding is
omitted then the command returns the current system encoding. The
system encoding is used whenever Tcl passes strings to system calls.
Operations involving encoding transforms may encounter several types of
errors such as invalid sequences in the source data, characters that
cannot be encoded in the target encoding and so on.
A
profile prescribes the strategy for dealing with such errors
in one of two ways:
-
Terminating further processing of the source data. The profile does not
determine how this premature termination is conveyed to the caller. By default,
this is signalled by raising an exception. If the -failindex option
is specified, errors are reported through that mechanism.
-
Continue further processing of the source data using a fallback strategy such
as replacing or discarding the offending bytes in a profile-defined manner.
The following profiles are currently implemented with strict being
the default if the -profile is not specified.
- strict
-
The strict profile always stops processing when an conversion error is
encountered. The error is signalled via an exception or the -failindex
option mechanism. The strict profile implements a Unicode standard
conformant behavior.
- tcl8
-
The tcl8 profile always follows the first strategy above and corresponds
to the behavior of encoding transforms in Tcl 8.6. When converting from an
external encoding other than utf-8 to Tcl strings with the encoding
convertfrom command, invalid bytes are mapped to their numerically equivalent
code points. For example, the byte 0x80 which is invalid in ASCII would be
mapped to code point U+0080. When converting from utf-8, invalid bytes
that are defined in CP1252 are mapped to their Unicode equivalents while those
that are not fall back to the numerical equivalents. For example, byte 0x80 is
defined by CP1252 and is therefore mapped to its Unicode equivalent U+20AC while
byte 0x81 which is not defined by CP1252 is mapped to U+0081. As an additional
special case, the sequence 0xC0 0x80 is mapped to U+0000.
When converting from Tcl strings to an external encoding format using
encoding convertto, characters that cannot be represented in the
target encoding are replaced by an encoding-dependent character, usually
the question mark ?.
- replace
-
Like the tcl8 profile, the replace profile always continues
processing on conversion errors but follows a Unicode standard conformant
method for substitution of invalid source data.
When converting an encoded byte sequence to a Tcl string using
encoding convertfrom, invalid bytes
are replaced by the U+FFFD REPLACEMENT CHARACTER code point.
When encoding a Tcl string with encoding convertto,
code points that cannot be represented in the
target encoding are transformed to an encoding-specific fallback character,
U+FFFD REPLACEMENT CHARACTER for UTF targets and generally `?` for other
encodings.
These examples use the utility proc below that prints the Unicode code points
comprising a Tcl string.
proc codepoints s {join [lmap c [split $s {}] {
string cat U+ [format %.6X [scan $c %c]]}]
}
Example 1: convert a byte sequence in Japanese euc-jp encoding to a TCL string:
% codepoints [encoding convertfrom euc-jp "\xA4\xCF"]
U+00306F
The result is the unicode codepoint
“\u306F”,
which is the Hiragana letter HA.
Example 2: Error handling based on profiles:
The letter A is Unicode character U+0041 and the byte "\x80" is invalid
in ASCII encoding.
% codepoints [encoding convertfrom -profile tcl8 ascii A\x80]
U+000041 U+000080
% codepoints [encoding convertfrom -profile replace ascii A\x80]
U+000041 U+00FFFD
% codepoints [encoding convertfrom -profile strict ascii A\x80]
unexpected byte sequence starting at index 1: '\x80'
Example 3: Get partial data and the error location:
% codepoints [encoding convertfrom -profile strict -failindex idx ascii AB\x80]
U+000041 U+000042
% set idx
2
Example 4: Encode a character that is not representable in ISO8859-1:
% encoding convertto iso8859-1 A\u0141
A?
% encoding convertto -profile strict iso8859-1 A\u0141
unexpected character at index 1: 'U+000141'
% encoding convertto -profile strict -failindex idx iso8859-1 A\u0141
A
% set idx
1
Tcl_GetEncoding,
fconfigure
encoding,
unicode
Copyright © 1998 Scriptics Corporation.