ISO

  • 19 May

    Text Character Encoding Conversion

    Dear Users,

    In v2019.05.19.0 we’ve added an option to convert the character encoding of text files.

    To achieve this we’ve added the following Pascal Script: psChangeTxtEncodingExt

    Function psChangeTxtEncodingExt(Source, Destination: String; SrcEncoding, DstEncoding: Integer; WriteBOM: Boolean): Boolean;

    • Source: Source File
    • Destination: Destination File
    • SrcEncoding: Encoding ID Source (please check list below)
    • DstEncoding: Encoding ID Destination (please check list below)
    • WriteBOM: Write BOM Destination File (Normally True)

    The function is also able to read the encoding tag of xml source files and automatically set the source encoding ID.

     

    In the following example we’re going to convert EBCDIC Codepage 273 (German) text files to UTF8.

    > Be sure to add and set a file filter so it will only handle text files (txt, xml, …)

    >Add a Pascal Script as Destination

    Const
      ctOutputPath = 'C:\Test\In_Patrick\Out\';
    Begin
      psExitCode:= 0;
      // Source File: EBCDIC in EBCDIC Codepage 273 (German) or Codepage 1141 (German incl. EUR-Sign)
      // Destination File: UTF8
      // https://limagito.com/text-character-encoding-conversion/
     
      If psChangeTxtEncodingExt(psFilePath + psFileName, ctOutputpath + psFileName, 20273, 65001, True) Then
      Begin
        psLogWrite(1, '', 'Conversion to ' + ctOutputpath + psFileName + ' Successful');
        psExitCode:= 1;
      End;
     
    End.

    > RunTime Log Result

    The function is able to convert quite some encodings:

    We have our own Encoding ID’s:

    EncodingID Additional information 
    0          UTF8
    1          UTF7
    2          Unicode
    3          Default
    4          Big Endian Unicode
    5          ASCII
    6          ANSI
    7          Reserved
    8          Reserved
    9          XML Auto Encoding detection (Source Only)

    But also all Windows Encoding ID’s can be used:

    Code Page Identifiers

    EncodingID  Additional information
    37          IBM037, IBM EBCDIC US-Canada
    437         IBM437, OEM United States
    500         IBM500, IBM EBCDIC International
    708         ASMO-708, Arabic (ASMO 708)
    709         ASMO-709, Arabic ASMO-449+, BCON V4
    710         ASMO-710, Arabic Transparent Arabic
    720         DOS-720, Arabic (Transparent ASMO); Arabic (DOS)
    737         IBM737, OEM Greek (formerly 437G); Greek (DOS)
    775         IBM775, OEM Baltic; Baltic (DOS)
    850         IBM850, OEM Multilingual Latin 1; Western European (DOS)
    852         IBM852, OEM Latin 2; Central European (DOS)
    855         IBM855, OEM Cyrillic (primarily Russian)
    857         IBM857, OEM Turkish; Turkish (DOS)
    858         IBM858, OEM Multilingual Latin 1 + Euro symbol
    860         IBM860, OEM Portuguese; Portuguese (DOS)
    861         IBM861, OEM Icelandic; Icelandic (DOS)
    862         DOS-862, OEM Hebrew; Hebrew (DOS)
    863         IBM863, OEM French Canadian; French Canadian (DOS)
    864         IBM864, OEM Arabic; Arabic (864)
    865         IBM865, OEM Nordic; Nordic (DOS)
    866         cp866, OEM Russian; Cyrillic (DOS)
    869         IBM869, OEM Modern Greek; Greek, Modern (DOS)
    870         IBM870, IBM EBCDIC Multilingual/ROECE (Latin 2); IBM EBCDIC Multilingual Latin 2
    874         windows-874, ANSI/OEM Thai (ISO 8859-11); Thai (Windows)
    875         cp875, IBM EBCDIC Greek Modern
    932         shift_jis, ANSI/OEM Japanese; Japanese (Shift-JIS)
    936         gb2312, ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312)
    949         ks_c_5601-1987, ANSI/OEM Korean (Unified Hangul Code)
    950         big5, ANSI/OEM Traditional Chinese (Taiwan; Hong Kong SAR, PRC); Chinese Traditional (Big5)
    1026        IBM1026, IBM EBCDIC Turkish (Latin 5)
    1047        IBM01047, IBM EBCDIC Latin 1/Open System
    1140        IBM01140, IBM EBCDIC US-Canada (037 + Euro symbol); IBM EBCDIC (US-Canada-Euro)
    1141        IBM01141, IBM EBCDIC Germany (20273 + Euro symbol); IBM EBCDIC (Germany-Euro)
    1142        IBM01142, IBM EBCDIC Denmark-Norway (20277 + Euro symbol); IBM EBCDIC (Denmark-Norway-Euro)
    1143        IBM01143, IBM EBCDIC Finland-Sweden (20278 + Euro symbol); IBM EBCDIC (Finland-Sweden-Euro)
    1144        IBM01144, IBM EBCDIC Italy (20280 + Euro symbol); IBM EBCDIC (Italy-Euro)
    1145        IBM01145, IBM EBCDIC Latin America-Spain (20284 + Euro symbol); IBM EBCDIC (Spain-Euro)
    1146        IBM01146, IBM EBCDIC United Kingdom (20285 + Euro symbol); IBM EBCDIC (UK-Euro)
    1147        IBM01147, IBM EBCDIC France (20297 + Euro symbol); IBM EBCDIC (France-Euro)
    1148        IBM01148, IBM EBCDIC International (500 + Euro symbol); IBM EBCDIC (International-Euro)
    1149        IBM01149, IBM EBCDIC Icelandic (20871 + Euro symbol); IBM EBCDIC (Icelandic-Euro)
    1200        utf-16, Unicode UTF-16, little endian byte order (BMP of ISO 10646); available only to managed applications
    1201        unicodeFFFE, Unicode UTF-16, big endian byte order; available only to managed applications
    1250        windows-1250, ANSI Central European; Central European (Windows)
    1251        windows-1251, ANSI Cyrillic; Cyrillic (Windows)
    1252        windows-1252, ANSI Latin 1; Western European (Windows)
    1253        windows-1253, ANSI Greek; Greek (Windows)
    1254        windows-1254, ANSI Turkish; Turkish (Windows)
    1255        windows-1255, ANSI Hebrew; Hebrew (Windows)
    1256        windows-1256, ANSI Arabic; Arabic (Windows)
    1257        windows-1257, ANSI Baltic; Baltic (Windows)
    1258        windows-1258, ANSI/OEM Vietnamese; Vietnamese (Windows)
    1361        Johab, Korean (Johab)
    10000       macintosh, MAC Roman; Western European (Mac)
    10001       x-mac-japanese, Japanese (Mac)
    10002       x-mac-chinesetrad, MAC Traditional Chinese (Big5); Chinese Traditional (Mac)
    10003       x-mac-korean, Korean (Mac)
    10004       x-mac-arabic, Arabic (Mac)
    10005       x-mac-hebrew, Hebrew (Mac)
    10006       x-mac-greek, Greek (Mac)
    10007       x-mac-cyrillic, Cyrillic (Mac)
    10008       x-mac-chinesesimp, MAC Simplified Chinese (GB 2312); Chinese Simplified (Mac)
    10010       x-mac-romanian, Romanian (Mac)
    10017       x-mac-ukrainian, Ukrainian (Mac)
    10021       x-mac-thai, Thai (Mac)
    10029       x-mac-ce, MAC Latin 2; Central European (Mac)
    10079       x-mac-icelandic, Icelandic (Mac)
    10081       x-mac-turkish, Turkish (Mac)
    10082       x-mac-croatian, Croatian (Mac)
    12000       utf-32, Unicode UTF-32, little endian byte order; available only to managed applications
    12001       utf-32BE, Unicode UTF-32, big endian byte order; available only to managed applications
    20000       x-Chinese_CNS, CNS Taiwan; Chinese Traditional (CNS)
    20001       x-cp20001, TCA Taiwan
    20002       x_Chinese-Eten, Eten Taiwan; Chinese Traditional (Eten)
    20003       x-cp20003, IBM5550 Taiwan
    20004       x-cp20004, TeleText Taiwan
    20005       x-cp20005, Wang Taiwan
    20105       x-IA5, IA5 (IRV International Alphabet No. 5, 7-bit); Western European (IA5)
    20106       x-IA5-German, IA5 German (7-bit)
    20107       x-IA5-Swedish, IA5 Swedish (7-bit)
    20108       x-IA5-Norwegian, IA5 Norwegian (7-bit)
    20127       us-ascii, US-ASCII (7-bit)
    20261       x-cp20261, T.61
    20269       x-cp20269, ISO 6937 Non-Spacing Accent
    20273       IBM273, IBM EBCDIC Germany
    20277       IBM277, IBM EBCDIC Denmark-Norway
    20278       IBM278, IBM EBCDIC Finland-Sweden
    20280       IBM280, IBM EBCDIC Italy
    20284       IBM284, IBM EBCDIC Latin America-Spain
    20285       IBM285, IBM EBCDIC United Kingdom
    20290       IBM290, IBM EBCDIC Japanese Katakana Extended
    20297       IBM297, IBM EBCDIC France
    20420       IBM420, IBM EBCDIC Arabic
    20423       IBM423, IBM EBCDIC Greek
    20424       IBM424, IBM EBCDIC Hebrew
    20833       x-EBCDIC-KoreanExtended, IBM EBCDIC Korean Extended
    20838       IBM-Thai, IBM EBCDIC Thai
    20866       koi8-r, Russian (KOI8-R); Cyrillic (KOI8-R)
    20871       IBM871, IBM EBCDIC Icelandic
    20880       IBM880, IBM EBCDIC Cyrillic Russian
    20905       IBM905, IBM EBCDIC Turkish
    20924       IBM00924, IBM EBCDIC Latin 1/Open System (1047 + Euro symbol)
    20932       EUC-JP, Japanese (JIS 0208-1990 and 0212-1990)
    20936       x-cp20936, Simplified Chinese (GB2312); Chinese Simplified (GB2312-80)
    20949       x-cp20949, Korean Wansung
    21025       cp1025, IBM EBCDIC Cyrillic Serbian-Bulgarian
    21027       (deprecated)
    21866       koi8-u, Ukrainian (KOI8-U); Cyrillic (KOI8-U)
    28591       iso-8859-1, ISO 8859-1 Latin 1; Western European (ISO)
    28592       iso-8859-2, ISO 8859-2 Central European; Central European (ISO)
    28593       iso-8859-3, ISO 8859-3 Latin 3
    28594       iso-8859-4, ISO 8859-4 Baltic
    28595       iso-8859-5, ISO 8859-5 Cyrillic
    28596       iso-8859-6, ISO 8859-6 Arabic
    28597       iso-8859-7, ISO 8859-7 Greek
    28598       iso-8859-8, ISO 8859-8 Hebrew; Hebrew (ISO-Visual)
    28599       iso-8859-9, ISO 8859-9 Turkish
    28603       iso-8859-13, ISO 8859-13 Estonian
    28605       iso-8859-15, ISO 8859-15 Latin 9
    29001       x-Europa, Europa 3
    38598       iso-8859-8-i, ISO 8859-8 Hebrew; Hebrew (ISO-Logical)
    50220       iso-2022-jp, ISO 2022 Japanese with no halfwidth Katakana; Japanese (JIS)
    50221       csISO2022JP, ISO 2022 Japanese with halfwidth Katakana; Japanese (JIS-Allow 1 byte Kana)
    50222       iso-2022-jp, ISO 2022 Japanese JIS X 0201-1989; Japanese (JIS-Allow 1 byte Kana - SO/SI)
    50225       iso-2022-kr, ISO 2022 Korean
    50227       x-cp50227, ISO 2022 Simplified Chinese; Chinese Simplified (ISO 2022)
    50229       ISO 2022 Traditional Chinese
    50930       EBCDIC Japanese (Katakana) Extended
    50931       EBCDIC US-Canada and Japanese
    50933       EBCDIC Korean Extended and Korean
    50935       EBCDIC Simplified Chinese Extended and Simplified Chinese
    50936       EBCDIC Simplified Chinese
    50937       EBCDIC US-Canada and Traditional Chinese
    50939       EBCDIC Japanese (Latin) Extended and Japanese
    51932       euc-jp, EUC Japanese
    51936       EUC-CN, EUC Simplified Chinese; Chinese Simplified (EUC)
    51949       euc-kr, EUC Korean
    51950       EUC, Traditional Chinese
    52936       hz-gb-2312, HZ-GB2312 Simplified Chinese; Chinese Simplified (HZ)
    54936       GB18030, Windows XP and later: GB18030 Simplified Chinese (4 byte); Chinese Simplified (GB18030)
    57002       x-iscii-de, ISCII Devanagari
    57003       x-iscii-be, ISCII Bangla
    57004       x-iscii-ta, ISCII Tamil
    57005       x-iscii-te, ISCII Telugu
    57006       x-iscii-as, ISCII Assamese
    57007       x-iscii-or, ISCII Odia
    57008       x-iscii-ka, ISCII Kannada
    57009       x-iscii-ma, ISCII Malayalam
    57010       x-iscii-gu, ISCII Gujarati
    57011       x-iscii-pa, ISCII Punjabi
    65000       utf-7, Unicode (UTF-7)
    65001       utf-8, Unicode (UTF-8)

     

    Please let us know if you need any help.

    Regards,

    Limagito Team

SEARCH