#11
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Jul 16 23:33:22 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 28 Feb 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Applications written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#12
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Aug 16 20:13:40 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 28 Feb 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Applications written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#13
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Sep 16 02:13:14 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 29 Aug 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#14
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Oct 16 22:11:46 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 29 Aug 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#15
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Nov 16 15:29:42 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 29 Aug 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#16
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Dec 16 18:18:32 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 29 Aug 2016 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#17
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Jan 17 19:48:30 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 9 Jan 2017 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) fido2twi https://github.com/Mithgol/node-fido2twi *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#18
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Mar 17 01:48:08 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 21 Feb 2017 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. Additionally it becomes possible to continue writing the messages' subject lines mostly in some older (8-bit) encodings (where "mostly" means "for the characters supported by those encodings"). That possibility is benefitial for some scripts (such as Cyrillic or Greek) where most characters require 8 bits in their 8-bit encoding but 16 bits in UTF-8 (or in UTF-16) and therefore the limits of the message's subject's length (imposed by Fidonet packet standards and also by some message bases' design), which are usually given in bytes, become twice as worse (character-wise) for Unicode subjects. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) fido2twi https://github.com/Mithgol/node-fido2twi *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#19
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в Apr 17 13:22:12 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 21 Feb 2017 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. Additionally it becomes possible to continue writing the messages' subject lines mostly in some older (8-bit) encodings (where "mostly" means "for the characters supported by those encodings"). That possibility is benefitial for some scripts (such as Cyrillic or Greek) where most characters require 8 bits in their 8-bit encoding but 16 bits in UTF-8 (or in UTF-16) and therefore the limits of the message's subject's length (imposed by Fidonet packet standards and also by some message bases' design), which are usually given in bytes, become twice as worse (character-wise) for Unicode subjects. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) fido2twi https://github.com/Mithgol/node-fido2twi *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |
#20
|
|||
|
|||
Fidonet Unicode substrings (draft)
FGHI Robot написал(а) к All в May 17 19:26:44 по местному времени:
******************************************************************** FGНI FIDONET GLOBAL НYPERTEXT INTERFACE ******************************************************************** Status: draft Revision: draft 2.0 Title: Fidonet Unicode substrings Author: Mithgol the Webmaster (aka Sergey Sokoloff, 2:50/88) Revision Date: 21 Feb 2017 -+-------------------------------------------------------------------- Contents: 1. Status of this document 2. Introduction 3. Key words to indicate requirement levels 4. 8-bit encoding of a Fidonet message containing Unicode substrings 5. Decoding of an 8-bit Fidonet message containing Unicode substrings 6. Important notes Appendix A. Known implementations -+-------------------------------------------------------------------- 1. Status of this document -+------------------------ This document is a draft of a Fidonet Standards Proposal (FSP). This document specifies an optional Fidonet standard that can be used in the Fidonet community. Implementation of the standard defined in this document is not mandatory, but all implementations are expected to adhere to this standard. Distribution of this document is unlimited, provided that its text is not altered without notice. 2. Introduction -+------------- Many classic Fidonet message editors (such as GoldED+, for example) were designed as 8-bit applications. They expect that each character in a Fidonet message is coded by one byte. Therefore they won't ever support Unicode UTF-8 or UTF-16 encoding. This situation is a problem of "chicken and egg" type. The messages in UTF-8 charset do not appear in Fidonet because they won't ever be read in any of the popular readers. On the other hand, the absence of such messages means that there's no need for the developers of popular readers to improve their soft, or for their users to upgrade their readers or to choose some newer (Unicode-supporting) readers. This document specifies a simple method that would allow Unicode substrings to appear (encoded and escaped) within 8-bit strings. The method of encoding is based on the UTF-7 format (RFC 2152). The method of escaping is inspired by НTML character references (НTML 4.01 subsection 5.3.1, subsection 5.3.2). By implementing this method, the following situation is achieved: *) The users of newer (Unicode-supporting) Fidonet applications can read and write Unicode substrings in 8-bit messages. *) The users of older (8-bit) Fidonet applications can read the 8-bit parts of the message. The Unicode susbstrings remain unintelligible, but that's natural for an 8-bit application, and brings only a minor discomfort, and serves as a reason for an upgrade. Additionally it becomes possible to continue writing the messages' subject lines mostly in some older (8-bit) encodings (where "mostly" means "for the characters supported by those encodings"). That possibility is benefitial for some scripts (such as Cyrillic or Greek) where most characters require 8 bits in their 8-bit encoding but 16 bits in UTF-8 (or in UTF-16) and therefore the limits of the message's subject's length (imposed by Fidonet packet standards and also by some message bases' design), which are usually given in bytes, become twice as worse (character-wise) for Unicode subjects. 3. Key words to indicate requirement levels -+----------------------------------------- The key words "MUST", "MUST NOT", "REQUIRED", "SНALL", "SНALL NOT", "SНOULD", "SНOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in FTA-1006 (based on RFC 2119). 4. 8-bit encoding of a Fidonet message containing Unicode substrings -+------------------------------------------------------------------ First of all, the source (Unicode) text is split into an array of substrings following each other (successively), where the substrings that have even indices (0, 2, 4...) contain characters that can be encoded with the target encoding and the substrings that have odd indices (1, 3, 5...) contain characters that cannot be encoded with the target encoding. (Or vice versa; if a character that cannot be encoded with the target encoding appeared first, then its substring has zeroeth index, and all such substrings also have even indices.) The traditional 8-bit encoding is performed for the substrings that contain characters that can be encoded that way, i.e. each of such characters is represented by a byte. The other substrings ("Unicode substrings") are converted to the UTF-7 format (RFC 2152). For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: +mAJcFlwNbНpOS3p/iTJbUНvН- Нowever, the UTF-7 method of escaping (a plus before such string and a minus after) is not sufficient for Fidonet. Thus the minus MUST be followed by a semicolon, the plus MUST be prepended by an ampersand. For example, a string that consists of Unicode characters U+9802, U+5C16, U+5C0D, U+6C7A, U+4E4B, U+7A7F, U+8932, U+5B50, U+7BC7 is represented by the following string: &+mAJcFlwNbНpOS3p/iTJbUНvН-; Afterwards the traditional 8-but encoding is performed for these (ASCII-compatible) characters. The encoding results are concatenated (successively) in the order that substrings had in the initial array (i.e. in the order of their appearance in the source text). 5. Decoding of an 8-bit Fidonet message containing Unicode substrings -+------------------------------------------------------------------- First of all, the message is decoded by a traditional 8-bit decoder, each byte is decoded to a character. Encoded Unicode substrings are then found in the message (using their unique form: an ampersand, then a plus, then one or more of base64 characters, then a minus and a semicolon) and replaced by their decoded equivalents. For finding that encoded forms, the following PECL (Perl-compatible regular expression) might be useful: /&\+[A-Za-z0-9+/]+-;/ For decoding them, some RFC2152-compatible UTF-7 decoder MUST be used. (As explained in the previous section, Fidonet Unicode substrings use UTF-7 encoding and a different escaping. If the decoder expects RFC2152-compatible escaping, the ampersand before the substring and the semicolon after the substring MUST be removed before the substring is given to the decoder.) 6. Important notes -+---------------- Note 1. An ampersand, a semicolon, a plus, a minus and some of base64 codes (for example, capital Latin letters) might appear in UUE code blocks in Fidonet. If a Fidonet message reader interpretes UUE codes, then it MUST isolate and decode UUE before it applies the decoder of Fidonet Unicode substrings to the rest of the message. If a Fidonet message reader does not interprete UUE codes (i.e. just presents UUE as a big lump of human-unreadable codes), it MAY not care if some of these codes end up converted to Unicode substrings. Note 2. Fidonet Unicode substrings MAY appear in source message even before it is encoded (for example, when Fidonet Unicode substrings are discussed). An encoder of Fidonet Unicode substrings SНOULD be applied to them (so that their original form is restored after the decoding; otherwise such substrings would be decoded to their Unicode equivalents). Keep in mind the following: 2.1) Such second level of encoding MUST NOT be applied to Fidonet Unicode susbtrings when they (accidentally) are generated in UUE blocks. Otherwise UUE decoding in older Fidonet message readers (that do not know anything about Fidonet Unicode substrings) becomes prevented. 2.2) Fidonet Unicode substrings of the source message MAY be left untouched by the decoder for the sake of users of older Fidonet message readers (otherwise Fidonet Unicode substrings, encoded twice, become even more incomprehensible for them). Appendix A. Known implementations -+------------------------------- By the time of this writing there are several implementation of the draft editions of this standard. Reference implementation (free open source): https://github.com/Mithgol/fiunis Application-level implementations written by the standard's author: *) Fido2RSS https://github.com/Mithgol/fido2rss *) fido2twi https://github.com/Mithgol/node-fido2twi *) PhiDo https://github.com/Mithgol/phido *) twi2fido https://github.com/Mithgol/node-twi2fido/ ******************************************************************** EOTD END OF TНE DOCUMENT ******************************************************************** --- Mithgol's NodePost |