cyrilchacko
4th September 2012, 11:46
Hi,
I am currently trying to read an UTF-16 file generated by another application. I am using the standard call for seq.gets to do this. This is then converted to the TSS format using uni.import function with the valid flag. Then it is read in to the variables and stored in the table.
Now, while reading a unicode character (上), the seq.gets reads the 4 byte word (0x4E0A) erroneously as 2 byte word (0x0A) and assumes it to be the new line character. It therefore, stops reading further. It then reads the next character as part of the new line. Thus giving an error.
Has someone encountered this and is there a solution or workaround?
Thanks in advance.
Regards,
Cyril Parathazham
bdittmar
4th September 2012, 12:42
Hi,
I am currently trying to read an UTF-16 file generated by another application. I am using the standard call for seq.gets to do this. This is then converted to the TSS format using uni.import function with the valid flag. Then it is read in to the variables and stored in the table.
Now, while reading a unicode character (上), the seq.gets reads the 4 byte word (0x4E0A) erroneously as 2 byte word (0x0A) and assumes it to be the new line character. It therefore, stops reading further. It then reads the next character as part of the new line. Thus giving an error.
Has someone encountered this and is there a solution or workaround?
Thanks in advance.
Regards,
Cyril Parathazham
Hello,
from LN progguide :
uni.import()
Syntax:
function long uni.import (ref string target$, const string source$ [, long sb_flag])
Description
This function converts a string from Unicode (encoded according to a UTF-16 Encoding encoding scheme) to the TSS Encoding character set. The default encoding scheme is UTF-16BE, i.e. each UTF-16 code unit is assumed to be serialized with the most significant byte first.
Arguments
ref string target$ The target string. This string will receive the TSS encoded characters. This can be a maximum of 4096 bytes.
const string source$ The source string. It is assumed to be in byte serialized UTF-16 encoding.
[long sb_flag ] This optional specifies the byte-order of the source string. Default is UNI_MSB_ORDER.
UNI_DEF_ORDER Use the default byte order of the underlying Operating System.
UNI_MSB_ORDER De-serialize the UTF-16 code units with the most significant byte first, i.e. use UTF-16BE.
UNI_LSB_ORDER De-serialize the UTF-16 code units with the least significant byte first, i.e. use UTF-16LE.
This optional specifies the byte-order of the source string. Default is UNI_MSB_ORDER.
UNI_DEF_ORDER Use the default byte order of the underlying Operating System.
UNI_MSB_ORDER De-serialize the UTF-16 code units with the most significant byte first, i.e. use UTF-16BE.
UNI_LSB_ORDER De-serialize the UTF-16 code units with the least significant byte first, i.e. use UTF-16LE.
Return values
\u2265 0 the number of converted bytes
-1 the target string was too small to contain the converted string
-2 an incomplete character was found at the end of the string
-3 character could not be converted.
Context
This function can be used in all script types.
Related topics
Inverse operation: uni.export()
TSS Encoding
Unicode
UTF-16 Encoding
Multibyte strings overview and synopsis
Regards
cyrilchacko
5th September 2012, 06:37
Hi,
I am currently using uni.import to do the UTF extraction and the character encoding is in UTF-16-LE with CRLF line ending. I am quoting the line below.
8T190XXX||3|XXX|XXXX00218003|XXX10|||XXXX21||||XXXX.XX|D||||上海网络维护
The only part that is in chinese is the last block. The character "上" has the hex value 0x4E0A. Somehow this is read by the seq.gets call as a line break. When the file was in LSB mode the value retrieved was "\n" or 0x0A and then the remaining in the next call of seq.gets. In MSB the character is read as 0x4E and then "\n" or 0x0A.
It is not the uni.import that is failing in my case, but the seq.gets. Is there a way to read a line of import that reads the character as 0x0A4E as it is in the UTF-16LE mode?
I do have seq.read, but it does not recognize end of line at all. So, I am a bit stuck with this.
Regards,
Cyril Parathazham
mark_h
5th September 2012, 15:31
Can you use seq.gets to read one character or byte, at a time? Or maybe seq.read. I was thinking build your input string until you got to the real EOL. Not sure if your input will be consistent enough to tell this. Not sure if this will work or not.