Page 1 of 1

Garbage characters read from input file

Posted: 7 Apr 2014 10:50
by Ferenc Nagy
Hi,
I have opened an input file starting with the valid command line
{Test with five places}
My program got the garbage line as the first line before the above command line.
which could not be interpreted.
The

Code: Select all

Line=InputStream:readLine()
line resulted in
which was not present in my data file.

The simplified code of parsing my input file is below.

Code: Select all

predicates handleEof:() determ. readInputFile:() determ.   clauses % Read input file. readInputFile() :-     std::repeat(),         EndofStream=parseLines(),     EndofStream=true,     !,     handleEof(). % Snip clauses   % Parse input lines. % Snip   parseLines()=Eof :-     Line=InputStream:readLine(),    parseLines()=true.
I don't understand where the
comes from and why I do not see it it an editor before the
{Test with five places}
first line.

The strange phenomenon appears only during reading one file. The parsing of my other input files gets the same first line as I saved them.

Posted: 7 Apr 2014 12:59
by Paul Cerkez
Frank,
Look at the file with a HEX reader. There may be a couple of 'hidden' characters identifying the file type (similar to BMPs, JPGs, etc identifying image types) at the very start of the file. Many 'text readers' will ignore this but as you are explicitly reading in the data line from the flie, they are coming in.

just a suggestion. ;-)

also, look at this link for ideas.

http://stackoverflow.com/questions/9878 ... characters

P.

Have I created an UTF-8 file?

Posted: 8 Apr 2014 9:41
by Ferenc Nagy
Thank you, Paul, for the link.
It looks like I created the file as UTF-8 and the type identifying garbage characters remained in it.
I could clean the file by adding an extra empty line before the comment line. (Maybe when I saved the modified file it became ANSI or Unicode.)
If the problem repeats then I have to add an extra check before interpreting the first data line.

Posted: 8 Apr 2014 15:48
by Thomas Linder Puls
In my mind utf-8 is a fine characterset. So you could consider working with files in utf-8:

Code: Select all

class inputStream_file : inputStream     open core   ...   constructors     openFileUtf8 : (string FileName).     % @short Opens file for input in utf-8 mode.     % @detail Opens file with specified name #FileName for input, setting stream position to the start of the file.     % File contents is treated as utf-8 text.     % @exception The fileSystem_exception::cannotCreate exception is raised if the specified file cannot be opened     % @end

Is not Unicode the default file mode of VIP?

Posted: 9 Apr 2014 10:59
by Ferenc Nagy
Is not Unicode the default file mode of VIP?
My input files contain comment lines and data lines split into fields of equal width.
Earlier I edited them with a DOS editor, Personal Editor II which had smart column and rectangular block handling commands. Now I handle them with Notepad++.
When I open them with the Scintilla source editor embedded in the VIP user interface, the text appears in proportional character set so the columns became invisible.
I do not know how one data file among the several ones changed to UTF-8.

Posted: 9 Apr 2014 12:56
by Thomas Linder Puls
Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.

So whenever the IDE save them it will save in that format (regardless of the format it used to be in).

Preferred file mode changed from Unicode to UTF-8

Posted: 10 Apr 2014 8:53
by Ferenc Nagy
Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.
What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.
What is the byte-order-mark?
What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?

:-( You used to prefer Unicode files in earlier versions, didn't you?

Re: Preferred file mode changed from Unicode to UTF-8

Posted: 11 Apr 2014 10:36
by Thomas Linder Puls
Ferenc Nagy wrote:Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.
What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.[/quote]

That is correct.
Ferenc Nagy wrote:What is the byte-order-mark?
(Strictly speaking it is not a byte-order-mark in utf-8.) Effectively it is a mark that help to establish that the file is actually in utf-8 format.

It is probably easier for you to read about it in wikipedea: Byte order mark
Ferenc Nagy wrote:What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?
Yes, it will.
Ferenc Nagy wrote::-( You used to prefer Unicode files in earlier versions, didn't you?
utf-8 is also Unicode, but you are correct anyway.

Previous versions of Visual Prolog by default kept files in the original encoding, but created new files in utf-16 (with byte-order-mark).

We have however had problems with files in non-Unicode (neither uft-8 or utf-16) being edited in both Russia and Denmark. And since utf-16 files in (by-large) English is twice as large as the corresponding file in utf-8 we have chosen to shift everything.

I am sorry for the inconvenences it gives you.

How are strings coded?

Posted: 11 Apr 2014 11:01
by Ferenc Nagy

Posted: 11 Apr 2014 13:13
by Thomas Linder Puls
Well I think you misunderstand some things.

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.

Secondly, you should notice that Unicode is the standarization of the a "conceptual" character set, which can be represented in several different ways.

"utf" refers to Unicode Transformation Format and/or UCS Transformation Format , where UCS refers to Universal Character Set. I belive it is unimportant to figure out how and why both Unicode and USC is used.

utf-8 can (just like utf-16) represent full Unicode, including all Hungarian, Danish and Russian letters and Chinese symbols, etc.

Full Unicode have more than 2^16 codepoints (=characters), so even in utf-16 some Unicode codepoints requires two 16bit words, a so called surrogate pair (currently these codepoints are however not in use).

utf-8 uses from 1 to 6 bytes to represent a Unicode codepoint (currently 4 bytes are sufficient however).

Here is a list of language names from wikipedea:



As you see it contains a lot of different kinds of letters. You can copy it into a Visual Prolog editor, save the contents and when you repoen the file you will notice that everything is the same. utf-8 represents it all consistently.

Notice: you may see a square instead of certain characters, this is not because the caracter is bad, but because the font does not have that character.

Let me close this thread

Posted: 12 Apr 2014 9:19
by Ferenc Nagy
Thank you for the tuition about code pages and character coding modes.

Code: Select all

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.
:P It was amusing that "little endian" and "big endian" comes from Gulliver in Lilliput, from the parties crackinge ind the hard boiled eggs from their pointed end named "little end-ian party" and from the unpointed end named "big end-ian party", respectively.
Ok. The lesson of your answers is that I have only to inspect my data files within the IDE. If I edit and save them I risk the insertion of the four leading characters and the changing the file mode from Unicode to UTF-17.

Code: Select all

Notice: you may see a square instead of certain characters, this is not because the character is bad, but because the font does not have that character.
Yes, I have noticed it.

Posted: 12 Apr 2014 10:36
by Thomas Linder Puls
UTF-17 that is a new one ;-).

What you call Unicode is actually utf-16. And IDE will change from whatever the file was to utf-8, this will work correctly in these four cases:
  • The file is already utf-8 (= no change)
  • The file is in utf-16
  • The file is an 8-bit file in the same code page that your account uses.
  • The file is an 8-bit file in some codepage, but it does not actually use any special characters
Meaning that it give problems in this case:
  • The file is in a different codepage (say Cyrillic) than the one your account uses
  • and contains characters that are special to that codepage
The special characters from the Cyrillic codepage will be interpreted in you local codepage and these (misinterpreted) characters will end up in the saved file. And from that point on they will not change again (until someone changes it manually) because then it is in an encoding that can represent everything.