Garbage characters read from input file

Ferenc Nagy · Unread post by **Ferenc Nagy** » 7 Apr 2014 10:50

Hi,
I have opened an input file starting with the valid command line

{Test with five places}

My program got the garbage line as the first line before the above command line.

which could not be interpreted.
The

Line=InputStream:readLine()

line resulted in

which was not present in my data file.

The simplified code of parsing my input file is below.

Code: Select all

predicates
handleEof:() determ.
readInputFile:() determ.
 
clauses
% Read input file.
readInputFile() :-
    std::repeat(),
        EndofStream=parseLines(),
    EndofStream=true,
    !,
    handleEof().
% Snip
clauses
 
% Parse input lines.
% Snip
 
parseLines()=Eof :-
    Line=InputStream:readLine(),
 
 parseLines()=true.

I don't understand where the

comes from and why I do not see it it an editor before the

{Test with five places}

first line.

The strange phenomenon appears only during reading one file. The parsing of my other input files gets the same first line as I saved them.

Paul Cerkez · Unread post by **Paul Cerkez** » 7 Apr 2014 12:59

Frank,
Look at the file with a HEX reader. There may be a couple of 'hidden' characters identifying the file type (similar to BMPs, JPGs, etc identifying image types) at the very start of the file. Many 'text readers' will ignore this but as you are explicitly reading in the data line from the flie, they are coming in.

just a suggestion.

also, look at this link for ideas.

http://stackoverflow.com/questions/9878 ... characters

P.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 8 Apr 2014 9:41

Thank you, Paul, for the link.
It looks like I created the file as UTF-8 and the type identifying garbage characters remained in it.
I could clean the file by adding an extra empty line before the comment line. (Maybe when I saved the modified file it became ANSI or Unicode.)
If the problem repeats then I have to add an extra check before interpreting the first data line.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 9 Apr 2014 10:59

Is not Unicode the default file mode of VIP?
My input files contain comment lines and data lines split into fields of equal width.
Earlier I edited them with a DOS editor, Personal Editor II which had smart column and rectangular block handling commands. Now I handle them with Notepad++.
When I open them with the Scintilla source editor embedded in the VIP user interface, the text appears in proportional character set so the columns became invisible.
I do not know how one data file among the several ones changed to UTF-8.

Unread post by **Thomas Linder Puls** » 9 Apr 2014 12:56

Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.

So whenever the IDE save them it will save in that format (regardless of the format it used to be in).

Ferenc Nagy · Unread post by **Ferenc Nagy** » 10 Apr 2014 8:53

Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.

What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.
What is the byte-order-mark?
What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?

You used to prefer Unicode files in earlier versions, didn't you?

Unread post by **Thomas Linder Puls** » 11 Apr 2014 10:36

Ferenc Nagy wrote:Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.

What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.[/quote]

That is correct.

Ferenc Nagy wrote:What is the byte-order-mark?

(Strictly speaking it is not a byte-order-mark in utf-8.) Effectively it is a mark that help to establish that the file is actually in utf-8 format.

It is probably easier for you to read about it in wikipedea: Byte order mark

Ferenc Nagy wrote:What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?

Yes, it will.

Ferenc Nagy wrote: You used to prefer Unicode files in earlier versions, didn't you?

utf-8 is also Unicode, but you are correct anyway.

Previous versions of Visual Prolog by default kept files in the original encoding, but created new files in utf-16 (with byte-order-mark).

We have however had problems with files in non-Unicode (neither uft-8 or utf-16) being edited in different regions. And since utf-16 files in (by-large) English is twice as large as the corresponding file in utf-8 we have chosen to shift everything.

I am sorry for the inconveniences' it gives you.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 11 Apr 2014 11:01

Unread post by **Thomas Linder Puls** » 11 Apr 2014 13:13

Well I think you misunderstand some things.

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.

Secondly, you should notice that Unicode is the standarization of the a "conceptual" character set, which can be represented in several different ways.

"utf" refers to Unicode Transformation Format and/or UCS Transformation Format , where UCS refers to Universal Character Set. I belive it is unimportant to figure out how and why both Unicode and USC is used.

utf-8 can (just like utf-16) represent full Unicode, including all Hungarian, Danish and Russian letters and Chinese symbols, etc.

Full Unicode have more than 2^16 codepoints (=characters), so even in utf-16 some Unicode codepoints requires two 16bit words, a so called surrogate pair (currently these codepoints are however not in use).

utf-8 uses from 1 to 6 bytes to represent a Unicode codepoint (currently 4 bytes are sufficient however).

Here is a list of language names from wikipedea:

As you see it contains a lot of different kinds of letters. You can copy it into a Visual Prolog editor, save the contents and when you repoen the file you will notice that everything is the same. utf-8 represents it all consistently.

Notice: you may see a square instead of certain characters, this is not because the caracter is bad, but because the font does not have that character.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 12 Apr 2014 9:19

Thank you for the tuition about code pages and character coding modes.

Code: Select all

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.

It was amusing that "little endian" and "big endian" comes from Gulliver in Lilliput, from the parties crackinge ind the hard boiled eggs from their pointed end named "little end-ian party" and from the unpointed end named "big end-ian party", respectively.
Ok. The lesson of your answers is that I have only to inspect my data files within the IDE. If I edit and save them I risk the insertion of the four leading characters and the changing the file mode from Unicode to UTF-17.

Code: Select all

Notice: you may see a square instead of certain characters, this is not because the character is bad, but because the font does not have that character.

Yes, I have noticed it.

Unread post by **Thomas Linder Puls** » 12 Apr 2014 10:36

UTF-17 that is a new one

.

What you call Unicode is actually utf-16. And IDE will change from whatever the file was to utf-8, this will work correctly in these four cases:

The file is already utf-8 (= no change)
The file is in utf-16
The file is an 8-bit file in the same code page that your account uses.
The file is an 8-bit file in some codepage, but it does not actually use any special characters

Meaning that it give problems in this case:

The file is in a different codepage (say Cyrillic) than the one your account uses
and contains characters that are special to that codepage

The special characters from the Cyrillic codepage will be interpreted in you local codepage and these (misinterpreted) characters will end up in the saved file. And from that point on they will not change again (until someone changes it manually) because then it is in an encoding that can represent everything.

Have I created an UTF-8 file?

Is not Unicode the default file mode of VIP?

Preferred file mode changed from Unicode to UTF-8

Re: Preferred file mode changed from Unicode to UTF-8

How are strings coded?

Let me close this thread