Garbage characters read from input file

Discussions related to Visual Prolog
User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

Garbage characters read from input file

Unread post by Ferenc Nagy » 7 Apr 2014 10:50

Hi,
I have opened an input file starting with the valid command line
{Test with five places}
My program got the garbage line as the first line before the above command line.
which could not be interpreted.
The

Code: Select all

Line=InputStream:readLine()
line resulted in
which was not present in my data file.

The simplified code of parsing my input file is below.

Code: Select all

predicates handleEof:() determ. readInputFile:() determ.   clauses % Read input file. readInputFile() :-     std::repeat(),         EndofStream=parseLines(),     EndofStream=true,     !,     handleEof(). % Snip clauses   % Parse input lines. % Snip   parseLines()=Eof :-     Line=InputStream:readLine(),    parseLines()=true.
I don't understand where the
comes from and why I do not see it it an editor before the
{Test with five places}
first line.

The strange phenomenon appears only during reading one file. The parsing of my other input files gets the same first line as I saved them.
TIA, Regards,
Frank Nagy

Paul Cerkez
VIP Member
Posts: 202
Joined: 6 Mar 2000 0:01

Unread post by Paul Cerkez » 7 Apr 2014 12:59

Frank,
Look at the file with a HEX reader. There may be a couple of 'hidden' characters identifying the file type (similar to BMPs, JPGs, etc identifying image types) at the very start of the file. Many 'text readers' will ignore this but as you are explicitly reading in the data line from the flie, they are coming in.

just a suggestion. ;-)

also, look at this link for ideas.

http://stackoverflow.com/questions/9878 ... characters

P.
AI Rules!
P.

User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

Have I created an UTF-8 file?

Unread post by Ferenc Nagy » 8 Apr 2014 9:41

Thank you, Paul, for the link.
It looks like I created the file as UTF-8 and the type identifying garbage characters remained in it.
I could clean the file by adding an extra empty line before the comment line. (Maybe when I saved the modified file it became ANSI or Unicode.)
If the problem repeats then I have to add an extra check before interpreting the first data line.
TIA, Regards,
Frank Nagy

User avatar
Thomas Linder Puls
VIP Member
Posts: 1622
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 8 Apr 2014 15:48

In my mind utf-8 is a fine characterset. So you could consider working with files in utf-8:

Code: Select all

class inputStream_file : inputStream     open core   ...   constructors     openFileUtf8 : (string FileName).     % @short Opens file for input in utf-8 mode.     % @detail Opens file with specified name #FileName for input, setting stream position to the start of the file.     % File contents is treated as utf-8 text.     % @exception The fileSystem_exception::cannotCreate exception is raised if the specified file cannot be opened     % @end
Regards Thomas Linder Puls
PDC

User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

Is not Unicode the default file mode of VIP?

Unread post by Ferenc Nagy » 9 Apr 2014 10:59

Is not Unicode the default file mode of VIP?
My input files contain comment lines and data lines split into fields of equal width.
Earlier I edited them with a DOS editor, Personal Editor II which had smart column and rectangular block handling commands. Now I handle them with Notepad++.
When I open them with the Scintilla source editor embedded in the VIP user interface, the text appears in proportional character set so the columns became invisible.
I do not know how one data file among the several ones changed to UTF-8.
TIA, Regards,
Frank Nagy

User avatar
Thomas Linder Puls
VIP Member
Posts: 1622
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 9 Apr 2014 12:56

Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.

So whenever the IDE save them it will save in that format (regardless of the format it used to be in).
Regards Thomas Linder Puls
PDC

User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

Preferred file mode changed from Unicode to UTF-8

Unread post by Ferenc Nagy » 10 Apr 2014 8:53

Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.
What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.
What is the byte-order-mark?
What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?

:-( You used to prefer Unicode files in earlier versions, didn't you?
TIA, Regards,
Frank Nagy

User avatar
Thomas Linder Puls
VIP Member
Posts: 1622
Joined: 28 Feb 2000 0:01

Re: Preferred file mode changed from Unicode to UTF-8

Unread post by Thomas Linder Puls » 11 Apr 2014 10:36

Ferenc Nagy wrote:Default mode for Visual Prolog files have switched to utf-8 with a byte-order-mark.
What does it mean exactly?
I have understood this statement that if the system automatically generates a source file or I save a file within the IDE then it is stored in utf-8 mode.[/quote]

That is correct.
Ferenc Nagy wrote:What is the byte-order-mark?
(Strictly speaking it is not a byte-order-mark in utf-8.) Effectively it is a mark that help to establish that the file is actually in utf-8 format.

It is probably easier for you to read about it in wikipedea: Byte order mark
Ferenc Nagy wrote:What happens with a data file not registered in the project if I edit and save it within the IDE?
Will its mode be changed to utf-8 with a byte-order-mark?
Yes, it will.
Ferenc Nagy wrote::-( You used to prefer Unicode files in earlier versions, didn't you?
utf-8 is also Unicode, but you are correct anyway.

Previous versions of Visual Prolog by default kept files in the original encoding, but created new files in utf-16 (with byte-order-mark).

We have however had problems with files in non-Unicode (neither uft-8 or utf-16) being edited in both Russia and Denmark. And since utf-16 files in (by-large) English is twice as large as the corresponding file in utf-8 we have chosen to shift everything.

I am sorry for the inconvenences it gives you.
Regards Thomas Linder Puls
PDC

User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

How are strings coded?

Unread post by Ferenc Nagy » 11 Apr 2014 11:01

TIA, Regards,
Frank Nagy

User avatar
Thomas Linder Puls
VIP Member
Posts: 1622
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 11 Apr 2014 13:13

Well I think you misunderstand some things.

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.

Secondly, you should notice that Unicode is the standarization of the a "conceptual" character set, which can be represented in several different ways.

"utf" refers to Unicode Transformation Format and/or UCS Transformation Format , where UCS refers to Universal Character Set. I belive it is unimportant to figure out how and why both Unicode and USC is used.

utf-8 can (just like utf-16) represent full Unicode, including all Hungarian, Danish and Russian letters and Chinese symbols, etc.

Full Unicode have more than 2^16 codepoints (=characters), so even in utf-16 some Unicode codepoints requires two 16bit words, a so called surrogate pair (currently these codepoints are however not in use).

utf-8 uses from 1 to 6 bytes to represent a Unicode codepoint (currently 4 bytes are sufficient however).

Here is a list of language names from wikipedea:



As you see it contains a lot of different kinds of letters. You can copy it into a Visual Prolog editor, save the contents and when you repoen the file you will notice that everything is the same. utf-8 represents it all consistently.

Notice: you may see a square instead of certain characters, this is not because the caracter is bad, but because the font does not have that character.
Regards Thomas Linder Puls
PDC

User avatar
Ferenc Nagy
VIP Member
Posts: 289
Joined: 24 Apr 2007 12:26

Let me close this thread

Unread post by Ferenc Nagy » 12 Apr 2014 9:19

Thank you for the tuition about code pages and character coding modes.

Code: Select all

First of all you should notice that PFC can still read and write files in utf-16, utf-8 and any other Microsoft codepage you may prefer. But the IDE itself will store files in utf-8 with a BOM.
:P It was amusing that "little endian" and "big endian" comes from Gulliver in Lilliput, from the parties crackinge ind the hard boiled eggs from their pointed end named "little end-ian party" and from the unpointed end named "big end-ian party", respectively.
Ok. The lesson of your answers is that I have only to inspect my data files within the IDE. If I edit and save them I risk the insertion of the four leading characters and the changing the file mode from Unicode to UTF-17.

Code: Select all

Notice: you may see a square instead of certain characters, this is not because the character is bad, but because the font does not have that character.
Yes, I have noticed it.
TIA, Regards,
Frank Nagy

User avatar
Thomas Linder Puls
VIP Member
Posts: 1622
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 12 Apr 2014 10:36

UTF-17 that is a new one ;-).

What you call Unicode is actually utf-16. And IDE will change from whatever the file was to utf-8, this will work correctly in these four cases:
  • The file is already utf-8 (= no change)
  • The file is in utf-16
  • The file is an 8-bit file in the same code page that your account uses.
  • The file is an 8-bit file in some codepage, but it does not actually use any special characters
Meaning that it give problems in this case:
  • The file is in a different codepage (say Cyrillic) than the one your account uses
  • and contains characters that are special to that codepage
The special characters from the Cyrillic codepage will be interpreted in you local codepage and these (misinterpreted) characters will end up in the saved file. And from that point on they will not change again (until someone changes it manually) because then it is in an encoding that can represent everything.
Regards Thomas Linder Puls
PDC

Post Reply