UTF8 conversion

Discussions related to Visual Prolog
User avatar
Tonton Luc
VIP Member
Posts: 814
Joined: 16 Oct 2001 23:01

UTF8 conversion

Unread post by Tonton Luc » 31 May 2010 14:47

Hi,
How to convert UTF8 text ?
I've tried this following code without succes (because setMode seems doesn't work) :

Code: Select all

                MyString = "Légende des pictos",                 Str = inputStream_string::new(MyString),                 Str:setMode(stream::ansi(core::codepage(codepageId::utf8))),                 note(Str:readLine()),
I would like to convert "Légende des pictos" to "Lgende des pictos".

User avatar
Tonton Luc
VIP Member
Posts: 814
Joined: 16 Oct 2001 23:01

Unread post by Tonton Luc » 31 May 2010 15:02

Ok, I've found the solution :

Code: Select all

                MyString = string8::mapFromString("Légende des pictos"),                 Str = string8::mapToString(MyString,core::codepage(codepageId::utf8)),                 note(toString(Str)),
:wink:

User avatar
Thomas Linder Puls
VIP Member
Posts: 2444
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 31 May 2010 19:20

I am somewhat mystified: Why do you at all have an "utf-8" string in "utf-16" format?

Normally, it is better to do the conversion at the source (i.e. where the string comes into the program).
Regards Thomas Linder Puls
PDC

User avatar
Tonton Luc
VIP Member
Posts: 814
Joined: 16 Oct 2001 23:01

Unread post by Tonton Luc » 1 Jun 2010 6:59

The string come from Gildas vp_web package as I tell about it in the last message of this following post : http://discuss.visual-prolog.com/viewto ... light=utf8

User avatar
Thomas Linder Puls
VIP Member
Posts: 2444
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 1 Jun 2010 9:50

getURLContentAsText seems to access the source, so it should use the relevant code page:

Code: Select all

predicates     getURLContentAsText: (string URL) -> string procedure (i). clauses     getURLContentAsText(URL)= Res :-         Bin2 = getURLContentAsBin(URL),         Text = uncheckedConvert(string8,Bin2),         Res  = string8::mapToString(Text, ,core::codepage(codepageId::utf8)).
Regards Thomas Linder Puls
PDC

User avatar
Tonton Luc
VIP Member
Posts: 814
Joined: 16 Oct 2001 23:01

Unread post by Tonton Luc » 1 Jun 2010 10:00

But are we sure that theese HTML pages are ALWAYS in UTF-8 ?

User avatar
Thomas Linder Puls
VIP Member
Posts: 2444
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 1 Jun 2010 10:15

No, unfortunately not.

Normally, the encoding is written in a meta tag near the top of the page:
This forum wrote:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
And in fact, the text can even be in utf-16 (i.e. 16bit Unicode) in which case you should not even call string8::mapToString.

So doing thigs correctly in the general case is quite complex.
Regards Thomas Linder Puls
PDC

Gildas Menier
VIP Member
Posts: 78
Joined: 8 Jun 2004 23:01

Unread post by Gildas Menier » 1 Jun 2010 19:54

Hi Tonton Luc & Thomas,

1. Finding the coding page of a web page is not an easy task. Most of the time, there is no tag for that (or not reliable). There are some code page sniffers that try to find the code given some n-gram statistics (ask google) and it is possible to use them within Visual Prolog. Detecting the language is sometimes a usable hint to find an acceptable code (problem arises for pages including multiple languages of course).

2. Well, you should not use vp_web - or at least take it too seriously. I made some libraries available to the forum's fellow thinking that it may be inspiring or give some ideas for developments. Some extensions are 'first try' implementations - and this is (especially) the case for vp_web. This started for me as a challenge ('is it possible to download a web page with Visual Prolog ? let's try). Actually, I suggest you use the API made by Jan 'vipcurl' which is a great piece of software (again, thank you Jan !). I am using it to build web crawlers in Visual Prolog and it is far more customizable than vp_web (which was just an experiment) and reliable. This doesn't solve the detection of code page, anyway.

3. I am currently rewritting many of the tools I made available for Visual Prolog 7.3. I am recrafting the extensions using the generic system introduced in 7.3 so that the solutions are much more elegant and simply reusable (or so I think and hope). Playing with VP7 is easy (or so it seems) but mastering it is another thing - some horse power is hidden in the template and genetic type mechanism. Many extensions made for VP < 7.2 will be deleted from arsaniit.com and replaced by 7.3 new versions. Some of these extensions are now deprecated (because of the support provided by the new PFC). For instance, the support of winsock2 eases the developpement of sockets so I can discard the .lib included in my simple-client-server; some of the API won't work with 64 bits Os (I doubt for instance that it is possible to use a service compile in 32 bits in a 64 bits OS) - So I really hope the Visual Prolog next big move will be a 64 bits version...

Regards
Gildas

User avatar
Jan de Lint
VIP Member
Posts: 239
Joined: 6 Mar 2000 0:01

Unread post by Jan de Lint » 1 Jun 2010 20:18

Hi,
yes the metatag is one way to specify the codepage on the server side. Another way is in a "header" tag of HTTP protocol. VPcURL can give you the received header-tags.
I have no idea which of the two takes precedence if both are present, perhaps that is not allowed. However, that could be found out.
Curl also allows for socket programming but I have not gotten around to implementing that in VPcURL.
]an

User avatar
Thomas Linder Puls
VIP Member
Posts: 2444
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 1 Jun 2010 20:54

I believe that all (major) WEB servers read the meta tag in the HTML file and send a corresponding HTTP header, i.e. when it deals with static HTML files.

CGI and ISAPI extenstions (including scripting engines) are however them selves responsible responsible for sending the correct HTTP headers (as I recall it).
Regards Thomas Linder Puls
PDC

Gildas Menier
VIP Member
Posts: 78
Joined: 8 Jun 2004 23:01

Unread post by Gildas Menier » 1 Jun 2010 21:54

You would be surprise how many web pages lack the meta with code page...
IE tries to guess the code page and uses MLang (see MLang.dll)

http://msdn.microsoft.com/en-us/library ... 85%29.aspx

see http://msdn.microsoft.com/en-us/library ... 85%29.aspx

and especially DetectInputCodePage.

Gildas

User avatar
Thomas Linder Puls
VIP Member
Posts: 2444
Joined: 28 Feb 2000 0:01

Unread post by Thomas Linder Puls » 2 Jun 2010 7:37

In principle pages that don't have a meta tag should be encoded in 7-bit ASCII. Such pages can still have spacial characters because they can use the syntax &#33345; (= &#33345;).

But I do not doubt that many pages are simply erroneous and that WEB browsers tries to guess a lot of stuff to overcome such bugs in a gentle way.
Regards Thomas Linder Puls
PDC

Post Reply