Page 1 of 1

UTF8 conversion

Posted: 31 May 2010 14:47
by Tonton Luc
Hi,
How to convert UTF8 text ?
I've tried this following code without succes (because setMode seems doesn't work) :

Code: Select all

                MyString = "Légende des pictos",                 Str = inputStream_string::new(MyString),                 Str:setMode(stream::ansi(core::codepage(codepageId::utf8))),                 note(Str:readLine()),
I would like to convert "Légende des pictos" to "Lgende des pictos".

Posted: 31 May 2010 15:02
by Tonton Luc
Ok, I've found the solution :

Code: Select all

                MyString = string8::mapFromString("Légende des pictos"),                 Str = string8::mapToString(MyString,core::codepage(codepageId::utf8)),                 note(toString(Str)),
:wink:

Posted: 31 May 2010 19:20
by Thomas Linder Puls
I am somewhat mystified: Why do you at all have an "utf-8" string in "utf-16" format?

Normally, it is better to do the conversion at the source (i.e. where the string comes into the program).

Posted: 1 Jun 2010 6:59
by Tonton Luc
The string come from Gildas vp_web package as I tell about it in the last message of this following post : http://discuss.visual-prolog.com/viewto ... light=utf8

Posted: 1 Jun 2010 9:50
by Thomas Linder Puls
getURLContentAsText seems to access the source, so it should use the relevant code page:

Code: Select all

predicates     getURLContentAsText: (string URL) -> string procedure (i). clauses     getURLContentAsText(URL)= Res :-         Bin2 = getURLContentAsBin(URL),         Text = uncheckedConvert(string8,Bin2),         Res  = string8::mapToString(Text, ,core::codepage(codepageId::utf8)).

Posted: 1 Jun 2010 10:00
by Tonton Luc
But are we sure that theese HTML pages are ALWAYS in UTF-8 ?

Posted: 1 Jun 2010 10:15
by Thomas Linder Puls
No, unfortunately not.

Normally, the encoding is written in a meta tag near the top of the page:
This forum wrote:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
And in fact, the text can even be in utf-16 (i.e. 16bit Unicode) in which case you should not even call string8::mapToString.

So doing thigs correctly in the general case is quite complex.

Posted: 1 Jun 2010 19:54
by Gildas Menier
Hi Tonton Luc & Thomas,

1. Finding the coding page of a web page is not an easy task. Most of the time, there is no tag for that (or not reliable). There are some code page sniffers that try to find the code given some n-gram statistics (ask google) and it is possible to use them within Visual Prolog. Detecting the language is sometimes a usable hint to find an acceptable code (problem arises for pages including multiple languages of course).

2. Well, you should not use vp_web - or at least take it too seriously. I made some libraries available to the forum's fellow thinking that it may be inspiring or give some ideas for developments. Some extensions are 'first try' implementations - and this is (especially) the case for vp_web. This started for me as a challenge ('is it possible to download a web page with Visual Prolog ? let's try). Actually, I suggest you use the API made by Jan 'vipcurl' which is a great piece of software (again, thank you Jan !). I am using it to build web crawlers in Visual Prolog and it is far more customizable than vp_web (which was just an experiment) and reliable. This doesn't solve the detection of code page, anyway.

3. I am currently rewritting many of the tools I made available for Visual Prolog 7.3. I am recrafting the extensions using the generic system introduced in 7.3 so that the solutions are much more elegant and simply reusable (or so I think and hope). Playing with VP7 is easy (or so it seems) but mastering it is another thing - some horse power is hidden in the template and genetic type mechanism. Many extensions made for VP < 7.2 will be deleted from arsaniit.com and replaced by 7.3 new versions. Some of these extensions are now deprecated (because of the support provided by the new PFC). For instance, the support of winsock2 eases the developpement of sockets so I can discard the .lib included in my simple-client-server; some of the API won't work with 64 bits Os (I doubt for instance that it is possible to use a service compile in 32 bits in a 64 bits OS) - So I really hope the Visual Prolog next big move will be a 64 bits version...

Regards
Gildas

Posted: 1 Jun 2010 20:18
by Jan de Lint
Hi,
yes the metatag is one way to specify the codepage on the server side. Another way is in a "header" tag of HTTP protocol. VPcURL can give you the received header-tags.
I have no idea which of the two takes precedence if both are present, perhaps that is not allowed. However, that could be found out.
Curl also allows for socket programming but I have not gotten around to implementing that in VPcURL.
]an

Posted: 1 Jun 2010 20:54
by Thomas Linder Puls
I believe that all (major) WEB servers read the meta tag in the HTML file and send a corresponding HTTP header, i.e. when it deals with static HTML files.

CGI and ISAPI extenstions (including scripting engines) are however them selves responsible responsible for sending the correct HTTP headers (as I recall it).

Posted: 1 Jun 2010 21:54
by Gildas Menier
You would be surprise how many web pages lack the meta with code page...
IE tries to guess the code page and uses MLang (see MLang.dll)

http://msdn.microsoft.com/en-us/library ... 85%29.aspx

see http://msdn.microsoft.com/en-us/library ... 85%29.aspx

and especially DetectInputCodePage.

Gildas

Posted: 2 Jun 2010 7:37
by Thomas Linder Puls
In principle pages that don't have a meta tag should be encoded in 7-bit ASCII. Such pages can still have spacial characters because they can use the syntax &#33345; (= &#33345;).

But I do not doubt that many pages are simply erroneous and that WEB browsers tries to guess a lot of stuff to overcome such bugs in a gentle way.