Reading hexadecimal from utf8 files

Loffy · Unread post by **Loffy** » 23 Oct 2019 9:09

I am testing concepts for a project, and to date I have settled on using utf8 files for a number of reasons.

My current testing requires using streams to read and write utf8 files. The streaming works fine until I get to the current test regarding reading a utf8 file that contains hexadecimal bytes and then doing some processing on the hex bytes.

If a hexadecimal domain existed I would have no problems.

I have tried the various instream file read predicates.

If I use the readBytes() predicate and my input was 7EAE then the input is stored in binary format as $[37,00,45,00,41,00,45,00] due to VPI using Unicode format. The are options here:
1. Use utf16 files instead of utf8
2. Process the binary to remove the 00 "half bytes".

I have tried using the read() predicate and using the hasDomain predicate to convert the bytes being read into a different domain, however I cannot see a way to define a hexadecimal domain (even some code "trickery" would do). I have tried some uncheckedConvert statements without success.

Does anyone have some guidance as to the best way forward. I will revert to utf16 if necessary, but I really don't want to. Sometimes I just get stubborn.

Regards,

Loffy

Harrison Pratt · Unread post by **Harrison Pratt** » 23 Oct 2019 11:25

Do you need to actually process the initial bytes in your application or just know what is the file format?

Have you looked at

inputStream_file::openFileBom/2 ( or openFileBom/1 )
openFileBom : (
string Filename,
fileSystem_api::accessPermit Access).

Opens a file and sets its mode based on the presense of a BOM in the file. If the file contains an utf16 BOM it is read in unicode mode; If it contains an utf8 BOM it is read in utf8 mode; if it does not contain a BOM it is read in thread-ansi mode.

BTW, Hexadecimal isn't a domain, it's a data presentation format for binaries.

Loffy · Unread post by **Loffy** » 23 Oct 2019 11:34

Harrison,

The file does not contain a BOM (at least as far as I can see with my hex editor), so I will look further at the details of thread-ansi.

Regards,

Loffy

B.Hooijenga · Unread post by **B.Hooijenga** » 23 Oct 2019 13:21

Loffy

For classical bytes you need unsigned8 as domain.

To pick up the unsigned8 values from the binary you could use the predicate:
binary::getIndexed_unsigned8/2->

Using a hexadecimal notation for unsigned8 values is possible.
for instance
X = 0xFF.

Kind regards

Ben

Unread post by **Thomas Linder Puls** » 23 Oct 2019 21:22

The bytes you have read are clearly utf16 characters, so if that are the bytes you have read from the file then the file is in utf16 format.

But in any case the problem has nothing to do with the file encoding.

inputStream::read can read numbers in these formats:

Decimal: -123
Octal: -0o123 (= -83)
Hexadecimal: -0x123 (= -291)

inputStream::read will read these formats from utf8, utf16, or a file in any 7 or 8-bit character set(that has a code page in Windows).

But it will not read hexadecimal unless the 0x is there.

It has nothing to do with "converting the bytes it reads" whenever read needs to read a number it will read in any of the supported formats.

So this call:

Code: Select all

L = hasDomain(list{integer}, Stream:read())

can read this "file":

Code: Select all

[-123, -0o123, -0x123]

And interpret it as the list

Code: Select all

[-123, -83, -291]

All in all, this does however not help on your problem, when the numbers in the file does not start with 0x.

Knowing more about the format may help in advising how to solve the problem.

Loffy · Unread post by **Loffy** » 24 Oct 2019 6:23

Thomas,

Please find attached file.

First, some points:

- the file I was testing is large (2GB).

- I have saved the first 32 bytes of the file using (WinHex) into a new test file.

- the 2GB file was created in VIP using utf8 file qualifiers where required.

- Windows notepad has confirmed the new file is utf8 format, but the original file is too big for Notepad to read.

- My project idea ideally requires the file to keep its current construct.

- I think I have found a way to do what I want to do with the current construct, but it will take a day or so to prove that.

I am interested in why you think the file is utf16 when Notepad is confirming to me that it is utf8. I have tried to find another tool to confirm the utf8 structure. So far, no luck. WinHex seems silent on the issue.

Regards,

Loffy

Unread post by **Thomas Linder Puls** » 24 Oct 2019 12:28

Last thing first. Sorry, the file is most likely in utf8 as you say, readBytes will read bytes after after conversion to utf16. I would normally not use readBytes on a file that is opened in text mode.

Anyways, the contents of the file is like this:

Code: Select all

7EAE5DCFF348C44CD8FB18E48B1A989E

And that obviously raise the question where does one number end and the next one begin?

Another obvious question is if you have yourself decided the format of the file with the purpose of reading it in again then why not decide a format that is easy to read?

Loffy · Unread post by **Loffy** » 24 Oct 2019 13:50

Thomas,

Thanks again for your prompt response.

Your question is quite pertinent.

I am attempting to emulate/simulate the actions of a quantum computer, and that goes a very long way further than this issue.

The simple pretext of my idea for this part of my system is as follows (and I will not elaborate further in a public forum):

1. The whole idea is to not know when one number starts/ends or the other number ends/starts.

2. The format is not designed to be easy to read. It is designed to be difficult to read.

At that point I stop, other than saying I have reached a point where I can get the input from my file to where I was (rightly or wrongly) expecting it would be:

BinaryBytes : binary = $[37,45,41,45,35,44,43,46]

I have not got further than that point today, and there will be little (if any) progress tomorrow.

There is no guarantee that I can achieve what I am attempting to do, though I will give a good run.

Regards,

Loffy

Unread post by **Thomas Linder Puls** » 24 Oct 2019 15:08

The format is not designed to be easy to read. It is designed to be difficult to read.

Well, you seem to have achieved this

(sorry I didn't resist it).

Anyways, if you open the file in binary mode then readBytes will read a binary like that.

Loffy · Unread post by **Loffy** » 25 Oct 2019 3:03

Thomas,

Thanks again.

Yes, I had noticed the mode parameter in my investigations, though I have not yet got around to using it. I will now.

And I don't mind your comment at all. It made me laugh.

Regards,

Loffy

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files

Re: Reading hexadecimal from utf8 files