Which is a better structure for a database?

hyphz · Post by **hyphz** » 13 Sep 2014 20:49

For storing data is it better to use the internal fact database, or an array or other collection of objects?

Ferenc Nagy · Post by **Ferenc Nagy** » 14 Sep 2014 11:52

Fact (databases)
mean the simplest way and the oldest elegant own feature of the Prolog since the birth of the ancestor of VIP, that is since Turbo Prolog (or earlier cavemen's Prologs).

The terms can be looked using partly bound compound variables:

Code: Select all

person(Family,FirstName,NickName,Gender,Age,1,"pirate")

will give out all one legged pirates not only Long John Silver.

The beginners are able to learn assert, asserta, assertz, retract and retractall predicates easily.
You can easily use facts in LIFO, FIFO and random access.

I do not know the size limit when the above predicates slow down because of walking all along the stored facts.

The chained data bases
are very elegant but not designed for beginners. You can walk along chains, insert new terms at its beginning, and end, even before and after the current element, or replace a term. It supports b-trees indexing (pointing to) the elements on a chain using a string as key.

They are the best off all but they are not included in the Personal Edition.

They can be hold in the memory or on the hard disk very easily. The opening predicate needs a parameter to decide it then you need not to take care of the storage mode.

The terms can be looked for partly bound compound variables, too:

Code: Select all

person(Family,FirstName,NickName,Gender,Age,1,"pirate")

will give out all one legged pirates not only Long John Silver.

Their limit was about one million term. I could program the logging of the changes using chains. This feature used to be very important to me because I used TurboImage on HP with logging and recovery to a given moment but unavailable in MS Access.

Binary arrays
provide access to one ore more dimensional arrays equally sized atoms using their index.
E. g. you have a table of mutual distances of 10 cities. You can store them in an 10*10 real array or in an 10*(10-1) div 2 array. The correction of a data at a given index is very simple.

The other collections
appeared in the recent versions are taken from the C language.
Search using more than a field in not allowed as above. The records must have unique identifiers.

If you have migrated from C to Prolog then you may like them the best.

Post by **Thomas Linder Puls** » 14 Sep 2014 14:52

I will have to say that I disagree rather strongly on these matters.

In short:

Collections: Yes! these are the ones to use. See Collection library
Nondeterm fact databases can be used where it makes sense
ChainDB and B+-trees: Don't use them.

Before going into details, let us first get history of collections correct.

I first learned about sets and maps in mathematics in first class (age 6-7). It was more or less when the C-programming language was "invented".

Sets and maps comes from mathematics, and first emerged into functional programming languages (which are heavily inspired by mathematics). Because they are such a good idea they have in various forms spread to practically all programming languages. Including C++ and Visual Prolog (it is likely that collection libraries also exist for plain C).

Enough about that and back to Visual Prolog.

Nondeterm fact databases

Pros

Easy to learn.
Easy to use.
Flexible search mechanisms.
Can be serialized to files.
Fully embedded in the language.

Cons

Bad performance for large databases (search/look up is O(N)).
Serialized forms are hard to deal with over time (i.e. when changing the code).

Collections

Pros

Learning is medium difficult.
Use is easy.
Integrates well in the language (due to polymorphism and generics).
Code is very clear.
Good performance (search/look up O(log(N)), and even O(1) for some variants).
Any datatype can be used as keys.

"Easy use" and "code being clear" is of course after having learned it.

Cons

No serialization

ChainDB and B+-trees

Pros

Good performance (search/look up O(log(N))).
Can be on disc (serialized) provided the data is serializable (i.e. if data doesn't contain objects and predicates)

Cons

Learning is rather hard.
Integrates poorly in the language.
Code is obscure (even to people that master it).
Serialized forms are fragile (one byte wrong can mean that everything is lost, with no means for recovery).
Keys must be strings of fixed length.

Post by **Thomas Linder Puls** » 14 Sep 2014 15:27

A little extra advice. Let this be your natural consideration order of collections:

redBlack tree based (e.g. mapM_redBlack)
hash table based (e.g. mapM_hash)
array's (e.g. arrayM)

Normally use _redBlack collections, and only consider alternatives for a good reason.

Especially for arrays notice this:

Pros

Look up in an array is very efficient (= O(1)). But hash tables are also O(1).

Cons

The keys must be numbers from 0 to "Size".
The array will need storage for all elements from 0 to "Size" (there cannot be holes)

These "Cons" are of course not "Cons" if they match your needs. But often you will map your real (more natural) keys into "unnatural" number keys in order to fit to the array concept. Such key mapping can obscure the algorithms.

So unless your problem really fit to the array concept, I suggest that you first make sure that the additional performance is really needed, and that a hash data structure is not a better solution.

Ferenc Nagy · Post by **Ferenc Nagy** » 15 Sep 2014 4:15

Connection of databases with the outer world
Nondeterm fact databases
The save predicate writes the whole contents to a disk file of them in one program statement.
The consult predicate read the whole contents from a disk file of them in one program statement.
The disk files look almost like a comma separated data file that is they are easily handled by Excel or Access.

Code: Select all

clauses
compose(p("Left Scale","Grid","Lines","Style"),s(1,"Dotted","Pen")).
compose(p("Bottom Scale","Grid","Lines","Style"),s(1,"Dotted","Pen")).
compose(p("Picture","Frame","Lines","Style"),s(5,"Solid","Pen")).
compose(p("Right Scale","Grid","Lines","Style"),s(1,"Dash Dot Dot","Pen")).
compose(p("Left Scale","Axes","Lines","Style"),s(3,"Solid","Pen")).
compose(p("Top Scale","Axes","Lines","Style"),s(2,"Solid","Pen")).
compose(p("Right Scale","Axes","Lines","Style"),s(2,"Solid","Pen")).
compose(p("Bottom Scale","Axes","Lines","Style"),s(3,"Solid","Pen")).
compose(p("Top Scale","Grid","Lines","Style"),s(1,"Dash Dot Dot","Pen")).
compose(p("Plot","Frame","Lines","Style"),s(5,"Solid","Pen")).
compose(p("Foreground","Edges","Lines","Style"),s(1,"Solid","(~Screen) & Pen)")).
compose(p("Foreground","Cut Edges","Lines","Style"),s(1,"Dashed","(~Screen) & Pen)")).
compose(p("Background","Cut Edges","Lines","Style"),s(1,"Dotted","Pen")).
compose(p("Background","Edges","Lines","Style"),s(1,"Solid","Pen")).

You can import these file using delimiters comma and opening parentheses, text surrounded by double quotes.
Step #1 of using saved facts by Excel.
Step #2 Omit some columns is not shown.
Step #3 Deletion of extra closing parentheses and periods.
You can write a macro to do these steps.

The chained databases
can be moved from memory to disk and back but their structure is so obscure that they cannot be uses with Excel and Access.

The collections have not predicates to save and read from disk files as a whole.
You have to write out their element one by one and fill from disk files from scratch.

The binary arrays can be written to the disk but the result is not for human inspection.

drspro2 · Post by **drspro2** » 20 Sep 2014 13:46

I use the Chaindb and every record insert I duplicate with an append file function to a text-file for data security and data duplication/mirror

I have noticed that for example Mysql databases can crash as well.

So should i change my chain db code, to start using for example Mysql and Odbc to have beter performance?

Or should I use the redblack tree, is the redblack tree managed on disk or in memory, and how can i save the redblack tree to a file, and how to use with multiple users

thankyou

Ferenc Nagy · Post by **Ferenc Nagy** » 21 Sep 2014 4:04

My father often told:

Everything is good for something but nothing good for everything.

I explain here how his sermon can be applied to the question

Which is the best database handling tool of VP?

I twist the mathematical problem of the capricious sultan and his dungeon.

Original version rephrased for the discussion forum:
The sultan has a dungeon with 100 cells having remote control. The prisoners do not notice at once if the door is unlocked. His majesty got out of the bed in good mood. He ordered the Prolog programmer sitting in the control tower to unlock each cells. Then he changed his mind and ordered: "Turn back the lock of every second cells." He reconsidered himself again: . Soon, whimsically: "Turn over the lock of every fourth cells." ... So on, in a cycle from 1 to 100.
Which cells remain open in the final state?

I think for this problem the best storage mode is the binary array.

Second version: The programmer has a detailed personal inventory about the captives.
The sultan orders "Release the shiite captives who have been jailed in 1981."

The chained database and the simple fact handling is the best way to fulfill his command.

Third kind of majestic wish: "Decrease the temperature of each cell with 5 centigrades for the sunnite captives named Ali."
If the programmer in the control tower tries to solve the problem with simple retract+assert statements then he makes a program in his hurry which fells in an endless loop and poor prisoners fin themselves on the absolute zero.
The usage of chained databases or the collection is recommended.

If the sultan exactly gives the unique identifier of the prisoners to be released then the usage of chained databases or the collection is recommended.

Post by **Thomas Linder Puls** » 21 Sep 2014 16:57

Notice that we only maintain the chain db for backwards compatibility. We consider it deprecated. And will eventually discontinue it.

The only recommendation we have about chain db is: migrate away from it.

Our "positive" recommendations only involve fact databases, collections and SQL databases.

Ferenc Nagy · Post by **Ferenc Nagy** » 22 Sep 2014 8:40

Notice that we only maintain the chain db for backwards compatibility. We consider it deprecated. And will eventually discontinue it.

I know that every new development pushes out some old long standing stuff.
This is a general rule of software development.

Please explain the special case of chain db
- Where did it come from to the Turbo Prolog?
- Who invented it?
- Was it present in other languages?
- What were its best application in commercial Prolog programs?
- Why have the VIP Development Team deprecate it?

Post by **Thomas Linder Puls** » 22 Sep 2014 12:12

That was a lot of questions.

I am not really aware of the history of the chainDB in Visual Prolog.

chainDB consist of two things chains and B+trees. Chains is a rather trivial data structure quite similar to fact databases, you can read the wikipedia article about B+trees.

B+trees and data structures similar to chains is most likely used in many other places and languages. But our chainDB API and disk representation is homecooked (i.e. a propriety format).

"What were its best application in commercial Prolog programs?" I cannot answer that question. We have used chainDB as a tool in our programs, just like we use lists, windows, files, fact databases, etc . To me the question is similar to asking: What is the best application of lists in ...

The reasons for deprecating it is the most interesting ones.

It is too fragile to be uses as a real (persistent) database (more on that below)
It is far too complicated to learn and use
It has many strange restrictions, especially the fixed lenght of keys in B+tres
It have many type flaws (many things uses "uncheckedConverts" rather than being compile time type safe).
The collection library is much better for in memory data (flexible, type safe, efficient, easy to learn and use).

We don't belive that the chainDB can be used as a real database, i.e. for long term (persistent) data storage. A single bit wrong in a chain db can mean that the entire database is completely lost. In an SQL database there will be implemented backup strategies and redundancy (perhaps on seperate disks, etc) ensuring that data loss can be held at a minimum.

Concurrent access from several programs is another problem. The ChainDB have a very simple synchronisation mechanism for this, which will almost definitely mean that the database becomes a (show stopping) bottleneck as soon as you scale to just a few concurrent users.

It is not for nothing that huge companies spends lots of efforts on SQL databases.[/list]

Martin Meyer · Post by **Martin Meyer** » 22 Sep 2014 16:31

On the other hand SQL databases produce a large overhead to accomplish all their nice features. And worse the load, they put on computers, is, that installation/patching/administration of for example an Oracle database system is a complicated and time consuming task for humans. Furthermore learning SQL, especially Oracle SQL and PL/SQL, is a lot more complicated than learning to use the ChainDB ..the Oracle docu is huge, check it at http://docs.oracle.com/database/121/index.htm.

In many cases, when concurrent access to the data is not needed, an SQL database is an oversized solution. It is more beneficial then, to have a small storage layer, which completely resides inside the program, rather than having to run a database system outside of the program.

The ChainDB has one useful functionality, which is missing in SQL databases: It supports chains (= doubly linked lists), which allow data to be inserted after/before any node in O(1). SQL databases in contrast only support ordering data according to sort keys (for which processing can be speed up by creating indexes).

I plead for not letting ChainDB simply die, but replacing it in some later VIP version by a modern solution. The solution does not need to support concurrent data access, but it should provide type safety, variable key length, non-string keys, cooperation with collections, improved efficiency, chains, and maybe more.

Ideally a modernized ChainDB would also support storing objects. Means, when you have developed automatic object serialization in some future VIP version, you could accompany it with a modernized ChainDB, which together will deliver an easy-to-use and low-overhead way of maintaining objects in non-volatile storage.

Regards
Martin

Post by **Thomas Linder Puls** » 23 Sep 2014 10:16

Well, as said we consider chainDB deprecated, and we will not develop or replace it. We can only use databases which are highly reliable, have backup facilities can scale to thousands of concurrent users, can be used remote, etc., etc. And therefore we will not put effort into such software.

That does not rule out the possibility of persisting objects, that issue is completely orthogonal to the actual storage medium.

For persisting smaller amounts of data, I will suggest saving and consulting fact databases. After the consult data it can be transferred to collections for better performance.

For large amount of data I will strongly suggest using a proper database. But I agree that there is a learning overhead for that.

Large databases that are heavily used can give a huge load on a machine. But small databases with little access give completely insignificant load.

I will not recommend starting with Oracle. Oracle should only be used when you have very huge data, load, etc. Oracle is expensive, complex and have large footprints in your computer.

MS SQL Express is free and it is quite simple to use and have a little footprint.

MySQL is also free (open source; now under Oracle wings). I have found it a little more complicated to use, but it has been a while since I last used it so things may have improved. There used to be a problem with parameterized queries in their ODBC driver, but that may also be solved.

(By the way, this discussion forum is in a MS SQL database and the wiki is in a MySQL database, but none of them uses Visual Prolog for anything).

drspro2 · Post by **drspro2** » 23 Sep 2014 15:28

and how is this solution?; use the QDBM tool for the indexes and the chaindb only for storing records in the chains without using the btree of the chaindb tool, and dont use large text-fields in the records,

this is the link of the qdbm tool: 3rd:QDBM

ty

It depends on the mode of filling and the retrieval of the data

Connection of databases with the outer world

Everything is good for something but nothing good for everything

Why is the chain db deprecated?

Re: