how to remove duplicates of facts ?

Jason_6 · Unread post by **Jason_6** » 19 Jan 2015 15:39

1>
How to remove duplicates of [ facts ] instead of list?

list::removeDuplicates/1-> 
removeDuplicates : (Elem* List)
    -> Elem* Rest.

Code: Select all

list::sort/2-> 
sort : (
    Elem* List,
    sortOrder Order)
    -> Elem* Sorted.

which are both not exactly what I want.

2>
Moreover,after removing duplicates of facts,
could you sort the facts ?

Could you recommend some proper function to settle these little problems?

Thank you all for pacient answering ,especially Thomas!

Code: Select all

Martin Meyer · Unread post by **Martin Meyer** » 20 Jan 2015 9:35

Hi Jason,

the best way is probably to retract each fact from the fact database, inserting it in a list, sort resp. deduplicate the list and finally assert each element of the list back into the fact database. So, for example:

Code: Select all

class facts
    exampleFact : (string).
 
class predicates
    dedupExampleFactDb : ().
clauses
    dedupExampleFactDb() :-
        List = [Strg || retract(exampleFact(Strg))],
        UniqueList = list::removeDuplicates(List),
        foreach Strg in UniqueList do
            assert(exampleFact(Strg))
        end foreach.

Deduplicating the facts without using a list is possible (but inefficient):

Code: Select all

class facts
    exampleFact : (string).
 
class predicates
    dedupExampleFactDb : ().
clauses
    dedupExampleFactDb() :-
        retract(exampleFact(Strg)),
        !,
        dedupExampleFactDb(),
        if not(exampleFact(Strg)) then
            assert((exampleFact(Strg)))
        end if.
    dedupExampleFactDb().

Regards
Martin

Vitaly Markov · Unread post by **Vitaly Markov** » 20 Jan 2015 18:50

For list:

Code: Select all

class facts - aaa
fff:(string).
 
clauses
fff("aa").  fff("aaa").  fff("b").  fff("aa").  fff("c").
 
run():- FactList = [fff(A) || retract(fff(A))],
          UniqueList = list::removeDuplicates(FactList),
          list::forAll(UniqueList, {(F):-assert(F),write(F),nl}).

Vitaly Markov · Unread post by **Vitaly Markov** » 20 Jan 2015 19:09

DB only:

Code: Select all

class facts
fff:(string).
clauses
fff("aa").  fff("aaa").  fff("b").  fff("aa").  fff("c").
 
run():- fff(S), retractAll(fff(S)),asserta(fff(S)), fail;
          fff(S),write(S),nl,fail;                  % <- check of DB
          _ = readchar().

Martin Meyer · Unread post by **Martin Meyer** » 21 Jan 2015 0:34

Yes, the deduplication can also be performed using retractAll:

Code: Select all

class facts
    exampleFact : (string).
 
class predicates
    dedupExampleFactDb : ().
clauses
    dedupExampleFactDb() :-
        foreach exampleFact(S) do
            retractAll(exampleFact(S)),
            assertA(exampleFact(S))
        end foreach.

The advantage over the no-list-solution, I proposed before, is, that it does not eat up stack in recursion. Disadvantage is however, that it is in the worst case even more inefficient than my no-list-solution:

Let us count each try to match a fact with a value as 1 step (matches with unbound variables shall not count). The worst case is, that the database consists of n facts having n distinct values (i.e. that there are no duplicates).

In my before proposed no-list-solution the line if not(exampleFact(Strg)) then is executed n times, whereby first time doing 0 steps, second time doing 1 step, etc. until n'th time doing n-1 steps. In total that are 0 + 1 + ... + n-1 = 0.5*n^2 - 0.5*n steps.

In the retractAll-solution (in above code box) line retractAll(exampleFact(S)) is executed n times and does each time n steps (Thomas may correct me, if retractAll is implemented differently). So, in total that are n^2 steps.

In contrast the list::removeDuplicates predicate is implemented using a red-black tree (per package setM_redBlack) to keep track of the already seen values. Let's count now comparison of two values as 1 step. Lookup and insert in a red-black tree having m nodes is running in O(log m) steps. Hence the removeDuplicates predicate runs on a list with n elements in O(log 1 + log 2 + ... + log n) = O(n log n) steps. The transfer of the values from database to list and back runs in O(n) and thus does not increase the cost in terms of big O. So the fact-deduplication using list::removeDuplicates runs in only O(n log n) steps, while the both solutions without lists take O(n^2) steps.

Regards
Martin

Jason_6 · Unread post by **Jason_6** » 21 Jan 2015 1:15

Millions of thanks!!
Thank Martin and Markov again for your wonderful solutions in difrrent views, which help me a lot!

Of course,other ideas are always welcome!

Vitaly Markov · Unread post by **Vitaly Markov** » 21 Jan 2015 19:45

For line if not(exampleFact(Strg)) then O(n) = (n^2-n)/2.

For line retractAll(exampleFact(S)) O(m,r) = n^2/r - n(m+1)/2.
It is better for r>1.
Here: n - total quantity of facts, m - quantity of uniq facts, r - quantity of occurrences of each fact. Therefore n = mr.
Short explanation for retractAll:
n + (n-r) + (n-2r) + ... - m times.
Hence:
nm - r(m^2+m)/2 = n^2/r - n(m+1)/2.

However use removeDublicates certainly is the best.

Martin Meyer · Unread post by **Martin Meyer** » 22 Jan 2015 0:01

I agree to your conclusion: The higher the quota of duplicates on all facts is, the better performs the retractAll-solution.

When counting the steps for the retractAll-solution in the case you mention (n - total quantity of facts, m - quantity of unique facts, r - quantity of occurrences of each fact) I get a little different result:

Code: Select all

    n  +  n-r  +  n-2*r  +  ...  +  n-(m-1)*r
=  n  +  n-r  +  n-2*r  +  ...  +  n-m*r+r
=  n  +  n-r  +  n-2*r  +  ...  +  n-n+r
=  n  +  n-r  +  n-2*r  +  ...  +  r
=  (n + r) * m/2
=  (n*m + r*m) / 2
=  (n*m + n) / 2
=  0.5*n*m  +  0.5*n

In the extreme case, that all n facts are equal (i.e. m=1, r=n), the retractAll-solution needs only n steps. But the retract-solution runs also in n steps in that case. The retractAll-solution should however be the faster one by a factor in praxis in that case, because it avoids recursive calls.

I suppose, in praxis there is a threshold for the quota of duplicates on all facts, so that, iff the threshold is exceeded, the retractAll-solution outperformes the retract-solution.

Many regards
Martin

Randy · Unread post by **Randy** » 26 Jan 2015 11:12

Gentlemen - I am awed by this international Math collaboration. Thanks for sharing.

Best Regards,
Randy

Ferenc Nagy · Unread post by **Ferenc Nagy** » 27 Jan 2015 6:16

Nice solutions.
But you'd rather check in advance in order to prevent duplicate assertion.

Code: Select all

facts
   yourfact(sometype).
predicates
   assertOnlyIfNew(sometype F) procedure.
clauses
   assertOnlyIfNew(F) :-
      yourfact(F),
       !. 
    assertOnlyIfNew(F) :-
        assert(yourfact(F)).

Unread post by **Thomas Linder Puls** » 27 Jan 2015 9:59

Interesting conversation.

However, I would definitely use a suitable collection instead.

A set automatically only contains one of each value; a red-black set will also put the elements in order (which was a side question).

Code: Select all

facts
    data : setM{sometype) := serM_redBlack::new().
 
clauses
    insert(Info) :- data:insert(Info).

The running time is O(N*log(R)) when inserting N values of which R are unique.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 29 Jan 2015 9:10

The old fashioned database handling has a nice mode to read and write a block of facts: the consult and save statements.
I like the files generated by the save statement.
1) I could handle them from Excel as CSV files. [I wrote macros to delete the unnecessary parentheses and final periods and tabulating them.]
2) The new VIP versions are not so sensitive to whitespaces and final cr+lf+eof characters as the original Turbo Prolog. The new versions allow the addition comment lines, too.

What is the corresponding simplest way to fill in and write out a whole collection?

Unread post by **Thomas Linder Puls** » 30 Jan 2015 10:28

In many cases the simplest solution is to keep facts for serialization, i.e.

Load: consult a fact database move the data to the collections.
Save: move data from collections to a fact database and save it.

As long as out "think in facts" such a strategy is simple, but once you begin to "think in collections" and make more complex collections you may find it more tricky to "invent" a fact structure that corresponds to a certain collection structure.

Facts can contain structured data like lists and functor structures, but in a certain sense they are "flat" like a relational database: The facts them selves are always on top level.

On the other hand collections can be nested deeply, and deeply structured data may be more difficult to "flatten" into a fact database.

It is the same with objects: they can represent very complex data structures, but it may be difficult "untangle" such data structures into something that can be serialized.

Ferenc Nagy · Unread post by **Ferenc Nagy** » 31 Jan 2015 9:06

Hi,
Do I summarize properly your advices with the sentence in the subject line?

Keep i/o data in facts sections but store internal data in collections.

Additionally:
I/O facts should "flat" i.e. they should not contain lists having unequal size [because of easy conversion from/to Excel tables and/or relational databases.)

The coexistence of facts and collections means no problem in case of small amount of data records.

Unread post by **Thomas Linder Puls** » 31 Jan 2015 23:01

Well, I don't really think I was giving such general advices.

Fact databases have many good properties; save and consult are definitely two of them.

But the collections also have some very good properties. Especially, they significantly offer better performance (for many vital operations). Therefore a general advice is to learn to use collections.

I don't see any problem with having lists an even more complex data in facts, unless you need easy conversion to Excel or the like.

Millions of thanks!!

Math at it's Best

Check before first assert

What does correspont to consult and savein case of collections?

Keep i/o data in facts sections but store internal data in collections