question about strings as function arguments/domains

drspro2 · Unread post by **drspro2** » 13 May 2024 9:35

my main philosophy is: only use the string domain inside prolog code if it is very strictly necessary.

Because for prolog a string of for example 5 characters is the same as a string of 300 or more characters
i have herited some code where all the arguments are declared as strings which also results in high memory usage.

and a second question , if i choose to use a string in a predicate which is asserted into memory, would it be benificial to take the first 3 characters of the string, calculate to sum- char_int val of those 3 , and assert the string together with that digit to able to later retrieve the values more efficient,
or does the VP compiler already has a built in method to handle small and big strings for indexes

so instead of :
assert( cars_names( "mercedes" )),
assert( cars_names( "opel" )),

use:

assert( cars_names( 234, "mercedes" )),
assert( cars_names( 524, "opel" )),

* assuming here that 234 stands for: mer and 524 stands for ope

Unread post by **Thomas Linder Puls** » 13 May 2024 11:28

This statement is wrong:

Because for prolog a string of for example 5 characters is the same as a string of 300 or more characters ...

A string with 5 characters occupies 12 bytes and an string of 300 characters occupies 602 bytes. The allocated memory may be a little larger because allocations are always aligned to certain address boundaries.

However a string is a pointer to the characters so a predicate that takes a string as argument actually takes a pointer as argument. And in that sense you are right: it doesn't matter whether the actual string is long or short; the argument is a pointer (32 bit in 32 bit programs; 64 bit in 64 bit programs).

All in all, I think you prejudice about strings is it wrong.

I am not sure I understand the second question.

if i choose to use a string in a predicate which is asserted into memory, would it be benificial to take the first 3 characters of the string, calculate to sum- char_int val of those 3 , and assert the string together with that digit to able to later retrieve the values more efficient,

I think this is a rather strange approach. Basically it seems that you are making a kind of hash value of the strings and store them as an additional argument. Your hash function is rather bad however "mer...", "mre...", "emr..." all gives same hash value.

However if you only have a little number of car names then the original fact database is not likely to cause a performance problem. But if there is a large number of car names then you may have a performance problem, because fact databases are searched from one end to the other to find a match. So if there are 10000 car names then searching for one will in average consider 5000 car names (5000 for names that are in the database; 10000 for names that are not in the database).

I.e. searching in a fact database is linear in the number of elements in the fact database (O(N)).

(And then it really doesn't matter to speed up the comparison slightly. The problem is the large amount of comparisons.)

If you instead insert the names in a setM_redBlack then lookup is logarithmic (O(log(N)). For a database with 10000 car names you will consider something like 20 car names to determine whether one is present in the set.

And that is a true performance speed-up: 20 comparisons instead of 5000.

To learn more about this you can read Colletion Library.

You can also use setM_hash (constructor new_string) which will use our default hash function and have an constant look up (O(1)). I.e. when you look up a car name in such a set, the hash value of the car name is calculated and converted to an index into an array, and then that entry is examined; there are a little more to deal with multiple strings "landing" in the same index, etc. But basically only one comparison.

drspro2 · Unread post by **drspro2** » 14 May 2024 6:48

thankyou for the extended information

in this example of Car-names:

lets say that before your put all the data in memory with asserts, you assign to every
carname, a unique number ( for example with a constant ) ,

( you can do this because you know the list of carnames is finite, and you know all the carnames in
advance. )

after that, you assert all the data clauses into memory with asserts with the numbers, but then, the matching
will be a lot faster because internally it can match on numbers instead of a string,

and my question is If this is allways faster than matching and retrieval with the string-implementation,
because with the number-declaration the running program always knows in advance the size of the number-domain, this compared to the string implementation where the program has to first follow the pointer and the string which is behind it.

and for the very lazy programmers who declare Everything as string:, are they rewarded because
visual-prolog does some performance enhancements for them when it has to compare strings internally?,

for example when it has to compare 2 strings wether they are equal or not, in this case 1 string is 5 characters and the other string is 300 characters, the program would not have to retrieve the whole string of 300 characters because it can know by the length of both strings in advance that they can not be equal

ok i know i could create a program with a timer to see which or what is faster and to test it.

in the code i inherited they even use strings of 1 character, and they perform comparisons with that.

Also they make comparisons like this:

Not equal the Empty -string ::,

not( carname( "" ) ).

which is in my opinion very dangerous, i would never match on empty strings,
because the code would be hard to maintain and understand

Unread post by **Thomas Linder Puls** » 15 May 2024 9:08

I think you completely miss the point. You may or may not be able to speed up your program by using such code. But the speed-up will be completely insignificant probably less than 1% (and you may even have difficulties measuring it do to other computer load).

Nobody will notice if a program run 10% faster (unless they use as watch). To feel a performance improvement it has to be 50% faster or something like that, and that you will never get by such constructions, furthermore your code will become very awkward and difficult to maintain.

To get a real speed up you should use the mentioned setM stuff.

To illustrate the point I made this little test program (try it on your own computer):

Code: Select all

implement main
    open core, stdio
 
class facts
    car_names : (string Carname).
    car_names_rb : setM{string} := setM_redBlack::new().
 
clauses
    run() :-
        foreach I = std::cIterate(10000) do
            CN = string::format("%_%", math::random(upperBound(unsigned)), I),
            assert(car_names(CN)),
            car_names_rb:insert(CN)
        end foreach,
        Count = 50000,
        SearchCar = "non-match",
        profileTime::start_pr("car_names"),
        foreach _ = std::cIterate(Count) do
            car_names(SearchCar)
            orelse
            succeed
        end foreach,
        profileTime::stop_pr("car_names"),
        profileTime::start_pr("car_names_rb"),
        foreach _ = std::cIterate(Count) do
            SearchCar in car_names_rb
            orelse
            succeed
        end foreach,
        profileTime::stop_pr("car_names_rb"),
        profileTime::printAndReset().
 
end implement main
 
goal
    console::runUtf8(main::run).

It invents 10000 "car names" and place them in a fact database and in a setM_redBlack, and then it searches for a car name (which is not among the invented names).

On my computer car_names takes 1.6923113 seconds and car_name_rb takes 0.0171002 seconds nearly 100 times faster.

Unread post by **Thomas Linder Puls** » 15 May 2024 9:24

I should add that the mentioned speed up is when there are many car names in the database. I there are only 100 cars and I instead look up 5_000_000 times then the numbers are:

car_names takes 1.8511829 seconds and car_name_rb takes 1.0976348 seconds

I.e., only around 40% faster in this case (which is not nearly as good of course, but far better than whatever you will achieve with your attempts).

Unread post by **Thomas Linder Puls** » 15 May 2024 9:45

OK, one last thing. I added also a search for numbers (corresponding yo your idea):

Code: Select all

clauses
    run() :-
        Cars = 10_000,
        foreach I = std::cIterate(Cars) do
            N = math::random(upperBound(unsigned) - Cars) + Cars + I,
            CN = string::format("%", N),
            assert(car_numbers(N)),
            assert(car_names(CN)),
            car_names_rb:insert(CN)
        end foreach,
        Count = 50_000,
        SearchCar = "non-match",
        profileTime::start_pr("car_names"),
        foreach _ = std::cIterate(Count) do
            car_names(SearchCar)
            orelse
            succeed
        end foreach,
        profileTime::stop_pr("car_names"),
        profileTime::start_pr("car_names_rb"),
        foreach _ = std::cIterate(Count) do
            SearchCar in car_names_rb
            orelse
            succeed
        end foreach,
        profileTime::stop_pr("car_names_rb"),
        SearchCarNumber = 0,
        profileTime::start_pr("car_numbers"),
        foreach _ = std::cIterate(Count) do
            car_numbers(SearchCarNumber)
            orelse
            succeed
        end foreach,
        profileTime::stop_pr("car_numbers"),
        profileTime::printAndReset().

And searching for numbers (in this way) is around 5% more efficient than strings:

car_names 1.6987098
car_numbers 1.6164746

But you will also have to calculate the number to search for, and other stuff, so I doubt you can achieve the 5%. But you will never reach the 100 times faster or even the 40% faster.

drspro2 · Unread post by **drspro2** » 15 May 2024 13:54

Thomas thankyou for the extended information , I will use your code in new prolog code which i have to generate for future projects

Martin Meyer · Unread post by **Martin Meyer** » 15 May 2024 14:22

A remark on your point:

lets say that before your put all the data in memory with asserts, you assign to every
carname, a unique number ( for example with a constant ) ,
...
after that, you assert all the data clauses into memory with asserts with the numbers, but then, the matching
will be a lot faster because internally it can match on numbers instead of a string

Your idea of assigning unique numbers (resp. pointers) to strings is built-in into VIP already. It is the domain symbol:

The symbol domain is a subtype of the string domain, so a symbol can be used everywhere instead of a string. A symbol is stored in a global symbol table, and that storage guarenties that the pointer value of symbols will be the same for the same symbol. The equality operation is very efficient for symbols since their pointer value will be the same if the symbol is the same.

For example symbols are used in the PFC in the class syntax for terminals:

Code: Select all

domains
    terminal = symbol [presenter(presenter_terminal)].

It is however not always better to use domain symbol in place of domain string. I guess in practice exist more cases where string is the better choice.

drspro2 · Unread post by **drspro2** » 16 May 2024 8:21

the Car-Names example was very bad chosen , the Car-marks are a better example because there are about 30-40 car-marks ( volkswagen, volvo, bmw mercedes etcetera ) they could easily be replaced by a number.

Unread post by **Thomas Linder Puls** » 16 May 2024 9:50

Be careful with symbols: symbols are never released. Once a a string it converted to a symbol, it will stay in memory until the program terminates. So you don't want to handle large text files as symbols.

You should also notice that only equality test is faster, greater than and smaller than are exactly like for strings. So there is not much gain when used as red-black-tree keys, because red-black-trees uses full comparison.

..., I will use your code in new prolog code which i have to generate for future projects

If one of your facts contains many entries, then you could consider replacing that fact with relevant sets and maps. I didn't mention maps, but normally you use a lot more maps than sets, because maps can associate related data to a key.

I guess a car fact would actually look something like this:

Code: Select all

facts
    car_fact : (string CarName, string CarType, string Manufacturer).

Let us put the information into a normal domain instead, and replace the fact by a map:

Code: Select all

domains
    carInfo = carInfo(string CarName, string CarType, string Manufacturer).
 
facts
    carInfoMap : mapM{string CarName, carInfo CarInfo} := mapM_redBlack::new().
 
clauses
    insertCar(CarName, CarType, Manufacturer) :-
        carInfoMap:set(CarName, carInfo(CarName, CarType, Manufacturer)).
 
clauses
   tryGetCarInfo(CarName) = carInfoMap:tryGet(CarName).

The map will greatly speedup retrieving information given the car name, but it doesn't help on anything else. To find all cars of a certain type you will have to run through the entire map for example like this:

Code: Select all

clauses
    carsOfType(CarType) = [ Car || carInfo(:CarType = CarType | Car)= carInfoMap:getValue_nd() ].

This will go through all car info in the map.

If the operation is important you may add a separate mechanism to speed up this search.

Code: Select all

domains
    carInfo = carInfo(string CarName, string CarType, string Manufacturer).
 
facts
    carInfoMap : mapM{string CarName, carInfo CarInfo} := mapM_redBlack::new().
    carTypeMap : mapM{string CarType, setM{string CarName}} := mapM_redBlack::new().
 
clauses
    insertCar(CarName, CarType, Manufacturer) :-
        carInfoMap:set(CarName, carInfo(CarName, CarType, Manufacturer)),
        CarTypeSet = carTypeMap:get_defaultLazy(CarType, {  = setM_redBlack::new() }),
        CarTypeSet:insert(CarName).
 
clauses
    tryGetCarInfo(CarName) = carInfoMap:tryGet(CarName).
 
clauses
    carsOfType(CarType) = [ carInfoMap:get(CarName) || CarName in carTypeMap:tryGet(CarType) ].

I have used all kind of "fancy" code here (with no explanation), so if you need clarification please don't hesitate to ask.

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains

Re: question about strings as function arguments/domains