question about strings as function arguments/domains
my main philosophy is: only use the string domain inside prolog code if it is very strictly necessary.
Because for prolog a string of for example 5 characters is the same as a string of 300 or more characters
i have herited some code where all the arguments are declared as strings which also results in high memory usage.
and a second question , if i choose to use a string in a predicate which is asserted into memory, would it be benificial to take the first 3 characters of the string, calculate to sum- char_int val of those 3 , and assert the string together with that digit to able to later retrieve the values more efficient,
or does the VP compiler already has a built in method to handle small and big strings for indexes
so instead of :
assert( cars_names( "mercedes" )),
assert( cars_names( "opel" )),
use:
assert( cars_names( 234, "mercedes" )),
assert( cars_names( 524, "opel" )),
* assuming here that 234 stands for: mer and 524 stands for ope
Because for prolog a string of for example 5 characters is the same as a string of 300 or more characters
i have herited some code where all the arguments are declared as strings which also results in high memory usage.
and a second question , if i choose to use a string in a predicate which is asserted into memory, would it be benificial to take the first 3 characters of the string, calculate to sum- char_int val of those 3 , and assert the string together with that digit to able to later retrieve the values more efficient,
or does the VP compiler already has a built in method to handle small and big strings for indexes
so instead of :
assert( cars_names( "mercedes" )),
assert( cars_names( "opel" )),
use:
assert( cars_names( 234, "mercedes" )),
assert( cars_names( 524, "opel" )),
* assuming here that 234 stands for: mer and 524 stands for ope
- Thomas Linder Puls
- VIP Member
- Posts: 1431
- Joined: 28 Feb 2000 0:01
Re: question about strings as function arguments/domains
This statement is wrong:
However a string is a pointer to the characters so a predicate that takes a string as argument actually takes a pointer as argument. And in that sense you are right: it doesn't matter whether the actual string is long or short; the argument is a pointer (32 bit in 32 bit programs; 64 bit in 64 bit programs).
All in all, I think you prejudice about strings is it wrong.
I am not sure I understand the second question.
However if you only have a little number of car names then the original fact database is not likely to cause a performance problem. But if there is a large number of car names then you may have a performance problem, because fact databases are searched from one end to the other to find a match. So if there are 10000 car names then searching for one will in average consider 5000 car names (5000 for names that are in the database; 10000 for names that are not in the database).
I.e. searching in a fact database is linear in the number of elements in the fact database (O(N)).
(And then it really doesn't matter to speed up the comparison slightly. The problem is the large amount of comparisons.)
If you instead insert the names in a setM_redBlack then lookup is logarithmic (O(log(N)). For a database with 10000 car names you will consider something like 20 car names to determine whether one is present in the set.
And that is a true performance speed-up: 20 comparisons instead of 5000.
To learn more about this you can read Colletion Library.
You can also use setM_hash (constructor new_string) which will use our default hash function and have an constant look up (O(1)). I.e. when you look up a car name in such a set, the hash value of the car name is calculated and converted to an index into an array, and then that entry is examined; there are a little more to deal with multiple strings "landing" in the same index, etc. But basically only one comparison.
A string with 5 characters occupies 12 bytes and an string of 300 characters occupies 602 bytes. The allocated memory may be a little larger because allocations are always aligned to certain address boundaries.Because for prolog a string of for example 5 characters is the same as a string of 300 or more characters ...
However a string is a pointer to the characters so a predicate that takes a string as argument actually takes a pointer as argument. And in that sense you are right: it doesn't matter whether the actual string is long or short; the argument is a pointer (32 bit in 32 bit programs; 64 bit in 64 bit programs).
All in all, I think you prejudice about strings is it wrong.
I am not sure I understand the second question.
I think this is a rather strange approach. Basically it seems that you are making a kind of hash value of the strings and store them as an additional argument. Your hash function is rather bad however "mer...", "mre...", "emr..." all gives same hash value.if i choose to use a string in a predicate which is asserted into memory, would it be benificial to take the first 3 characters of the string, calculate to sum- char_int val of those 3 , and assert the string together with that digit to able to later retrieve the values more efficient,
However if you only have a little number of car names then the original fact database is not likely to cause a performance problem. But if there is a large number of car names then you may have a performance problem, because fact databases are searched from one end to the other to find a match. So if there are 10000 car names then searching for one will in average consider 5000 car names (5000 for names that are in the database; 10000 for names that are not in the database).
I.e. searching in a fact database is linear in the number of elements in the fact database (O(N)).
(And then it really doesn't matter to speed up the comparison slightly. The problem is the large amount of comparisons.)
If you instead insert the names in a setM_redBlack then lookup is logarithmic (O(log(N)). For a database with 10000 car names you will consider something like 20 car names to determine whether one is present in the set.
And that is a true performance speed-up: 20 comparisons instead of 5000.
To learn more about this you can read Colletion Library.
You can also use setM_hash (constructor new_string) which will use our default hash function and have an constant look up (O(1)). I.e. when you look up a car name in such a set, the hash value of the car name is calculated and converted to an index into an array, and then that entry is examined; there are a little more to deal with multiple strings "landing" in the same index, etc. But basically only one comparison.
Regards Thomas Linder Puls
PDC
PDC
Re: question about strings as function arguments/domains
thankyou for the extended information
in this example of Car-names:
lets say that before your put all the data in memory with asserts, you assign to every
carname, a unique number ( for example with a constant ) ,
( you can do this because you know the list of carnames is finite, and you know all the carnames in
advance. )
after that, you assert all the data clauses into memory with asserts with the numbers, but then, the matching
will be a lot faster because internally it can match on numbers instead of a string,
and my question is If this is allways faster than matching and retrieval with the string-implementation,
because with the number-declaration the running program always knows in advance the size of the number-domain, this compared to the string implementation where the program has to first follow the pointer and the string which is behind it.
and for the very lazy programmers who declare Everything as string:, are they rewarded because
visual-prolog does some performance enhancements for them when it has to compare strings internally?,
for example when it has to compare 2 strings wether they are equal or not, in this case 1 string is 5 characters and the other string is 300 characters, the program would not have to retrieve the whole string of 300 characters because it can know by the length of both strings in advance that they can not be equal
ok i know i could create a program with a timer to see which or what is faster and to test it.
in the code i inherited they even use strings of 1 character, and they perform comparisons with that.
Also they make comparisons like this:
Not equal the Empty -string ::,
not( carname( "" ) ).
which is in my opinion very dangerous, i would never match on empty strings,
because the code would be hard to maintain and understand
in this example of Car-names:
lets say that before your put all the data in memory with asserts, you assign to every
carname, a unique number ( for example with a constant ) ,
( you can do this because you know the list of carnames is finite, and you know all the carnames in
advance. )
after that, you assert all the data clauses into memory with asserts with the numbers, but then, the matching
will be a lot faster because internally it can match on numbers instead of a string,
and my question is If this is allways faster than matching and retrieval with the string-implementation,
because with the number-declaration the running program always knows in advance the size of the number-domain, this compared to the string implementation where the program has to first follow the pointer and the string which is behind it.
and for the very lazy programmers who declare Everything as string:, are they rewarded because
visual-prolog does some performance enhancements for them when it has to compare strings internally?,
for example when it has to compare 2 strings wether they are equal or not, in this case 1 string is 5 characters and the other string is 300 characters, the program would not have to retrieve the whole string of 300 characters because it can know by the length of both strings in advance that they can not be equal
ok i know i could create a program with a timer to see which or what is faster and to test it.
in the code i inherited they even use strings of 1 character, and they perform comparisons with that.
Also they make comparisons like this:
Not equal the Empty -string ::,
not( carname( "" ) ).
which is in my opinion very dangerous, i would never match on empty strings,
because the code would be hard to maintain and understand
- Thomas Linder Puls
- VIP Member
- Posts: 1431
- Joined: 28 Feb 2000 0:01
Re: question about strings as function arguments/domains
I think you completely miss the point. You may or may not be able to speed up your program by using such code. But the speed-up will be completely insignificant probably less than 1% (and you may even have difficulties measuring it do to other computer load).
Nobody will notice if a program run 10% faster (unless they use as watch). To feel a performance improvement it has to be 50% faster or something like that, and that you will never get by such constructions, furthermore your code will become very awkward and difficult to maintain.
To get a real speed up you should use the mentioned setM stuff.
To illustrate the point I made this little test program (try it on your own computer):
It invents 10000 "car names" and place them in a fact database and in a setM_redBlack, and then it searches for a car name (which is not among the invented names).
On my computer car_names takes 1.6923113 seconds and car_name_rb takes 0.0171002 seconds nearly 100 times faster.
Nobody will notice if a program run 10% faster (unless they use as watch). To feel a performance improvement it has to be 50% faster or something like that, and that you will never get by such constructions, furthermore your code will become very awkward and difficult to maintain.
To get a real speed up you should use the mentioned setM stuff.
To illustrate the point I made this little test program (try it on your own computer):
Code: Select all
implement main
open core, stdio
class facts
car_names : (string Carname).
car_names_rb : setM{string} := setM_redBlack::new().
clauses
run() :-
foreach I = std::cIterate(10000) do
CN = string::format("%_%", math::random(upperBound(unsigned)), I),
assert(car_names(CN)),
car_names_rb:insert(CN)
end foreach,
Count = 50000,
SearchCar = "non-match",
profileTime::start_pr("car_names"),
foreach _ = std::cIterate(Count) do
car_names(SearchCar)
orelse
succeed
end foreach,
profileTime::stop_pr("car_names"),
profileTime::start_pr("car_names_rb"),
foreach _ = std::cIterate(Count) do
SearchCar in car_names_rb
orelse
succeed
end foreach,
profileTime::stop_pr("car_names_rb"),
profileTime::printAndReset().
end implement main
goal
console::runUtf8(main::run).
On my computer car_names takes 1.6923113 seconds and car_name_rb takes 0.0171002 seconds nearly 100 times faster.
Regards Thomas Linder Puls
PDC
PDC
- Thomas Linder Puls
- VIP Member
- Posts: 1431
- Joined: 28 Feb 2000 0:01
Re: question about strings as function arguments/domains
I should add that the mentioned speed up is when there are many car names in the database. I there are only 100 cars and I instead look up 5_000_000 times then the numbers are:
car_names takes 1.8511829 seconds and car_name_rb takes 1.0976348 seconds
I.e., only around 40% faster in this case (which is not nearly as good of course, but far better than whatever you will achieve with your attempts).
car_names takes 1.8511829 seconds and car_name_rb takes 1.0976348 seconds
I.e., only around 40% faster in this case (which is not nearly as good of course, but far better than whatever you will achieve with your attempts).
Regards Thomas Linder Puls
PDC
PDC
- Thomas Linder Puls
- VIP Member
- Posts: 1431
- Joined: 28 Feb 2000 0:01
Re: question about strings as function arguments/domains
OK, one last thing. I added also a search for numbers (corresponding yo your idea):
And searching for numbers (in this way) is around 5% more efficient than strings:
car_names 1.6987098
car_numbers 1.6164746
But you will also have to calculate the number to search for, and other stuff, so I doubt you can achieve the 5%. But you will never reach the 100 times faster or even the 40% faster.
Code: Select all
clauses
run() :-
Cars = 10_000,
foreach I = std::cIterate(Cars) do
N = math::random(upperBound(unsigned) - Cars) + Cars + I,
CN = string::format("%", N),
assert(car_numbers(N)),
assert(car_names(CN)),
car_names_rb:insert(CN)
end foreach,
Count = 50_000,
SearchCar = "non-match",
profileTime::start_pr("car_names"),
foreach _ = std::cIterate(Count) do
car_names(SearchCar)
orelse
succeed
end foreach,
profileTime::stop_pr("car_names"),
profileTime::start_pr("car_names_rb"),
foreach _ = std::cIterate(Count) do
SearchCar in car_names_rb
orelse
succeed
end foreach,
profileTime::stop_pr("car_names_rb"),
SearchCarNumber = 0,
profileTime::start_pr("car_numbers"),
foreach _ = std::cIterate(Count) do
car_numbers(SearchCarNumber)
orelse
succeed
end foreach,
profileTime::stop_pr("car_numbers"),
profileTime::printAndReset().
car_names 1.6987098
car_numbers 1.6164746
But you will also have to calculate the number to search for, and other stuff, so I doubt you can achieve the 5%. But you will never reach the 100 times faster or even the 40% faster.
Regards Thomas Linder Puls
PDC
PDC
Re: question about strings as function arguments/domains
Thomas thankyou for the extended information , I will use your code in new prolog code which i have to generate for future projects
-
- VIP Member
- Posts: 331
- Joined: 14 Nov 2002 0:01
Re: question about strings as function arguments/domains
A remark on your point:
The symbol domain is a subtype of the string domain, so a symbol can be used everywhere instead of a string. A symbol is stored in a global symbol table, and that storage guarenties that the pointer value of symbols will be the same for the same symbol. The equality operation is very efficient for symbols since their pointer value will be the same if the symbol is the same.
For example symbols are used in the PFC in the class syntax for terminals:
It is however not always better to use domain symbol in place of domain string. I guess in practice exist more cases where string is the better choice.
Your idea of assigning unique numbers (resp. pointers) to strings is built-in into VIP already. It is the domain symbol:lets say that before your put all the data in memory with asserts, you assign to every
carname, a unique number ( for example with a constant ) ,
...
after that, you assert all the data clauses into memory with asserts with the numbers, but then, the matching
will be a lot faster because internally it can match on numbers instead of a string
The symbol domain is a subtype of the string domain, so a symbol can be used everywhere instead of a string. A symbol is stored in a global symbol table, and that storage guarenties that the pointer value of symbols will be the same for the same symbol. The equality operation is very efficient for symbols since their pointer value will be the same if the symbol is the same.
For example symbols are used in the PFC in the class syntax for terminals:
Code: Select all
domains
terminal = symbol [presenter(presenter_terminal)].
Regards Martin
Re: question about strings as function arguments/domains
the Car-Names example was very bad chosen , the Car-marks are a better example because there are about 30-40 car-marks ( volkswagen, volvo, bmw mercedes etcetera ) they could easily be replaced by a number.
- Thomas Linder Puls
- VIP Member
- Posts: 1431
- Joined: 28 Feb 2000 0:01
Re: question about strings as function arguments/domains
Be careful with symbols: symbols are never released. Once a a string it converted to a symbol, it will stay in memory until the program terminates. So you don't want to handle large text files as symbols.
You should also notice that only equality test is faster, greater than and smaller than are exactly like for strings. So there is not much gain when used as red-black-tree keys, because red-black-trees uses full comparison.
I guess a car fact would actually look something like this:
Let us put the information into a normal domain instead, and replace the fact by a map:
The map will greatly speedup retrieving information given the car name, but it doesn't help on anything else. To find all cars of a certain type you will have to run through the entire map for example like this:
This will go through all car info in the map.
If the operation is important you may add a separate mechanism to speed up this search.
I have used all kind of "fancy" code here (with no explanation), so if you need clarification please don't hesitate to ask.
You should also notice that only equality test is faster, greater than and smaller than are exactly like for strings. So there is not much gain when used as red-black-tree keys, because red-black-trees uses full comparison.
If one of your facts contains many entries, then you could consider replacing that fact with relevant sets and maps. I didn't mention maps, but normally you use a lot more maps than sets, because maps can associate related data to a key...., I will use your code in new prolog code which i have to generate for future projects
I guess a car fact would actually look something like this:
Code: Select all
facts
car_fact : (string CarName, string CarType, string Manufacturer).
Code: Select all
domains
carInfo = carInfo(string CarName, string CarType, string Manufacturer).
facts
carInfoMap : mapM{string CarName, carInfo CarInfo} := mapM_redBlack::new().
clauses
insertCar(CarName, CarType, Manufacturer) :-
carInfoMap:set(CarName, carInfo(CarName, CarType, Manufacturer)).
clauses
tryGetCarInfo(CarName) = carInfoMap:tryGet(CarName).
Code: Select all
clauses
carsOfType(CarType) = [ Car || carInfo(:CarType = CarType | Car)= carInfoMap:getValue_nd() ].
If the operation is important you may add a separate mechanism to speed up this search.
Code: Select all
domains
carInfo = carInfo(string CarName, string CarType, string Manufacturer).
facts
carInfoMap : mapM{string CarName, carInfo CarInfo} := mapM_redBlack::new().
carTypeMap : mapM{string CarType, setM{string CarName}} := mapM_redBlack::new().
clauses
insertCar(CarName, CarType, Manufacturer) :-
carInfoMap:set(CarName, carInfo(CarName, CarType, Manufacturer)),
CarTypeSet = carTypeMap:get_defaultLazy(CarType, { = setM_redBlack::new() }),
CarTypeSet:insert(CarName).
clauses
tryGetCarInfo(CarName) = carInfoMap:tryGet(CarName).
clauses
carsOfType(CarType) = [ carInfoMap:get(CarName) || CarName in carTypeMap:tryGet(CarType) ].
Regards Thomas Linder Puls
PDC
PDC