Another tokenise
Posted: 12 May 2011 14:27
Maybe this is useful to someone. Jump to the bottom to see examples.
You need to set the fact zz_splitby_items - here it is :=[" ",";"], so only splitting the string when there is a space or a semi-colon.
results in
["he","was","on-call","he","wasn't","oncall","three222"]
A more complex version:
This manages text in pairs of quotes or brackets. eg <hello there> is not split
With
EXAMPLES
1) "he was on-call;twice" is split to
["he","was","on-call","twice"]
2) "he was ;;;on-call;twice keeping the under_score_s" is split to:
["he","was","on-call","twice","keeping","the","under_score_s"]
3) "he was ;;;on-call;<twice keeping> the \"underscores together\"" is split to:
["he","was","on-call","twice keeping","the","underscores together"]
But I'm sure the gurus at PDC could write it more nicely/powerfully/flexibly !
I attach the class,
Steve
Code: Select all
class predicates
mytokenise:(string)->string*.
clauses
mytokenise(STR)=L:-
Rev=mytokprivate(STR,[""]),
L=list::reverse(Rev).
class facts
zz_splitby_items:string*:=[].
zz_previous_char_was_a_split:boolean:=false.
class predicates
mytokprivate:(string,string*)->string*.
clauses
mytokprivate("",L)=L:-!.
mytokprivate(STR,List)=List1:-
string::front(STR,1,First,Last),
list::isMember(First,zz_splitby_items),!, %then split
if zz_previous_char_was_a_split=true then
List1=mytokprivate(Last,List)
else
zz_previous_char_was_a_split:=true,
List1=mytokprivate(Last,[""|List])
end if.
mytokprivate(STR,[S|List])=List1:-
zz_previous_char_was_a_split:=false,
string::front(STR,1,First,Last),!,
List1=mytokprivate(Last,[string::concat(S,First)|List]).
mytokprivate(_,L)=L:-!.
class predicates
test:().
clauses
test():-
zz_splitby_items:=[" ",";"],
TOKS=mytokenise("he was on-call; he wasn't oncall; ;;; three222"),
stdio::write(TOKS).
Code: Select all
"he was on-call; he wasn't oncall; ;;; three222"
["he","was","on-call","he","wasn't","oncall","three222"]
A more complex version:
This manages text in pairs of quotes or brackets. eg <hello there> is not split
Code: Select all
class predicates
mytokenise:(string)->string*.
clauses
mytokenise(STR)=L:-
Rev=mytokprivate(STR,not_in_pair,[""]),
%it's all backwards (saves using append), so just reverse
L=list::reverse(Rev).
domains
pair=pair(string,string);
pair_same_char(string). %eg within single quotes
within_pair=within_pair;not_in_pair.
class facts
zz_splitby_items:string*:=[].
zz_previous_char_was_a_split:boolean:=false.
ndb_dont_split:(pair).
class predicates
mytokprivate:(string,within_pair,string*)->string*.
clauses
mytokprivate("",_,L)=L:-!.
%the source string is always split, so do that as a first step.
mytokprivate(STR,Within_pair,List)=List1:-
string::front(STR,1,First,Last),
List1=mytokprivate2(First,Last,Within_pair,List),!.
class predicates
mytokprivate2:(string First,string Last,within_pair,string*)->string*.
clauses
%current token is not in a pair of quotes or brackets
%create new token
mytokprivate2(First,Last,not_in_pair,List)=List1:-
list::isMember(First,zz_splitby_items),!, %then split
if zz_previous_char_was_a_split=true then %prevent empty tokens being created
List1=mytokprivate(Last,not_in_pair,List)
else
zz_previous_char_was_a_split:=true,
List1=mytokprivate(Last,not_in_pair,[""|List])
end if.
%start of a pair of (eg) brackets. (Nothing will be split until a close bracket is reached)
mytokprivate2(First,Last,not_in_pair,[_S|List])=List1:-
zz_previous_char_was_a_split:=false,
ndb_dont_split(pair(First,_Close)),!,
List1=mytokprivate(Last,within_pair,[""|List]).
mytokprivate2(First,Last,not_in_pair,[_S|List])=List1:-
zz_previous_char_was_a_split:=false,
ndb_dont_split(pair_same_char(First)),!,
List1=mytokprivate(Last,within_pair,[""|List]).
%continue building the token, char by char
mytokprivate2(First,Last,not_in_pair,[S|List])=List1:-
zz_previous_char_was_a_split:=false,!,
List1=mytokprivate(Last,not_in_pair,[string::concat(S,First)|List]).
%currently with a pair of brackets, and find the close bracket
mytokprivate2(First,Last,within_pair,List)=List1:-
ndb_dont_split(pair(_,First)),!,
List1=mytokprivate(Last,not_in_pair,List).
mytokprivate2(First,Last,within_pair,List)=List1:-
ndb_dont_split(pair_same_char(First)),!,
List1=mytokprivate(Last,not_in_pair,List).
%continue building the long token withing brackets, char by char
mytokprivate2(First,Last,within_pair,[S|List])=List1:-!,
List1=mytokprivate(Last,within_pair,[string::concat(S,First)|List]).
%all failures returns the list
mytokprivate2(_,_,_,L)=L:-!.
class predicates
test:().
clauses
test():-
zz_splitby_items:=[" ",";"],
assert(ndb_dont_split(pair("<",">"))),
assert(ndb_dont_split(pair_same_char("\""))),
S="he was ;;;on-call;<twice keeping> the \"underscores together\"",
TOKS=mytokenise(S),
stdio::write(TOKS).
Code: Select all
zz_splitby_items:=[" ",";"],
assert(ndb_dont_split(pair("<",">"))),
assert(ndb_dont_split(pair_same_char("\""))),
1) "he was on-call;twice" is split to
["he","was","on-call","twice"]
2) "he was ;;;on-call;twice keeping the under_score_s" is split to:
["he","was","on-call","twice","keeping","the","under_score_s"]
3) "he was ;;;on-call;<twice keeping> the \"underscores together\"" is split to:
["he","was","on-call","twice keeping","the","underscores together"]
But I'm sure the gurus at PDC could write it more nicely/powerfully/flexibly !
I attach the class,
Steve