Another tokenise

Share Tips, Code Samples, etc. with the Visual Prolog community.
Post Reply
Steve Lympany
VIP Member
Posts: 695
Joined: 31 Mar 2001 23:01

Another tokenise

Post by Steve Lympany » 12 May 2011 14:27

Maybe this is useful to someone. Jump to the bottom to see examples.

Code: Select all

class predicates         mytokenise:(string)->string*. clauses         mytokenise(STR)=L:-                 Rev=mytokprivate(STR,[""]),                 L=list::reverse(Rev).   class facts         zz_splitby_items:string*:=[].         zz_previous_char_was_a_split:boolean:=false. class predicates         mytokprivate:(string,string*)->string*. clauses         mytokprivate("",L)=L:-!.         mytokprivate(STR,List)=List1:-                 string::front(STR,1,First,Last),                 list::isMember(First,zz_splitby_items),!, %then split                 if zz_previous_char_was_a_split=true then                         List1=mytokprivate(Last,List)                 else                         zz_previous_char_was_a_split:=true,                         List1=mytokprivate(Last,[""|List])                 end if.         mytokprivate(STR,[S|List])=List1:-                 zz_previous_char_was_a_split:=false,                 string::front(STR,1,First,Last),!,                 List1=mytokprivate(Last,[string::concat(S,First)|List]).         mytokprivate(_,L)=L:-!. class predicates         test:(). clauses         test():-                 zz_splitby_items:=[" ",";"],                 TOKS=mytokenise("he was on-call; he wasn't oncall; ;;; three222"),                 stdio::write(TOKS).
You need to set the fact zz_splitby_items - here it is :=[" ",";"], so only splitting the string when there is a space or a semi-colon.

Code: Select all

 "he was on-call; he wasn't oncall; ;;; three222"
results in

["he","was","on-call","he","wasn't","oncall","three222"]

A more complex version:

This manages text in pairs of quotes or brackets. eg <hello there> is not split

Code: Select all

class predicates         mytokenise:(string)->string*. clauses         mytokenise(STR)=L:-                 Rev=mytokprivate(STR,not_in_pair,[""]), %it's all backwards (saves using append), so just reverse                 L=list::reverse(Rev).   domains         pair=pair(string,string);                         pair_same_char(string). %eg within single quotes         within_pair=within_pair;not_in_pair. class facts         zz_splitby_items:string*:=[].         zz_previous_char_was_a_split:boolean:=false.         ndb_dont_split:(pair). class predicates         mytokprivate:(string,within_pair,string*)->string*. clauses         mytokprivate("",_,L)=L:-!. %the source string is always split, so do that as a first step.         mytokprivate(STR,Within_pair,List)=List1:-                 string::front(STR,1,First,Last),                 List1=mytokprivate2(First,Last,Within_pair,List),!.     class predicates         mytokprivate2:(string First,string Last,within_pair,string*)->string*. clauses %current token is not in a pair of quotes or brackets %create new token         mytokprivate2(First,Last,not_in_pair,List)=List1:-                 list::isMember(First,zz_splitby_items),!, %then split                 if zz_previous_char_was_a_split=true then %prevent empty tokens being created                         List1=mytokprivate(Last,not_in_pair,List)                 else                         zz_previous_char_was_a_split:=true,                         List1=mytokprivate(Last,not_in_pair,[""|List])                 end if.   %start of a pair of (eg) brackets. (Nothing will be split until a close bracket is reached)         mytokprivate2(First,Last,not_in_pair,[_S|List])=List1:-                 zz_previous_char_was_a_split:=false,                 ndb_dont_split(pair(First,_Close)),!,                 List1=mytokprivate(Last,within_pair,[""|List]).         mytokprivate2(First,Last,not_in_pair,[_S|List])=List1:-                 zz_previous_char_was_a_split:=false,                 ndb_dont_split(pair_same_char(First)),!,                 List1=mytokprivate(Last,within_pair,[""|List]).   %continue building the token, char by char         mytokprivate2(First,Last,not_in_pair,[S|List])=List1:-                 zz_previous_char_was_a_split:=false,!,                 List1=mytokprivate(Last,not_in_pair,[string::concat(S,First)|List]).   %currently with a pair of brackets, and find the close bracket         mytokprivate2(First,Last,within_pair,List)=List1:-                 ndb_dont_split(pair(_,First)),!,                 List1=mytokprivate(Last,not_in_pair,List).         mytokprivate2(First,Last,within_pair,List)=List1:-                 ndb_dont_split(pair_same_char(First)),!,                 List1=mytokprivate(Last,not_in_pair,List).   %continue building the long token withing brackets, char by char         mytokprivate2(First,Last,within_pair,[S|List])=List1:-!,                 List1=mytokprivate(Last,within_pair,[string::concat(S,First)|List]).   %all failures returns the list         mytokprivate2(_,_,_,L)=L:-!. class predicates         test:(). clauses         test():-                 zz_splitby_items:=[" ",";"],                 assert(ndb_dont_split(pair("<",">"))),                 assert(ndb_dont_split(pair_same_char("\""))),                 S="he was ;;;on-call;<twice keeping> the \"underscores together\"",                 TOKS=mytokenise(S),                 stdio::write(TOKS).
With

Code: Select all

                zz_splitby_items:=[" ",";"],                 assert(ndb_dont_split(pair("<",">"))),                 assert(ndb_dont_split(pair_same_char("\""))),
EXAMPLES

1) "he was on-call;twice" is split to
["he","was","on-call","twice"]

2) "he was ;;;on-call;twice keeping the under_score_s" is split to:
["he","was","on-call","twice","keeping","the","under_score_s"]

3) "he was ;;;on-call;<twice keeping> the \"underscores together\"" is split to:
["he","was","on-call","twice keeping","the","underscores together"]

But I'm sure the gurus at PDC could write it more nicely/powerfully/flexibly !

I attach the class,

Steve
Attachments
mytokenise.zip
(2.28 KiB) Downloaded 884 times

Post Reply