Version 29 of cmdSplit

Updated 2014-09-18 17:33:19 by pooryorick

scriptSplit (formerly known as cmdSplit), by dgp, parses a script into its constituent commands while properly handling semicolon-delimited commands and the "semicolon in a comment" problem. It was written to support parsing of class bodies in an itcl-like, pure Tcl, OO framework into Tcl commands.


AMG: PYK, did you get dgp's permission to rename his script? I see a few days ago you commented that splitScript is likely the better name. While I agree with you, I saw no follow-up commentary nor discussion indicating buy-in from dgp, but rather only that you deleted your own comment and simply renamed the script, and that you did so without renaming the page (moving its contents) or updating every instance of its name in the discussion (thereby leaving people confused about what cmdSplit is).

PYK 2014-09-18: It looks to me like I did do the renaming where needed. All the references to cmdSplit below refer to a different thing. My philosophy on editing the wiki is that in order not to waste peoples' time, for a lot of things, it's better to just make changes and see how they fly. I actually looked for dgp in the Tcl Chatroom last night to mention this change, and was planning to mention it today when he appeared. If he objects to it, or if it otherwise becomes clear that it the change should be reverted, I'll revert it, of course. I planned to rename the page at some future point if there was no pushback about renaming the proc.

aspect: I have to admit I've been a bit uneasy about the rename -- I can see sense in the new name, but cmdsplit has made it into my standard prelude and is already used in my scripts, so I'm stuck with the old name. What I didn't realise until now is how many pages reference it by name or directly xbody it. Those links are just from wiki search, which won't give me more than a page of results. I'd consider this a risky change - especially with cmdSplit not simply disappearing, but taking a new meaning!

PYK 2014-09-18: dgp threw out the same concern in a one-liner, but I didn't get the impression he was really stumping to change it back. I'm still in favour of this change, since order of the future is a priority for me, and the chaos of the past is not. Similarly, I advocate for Tcl itself to follow the (very succesful) example of the R project, and relentlessly deprecate anything but the current version. Their rate of progress is startling. If someone reverts it, I'll just leave it as is, and will go correct all the other pages that I've now changed to indicate scriptSplit. One feature I'd like for the wiki is "snapshot mode", which would let one browse through the wiki as it looked as some point in time.

See Also

cmdStream
Config file using slave interp, by AMG
more-or-less the same thing, implemented using a slave interpreter

Description

scriptSplit returns a list of the commands in a script. The original post is How to split a string into elements exactly as eval would do Options, comp.lang.tcl, 1998-09-07 .

PYK 2013-04-14: I've modified scriptSplit to not filter out comments, and provided a simple helper script that does that if desired:

nocomments [scriptSplit $script]

code:

proc scriptSplit {script} {
    set commands {}
    set chunk {} 
    foreach line [split $script \n] {
        append chunk $line
        if {[info complete $chunk\n]} {
            # $chunk ends in a complete Tcl command, and none of the
            # newlines within it end a complete Tcl command.  If there
            # are multiple Tcl commands in $chunk, they must be
            # separated by semi-colons.
            set cmd {} 
            foreach part [split $chunk \;] {
                append cmd $part
                if {[info complete $cmd\n]} {
                    set cmd [string trimleft $cmd]
                    #drop empty commands
                    if {$cmd eq {}} {
                        continue
                    }
                    if {[string match \#* $cmd]} {
                        #the semi-colon was part of a comment.  Add it back
                        append cmd \;
                    } else {
                        lappend commands $cmd
                        set cmd {}
                    }
                } else {
                    # No complete command yet.
                    # Replace semicolon and continue
                    append cmd \;
                }
            }
            #if there was an "inline" comment, it will be in cmd, with an
            #additional semicolon at the end
            if {$cmd ne {}} {
                lappend commands [string replace $cmd[set cmd {}] end end]
            }
            set chunk {} 
        } else {
            # No end of command yet.  Put the newline back and continue
            append chunk \n
        }
    }
     if {![string match {} [string trimright $chunk]]} {
        return -code error "Can't parse script into a\
                sequence of commands.\n\tIncomplete\
                command:\n-----\n$chunk\n-----"
    }
    return $commands
}

proc nocomments {commands} {
    set res [list]
    foreach command $commands {
        if {![string match \#* $command]} {
            lappend res $command
        }
    }
    return $res
}

wordSplit

Sarnold: cmdSplit takes a command and returns its arguments as a list.

proc cmdSplit command {
    if {![info complete $command]} {error "non complete command"}
    set res ""; # the list of words
    set chunk ""
    foreach word [split $command " \t"] {
        # testing each word until the word being tested makes the
        # command up to it complete
        # example:
        # set "a b"
        # set -> complete, 1 word
        # set "a -> not complete
        # set "a b" -> complete, 2 words
        append chunk $word
        if {[info complete "$res $chunk\n"]} {
            lappend res $chunk
            set chunk ""
        } else {
            append chunk " "
        }
    }
    lsearch -inline -all -not $res {}   ;# empty words denote consecutive whitespace
}

aspect: forgive my foolishness, but what is cmdSplit for? From the description it sounds like cmdSplit $command means lrange $command 1 end but it seems to do something different. If you want the elements of $command as a list, just use $command!

AMG: cmdSplit splits an arbitrary string by whitespace, then attempts to join the pieces according to the result of info complete. This results in a list in which each element embeds its original quote characters. Since an odd number of trailing backslashes doesn't cause info complete to return false, cmdSplit doesn't correctly recognize backslashes used to quote spaces.

I agree that cmdSplit doesn't appear to serve a useful purpose. Its input should already be a valid, directly usable list.

aspect: it also does strange things if there are consecutive spaces in the input. "each element embeds its original quote characters" seems to be the important characteristic, but I can't think of a use-case where this would be desirable, hoping that Sarnold can elaborate on his original intention so the example can be focussed (and corrected?).

PYK 2014-09-11: Due to command substitution, several words in the "raw" list that composes a command might contribute to one word in the "logical" list that is that command. cmdSplit parses the command into its logical words. Keeping the braces and quotation marks allows the consumer to know how Tcl would have dealt with each word.

To eliminate false positives, I added a \n to the info complete.. test in cmdSplit, and alsoe, here is another variant that is different only in style:

proc cmdSplit2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set cmdwords {}
    set realword {}
    foreach word [split $cmd " \t"] {
        set realword [concat $realword[set realword {}] $word]
        if {[info complete $realword\n]} {
            lappend cmdwords $realword
            set realword {}
        }
    }
    return $cmdwords
}

example:

% cmdSplit2 {set "var one" [lindex {one "two three" four} 1]} 
#-> set {"var one"} {[lindex {one "two three" four} 1]}

aspect: that example made it clear! I've wanted something like this before, It could have potential for some interesting combinations with Scripted list and Annotating words for better specifying procs in Tcl9.

I added an lsearch at the end of cmdSplit to get rid of the "empty word" artifacts caused by consecutive whitespace. PYK's implementation needs a bit of tweaking to handle this better:

% cmdSplit {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo  bar}} {"$baz   quz 23"} {lel\ lal} lka {${foo b  bar}}
% cmdSplit2 {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo bar}} {} {"$baz quz 23"} {} {} {lel\ lal} lka {${foo b bar}}

PYK 2014-09-12: Yes, it does need some tweaking. In addition to the issue noted, both cmdSplit and the previous cmdSplit2 improperly converted tab characters within a word into spaces. To fix that, it's necessary to use regexp instead of split to get a handle on the actual delimiter. Here is a new cmdSplit2 that I think works correctly:

proc cmdSplit2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set words {}
    set logical {}
    set cmd [string trimleft $cmd[set cmd {}]]
    while {[regexp {([^\s]*)(\s+)(.*)} $cmd full first delim last]} {
        append logical $first
        if {[info complete $logical\n]} {
            lappend words $logical
            set logical {}
        } else {
            append logical $delim
        }
        set cmd $last[set last {}]
    }
    if {$cmd ne {}} {
        append logical $cmd
    }
    if {$logical ne {}} {
        lappend words $logical 
    }
    return $words
}