cmdSplit

Difference between version 78 and 79 - Previous - Next
'''`commands`''' (formerly known as '''`scriptSplit`''', and before that as '''`cmdSplit`'''), by [dgp], parses a [script] into its constituent commands while properly handling semicolon-delimited commands and the "semicolon in a comment" problem.  It was written to support parsing of class bodies in an itcl-like, pure Tcl, OO framework into Tcl commands.

[PYK] 2014-10-08, 2016-09-14:   '''`scriptSplit`''' was previously named
'''`cmdSplit`''', but was renamed so '''`wordparts`''', the procedure that
split commands, could be renamed to '''`cmdSplit`'''.  Finally, `cmdSplit`
was renamed to just '''`words`'''.  Check the history of this page for some
discussion. 



** See Also **

   [cmdStream]:   Process commands in a script read from a [chan%|%channel].  Both a [callback] variant and a [coroutine]-based producer are available .
   [Config file using slave interp], by [AMG]:   More-or-less the same thing, implemented using a slave [interp]%|%interpreter].
   [tclparser]:   An [extension] which exposes Tcl's own parser (of scripts, commands, lists and [expr]%|%expressions]) to scripts.

   [parsetcl]:   A pure-Tcl package for parsing Tcl.



** Download **


`commands`, `words`, `wordparts`, `exprlex`, and `funclex` are available in
`[ycl%|%ycl parse]`, along with commands that provide information about offsets
and line numbers of each component of a parsed script.  These commands are
used, for example, in `[ycl] vim` to determine which command and
word the cursor is currently located in.



** Description **


`commands` returns a list of the commands in a script.  The original post is
''[http://groups.google.com/group/comp.lang.tcl/msg/cfe2d00fc7b291be%|%How to
split a string into elements exactly as eval would do Options, comp.lang.tcl, 1998-09-07]''.

[PYK] 2013-04-14:  I've modified `commands` to not filter out comments, and
provided a simple helper script that does that if desired:

======
nocomments [commands $script]
======

code:

======
proc commands script {
    set commands {}
    set chunk {} 
    foreach line [split $script \n] {
        append chunk $line
        if {[info complete $chunk\n]} {
            # $chunk ends in a complete Tcl command, and none of the
            # newlines within it end a complete Tcl command.  If there
            # are multiple Tcl commands in $chunk, they must be
            # separated by semi-colons.
            set cmd {}
            foreach part [split $chunk \;] {
                append cmd $part
                if {[info complete $cmd\n]} {
                    set cmd [string trimleft $cmd[set cmd {}] "\f\n\r\t\v "]

                    if {[string match #* $cmd]} {
                        #the semi-colon was part of a comment.  Add it back
                        append cmd \;
                        continue
                    }
                    #drop empty commands
                    if {$cmd eq {}} {
                        continue
                    }
                    lappend commands $cmd
                    set cmd {}
                } else {
                    # No complete command yet.
                    # Replace semicolon and continue
                    append cmd \;
                }
            }
            # Handle comments, removing synthetic semicolon at the end
            if {$cmd ne {}} {
                lappend commands [string replace $cmd[set cmd {}] end end]
            }
            set chunk {} 
        } else {
            # No end of command yet.  Put the newline back and continue
            append chunk \n
        }
    }
    if {![string match {} [string trimright $chunk]]} {
        return -code error "Can't parse script into a\
                sequence of commands.\n\tIncomplete\
                command:\n-----\n$chunk\n-----"
    }
    return $commands
}
======

[PYK] 2016-09-25:   Added `[string trimright]` and juggled some lines to ensure
proper removal of trailing trailing whitespace from a commands separated by
semicolons.

[PYK] 2017-08-16: Removed `[string trimright]` which introduced a bug in the case where the last word of a command ends in backslash-whitespace.  Added a list of whitespares characters to `[string trimleft]` as [https://www.tcl.tk/cgi-bin/tct/tip/413.html%|%TIP #413] NULL to its list of default characters to remove, making it unsuitable for the purpose of trimming whitespace from the beginning of a command.


----

'''[PYK] 2017-08-16'''

Here is another implementation of `commands` which is structured a little
differently and also makes also makes available some information about the
commands in the script.  It's available as `[ycl] parse tcl commands`:

======
proc commands script {
    namespace upvar [namespace current] commands_info report
    set report {}
    set commands {}
    set command {}
    set comment 0
    set lineidx 0
    set offset 0
    foreach line [split $script \n] {
        set parts [split $line \;]
        set numparts [llength $parts]
        set partidx 0
        while 1 {
            set parts [lassign $parts[set parts {}] part]
            if {[string length $command]} {
                if {$partidx} {
                    append command \;$part
                } else {
                    append command \n$part
                }
            } else {
                set partlength [string length $part]
                set command [string trimleft $part[set part {}] "\f\n\r\t\v "]
                incr offset [expr {$partlength - [string length $command]}]
                if {[string match #* $command]} {
                    set comment 1
                }
            }

            if {$command eq {}} {
                incr offset
            } elseif {(!$comment || (
                    $comment && (!$numparts || ![llength $parts])))
                && [info complete $command\n]} {

                lappend commands $command
                set info [dict create character $offset line $lineidx]
                set offset [expr {$offset + [string length $command] + 1}]
                lappend report $info
                set command {}
                set comment 0
                set info {}
            }

            incr partidx
            if {![llength $parts]} break
        }
    }
    incr lineidx
    if {$command ne {}} {
        error [list {incomplete command} $command]
    }
    return $commands
}
======



** `words` **

[Sarnold]: `words` takes a command and returns its arguments as a list.

======
proc words command {
    if {![info complete $command]} {error "non complete command"}
    set res ""; # the list of words
    set chunk ""
    foreach word [split $command "\f\n\r\t\v "] {
        # testing each word until the word being tested makes the
        # command up to it complete
        # example:
        # set "a b"
        # set -> complete, 1 word
        # set "a -> not complete
        # set "a b" -> complete, 2 words
        append chunk $word
        if {[info complete "$res $chunk\n"]} {
            lappend res $chunk
            set chunk ""
        } else {
            append chunk " "
        }
    }
    lsearch -inline -all -not $res {}   ;# empty words denote consecutive whitespace
}
======

----

[aspect]: forgive my foolishness, but what is `words` for?  From the
description it sounds like `words $command` means `lrange $command 1
end` but it seems to do something different.  If you want the elements of
`$command` as a list, just use `$command`!

[AMG]: `words` splits an arbitrary string by whitespace, then attempts
to join the pieces according to the result of `[info complete]`.  This results
in a list in which each element embeds its original quote characters.  Since an
odd number of trailing backslashes doesn't cause `[info complete]` to return
false, `words` doesn't correctly recognize backslashes used to quote
spaces.

I agree that `words` doesn't appear to serve a useful purpose.  Its
input should already be a valid, directly usable list.

[aspect]: it also does strange things if there are consecutive spaces in the
input.  "each element embeds its original quote characters" seems to be the
important characteristic, but I can't think of a use-case where this would be
desirable, hoping that [Sarnold] can elaborate on his original intention so
the example can be focussed (and corrected?). 

[PYK] 2014-09-11:  Due to [dodekalogue%|%command substitution], several words in the "raw" list that composes a command might contribute to one word in the "logical" list that is that command. `words` parses the command into its logical words.
Keeping the braces and quotation marks allows the consumer to know how Tcl
would have dealt with each word.

To eliminate false positives, I added a `\n` to the `[info complete..]` test in `words`, and also, here is another variant that is different only in style: 

======
proc words2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set cmdwords {}
    set realword {}
    foreach word [split $cmd "\f\n\r\t\v "] {
        set realword [concat $realword[set realword {}] $word]
        if {[info complete $realword\n]} {
            lappend cmdwords $realword
            set realword {}
        }
    }
    return $cmdwords
}
======

example:

======
% words2 {set "var one" [lindex {one "two three" four} 1]} 
#-> set {"var one"} {[lindex {one "two three" four} 1]}
======

[aspect]:  that example made it clear!  I've wanted something like this before, It could have potential for some interesting combinations with [Scripted list] and [Annotating words for better specifying procs in Tcl9].

I added an [lsearch] at the end of words to get rid of the "empty word" artifacts caused by consecutive whitespace.  [PYK]'s implementation needs a bit of tweaking to handle this better:

======
% words {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo  bar}} {"$baz   quz 23"} {lel\ lal} lka {${foo b  bar}}
% words2 {{foo  bar}  "$baz   quz 23"   lel\ lal lka ${foo b  bar}}
{{foo bar}} {} {"$baz quz 23"} {} {} {lel\ lal} lka {${foo b bar}}
======

[PYK] 2014-09-12:  Yes, it does need some tweaking. In addition to the issue
noted, both `words` and the previous `words2` improperly converted tab
characters within a word into spaces.  To fix that, it's necessary to use
`[regexp]` instead of `[split]` to get a handle on the actual delimiter.  Here is a new `words2` that I think works correctly:

======
proc words2 cmd {
    if {![info complete $cmd]} {
        error [list {not a complete command} $cmd]
    }
    set words {}
    set logical {}
    set cmd [string trimleft $cmd[set cmd {}] "\f\n\r\t\v " ]
    while {[regexp {([^\f\n\r\t\v ]*)([\f\n\r\t\v ]+)(.*)} $cmd full first delim last]} {
        append logical $first
        if {[info complete $logical\n]} {
            lappend words $logical
            set logical {}
        } else {
            append logical $delim
        }
        set cmd $last[set last {}]
    }
    if {$cmd ne {}} {
        append logical $cmd
    }
    if {$logical ne {}} {
        lappend words $logical 
    }
    return $words
}
======


[PYK] 2016-09-14:   I've edited the procedures in this section to include `\f`, `\v`, and `\r` as whitespace.


** `wordparts` **

[PYK] 2014-10-07: `wordparts` accepts a single word and splits it into its
components.  This implementation attempts to split a word exactly as Tcl would,
minding details such as just what exactly Tcl considers whitespace, and only
interpreting `\<newline>whitespace` in a braced word specially if there is an
odd number of backslashes preceding the newline character.

`wordparts` is a little more complicated than the other scripts on this page
because it doesn't get as much mileage out of `[info complete]`.  It's also
available as a part of [ycl%|%ycl parse wordparts].  If you think you need
`wordparts`, you may actually be looking for [scripted list], which gives you a
list of words with the the substitutions performed.

[aspect] has also recently [http://paste.tclers.tk/3304%|%produced an
implementation], but it fails a good number of the tests developed for the
implementation below.

In this script, `sl` is [scripted list].

======
#sl is "scripted list", http://wiki.tcl.tk/39972
proc wordparts word {
    set parts {}
    set first [string index $word 0]
    if {$first in {\" \{}} { ;#P syche! "
        set last [string index $word end]
        set wantlast [dict get {\" \" \{ \}} $first] ;#P syche! "
        if {$last ne $wantlast} {
            error [list [list missing trailing [
                dict get {\" quote \{ brace} $first]]] ;#P syche! "
        }
        set word [string range $word[set word {}] 1 end-1]
    }
    if {$first eq "\{"} {
        set obracecount 0
        set cbracecount 0
        set part {}
        while {$word ne {}} {
            switch -regexp -matchvar rematch $word [sl {
                #these seem to be the only characters Tcl accepts as whitespace
                #in this context
                {^([{}])(.*)} {
                    if {[string index $word 0] eq "\{"} {
                        incr obracecount
                    } else {
                        incr cbracecount
                    }
                    lassign $rematch -> 1 word 
                    append part $1
                }
                {^(\\[{}])(.*)} {
                    lassign $rematch -> 1 word 
                    append part $1
                }
                {^(\\+\n[\x0a\x0b\x0d\x20]*)(.*)}  {
                    lassign $rematch -> 1 word
                    if {[regexp -all {\\} $1] % 2} {
                        if {$part ne {}} {
                            lappend parts $part
                            set part {}
                        }
                        lappend parts $1
                    } else {
                        append part $1
                    }
                }
                {^(.+?(?=(\\?[{}])|(\\+\n)|$))(.*$)} {
                    lassign $rematch -> 1 word
                    append part $1
                } 
                default {
                    error [list {no match} $word]
                }
            }]
        }
        if {$cbracecount != $obracecount} {
            error [list {unbalanced braces in braced word}]
        }
        if {$part ne {}} {
            lappend parts $part
        }
        return $parts
    } else {
        set expression [sl {
            #order matters in some cases below

            {^(\$(?:::|[A-Za-z0-9_])*\()(.*)} - 
            {^(\[)(.*)} {
                if {[string index $word 0] eq {$}} {
                    set re {^([^)]*\))(.*)}
                    set errmsg {incomplete variable name}
                } else {
                    set re {^([^]]*])(.*)}
                    set errmsg {incomplete command substitution}
                }
                lassign $rematch -> 1 word
                while {$word ne {}} {
                    set part {}
                    regexp $re $word -> part word
                    append 1 $part
                    if {[info complete $1]}  {
                        lappend parts $1
                        break
                    } elseif {$word eq {}} {
                        error [list $errmsg $1] 
                    }
                }
            }

            #these seem to be the only characters Tcl accepts as whitespace
            #in this context
            {^(\\\n[\x0a\x0b\x0d\x20]*)(.*)} -
            {^(\$(?:::|[A-Za-z0-9_])+)(.*)} -
            {^(\$\{[^\}]*\})(.*)} -
            #detect a single remaining backlsash or dollar character here
            #to avoid a more complicated re below
            {^(\\|\$)($)} -
            {^(\\[0-7]{1,3})(.*)} -
            {^(\\U[0-9a-f]{1,8})(.*)} -
            {^(\\u[0-9a-f]{1,4})(.*)} -
            {^(\\x[0-9a-f]{1,2})(.*)} -
            {^(\\.)(.*)} -
            #lookahead ensures that .+ matches non-special occurrences of
            #"$" character
            #non greedy match here, so make sure .*$ stretches the match to
            #the end, so that something ends up in $2
            {(?x)
                #non-greedy so that the following lookahead stops it at the
                #first chance 
                ^(.+?
                    #stop at and backslashes
                    (?=(\\
                        #but only if they aren't at the end of the word 
                        (?!$))
                    #also stop at brackets
                    |(\[)
                    #and stop at variables
                    |(\$(?:[\{A-Za-z0-9_]|::))
                    #or at the end of the word
                    |$)
                )
                #the rest of the word
                (.*$)} {

                lassign $rematch -> 1 word 
                lappend parts $1
            } 
            default {
                error [list {no match} $word]
            }
        }]
        while {$word ne {}} {
            set part {}
            switch -regexp -matchvar rematch $word $expression
        }
    }
    return $parts
}
======

[PYK] 2015-11-15:  Found and fixed four bugs in `wordparts`; one relating to an
array whose name is the empty string, one relating to an array whose index is
the empty string, and one relating to varying numbers of digits in Unicode
sequences. 

[PYK] 2015-12-02:  Fixed bug in `wordparts` variable detection regular expressions.



** `varparts` **

[PYK] 2015-11-15:  `varparts` splits a variable like `$myvar(some [[crazy hay]]
$here)` into its ''name'' and ''index'' parts.  If `wordparts` was used to get
the variable, the caller knows whether the variable name was enclosed in
braces, and thus wheher to continue to parse the ''index'' value, which can
actually be done with `wordparts`.  This works because `wordparts` assumes that
whitespace in th value passed to it for parsing is part of the literal value of
the word.

======
proc varparts varspec {
    # varspec is already stripped of $ and braces
    set res {}
    if {[regexp {([^\)]*)(?:(\()(.*)(\)))?$} $varspec -> name ( index )]} {
        lappend res $name
    }
    if {${(} eq {(}} {
        lappend res $index
    }
    return $res
}
======

`varparts` is used in `[procstep%|%ycl proc step]` to fully intercept variablesubstitution, including rewriting the command substitutions in the ''index''
part of variables.

And that's it, folks!  This page now contains all the parts needed for a
complete Tcl script lexer.


** `[ycl%|%exprlex]` and `[ycl%|%funclex]` **

`exprlex` splits an `[expr]` expression into its parts, and `funclex` does the
analogous thing for functions.  These are lexers, not parsers.  They don't
verify that the expressions are gramatically correct or that barewords are
acceptable values, but they are sufficient to locate values, variables,
functions, and scripts.  Both commands are available in [ycl%|%ycl parse tcl].

======none
# captures a Tcl variable, or in the case of $name(index) syntax, the first part
# of a Tcl variable, up to and including the opening parenthesis , and also captures the remainder of the data into the second backreference
variable exprvarre {^(\$(?:(?:::|[A-Za-z0-9_])+\(?)|{[^\}]*})(.*$)}
variable operre {[^[:alnum:]_[:space:]\$\[\]\{\}\"]+}
variable funcre {^([A-Za-z](?:[[:alnum:]_]+)\()(.*$)}

proc exprlex {expression args} {
    variable exprvarre
    variable funcre
    variable operre
    set args [dict merge [dict create flevel 0] $args[set args {}]]
    dict update args flevel flevel {}
    set res {}
    set idx 0
    while {$expression ne {}} {
        set part {}

        regexp {^(\s*)(.*$)} $expression -> white expression

        # Match Tcl variable
        if {[regexp $exprvarre $expression -> part expression]} {
            while {![info complete $part] && $expression ne {} && 
                [regexp (.)(.*$) $expression -> part1 expression]
            } {
                append part $part1
            }
            lappend res $part
            continue
        }

        # match Tcl brace, quotation, and command substitution
        if {[regexp {^(["\{[])(.*$)} $expression -> part expression]} {
            set delim [dict get [dict create \[ \] \" \" \{ \}] $part]
            while {![info complete $part] && $expression ne {}
                && [regexp "^(.*?$delim)(.*$)" $expression -> part1 expression]
            } {
                append part $part1
            }
            lappend res $part
            continue
        }

        # match function
        if {[regexp $funcre $expression -> part expression]} {
            set part1 [[
                lindex [info level 0] 0] $expression flevel [incr flevel]]
            lassign $part1[set part1 {}] part2 expression
            lappend res $part$part2
            incr flevel -1
            continue
        }

        #match end of function
        if {$flevel > 0 && [
            regexp {^(\))(.*$)} $expression -> part expression]} {
            lappend res $part
            return [list [join $res {}] $expression]
        }

        # match operator
        if {[regexp ^(${operre})(.*$) $expression -> part expression]} {
            lappend res $part
            continue
        }

        # Match bare value , which at this point is anything up to the next
        # operand .
        if {[regexp ^(?:(.*?)(?=($operre|\[\[:space:\]\]|$)))(.*$) \
            $expression -> part expression]} {

            lappend res $part
            continue
        }

        return -code error [list {bad expr} at $expression]
    }
    return $res
}

proc exprval part {
    variable exprvarre
    variable funcre
    set first [string index $part 0]
    if {$first eq  "\""} {
        return quoted
    }

    if {$first eq "\{"} {
        return braced
    }

    if {$first eq {[}} {
        return script
    }
    if {[regexp $exprvarre $part]} {
        return variable
    }
    if {[regexp $funcre $part]} {
        return function
    }
    return bare
}

proc funclex call {
    variable funcre
    if {![regexp $funcre $call -> name call]} {
        return {}
    }

    # Get rid of the matching open parenthesis .
    set name [string range $name[set name {}] 0 end-1]
    set res [exprlex [string range $call 0 end-1]]
    set res [lmap {arg comma} $res[set res {}] {
        lindex $arg
    }]
    return [list $name {*}$res]
}
======

[PYK] 2015-12-03:  Fixed bug in handling of variables with namespace
qualifiers.  Thanks to [aspect] for the report.  A little later fixed bug in
handling function calls within function calls.  Also reported by [aspect]


<<categories>> Parsing | Object Orientation | ycl component