Split On Whitespace

Difference between version 16 and 17 - Previous - Next
Created by [CecilWesterhof].

Often I want to split a string on repeating white-space. The normal split function does not do what I want. For example:

======
split "   To   show    the   problem.   "
======

gives:

======
{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}
======

What I want is:

======
To show the problem.
======


That is why I created the following proc:

======
# A split that works on repeating white-space
# With:
#     splitOnWhiteSpace "   To   show    the   problem.   "
# You get:
#     "To show the problem."
# instead of:
#     "{} {} {} To {} {} show {} {} {} the {} {} problem. {} {} {}"
# With min/max you can verify the number of elements
# I prefer the regexp version, but
# the other version could take about 55% of the time
# That is why you can use fast to go for the fast version
proc splitOnWhiteSpace {value {min -1} {max -1} {fast False}} {    if {!([string is integer -strict ${min}] && [string is integer -strict ${max}])} {
        error "min and max should both be integers (${min}, ${max})"
    }    if {(${min} < -1) || (${max} < -1)} {
        error "min and max should both be >= -1 (${min}, ${max})"
    }    if {(${max} != -1) && (${max} < ${min})} {
        error "min should be <= max (${min}, ${max})"
    }    if {${fast}} {
        set splitLst [list {*}[string map {
            \{ \\\{
            \" \\\"
            \\ \\\\            } ${value}]]
    } else {        set splitLst [regexp -all -inline {\S+} ${value}]
    }    if {${min} != -1} {
        if {${max} == -1} {
            set max ${min}
        }        set length [llength ${splitLst}]
        if {(${length} < ${min}) || (${length} > ${max})} {
            if {${min} == ${max}} {
                set msgEnd "${min} values"
            } else {                set msgEnd "between ${min} and ${max} values"
            }            error "'${value}' contains ${length} instead of ${msgEnd}"
        }
    }    return ${splitLst}
}
======

With this I get:

======
To show the problem.
======

Beside splitting on repeating white-space, it can also check the number of elements. For example:

======
splitOnWhiteSpace "Just a test." 4
======

gives:

======
'Just a test.' contains 3 instead of 4 values
======

and:

======
splitOnWhiteSpace "Just a test." 4 5
======

gives:

======
'Just a test.' contains 3 instead of between 4 and 5 values
======

----

As always: comments, tips and questions are appreciated.

----
StephanKuhagen:

About four times faster compared to the regexp-line: 
======
list {*}[string map {\{ \\\{} $value]
======

The string map is needed to avoid unmatched open braces in lists. If you know, that there will never be an opening brace in your inputs, you can get it even faster.

[CecilWesterhof]

Thanks, I implemented it. For the curious, originally I used:

======set splitLst [regexp -all -inline {\S+} ${value}]
======

----

[PYK] 2018-06-07:  `[ycl%|%ycl string delimit]` is a more general routine for performing this type of task.

----
'''[gerhardr] - 2018-06-11 14:41:59'''

Just for completeness the http://core.tcl.tk/tcllib/doc/tcllib-1-19/embedded/www/tcllib/files/modules/textutil/textutil_split.html%|%Tcllib solution%|%.

It's a way slower (ca. factor 3 in my tests) but a general solution as it can use regexps as split object. 
Maybe it's also a motivation to improve the Tcllib method.
 % package require textutil
 0.8
 % set str        "   To   show    the   problem.   "
    To   show    the   problem.   
 % textutil::splitx $str
 {} To show the problem. {}
 % textutil::splitx [string trim $str]
 To show the problem.

<<categories>>Example | String | Utilities