Manual Reference Pages - tokenize (3fortran)

NAME

TOKENIZE(3) - [CHARACTER:PARSE] Parse a string into tokens

SYNOPSIS

TOKEN form (returns array of strings)

        subroutine tokenize(string, set, tokens [, separator])

         character(len=*),intent(in) :: string
         character(len=*),intent(in) :: set
         character(len=:),allocatable,intent(out) :: tokens(:)
         character(len=1),allocatable,intent(out),optional :: separator(:)

ARRAY BOUNDS form (returns arrays defining token positions)

        subroutine tokenize (string, set, first, last)

         character(len=*),intent(in) :: string
         character(len=*),intent(in) :: set
         integer,allocatable,intent(out) :: first(:)
         integer,allocatable,intent(out) :: last(:)

CHARACTERISTICS

o STRING - a scalar of type character. It is an INTENT(IN) argument.

o SET - a scalar of type character with the same kind type parameter as STRING. It is an INTENT(IN) argument.

o SEPARATOR - (optional) shall be of type character with the same kind type parameter as STRING. It is an INTENT(OUT)argument. It shall not be a coarray or a coindexed object.

o TOKENS - of type character with the same kind type parameter as STRING. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object.

o FIRST,LAST - an allocatable array of type integer and rank one. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object.

To reiterate, STRING, SET, TOKENS and SEPARATOR must all be of the same CHARACTER kind type parameter.

DESCRIPTION

TOKENIZE(3) parses a string into tokens. There are two forms of the subroutine TOKENIZE(3).

o The token form returns an array with one token per element, all of the same length as the longest token.

o The array bounds form returns two integer arrays. One contains the beginning position of the tokens and the other the end positions.

Since the token form pads all the tokens to the same length the original number of trailing spaces of each token accept for the longest is lost.
The array bounds form retains information regarding the exact token length even when padded by spaces.

OPTIONS

o STRING : The string to parse into tokens.

o SET : Each character in SET is a token delimiter. A sequence of zero or more characters in STRING delimited by any token delimiter, or the beginning or end of STRING, comprise a token. Thus, two consecutive token delimiters in STRING, or a token delimiter in the first or last character of STRING, indicate a token with zero length.

o TOKENS : It shall be an allocatable array of rank one with deferred length. It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING, and with character length equal to the length of the longest token.
The tokens in STRING are assigned in the order found, as if by intrinsic assignment, to the elements of TOKENS, in array element order.

o FIRST : shall be an allocatable array of type integer and rank one. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object.
It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING. Each element is assigned, in array element order, the starting position of each token in STRING, in the order found.
If a token has zero length, the starting position is equal to one if the token is at the beginning of STRING, and one greater than the position of the preceding delimiter otherwise.

o LAST : It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING. Each element is assigned, in array element order, the ending position of each token in STRING, in the order found.
If a token has zero length, the ending position is one less than the starting position.

EXAMPLES

Sample of uses

        program demo_tokenize
        !use M_strings, only : tokenize=>split2020
        implicit none
        ! some useful formats
        character(len=*),parameter :: brackets=’(*("[",g0,"]":,","))’
        character(len=*),parameter :: a_commas=’(a,*(g0:,","))’
        character(len=*),parameter :: space=’(*(g0:,1x))’
        character(len=*),parameter :: gen=’(*(g0))’

        ! Execution of TOKEN form (return array of tokens)

        block
           character (len=:), allocatable :: string
           character (len=:), allocatable :: tokens(:)
           character (len=:), allocatable :: kludge(:)
           integer                        :: i
           string = ’  first,second ,third       ’
           call tokenize(string, set=’;,’, tokens=tokens )
           write(*,brackets)tokens

           string = ’  first , second ,third       ’
           call tokenize(string, set=’ ,’, tokens=tokens )
           write(*,brackets)(trim(tokens(i)),i=1,size(tokens))
           ! remove blank tokens
           ! <<<
           !tokens=pack(tokens, tokens /= ’’ )
           ! gfortran 13.1.0 bug -- concatenate //’’ and use scratch
           ! variable KLUDGE. JSU: 2024-08-18
           kludge=pack(tokens//’’, tokens /= ’’ )
           ! >>>
           write(*,brackets)kludge

        endblock

        ! Execution of BOUNDS form (return position of tokens)

        block
           character (len=:), allocatable :: string
           character (len=*),parameter :: set = " ,"
           integer, allocatable        :: first(:), last(:)
           write(*,gen)repeat(’1234567890’,6)
           string = ’first,second,,fourth’
           write(*,gen)string
           call tokenize (string, set, first, last)
           write(*,a_commas)’FIRST=’,first
           write(*,a_commas)’LAST=’,last
           write(*,a_commas)’HAS LENGTH=’,last-first.gt.0
        endblock

        end program demo_tokenize

Results:

     > [  first     ],[second      ],[third       ]
     > [],[first],[],[],[second],[],[third],[],[],[],[],[]
     > [first ],[second],[third ]
     > 123456789012345678901234567890123456789012345678901234567890
     > first,second,,fourth
     > FIRST=1,7,14,15
     > LAST=5,12,13,20
     > HAS LENGTH=T,T,F,T

STANDARD

Fortran 2023

o	STRING : The string to parse into tokens.
o	SET : Each character in SET is a token delimiter. A sequence of zero or more characters in STRING delimited by any token delimiter, or the beginning or end of STRING, comprise a token. Thus, two consecutive token delimiters in STRING, or a token delimiter in the first or last character of STRING, indicate a token with zero length.
o	TOKENS : It shall be an allocatable array of rank one with deferred length. It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING, and with character length equal to the length of the longest token. The tokens in STRING are assigned in the order found, as if by intrinsic assignment, to the elements of TOKENS, in array element order.
o	FIRST : shall be an allocatable array of type integer and rank one. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object. It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING. Each element is assigned, in array element order, the starting position of each token in STRING, in the order found. If a token has zero length, the starting position is equal to one if the token is at the beginning of STRING, and one greater than the position of the preceding delimiter otherwise.
o	LAST : It is allocated with the lower bound equal to one and the upper bound equal to the number of tokens in STRING. Each element is assigned, in array element order, the ending position of each token in STRING, in the order found. If a token has zero length, the ending position is one less than the starting position.

o	SPLIT(3) - return tokens from a string, one at a time
o	INDEX(3) - Position of a substring within a string
o	SCAN(3) - Scan a string for the presence of a set of characters
o	VERIFY(3) - Position of a character in a string of characters that does not appear in a given set of characters.

o	STRING - a scalar of type character. It is an INTENT(IN) argument.
o	SET - a scalar of type character with the same kind type parameter as STRING. It is an INTENT(IN) argument.
o	SEPARATOR - (optional) shall be of type character with the same kind type parameter as STRING. It is an INTENT(OUT)argument. It shall not be a coarray or a coindexed object.
o	TOKENS - of type character with the same kind type parameter as STRING. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object.
o	FIRST,LAST - an allocatable array of type integer and rank one. It is an INTENT(OUT) argument. It shall not be a coarray or a coindexed object.

o	The token form returns an array with one token per element, all of the same length as the longest token.
o	The array bounds form returns two integer arrays. One contains the beginning position of the tokens and the other the end positions.

Manual Reference Pages - tokenize (3fortran)

NAME

SYNOPSIS

CHARACTERISTICS

DESCRIPTION

OPTIONS

EXAMPLES

STANDARD

SEE ALSO