Manual Reference Pages - utf8_to_codepoints (3m_unicode)

NAME

UTF8_TO_CODEPOINTS(3f) - [M_unicode:CONVERSION] Convert UTF-8-encoded data to Unicode codepoints (LICENSE:MIT)

Synopsis
Characteristics
Description
Options
Examples
See Also
Author
License

SYNOPSIS

pure subroutine utf8_to_codepoints(utf8,codepoints,nerr)

    character(len=1),intent(in)     :: utf8(:)
    !  or
    character(len=*),intent(in)     :: utf8
    !
    integer,allocatable,intent(out) :: codepoints(:)
    integer,intent(out)             :: nerr

CHARACTERISTICS

o UTF8 is a scalar CHARACTER variable or array of single-byte CHARACTER values

o the returned values in CODEPOINTS are of default INTEGER kind

o the error flag NERR is default integer kind

DESCRIPTION

UTF8_TO_CODEPOINTS(3f) takes either a scalar CHARACTER variable or an array of CHARACTER(LEN=1) bytes which are treated as a stream of bytes representing UTF-8-encoded data and converted to an INTEGER array containing Unicode codepoint values for each glyph.

OPTIONS

o UTF8 : Scalar CHARACTER string or single-character array of CHARACTER variables assumed to represent a stream of bytes containing data encoded at UTF-8 text.

o CODEPOINTS : An INTEGER array of Unicode codepoint values representing the glyphs found in STRING

o NERR : Zero if no error occurred. If not zero the stream of bytes could not be completely converted to UTF-8 characters.

EXAMPLES

Sample program

   program demo_utf8_to_codepoints
   use m_unicode, only : utf8_to_codepoints
   implicit none
   character(len=*),parameter   :: string =’Noho me ka hau’oli’ !(Be happy)
   character(len=1),allocatable :: bytes(:)
   character(len=*),parameter   :: solid=’(*(g0))’
   character(len=*),parameter   :: space=’(*(g0,1x))’
   character(len=*),parameter   :: z=’(a,*(z0,1x))’
   integer,allocatable          :: codepoints(:)
   integer                      :: nerr
   integer                      :: i
   ! BASIC USAGE: SCALAR CHARACTER VARIABLE
     write(*,solid)’STRING:’,string
     call utf8_to_codepoints(string,codepoints,nerr)
     write(*,space)’CODEPOINTS:’, codepoints
     write(*,z)’HEXADECIMAL CODEPOINTS:’, codepoints
   !
     write(*,space)’How long is this string in glyphs? ’
     write(*,space)size(codepoints)
     write(*,space)’How long is this string in bytes? ’
     write(*,space)len(string)
   !
   ! BASIC USAGE: ARRAY OF BYTES
     bytes=[(string(i:i),i=1,len(string))]
     write(*,solid)’STRING:’,bytes
     call utf8_to_codepoints(bytes,codepoints,nerr)
     write(*,space)’CODEPOINTS:’, codepoints
     write(*,z)’HEXADECIMAL CODEPOINTS:’, codepoints
   !
     write(*,space)’How long is this string in glyphs? ’
     write(*,space)size(codepoints)
     write(*,space)’How long is this string in bytes? ’
     write(*,space)size(bytes)
   !
   end program demo_utf8_to_codepoints

Results:

    > STRING:Noho me ka hau’oli
    > CODEPOINTS: 78 111 104 111 32 109 101 32 107 97 32 104 97 117 ...
    > 8217 111 108 105
    > 48 4E 6F 68 6F 20 6D 65 20 6B 61 20 68 61 75 2019 6F 6C 69
    > How long is this string in glyphs?
    > 18
    > How long is this string in bytes?
    > 20
    > STRING:Noho me ka hau’oli
    > CODEPOINTS: 78 111 104 111 32 109 101 32 107 97 32 104 97 117 ...
    > 8217 111 108 105
    > 48 4E 6F 68 6F 20 6D 65 20 6B 61 20 68 61 75 2019 6F 6C 69
    > How long is this string in glyphs?
    > 18
    > How long is this string in bytes?
    > 20

AUTHOR

o John S. Urban

o Francois Jacq - enhancements and optional Latin support from Francois Jacq, 2025-08

Manual Reference Pages - utf8_to_codepoints (3m_unicode)

NAME

CONTENTS

SYNOPSIS

CHARACTERISTICS

DESCRIPTION

OPTIONS

EXAMPLES

SEE ALSO

AUTHOR

LICENSE

MIT

o	UTF8 is a scalar CHARACTER variable or array of single-byte CHARACTER values
o	the returned values in CODEPOINTS are of default INTEGER kind
o	the error flag NERR is default integer kind

o	UTF8 : Scalar CHARACTER string or single-character array of CHARACTER variables assumed to represent a stream of bytes containing data encoded at UTF-8 text.
o	CODEPOINTS : An INTEGER array of Unicode codepoint values representing the glyphs found in STRING
o	NERR : Zero if no error occurred. If not zero the stream of bytes could not be completely converted to UTF-8 characters.

o	elemental: adjustl(3), adjustr(3), index(3), scan(3), verify(3)
o	non-elemental: len_trim(3), repeat(3), trim(3), codepoints_to_utf8(3)

o	John S. Urban
o	Francois Jacq - enhancements and optional Latin support from Francois Jacq, 2025-08