NLP: Non-Linear Procedures

Overview

This is the API documentation for all nonlinear procedures.


Hierarchical Clustering

fsml_hclust

Description

The procedure is an implementation of the agglomerative hierarchical clustering method that groups data points into clusters by iteratively merging the most similar clusters. The procedure uses centroid linkage and the Mahalanobis distance as a measure of similarity.

The input matrix (x) holds observations in rows (nd) and variables in columns (nv). The target number of clusters (nc) must be at least 1 and not greater than the number of data points.

The variables are standardised before computing the covariance matrix on the transformed data. The matrix is used for calculating the Mahalanobis distance.

Clusters are merged iteratively until the target number of clusters is reached. The global mean (gm), cluster centroids (cm), membership assignments (cl), and cluster sizes (cc), the covariance matrix (cov) and standard deviations (sigma) used in the distance calculations are returned.

Note: This procedure uses the pure procedure for calculating the Mahalanobis distance f_lin_mahalanobis_core, which useschol from the stdlib_linalg module.

Note

The current implementation of the procedure is exact, but very slow for large nd.

Syntax

call fsml_hclust(x, nd, nv, nc, gm, cm, cl, cc, cov, sigma)

Parameters

x: A rank-2 array of type real with dimensions nd, nv.

nd: A scalar of type integer.

nv: A scalar of type integer.

nc: A scalar of type integer.

Invalid argument values will result in the return of a sentinel value.

Returns

gm: A rank-1 array of type real with dimension nv.

cm: A rank-2 array of type real with dimensions nv, nv.

cl: A rank-1 array of type integer with dimension nd.

cc: A rank-1 array of type integer with dimension nc.

cov: A rank-2 array of type real with dimensions nv, nv.

sigma: A rank-1 array of type real with dimension nv.


K-Means Clustering

fsml_kmeans

Description

The procedure implements the k-means clustering algorithm using the Mahalanobis distance as the similarity measure. It accepts initial centroids (cm_in), refines them iteratively, and returns the final centroids (cm).

The input matrix (x) holds observations in rows (nd) and variables in columns (nv). The number of clusters (nc) must be at least 1 and not greater than the number of data points. The procedure assigns each observation to the nearest centroid using the Mahalanobis distance, recomputes centroids from cluster memberships, and iterates until convergence or the iteration limit is reached. Final centroids are sorted by the first variable, and assignments are updated accordingly.

If the covariance matrix (cov_in) is passed, it will be used to calculate the Mahalanobis distance. If it is not passed, the variables are standardised before computing the covariance matrix on the transformed data.

The global mean (gm), cluster centroids (cm), membership assignments (cl), and cluster sizes (cc), the covariance matrix (cov - either cov_in if passed, or internally calculated if cov_in is not passed) and standard deviations (sigma) used in the distance calculations are returned.

Note: This procedure uses the pure procedure for calculating the Mahalanobis distance f_lin_mahalanobis_core, which useschol from the stdlib_linalg module.

Syntax

call fsml_kmeans(x, nd, nv, nc, cm_in, gm, cm, cl, cc, cov, sigma, [cov_in])

Parameters

x: A rank-2 array of type real with dimensions nd, nv.

nd: A scalar of type integer.

nv: A scalar of type integer.

nc: A scalar of type integer.

cm_in: A rank-2 array of type real with dimensions nv, nc.

cov_in (optional): A rank-2 array of type real with dimensions nv, nv.

Invalid argument values will result in the return of a sentinel value.

Returns

gm: A rank-1 array of type real with dimension nv.

cm: A rank-2 array of type real with dimensions nv, nc.

cl: A rank-1 array of type integer with dimension nd.

cc: A rank-1 array of type integer with dimension nc.

cov: A rank-2 array of type real with dimensions nv, nv.

sigma: A rank-1 array of type real with dimension nv.


Hybrid Hierarchical/K-Means Clustering

fsml_hkmeans

Description

The procedure implements a hybrid clustering approach combining agglomerative hierarchical clustering and k-means clustering, both using the Mahalanobis distance as the similarity measure. The hierarchical step first partitions the data into nc clusters by iteratively merging the most similar clusters. The resulting centroids from are then used as initial centroids (cm_in) for the k-means procedure, which refines them iteratively.

The input matrix (x) holds observations in rows (nd) and variables in columns (nv). The number of clusters (nc) must be at least 1 and not greater than the number of data points. In the hierarchical clustering step, variables are standardised before computing the covariance matrix on the transformed data. The covariance matrix is passed to the k-means clustering procedure along with the initial cluster centroids. The k-means clustering step then assigns each observation to the nearest centroid, recomputes centroids from cluster memberships, and iterates until convergence or the iteration limit is reached. Final centroids are sorted by the first variable, and assignments are updated accordingly.

The global mean (gm), final cluster centroids (cm), membership assignments (cl), and cluster sizes (cc), the covariance matrix (cov) and standard deviations (sigma) used in the distance calculations are returned.

Note: This procedure uses the pure procedure for calculating the Mahalanobis distance f_lin_mahalanobis_core, which useschol from the stdlib_linalg module.

Syntax

call fsml_hkmeans(x, nd, nv, nc, gm, cm, cl, cc, cov, sigma)

Parameters

x: A rank-2 array of type real with dimensions nd, nv.

nd: A scalar of type integer. It must be at least 1.

nv: A scalar of type integer. It must be at least 1.

nc: A scalar of type integer. It must be at least 1 and not greater than nd.

Invalid argument values will result in the return of a sentinel value.

Returns

gm: A rank-1 array of type real with dimension nv.

cm: A rank-2 array of type real with dimensions nv, nc.

cl: A rank-1 array of type integer with dimension nd.

cc: A rank-1 array of type integer with dimension nc.

cov: A rank-2 array of type real with dimensions nv, nv.

sigma: A rank-1 array of type real with dimension nv.


Examples

program example_nlp

! |--------------------------------------------------------------------|
! | fsml - fortran statistics and machine learning library             |
! |                                                                    |
! | about                                                              |
! | -----                                                              |
! | Examples for nonlinear procedures (nlp module).                    |
! |                                                                    |
! | license : MIT                                                      |
! | author  : Sebastian G. Mutz (sebastian@sebastianmutz.com)          |
! |--------------------------------------------------------------------|

  use :: fsml
  use :: fsml_ini ! import wp; alternatively: iso_fortran_env, wp => real64

  implicit none

! ==== Description
!! The programme demonstrates the use of nonlinear procedures.

! ==== Declarations
  integer(i4), parameter :: nd = 100 ! number of data points
  integer(i4), parameter :: nv = 5   ! number of variables ("features")
  integer(i4), parameter :: nc = 4   ! number of target clusters for hcluster
  real(wp) :: x(nd, nv)              ! data
  real(wp) :: gm(nv)                 ! global mean per variable
  real(wp) :: cm_h(nv, nc)           ! cluster centroids (hcluster)
  real(wp) :: cm_k(nv, nc)           ! cluster centroids (k-means)
  real(wp) :: cov(nv, nv)            ! covariance matrix (hcluster)
  real(wp) :: cov_k(nv, nv)
  real(wp) :: sigma(nv)              ! standard dev per variable
  integer(i4) :: cl(nd)              ! cluster assignment per data point
  integer(i4) :: cc(nc)              ! cluster sizes
  integer(i4) :: i, j                ! for loops

! ==== Instructions

  ! create data
  do i = 1, nd
     do j = 1, nv
        x(i,j) = sin(real(i*j, wp))  ! example data, replace with real input
     enddo
  enddo

! ---- hierarchical clustering

  ! conduct hierarchical clustering
  call fsml_hclust(x, nd, nv, nc, gm, cm_h, cl, cc, cov, sigma)
  write(*,'(A)') "> hierarchical cluster analysis"
  print*
  write(*,'(A)')  "  cluster centroids:  "
  do j = 1, nc
    write(*,'(A,I2,A,I2,A,5F12.6)'), "  cluster", j, " (size: ", cc(j), ")", cm_h(1:nv, j)
  enddo
  print*
  write(*,'(A)'), '  global means (gm):'
  write(*,'(5F12.6)'), gm(1:nv)
  print*
  write(*,'(A)'), '  covariance matrix:'
  do i = 1, nv
     write(*,'(5F12.6)'), cov(i, 1:nv)
  enddo
  print*
  write(*,'(A)'), '  standard deviations (sigma):'
  write(*,'(5F12.6)'), sigma(1:nv)
  print*
  ! cluster centroids:
  ! cluster 1 (size: 10)    1.114982    1.280285    0.426739   -0.620822   -0.979638
  ! cluster 2 (size: 74)    0.046810   -0.284541   -0.044039   -0.064682    0.347077
  ! cluster 3 (size:  9)   -0.540211    0.961225   -1.147117    1.125295   -0.846290
  ! cluster 4 (size:  7)   -1.393120   -0.056833    1.330794    0.123859   -1.181532
  !
  ! global means (gm):
  !  -0.001272   -0.002720   -0.004636   -0.007745   -0.014948
  !
  ! covariance matrix:
  !   1.000000    0.000134    0.000297    0.000800    0.005803
  !   0.000134    1.000000    0.000932    0.006108   -0.003596
  !   0.000297    0.000932    1.000000   -0.003435   -0.002111
  !   0.000800    0.006108   -0.003435    1.000000   -0.001955
  !   0.005803   -0.003596   -0.002111   -0.001955    1.000000
  !
  ! standard deviations (sigma):
  !   0.712573    0.712679    0.714744    0.711827    0.711722


! ---- k-means clustering

  ! conduct k-means clustering with clusters and covariance matrix from hcluster
  call fsml_kmeans(x, nd, nv, nc, cm_h, gm, cm_k, cl, cc, cov_k, sigma, cov_in=cov)
  write(*,'(A)') "> k-means cluster analysis"
  print*
  write(*,'(A)')  "  cluster centroids:  "
  do j = 1, nc
    write(*,'(A,I2,A,I2,A,5F12.6)'), "  cluster", j, " (size: ", cc(j), ")", cm_k(1:nv, j)
  enddo
  print*
  write(*,'(A)'), '  global means (gm):'
  write(*,'(5F12.6)'), gm(1:nv)
  print*
  write(*,'(A)'), '  covariance matrix:'
  do i = 1, nv
     write(*,'(5F12.6)'), cov_k(i, 1:nv)
  enddo
  print*
  write(*,'(A)'), '  standard deviations (sigma):'
  write(*,'(5F12.6)'), sigma(1:nv)
  print*
  ! cluster centroids:
  ! cluster 1 (size: 20)    0.897497    1.076016    0.566456    0.025259   -0.064641
  ! cluster 2 (size: 50)    0.268886   -0.806361   -0.226511   -0.142963    0.169572
  ! cluster 3 (size: 16)   -0.772046    1.132262   -0.927534    0.439275    0.033617
  ! cluster 4 (size: 14)   -1.360107    0.048683    1.059784   -0.027532   -0.551692
  !
  ! global means (gm):
  ! -0.001272   -0.002720   -0.004636   -0.007745   -0.014948
  !
  ! covariance matrix:
  !   1.000000    0.000134    0.000297    0.000800    0.005803
  !   0.000134    1.000000    0.000932    0.006108   -0.003596
  !   0.000297    0.000932    1.000000   -0.003435   -0.002111
  !   0.000800    0.006108   -0.003435    1.000000   -0.001955
  !   0.005803   -0.003596   -0.002111   -0.001955    1.000000
  !
  ! standard deviations (sigma):
  !   0.712573    0.712679    0.714744    0.711827    0.711722


! ---- hybrid hierarchical - k-means clustering

  ! conduct clustering (combines 2 steps above)
  call fsml_hkmeans(x, nd, nv, nc, gm, cm_k, cl, cc, cov_k, sigma)
  write(*,'(A)') "> hierarchical clustering with k-means refinements"
  print*
  write(*,'(A)')  "  cluster centroids:  "
  do j = 1, nc
    write(*,'(A,I2,A,I2,A,5F12.6)'), "  cluster", j, " (size: ", cc(j), ")", cm_k(1:nv, j)
  enddo
  print*
  write(*,'(A)'), '  global means (gm):'
  write(*,'(5F12.6)'), gm(1:nv)
  print*
  write(*,'(A)'), '  covariance matrix:'
  do i = 1, nv
     write(*,'(5F12.6)'), cov_k(i, 1:nv)
  enddo
  print*
  write(*,'(A)'), '  standard deviations (sigma):'
  write(*,'(5F12.6)'), sigma(1:nv)
  print*
  ! cluster centroids:
  ! cluster 1 (size: 20)    0.897497    1.076016    0.566456    0.025259   -0.064641
  ! cluster 2 (size: 50)    0.268886   -0.806361   -0.226511   -0.142963    0.169572
  ! cluster 3 (size: 16)   -0.772046    1.132262   -0.927534    0.439275    0.033617
  ! cluster 4 (size: 14)   -1.360107    0.048683    1.059784   -0.027532   -0.551692
  !
  ! global means (gm):
  ! -0.001272   -0.002720   -0.004636   -0.007745   -0.014948
  !
  ! covariance matrix:
  !   1.000000    0.000134    0.000297    0.000800    0.005803
  !   0.000134    1.000000    0.000932    0.006108   -0.003596
  !   0.000297    0.000932    1.000000   -0.003435   -0.002111
  !   0.000800    0.006108   -0.003435    1.000000   -0.001955
  !   0.005803   -0.003596   -0.002111   -0.001955    1.000000
  !
  ! standard deviations (sigma):
  !   0.712573    0.712679    0.714744    0.711827    0.711722

end program example_nlp