This is the API documentation for all linear procedures (relying heavily on linear algebra).
fsml_pca
Principal Component Analysis (PCA) is a procedure to reduce the dimensionality of multivariate data by identifying a set of orthogonal vectors (eigenvectores) that represent directions of maximum variance in the dataset.
For a classic PCA, the input matrix (x
) holds data or observations in rows (nd
)
and different variables in columns (nv
).
The procedure fsml_pca
is a wrapper for fsml_eof
and offers a simpler,
more familiar interface for non-geoscientists. The EOF interface allows for
more options to be passed that are irrelevant to standard applications of PCA.
The PCA procedure calls the EOF procedures with weights (wt
) set to 1.0,
and matrix options set to opt = 0
to force the use of the covariance matrix
to be comparable to other common implementations of a PCA (e.g., Python sklearn).
The covariance matrix is computed as:
where is the preprocessed (centred and optionally standardised) data matrix,
and is the number of observations (rows in x
).
A symmetric eigen-decomposition is then performed:
where contains the EOFs (ev
), and is a diagonal matrix
of eigenvalues (ew
).
The principal components or scores (PCs, pc
) are given by:
The number of valid PC modes is determined by the number of non-zero eigenvalues.
Arrays are initialised to zero and populated only where eigenvalues are strictly positive.
The explained variance (r2
) for each component is computed as a fraction:
where is the PC index, and spans all retained eigenvalues,
representing all principal components that explain variability in the data.
Note: This procedure uses eigh
from the stdlib_linalg
module to compute
eigenvalues and eigenvectors of the symmetric covariance matrix.
Note
The procesure has no pure equivalent, because it relies on an
impure procedure for eigen-decomposition (eigh
).
call
fsml_pca(x, nd, nv, pc, ev, ew [, r2])
x
: A rank-2 array of type real
with dimensions nd
, nv
.
nd
: A scalar of type integer
.
nv
: A scalar of type integer
.
Invalid argument values will result in the return of a sentinel value.
pc
: A rank-2 array of type real
(same type as x
) with dimensions nd
, nv
.
ev
: A rank-2 array of type real
(same type as x
) with dimensions nv
, nv
.
ew
: An rank-1 array of type real
. It stores the eigenvalues.
r2
: An optional return and rank-1 array of type real
.
fsml_eof
Empirical Orthogonal Function (EOF) analysis is a procedure to reduce the dimensionality
of multivariate data by identifying a set of orthogonal vectors (EOFs or eigenvectores)
that represent directions of maximum variance in the dataset.
The term EOF analysis is often used interchangably with the geographically weighted
principal component analysis (PCA). The procedures are mathematically equivalent, but
procedures for EOF analysis offer some additional options that are mostly relevant for
geoscience. The procedure fsml_pca
is a wrapper for fsml_eof
that offers a simpler,
more familiar interface for non-geoscientists.
For a classic EOF analysis, the input matrix x
holds data or observations that have been
discretised in time and space. Rows (nd
) and columns (nv
) can therefore be interpreted
as time and space dimensions, respectively. EOF analysis allows for geographical weighting,
which translates to column-wise weighting prior to analysis in the procedure.
Weights can be set by bassing the rank-1 array wt
of dimension nv
. If this optional
argument is not passed, the procedure will default to equal weights of value .
It is numerically more stable than 1.0, which is the default for many implementations of a PCA.
After the weighting is applied, the covariance or correlation matrix is computed:
where is the preprocessed (centred and optionally standardised) data matrix,
and is the number of observations (rows nd
in x
).
The value of the optional argument opt
determines if the covariance matrix (opt = 0
) or
correlation matrix (opt = 1
) is constructed. If the argument is not passed, the procedure will
default to the use of the covariance matrix, as is the standard for a regular PCA.
A symmetric eigen-decomposition is then performed:
where contains the EOFs (eof
), and is a diagonal matrix
of eigenvalues (ew
).
The principal components or scores (PCs, pc
) are given by:
The number of valid EOF/PC modes is determined by the number of non-zero eigenvalues.
Arrays are initialised to zero and populated only where eigenvalues are strictly positive.
The explained variance (r2
) for each component is computed as a fraction:
where is the PC index, and spans all retained eigenvalues,
representing all principal components that explain variability in the data.
EOFs may optionally be scaled (eof_scaled
) for more convenient plotting:
Note: This procedure uses eigh
from the stdlib_linalg
module to compute
eigenvalues and eigenvectors of the symmetric covariance or correlation matrix.
Note
The procesure has no pure equivalent, because it relies on an
impure procedure for eigen-decomposition (eigh
).
call
fsml_eof(x, nd, nv, pc, eof, ew [, opt, wt, r2, eof_scaled])
x
: A rank-2 array of type real
with dimensions nd
, nv
.
nd
: A scalar of type integer
.
nv
: A scalar of type integer
.
opt
: An optional argument and scalar of type integer
. It must be 0 or 1. If not passed, it will default to 0.
wt
: An optional argument and rank-1 array of type real
. If not passed, it will default to equal weights of value 1/n.
Invalid argument values will result in the return of a sentinel value.
pc
: A rank-2 array of type real
(same type as x
) with dimensions nd
, nv
.
eof
: A rank-2 array of type real
(same type as x
) with dimensions nv
, nv
.
ew
: An rank-1 array of type real
. It stores the eigenvalues.
r2
: An optional return and rank-1 array of type real
.
eof_scaled
: An optional return and rank-2 array of type real
(same type as x
)
with dimensions nv
, nv
.
fsml_lda_2class
The 2-class multivariate Linear Discriminant Analysis (LDA) is a statistical procedure for classification and the investigation and explanation of differences between two groups (or classes) with regard to their attribute variables. It quantifies the discriminability of the groups and the contribution of each of the attribute variables to this discriminability.
The procedure finds a discriminant function that best separates the two groups. The function can be expressed as a linear combination of the attribute variables:
where is the discriminant function, are the attribute
variables used in evaluating the differences between the groups,
are the discriminant coefficients associated with each variable, (nv
) is
the number of variables, and is the y-intercept.
(Note: Mathematically, it is analogous to a multivariate linear regression function.)
Each attribute variable contains elements (x
), where
(nd
) is the number of elements in each group. Each element is associated with a
discriminant value described by:
Geometrically, this can be visualised as elements being projected on the
discriminant axis . The optimal discriminant function is then determined by
finding an axis, on which the projected elements for the two groups are best separated.
The best separation is given by maximising the discriminant criterion (g
),
a signal to noise ratio, so that:
where and are the number of elements in groups and ,
respectively. The procedure assumes that these are the same (nd
) and only accepts 2 groups (nc = 2
).
The discriminant coefficients are then standardised (sa
) using the standard deviations
of respective variables. The discriminant function represents a model that best seperates
the groups and can be used as a classification model. The skill of that model is determined
by forgetting the association of each element with the groups and using the model to reclassify
the elements. The score (score
) is the fraction of correct classifications and can be
interpreted as a measure of how well the function works as a classification model.
The procedure optionally returns the Mahalanobis distance (mh
) as a measure of distance
between the groups.
Note: This procedure uses eigh
from the stdlib_linalg
module.
Note
The procesure has no pure equivalent, because it relies on an
impure procedure for eigen-decomposition (eigh
).
call
fsml_lda_2class(x, nd, nv, nc, sa, g, score [, mh])
x
: A rank-3 array of type real
with dimensions nd
, nv
, nc
.
nd
: A scalar of type integer
.
nv
: A scalar of type integer
.
nc
: A scalar of type integer
. It must be 2.
Invalid argument values will result in the return of a sentinel value.
sa
: A rank-1 array of type real
and dimension nv
.
g
: A scalar of type real
.
score
: A rank-3 array of type real
.
mh
: An optional return and scalar of type real
fsml_ols
The multiple linear Ordinary Least Squares (OLS) regression models the relationship or linear dependence between a dependent (predictand) variable and and one or more independent (predictor) variables. The procedure estimates the linear regression coefficients by minimising the sum of squared residuals.
The estimated regression model is of the form:
where is the predictand variable, are the predictor variables (x
)
with nd
observations, is the y-intercept (b0
), (b
)
are the regression coefficients, and (nv
) is the number of predictors (excluding the intercept).
The subroutine constructs a full matrix internally by prepending a column of ones to account for the intercept. The regression coefficients are estimated as:
where is the extended design matrix including the intercept term.
The coefficient of determination (r2
) which represents the proportion of the total
variance of (y
) explained by the predictors. The predicted values (y_hat
),
standard errors (se
) of the coefficients, and the covariance matrix of the predictors (cov_b
)
can optionally be returned by the procedure, too.
Note: This procedure uses eigh
from the stdlib_linalg
module.
Note: The intercept and predictor coefficients are computed separately and returned explicitly.
Note
The procesure has no pure equivalent, because it relies on an
impure procedure for eigen-decomposition (eigh
).
call
fsml_ols(x, y, nd, nv, b0, b, r2 [, y_hat, se, covb])
x
: A rank-2 array of type real
with dimensions nd
, nv
.
y
: A rank-1 array of type real
with dimension nd
.
nd
: A scalar of type integer
.
nv
: A scalar of type integer
.
Invalid argument values will result in the return of a sentinel value.
b0
: A scalar of type real
.
b
: A rank-1 array of type real
with dimension nv
.
r2
: A scalar of type real
.
y_hat
An optional return and rank-1 array of type real
with dimension nd
.
se
An optional return and rank-1 array of type real
with dimension nv
.
cov_b
An optional return and rank-2 array of type real
with dimensions nv
, nv
.
fsml_ridge
The multiple linear Ridge regression models the relationship or linear dependence between a dependent (predictand) variable and one or more independent (predictor) variables, incorporating a penalty term on the size of the regression coefficients to reduce multicollinearity and overfitting.
The procedure estimates the linear regression coefficients by minimising the sum of squared residuals plus a penalty proportional to the square of the magnitude of coefficients:
where (lambda
) is the ridge penalty parameter, and is the
identity matrix with the first diagonal element corresponding to the intercept set to zero
(no penalty on intercept).
The estimated regression model is of the form:
where is the predictand variable, are the predictor variables (x
)
with nd
observations, is the y-intercept (b0
), (b
)
are the ridge regression coefficients, and (nv
) is the number of predictors
(excluding the intercept).
The subroutine constructs a full matrix internally by prepending a column of ones to account for
the intercept. The coefficient of determination (r2
), predicted values (y_hat
),
ridge-adjusted standard errors (se
) of the coefficients, and the ridge-adjusted covariance
matrix of the predictors (cov_b
) can optionally be returned. The covariance matrix and standard
errors are adjusted for the ridge penalty as:
where is the residual variance estimate.
Note: This procedure uses eigh
from the stdlib_linalg
module.
Note
The procesure has no pure equivalent, because it relies on an
impure procedure for eigen-decomposition (eigh
).
call
fsml_ridge(x, y, nd, nv, lambda, b0, b, r2 [, y_hat, se, covb])
x
: A rank-2 array of type real
with dimensions nd
, nv
.
y
: A rank-1 array of type real
with dimension nd
.
nd
: A scalar of type integer
.
nv
: A scalar of type integer
.
lambda
: A scalar of type real
. Must be non-zero positive.
Invalid argument values will result in the return of a sentinel value.
b0
: A scalar of type real
.
b
: A rank-1 array of type real
with dimension nv
.
r2
: A scalar of type real
.
y_hat
: An optional return and rank-1 array of type real
with dimension nd
.
se
: An optional return and rank-1 array of type real
with dimension nv
.
cov_b
: An optional return and rank-2 array of type real
with dimensions nv
, nv
.
fsml_mahalanobis
Computes the Mahalanobis distance between two input feature vectors x
and y
.
If a covariance matrix cov
is provided, it is used directly in the calculation.
Otherwise, the procedure estimates the covariance matrix from the two-sample dataset
formed by x
and y
. A Cholesky-based solver is used to perform the distance
calculation.
The Mahalanobis distance is defined as:
where is the covariance matrix. The inverse is applied via the Cholesky decomposition for numerical stability.
Note: If passed, the covariance matrix (cov
) must be positive definite
for the factorisation to succeed.
Note: This procedure uses chol
from the stdlib_linalg
module.
dist =
fsml_mahalanobis(x, y [, cov])
x
: A rank-1 array of type real
with dimension m
.
y
: A rank-1 array of type real
with dimension m
.
cov
: An optional rank-2 array of type real
with dimensions m
, m
.
dist
: A scalar of type real
.
program example_lin
! |--------------------------------------------------------------------|
! | fsml - fortran statistics and machine learning library |
! | |
! | about |
! | ----- |
! | Examples for linear (algebra) statistical procedures (lin module). |
! | |
! | license : MIT |
! | author : Sebastian G. Mutz (sebastian@sebastianmutz.com) |
! |--------------------------------------------------------------------|
use :: fsml
use :: fsml_ini ! import wp; alternatively: iso_fortran_env, wp => real64
implicit none
! ==== Description
!! The programme demonstrates the use of procedures relying heavily on linear
!! algebra.
! ==== Declarations
! ---- general
integer(i4) :: i, j
! ---- data
! dims
integer(i4), parameter :: nd = 5, nv = 3, nc = 2
! PCA/EOF
real(wp) , parameter :: x1(nd,nv) = reshape([ &
& 2.1_wp, 2.5_wp, 1.9_wp, 2.3_wp, 2.0_wp, &
& 2.4_wp, 2.8_wp, 2.6_wp, 3.2_wp, 2.9_wp, &
& 2.0_wp, 2.8_wp, 2.1_wp, 2.3_wp, 2.9_wp &
& ], shape=[nd,nv])
! LDA
real(wp) , parameter :: x2(nd,nv,nc) = reshape([ &
& 2.1_wp, 2.5_wp, 1.9_wp, 2.3_wp, 2.0_wp, & ! class 1, var 1
& 2.4_wp, 2.8_wp, 2.6_wp, 3.2_wp, 2.9_wp, & ! class 1, var 2
& 2.0_wp, 2.8_wp, 2.1_wp, 2.3_wp, 2.9_wp, & ! class 1, var 3
& 1.1_wp, 1.3_wp, 1.5_wp, 1.4_wp, 1.6_wp, & ! class 2, var 1
& 1.2_wp, 1.5_wp, 1.6_wp, 1.7_wp, 1.8_wp, & ! class 2, var 2
& 1.0_wp, 1.2_wp, 1.3_wp, 1.1_wp, 1.4_wp & ! class 2, var 3
& ], shape=[nd,nv,nc])
! OLS/Ridge
real(wp), parameter :: x3(nd,nv) = reshape([ &
& 1.0_wp, 2.0_wp, 3.0_wp, 4.0_wp, 5.0_wp, & ! var 1
& 2.0_wp, 1.0_wp, 4.0_wp, 3.0_wp, 2.0_wp, & ! var 2
& 3.0_wp, 3.5_wp, 2.0_wp, 1.0_wp, 2.5_wp & ! var 3
& ], shape=[nd,nv])
real(wp), parameter :: y(nd) = [10.0_wp, 12.0_wp, 13.0_wp, 14.0_wp, 15.0_wp] ! target values
! Mahalanobis distance
real(wp) :: dist ! distance
! ---- results
! PCA
real(wp) :: pc(nd,nv), eof(nv,nv), ew(nv), eof_scaled(nv,nv), r2(nv), wt(nv)
! LDA
real(wp) :: score, g, mh, sa(nv)
! OLS and Ridge
real(wp) :: b(nv), b0, yhat(nd), se(nv), covb(nv,nv), rsq
! ==== Instructions
! ---- Principal Component Analysis
call fsml_pca(x1, nd, nv, pc=pc, ev=eof, ew=ew, r2=r2)
write(*,'(A)') "> principal component analysis"
print*
write(*,'(A)') " principal components: "
write(*,'(3F10.5)') (pc(i,:), i=1,nd)
print*
write(*,'(A)') " eigenvectors: "
write(*,'(3F10.5)') (eof(i,:), i=1,nv)
print*
write(*,'(A)') " eigenvalues: "
write(*,'(3F10.5)') ew
print*
write(*,'(A)') " explained variance: "
write(*,'(3F10.5)') r2
print*
! principal components:
! -0.54693 -0.07722 0.13896
! 0.42315 -0.07015 0.27645
! -0.42586 -0.05419 -0.13456
! 0.13778 0.43289 -0.06344
! 0.41185 -0.23133 -0.21741
!
! eigenvectors:
! 0.28392 0.36310 0.88744
! 0.47190 0.75276 -0.45897
! 0.83469 -0.54909 -0.04238
!
! eigenvalues:
! 0.21204 0.06368 0.04128
!
! explained variance:
! 0.66888 0.20089 0.13023
! ---- Empirical Orthogonal Functions
! applying weights of 1 (detault = 1/n)
wt = 1.0_wp
! call eof procedure with weights of 1 and opt=0 (covariance matrix) to makes it consistent with fmsl_pca and other commona pca implementations
call fsml_eof(x1, nd, nv, pc=pc, eof=eof, ew=ew, opt=0, &
wt=wt, r2=r2, eof_scaled=eof_scaled)
write(*,'(A)') "> empirical orthogonal function analysis"
print*
write(*,'(A)') " principal components: "
write(*,'(3F10.5)') (pc(i,:), i=1,nd)
print*
write(*,'(A)') " EOFs (eigenvectors): "
write(*,'(3F10.5)') (eof(i,:), i=1,nv)
print*
write(*,'(A)') " eigenvalues: "
write(*,'(3F10.5)') ew
print*
write(*,'(A)') " explained variance: "
write(*,'(3F10.5)') r2
print*
write(*,'(A)') " scaled EOFs: "
write(*,'(3F10.5)') (eof_scaled(i,:), i=1,nv)
print*
! principal components:
! -0.54693 -0.07722 0.13896
! 0.42315 -0.07015 0.27645
! -0.42586 -0.05419 -0.13456
! 0.13778 0.43289 -0.06344
! 0.41185 -0.23133 -0.21741
!
! eigenvectors:
! 0.28392 0.36310 0.88744
! 0.47190 0.75276 -0.45897
! 0.83469 -0.54909 -0.04238
!
! eigenvalues:
! 0.21204 0.06368 0.04128
!
! explained variance:
! 0.66888 0.20089 0.13023
!
! scaled EOFs:
! 0.13074 0.09163 0.18031
! 0.21730 0.18996 -0.09325
! 0.38435 -0.13856 -0.00861
! ---- 2-Class Multivariate Linear Discriminant Analysis
call fsml_lda_2class(x2, nd, nv, nc, sa, g, score, mh)
write(*,'(A)') "> linear discriminant analysis (2-class)"
print*
write(*,'(A,F10.5)') " classification score: ", score
write(*,'(A,F10.5)') " Mahalanobis distance: ", mh
write(*,'(A,F10.5)') " discriminant value g: ", g
print*
write(*,'(A)') " standardised coefficients:"
write(*,'(3F10.5)') sa
print*
! classification score: 1.00000
! Mahalanobis distance: 5.53247
! discriminant value g: 53.34966
!
! standardised coefficients:
! 2.07805 9.14221 5.42794
! ---- Multiple Linear Regression (Ordinary Least Squares)
call fsml_ols(x3, y, nd, nv, b0, b, rsq, yhat, se, covb)
write(*,'(A)') "> multiple linear regression (OLS)"
print*
write(*,'(A,F10.5)') " R²: ", rsq
write(*,'(A,F10.5)') " Intercept: ", b0
print*
write(*,'(A)') " Coefficients:"
write(*,'(3F10.5)') b
print*
write(*,'(A)') " Predicted values:"
write(*,'(5F10.5)') yhat
print*
write(*,'(A)') " Standard errors:"
write(*,'(3F10.5)') se
print*
! R²: 0.97361
! Intercept: 8.78516
!
! Coefficients:
! 1.22266 0.05078 0.09375
!
! Predicted values:
! 10.39062 11.60938 12.84375 13.92187 15.23437
!
! Standard errors:
! 0.25240 0.43454 0.60515
! ---- Multiple Linear Regression (Ridge)
! call with lambda set to 0.2
call fsml_ridge(x3, y, nd, nv, 0.2_wp, b0, b, rsq, yhat, se, covb)
write(*,'(A)') "> multiple linear regression (Ridge)"
print*
write(*,'(A,F10.5)') " R²: ", rsq
write(*,'(A,F10.5)') " Intercept: ", b0
print*
write(*,'(A)') " Coefficients:"
write(*,'(3F10.5)') b
print*
write(*,'(A)') " Predicted values:"
write(*,'(5F10.5)') yhat
print*
write(*,'(A)') " Standard errors:"
write(*,'(3F10.5)') se
print*
! R²: 0.97293
! Intercept: 9.11522
!
! Coefficients:
! 1.18224 0.02590 0.03162
!
! Predicted values:
! 10.44413 11.61628 12.82879 13.95351 15.15729
!
! Standard errors:
! 0.23481 0.36909 0.49025
! ---- Mahalanobis distance
! call without passing covariance matrix
dist = fsml_mahalanobis(x3(:,1), x3(:,2))
write(*,'(A)') "> Mahalanobis distance"
print*
write(*,'(A,F10.5)') " Distance: ", dist
print*
! Distance: 1.41421
end program example_lin