| Title: | Machine Learning for Runoff Prediction |
|---|---|
| Description: | Machine learning In k-fold cross validation . |
| Authors: | Dongdong Kong [aut, cre] (ORCID: <https://orcid.org/0000-0003-1836-8172>) |
| Maintainer: | Dongdong Kong <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-06-06 15:14:45 UTC |
| Source: | https://github.com/rpkgs/kfold |
add_previous
add_previous(d, nlead = 12)add_previous(d, nlead = 12)
d |
with the variable of |
nlead |
the number of leads to add |
Splits observation indices into kfold groups, ensuring each group
receives a representative range of the target variable y.
chunk_stratified(y, kfold = 5, seed = 1)chunk_stratified(y, kfold = 5, seed = 1)
y |
Numeric vector of target values used for stratified splitting. |
kfold |
Number of folds. |
seed |
Random seed (currently unused; seed is fixed internally). |
A named list of length kfold, each element an integer vector
of row indices belonging to that fold.
Build lagged feature matrices for multiple lead times
feature_leads(data_full, leads = 1:12)feature_leads(data_full, leads = 1:12)
data_full |
A |
leads |
Integer vector of lead times (in time steps) to build features for. |
A named list (one element per lead) of lists with X (feature matrix)
and Y (response matrix).
Compute GOF across multiple lead-time kfold objects
GOT_list(list_kfold, list_test, ..., idcol = "lead")GOT_list(list_kfold, list_test, ..., idcol = "lead")
... |
Ignored. |
idcol |
Column name for the lead-time id column. |
objects |
Named list of |
ds_test |
Named list of test datasets (one per lead time), each a list
with |
A data.table of GOF metrics with columns lead and mode.
Calibrate a model on a single train/validation split.
kfold_calib(X, Y, FUN = xgboost, index = NULL, ..., ratio_valid = 0.3)kfold_calib(X, Y, FUN = xgboost, index = NULL, ..., ratio_valid = 0.3)
X |
Feature matrix (rows = observations). |
Y |
Response matrix (rows = observations). |
FUN |
Model fitting function with signature |
index |
Integer vector of validation row indices. If |
... |
Additional arguments forwarded to |
ratio_valid |
Fraction of rows used as validation when |
kfold machine learning
kfold_ml( X, Y, kfold = 5, FUN, ..., fn_chunk = chunk_stratified, .progress = TRUE ) kfold_rf(X, Y, kfold = 5, FUN = ranger, ntree = 500, importance = "none", ...) kfold_xgboost(X, Y, kfold = 5, FUN = xgboost, nrounds = 500, ...) kfold_lm(X, Y, kfold = 5, ...)kfold_ml( X, Y, kfold = 5, FUN, ..., fn_chunk = chunk_stratified, .progress = TRUE ) kfold_rf(X, Y, kfold = 5, FUN = ranger, ntree = 500, importance = "none", ...) kfold_xgboost(X, Y, kfold = 5, FUN = xgboost, nrounds = 500, ...) kfold_lm(X, Y, kfold = 5, ...)
X |
Feature matrix (rows = observations). |
Y |
Response matrix (rows = observations). |
kfold |
Number of folds. |
FUN |
Model fitting function with signature |
... |
Additional arguments forwarded to |
fn_chunk |
Fold-splitting function; defaults to |
.progress |
Show a progress bar during fold iteration. |
ntree |
Number of trees for |
importance |
Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see |
nrounds |
Number of boosting iterations / rounds. Note that the number of default boosting rounds here is not automatically tuned, and different problems will have vastly different optimal numbers of boosting rounds. |
ranger::ranger(), xgboost::xgboost()
set.seed(1) n <- 100 ; p <- 2 X <- matrix(rnorm(n * p), n, p) # no intercept! y <- as.matrix(rnorm(n)) ## kfold r_lm <- kfold_lm(X, y) r_xgb <- kfold_xgboost(X, y) # r_rf <- kfold_rf(X, y) ## 70%-30% split r = kfold_calib(X, y, ratio_valid = 0.7, nrounds=500, verbose=FALSE) r$gofset.seed(1) n <- 100 ; p <- 2 X <- matrix(rnorm(n * p), n, p) # no intercept! y <- as.matrix(rnorm(n)) ## kfold r_lm <- kfold_lm(X, y) r_xgb <- kfold_xgboost(X, y) # r_rf <- kfold_rf(X, y) ## 70%-30% split r = kfold_calib(X, y, ratio_valid = 0.7, nrounds=500, verbose=FALSE) r$gof
Good of fitting
NSE(yobs, ysim, w, ...) GOF(yobs, ...) ## Default S3 method: GOF( yobs, ysim, w = NULL, include.cv = FALSE, include.r = TRUE, ..., idcol = "kfold", mode = "test" ) ## S3 method for class 'kfold' GOF(yobs, test = NULL, ...)NSE(yobs, ysim, w, ...) GOF(yobs, ...) ## Default S3 method: GOF( yobs, ysim, w = NULL, include.cv = FALSE, include.r = TRUE, ..., idcol = "kfold", mode = "test" ) ## S3 method for class 'kfold' GOF(yobs, test = NULL, ...)
yobs |
Numeric vector, observations |
ysim |
Numeric vector, corresponding simulated values |
w |
Numeric vector, weights of every points. If w included, when calculating mean, Bias, MAE, RMSE and NSE, w will be taken into considered. |
... |
Ignored. |
include.cv |
If true, cv will be included. |
include.r |
If true, r and R2 will be included. |
idcol |
Column name for the id column when binding multi-column results. |
mode |
Label inserted into the |
test |
A list with |
RMSE root mean square error
NSE NASH coefficient
MAE mean absolute error
AI Agreement index (only good points (w == 1)) participate to
calculate. See details in Zhang et al., (2015).
Bias bias
Bias_perc bias percentage
n_sim number of valid obs
cv Coefficient of variation
R2 correlation of determination
R pearson correlation
pvalue pvalue of R
https://en.wikipedia.org/wiki/Coefficient_of_determination
https://en.wikipedia.org/wiki/Explained_sum_of_squares
https://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient
Zhang Xiaoyang (2015), http://dx.doi.org/10.1016/j.rse.2014.10.012
yobs <- rnorm(100) ysim <- yobs + rnorm(100) / 4 GOF(yobs, ysim)yobs <- rnorm(100) ysim <- yobs + rnorm(100) / 4 GOF(yobs, ysim)
predict for kfold object
## S3 method for class 'kfold' predict(object, newdata = NULL, ..., mode = "test")## S3 method for class 'kfold' predict(object, newdata = NULL, ..., mode = "test")
object |
A |
newdata |
New feature matrix for prediction. Required when |
... |
Additional arguments forwarded to the underlying model's |
mode |
Prediction mode: |
Build a lagged matrix: column t is the original series, columns t-1,
t-2, … are progressively shifted (lagged) copies.
previous_tn(x, n = 7, prefix = "", ...) ## Default S3 method: previous_tn(x, n = 7, prefix = "", ...) ## S3 method for class 'data.frame' previous_tn(x, n = 7, ...)previous_tn(x, n = 7, prefix = "", ...) ## Default S3 method: previous_tn(x, n = 7, prefix = "", ...) ## S3 method for class 'data.frame' previous_tn(x, n = 7, ...)
x |
Numeric vector (default method) or |
n |
Number of lags to create. |
prefix |
Character prefix prepended to each column name. |
... |
Ignored. |
set.seed(1) x <- rnorm(10) previous_tn(x, 7, "R1_") # data.frame d = data.frame(x) previous_tn(d)set.seed(1) x <- rnorm(10) previous_tn(x, 7, "R1_") # data.frame d = data.frame(x) previous_tn(d)