How to perform a kfino outlier detection

library(kfino)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)

Objectives

This vignette describes how to use the kfino algorithm on time courses in order to detect impulse noised outliers and predict the parameter of interest.

Kalman filter with impulse noised outliers (kfino) is a robust sequential algorithm allowing to filter data with a large number of outliers. This algorithm is based on simple latent linear Gaussian processes as in the Kalman Filter method and is devoted to detect impulse-noised outliers. These are data points that differ significantly from other observations. ML (Maximization Likelihood) and EM (Expectation-Maximization algorithm) algorithms were implemented in kfino.

The method is described in full details in the following arxiv preprint: https://arxiv.org/abs/2208.00961.

To test the kfino algorithm, we enclosed real data sets into the kfino package. Those data sets were created for the publication describing the mobile and automated walk-over-weighing system:

https://doi.org/10.1016/j.compag.2018.08.022

To test the feasibility of using an automated weighing prototype suitable for a range of contrasting sheep farming systems, the authors automatically recorded the weight of 15 sheep grazing outdoor in spring.

The kfino package has 4 data sets available of automated weighing:

  • spring1: contains the weighing data of one animal grazing outdoor in spring (203 data points recorded)
  • merinos1: contains the weighing data of one merinos lamb (397 data points recorded)
  • merinos2: contains the weighing data of one merinos lamb difficult to model without an appropriate method like kfino (345 data points recorded)
  • lambs: contains the weighing data of four merinos lambs

Description of the spring1 dataset

We start by using the spring1 data set:

data(spring1)

# Dimension of this dataset
dim(spring1)
#> [1] 203   5

head(spring1)
#> # A tibble: 6 × 5
#> # Groups:   IDE [1]
#>   Poids Date                IDE             Day                 dateNum
#>   <dbl> <dttm>              <chr>           <dttm>                <dbl>
#> 1  28.6 2017-05-24 00:00:00 250016286863027 2017-05-24 16:34:00   0.469
#> 2  45   2017-05-24 00:00:00 250016286863027 2017-05-24 19:24:00   0.587
#> 3  25   2017-05-25 00:00:00 250016286863027 2017-05-25 05:25:00   1.00 
#> 4  43   2017-05-25 00:00:00 250016286863027 2017-05-25 05:45:00   1.02 
#> 5  23.4 2017-05-25 00:00:00 250016286863027 2017-05-25 05:58:00   1.03 
#> 6   0   2017-05-25 00:00:00 250016286863027 2017-05-25 09:30:00   1.17

The range weight of this animal is between 30 and 75 kg and must be given in param, a list of initial parameters to include in the kfino_fit() function call.

The user can either perform an outlier detection (and prediction) given initial parameters or on optimized initial parameters (on m0, mm and pp). param list is composed of:

  • m0 = (optional) the initial weight, NULL if the user wants to optimize it,
  • mm = (optional) the target weight, NULL if the user wants to optimize it,
  • pp = (optional) the probability to be correctly weighed, NULL if the user wants to optimize it,
  • aa = the rate of weight change, default 0.001,
  • expertMin = the minimal weight expected by the user,
  • expertMax = the maximal weight expected by the user,
  • sigma2_m0 = the variance of m0, default 1,
  • sigma2_mm = the variance of mm, related to the unit of Tvar, default 0.05,
  • sigma2_pp = the variance of pp, related to the unit of Yvar, default 5,
  • K = a constant value in the outlier function (trapezium), by default K=5,
  • seqp = a vector, sequence of pp probability to be correctly weighted. default seq(0.5,0.7,0.1)

Kfino algorithm on the spring1 dataset

Parameters (m0, mm and pp) not optimized

If the user chooses to not optimize the initial parameters, all the list must be completed according to expert knowledge of the data set. Here, the user supposes that the initial weight is around 41 and the target one around 45.

# --- Without Optimisation on parameters
param2<-list(m0=41,
             mm=45,
             pp=0.5,
             aa=0.001,
             expertMin=30,
             expertMax=75,
             sigma2_m0=1,
             sigma2_mm=0.05,
             sigma2_pp=5,
             K=2,
             seqp=seq(0.5,0.7,0.1))

resu2<-kfino_fit(datain=spring1,
              Tvar="dateNum",Yvar="Poids",
              param=param2,
              doOptim=FALSE,
              verbose=TRUE)     
#> [1] "-------:"
#> [1] "No optimization of initial parameters:"
#> [1] "Used parameters: "
#> [1] 41.0  0.5 45.0

resu2 is a list of 3 elements:

  • detectOutlier: The whole input data set with the detected outliers flagged and the prediction of the analyzed variable. the following columns are joined to the columns present in the input data set:

    • prediction: the parameter of interest - Yvar - predicted
    • label_pred: the probability of the value being well predicted
    • lwr: lower bound of the confidence interval of the predicted value
    • upr: upper bound of the confidence interval of the predicted value
    • flag: flag of the value (OK value, KO value (outlier), OOR value (out of range values defined by the user in kfino_fit)
  • PredictionOK: A subset of detectOutlier data set with the predictions of the analyzed variable on possible values (OK and KO values)

  • kfino.results: kfino results (a list of vectors, prediction, probability to be an outlier , likelihood, confidence interval of the prediction and the flag of the data) on input parameters that were optimized if the user chooses this option

# structure of detectOutlier data set
str(resu2$detectOutlier)
#> 'data.frame':    203 obs. of  11 variables:
#>  $ Poids     : num  28.6 45 25 43 23.4 0 42.2 43 85.4 40.1 ...
#>  $ Date      : POSIXct, format: "2017-05-24" "2017-05-24" ...
#>  $ IDE       : chr  "250016286863027" "250016286863027" "250016286863027" "250016286863027" ...
#>  $ Day       : POSIXct, format: "2017-05-24 16:34:00" "2017-05-24 19:24:00" ...
#>  $ dateNum   : num  0.469 0.587 1.004 1.018 1.027 ...
#>  $ rowNum    : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ prediction: num  NA 41.5 NA 41.7 NA ...
#>  $ label_pred: num  NA 0.68 NA 0.88 NA NA 0.9 0.88 NA 0.87 ...
#>  $ lwr       : num  NA 39.5 NA 39.9 NA ...
#>  $ upr       : num  NA 43.4 NA 43.5 NA ...
#>  $ flag      : chr  "OOR" "OK" "OOR" "OK" ...

# head of PredictionOK data set
head(resu2$PredictionOK)
#>   rowNum prediction label_pred      lwr      upr flag
#> 1      2   41.45659       0.68 39.49659 43.41659   OK
#> 2      4   41.68643       0.88 39.88262 43.49024   OK
#> 3      7   41.75829       0.90 40.07535 43.44123   OK
#> 4      8   41.90155       0.88 40.32265 43.48044   OK
#> 5     10   41.71243       0.87 40.14772 43.27714   OK
#> 6     11   41.81293       0.89 40.33186 43.29400   OK

# structure of kfino.results list
str(resu2$kfino.results)
#> List of 6
#>  $ prediction: num [1:121] 41.5 41.7 41.8 41.9 41.7 ...
#>  $ label     : num [1:121] 0.685 0.875 0.895 0.884 0.874 ...
#>  $ likelihood: num [1, 1] 1.25e-150
#>  $ lwr       : num [1:121] 39.5 39.9 40.1 40.3 40.1 ...
#>  $ upr       : num [1:121] 43.4 43.5 43.4 43.5 43.3 ...
#>  $ flag      : chr [1:121] "OK" "OK" "OK" "OK" ...

Using the kfino_plot()function allows the user to visualize the results:

# flags are qualitative
kfino_plot(resuin=resu2,typeG="quali",
            Tvar="Day",Yvar="Poids",Ident="IDE")

            
# flags are quantitative
kfino_plot(resuin=resu2,typeG="quanti",
            Tvar="Day",Yvar="Poids",Ident="IDE")

Parameters (m0, mm and pp) optimized

The user can use either (Maximization Likelihood) ML or (Expectation-Maximization algorithm) EM method.

If the user chooses to optimize the initial parameters, m0, mm and pp must be set to NULL.

# --- With Optimisation on parameters
param1<-list(m0=NULL,
             mm=NULL,
             pp=NULL,
             aa=0.001,
             expertMin=30,
             expertMax=75,
             sigma2_m0=1,
             sigma2_mm=0.05,
             sigma2_pp=5,
             K=2,
             seqp=seq(0.5,0.7,0.1))

Maximized Likelihood (ML) method


resu1<-kfino_fit(datain=spring1,
              Tvar="dateNum",Yvar="Poids",
              param=param1,
              doOptim=TRUE,
              method="ML",
              verbose=TRUE)  
#> [1] "-------:"
#> [1] "Optimization of initial parameters with ML method - result:"
#> [1] "no sub-sampling performed:"
#> range m0:  40.1 43 
#> initial m0opt:  41.3 
#> initial mmopt:  46.3 
#> [1] "Optimized parameters: "
#> Optimized m0:  42.1 
#> Optimized mm:  61.1 
#> Optimized pp:  0.7 
#> [1] "-------:"

# flags are qualitative
kfino_plot(resuin=resu1,typeG="quali",
            Tvar="Day",Yvar="Poids",Ident="IDE")

            
# flags are quantitative
kfino_plot(resuin=resu1,typeG="quanti",
            Tvar="Day",Yvar="Poids",Ident="IDE")

Prediction of the weight on the cleaned dataset:

kfino_plot(resuin=resu1,typeG="prediction",
            Tvar="Day",Yvar="Poids",Ident="IDE")

Expectation-Maximization (EM) method

resu1b<-kfino_fit(datain=spring1,
              Tvar="dateNum",Yvar="Poids",
              param=param1,
              doOptim=TRUE,
              method="EM",
              verbose=TRUE)  
#> [1] "-------:"
#> [1] "Optimization of initial parameters with EM method - result:"
#> [1] "no sub-sampling performed:"
#> range m0:  40.88 45.82 
#> [1] "Optimized parameters with EM method: "
#> Optimized m0:  41.3 
#> Optimized mm:  46.3 
#> Optimized pp:  0.5 
#> [1] "-------:"

# flags are qualitative
kfino_plot(resuin=resu1b,typeG="quali",
            Tvar="Day",Yvar="Poids",Ident="IDE")

            
# flags are quantitative
kfino_plot(resuin=resu1b,typeG="quanti",
            Tvar="Day",Yvar="Poids",Ident="IDE")


kfino_plot(resuin=resu1b,typeG="prediction",
            Tvar="Day",Yvar="Poids",Ident="IDE")

Description of the merinos1 dataset

The user can test the kfino method using another data set given in the package. Here, we test with the merinos1 data set on a ewe lamb. For this animal,the range weight is between 10 and 45 kg and must be given in the initial parameters of the kfino_fit()function.

data(merinos1)

# Dimension of this dataset
dim(merinos1)
#> [1] 397   5

head(merinos1)
#> # A tibble: 6 × 5
#> # Groups:   IDE [1]
#>   Poids Date                IDE             Day                 dateNum
#>   <dbl> <dttm>              <chr>           <dttm>                <dbl>
#> 1   0   2021-01-27 00:00:00 250017033503004 2021-01-27 08:31:00   0.353
#> 2   0   2021-01-27 00:00:00 250017033503004 2021-01-27 08:31:00   0.353
#> 3   0   2021-01-27 00:00:00 250017033503004 2021-01-27 08:31:00   0.353
#> 4   0   2021-01-27 00:00:00 250017033503004 2021-01-27 08:31:00   0.353
#> 5   0   2021-01-27 00:00:00 250017033503004 2021-01-27 08:31:00   0.353
#> 6  40.2 2021-01-27 00:00:00 250017033503004 2021-01-27 12:37:00   0.524

Kfino algorithm on the merinos1 dataset

Parameters (m0, mm and pp) optimized

# --- With Optimisation on parameters
param3<-list(m0=NULL,
             mm=NULL,
             pp=NULL,
             aa=0.001,
             expertMin=10,
             expertMax=45,
             sigma2_m0=1,
             sigma2_mm=0.05,
             sigma2_pp=5,
             K=2,
             seqp=seq(0.5,0.7,0.1))

resu3<-kfino_fit(datain=merinos1,
              Tvar="dateNum",Yvar="Poids",
              doOptim=TRUE,param=param3,
              verbose=TRUE)      
#> [1] "-------:"
#> [1] "Optimization of initial parameters with ML method - result:"
#> [1] "no sub-sampling performed:"
#> range m0:  26.4 40.4 
#> initial m0opt:  28.2 
#> initial mmopt:  33.6 
#> [1] "Optimized parameters: "
#> Optimized m0:  26.4 
#> Optimized mm:  45.4 
#> Optimized pp:  0.7 
#> [1] "-------:"

# flags are qualitative
kfino_plot(resuin=resu3,typeG="quali",
            Tvar="Day",Yvar="Poids",Ident="IDE")

            
# flags are quantitative
kfino_plot(resuin=resu3,typeG="quanti",
            Tvar="Day",Yvar="Poids",Ident="IDE")

Prediction of the weight on the cleaned dataset:

kfino_plot(resuin=resu3,typeG="prediction",
            Tvar="Day",Yvar="Poids",Ident="IDE")

References

  1. E.González-García et. al. (2018) A mobile and automated walk-over-weighing system for a close and remote monitoring of liveweight in sheep. vol 153: 226-238. https://doi.org/10.1016/j.compag.2018.08.022
  2. González García, Eliel, 2021, Individual liveweight of Mérinos d’Arles ewelambs, measured with a Walk-over-Weighing (WoW) system under Mediterranean grazing conditions, https://doi.org/10.15454/IXSHF7, Recherche Data Gouv, V5, UNF:6:q4HEDt0n8nzxYRxc+9KK8g==[fileUNF]

Session informations

#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1 dplyr_1.1.4   kfino_1.0.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.6      jsonlite_1.8.9    compiler_4.4.2    tidyselect_1.2.1 
#>  [5] jquerylib_0.1.4   scales_1.3.0      yaml_2.3.10       fastmap_1.2.0    
#>  [9] R6_2.5.1          labeling_0.4.3    generics_0.1.3    knitr_1.49       
#> [13] tibble_3.2.1      maketools_1.3.1   munsell_0.5.1     bslib_0.8.0      
#> [17] pillar_1.9.0      rlang_1.1.4       utf8_1.2.4        cachem_1.1.0     
#> [21] xfun_0.49         sass_0.4.9        sys_3.4.3         cli_3.6.3        
#> [25] withr_3.0.2       magrittr_2.0.3    digest_0.6.37     grid_4.4.2       
#> [29] lifecycle_1.0.4   vctrs_0.6.5       evaluate_1.0.1    glue_1.8.0       
#> [33] farver_2.1.2      buildtools_1.0.0  fansi_1.0.6       colorspace_2.1-1 
#> [37] rmarkdown_2.29    tools_4.4.2       pkgconfig_2.0.3   htmltools_0.5.8.1