Title: | Kalman Filter for Impulse Noised Outliers |
---|---|
Description: | A method for detecting outliers with a Kalman filter on impulsed noised outliers and prediction on cleaned data. 'kfino' is a robust sequential algorithm allowing to filter data with a large number of outliers. This algorithm is based on simple latent linear Gaussian processes as in the Kalman Filter method and is devoted to detect impulse-noised outliers. These are data points that differ significantly from other observations. 'ML' (Maximization Likelihood) and 'EM' (Expectation-Maximization algorithm) algorithms were implemented in 'kfino'. The method is described in full details in the following arXiv e-Print: <arXiv:2208.00961>. |
Authors: | Bertrand Cloez [aut], Isabelle Sanchez [aut, cre], Benedicte Fontez [ctr] |
Maintainer: | Isabelle Sanchez <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-11-13 03:59:14 UTC |
Source: | https://github.com/cran/kfino |
doutlier defines an outlier distribution (Surface of a trapezium) and uses input parameters given in the main function kfino_fit()
doutlier(y, K, expertMin, expertMax)
doutlier(y, K, expertMin, expertMax)
y |
numeric, point |
K |
numeric, constant value |
expertMin |
numeric, the minimal weight expected by the user |
expertMax |
numeric, the maximal weight expected by the user |
this function is used to calculate an outlier distribution
following a trapezium shape.
is the probability density function on
which is linear and verifies
In particular, when $K=1$ this corresponds to the uniform distribution.
a numeric value
doutlier(2,5,10,45)
doutlier(2,5,10,45)
A method for detecting outliers with a Kalman filter on impulsed noised outliers and prediction on cleaned data. 'kfino' is a robust sequential algorithm allowing to filter data with a large number of outliers. This algorithm is based on simple latent linear Gaussian processes as in the Kalman Filter method and is devoted to detect impulse-noised outliers. These are data points that differ significantly from other observations. 'ML' (Maximization Likelihood) and 'EM' (Expectation-Maximization algorithm) algorithms were implemented in 'kfino'. The method is described in full details in the following arXiv e-Print: arXiv:2208.00961.
xxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx.
Maintainer: Isabelle Sanchez [email protected]
Authors:
Bertrand Cloez [email protected]
Other contributors:
Benedicte Fontez [email protected] [contractor]
Useful links:
Report bugs at https://forgemia.inra.fr/isabelle.sanchez/kfino/-/issues
kfino_fit a function to detect outlier with a Kalman Filtering approach
kfino_fit( datain, Tvar, Yvar, param = NULL, doOptim = TRUE, method = "ML", threshold = 0.5, kappa = 10, kappaOpt = 7, verbose = FALSE )
kfino_fit( datain, Tvar, Yvar, param = NULL, doOptim = TRUE, method = "ML", threshold = 0.5, kappa = 10, kappaOpt = 7, verbose = FALSE )
datain |
an input data.frame of one time course to study (unique IDE) |
Tvar |
char, time column name in the data.frame datain, a numeric vector Tvar should be expressed as a proportion of day in seconds |
Yvar |
char, name of the variable to predict in the data.frame datain |
param |
list, a list of initialization parameters |
doOptim |
logical, if TRUE optimization of the initial parameters, default TRUE |
method |
character, the method used to optimize the initial parameters: Expectation-Maximization algorithm '"EM"' (faster) or Maximization Likelihood '"ML"' (more robust), default '"ML"' |
threshold |
numeric, threshold to qualify an observation as outlier according to the label_pred, default 0.5 |
kappa |
numeric, truncation setting for likelihood optimization over initial parameters, default 10 |
kappaOpt |
numeric, truncation setting for the filtering and outlier detection step with optimized parameters, default 7 |
verbose |
write details if TRUE (optional), default FALSE. |
The initialization parameter list 'param' contains:
(optional) numeric, target weight, NULL if the user wants to optimize it
(optional) numeric, probability to be correctly weighed, NULL if the user wants to optimize it
(optional) numeric, initial weight, NULL if the user wants to optimize it
numeric, rate of weight change, default 0.001
numeric, the minimal weight expected by the user
numeric, the maximal weight expected by the user
numeric, variance of m0, default 1
numeric, variance of mm, related to the unit of Tvar, default 0.05
numeric, variance of pp, related to the unit of Yvar, default 5
numeric, a constant value in the outlier function (trapezium), by default K=5
numeric vector, sequence of pp probability to be correctly weighted. default seq(0.5,0.7,0.1)
It should be given by the user based on their knowledge of the animal or the data set. All parameters are compulsory except m0, mm and pp that can be optimized by the algorithm. In the optimization step, those three parameters are initialized according to the input data (between the expert range) using quantile of the Y distribution (varying between 0.2 and 0.8 for m0 and 0.5 for mm). pp is a sequence varying between 0.5 and 0.7. A sub-sampling is performed to speed the algorithm if the number of possible observations studied is greater than 500. Optimization is performed using '"EM"' or '"ML"' method.
a S3 list with two data frames and a list of vectors of kfino results
detectOutlier: The whole input data set with the detected outliers flagged and the prediction of the analyzed variable. the following columns are joined to the columns present in the input data set:
the parameter of interest - Yvar - predicted
the probability of the value being well predicted
lower bound of the confidence interval of the predicted value
upper bound of the confidence interval of the predicted value
flag of the value (OK value, KO value (outlier), OOR value (out of range values defined by the user in 'kfino_fit' with 'expertMin', 'expertMax' input parameters). If flag == OOR the 4 previous columns are set to NA.
PredictionOK: A subset of 'detectOutlier' data set with the predictions of the analyzed variable on possible values (OK and KO values)
kfino.results: kfino results (a list of vectors containing the prediction of the analyzed variable, the probability to be an outlier, the likelihood, the confidence interval of the prediction and the flag of the data) on input parameters that were optimized if the user chose this option
data(spring1) library(dplyr) # --- With Optimization on initial parameters - ML method t0 <- Sys.time() param1<-list(m0=NULL, mm=NULL, pp=NULL, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu1<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", doOptim=TRUE,method="ML",param=param1, verbose=TRUE) Sys.time() - t0 # --- Without Optimization on initial parameters t0 <- Sys.time() param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu2<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", param=param2, doOptim=FALSE, verbose=FALSE) Sys.time() - t0
data(spring1) library(dplyr) # --- With Optimization on initial parameters - ML method t0 <- Sys.time() param1<-list(m0=NULL, mm=NULL, pp=NULL, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu1<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", doOptim=TRUE,method="ML",param=param1, verbose=TRUE) Sys.time() - t0 # --- Without Optimization on initial parameters t0 <- Sys.time() param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu2<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", param=param2, doOptim=FALSE, verbose=FALSE) Sys.time() - t0
kfino_plot a graphical function for the result of a kfino run
kfino_plot( resuin, typeG, Tvar, Yvar, Ident, title = NULL, labelX = NULL, labelY = NULL )
kfino_plot( resuin, typeG, Tvar, Yvar, Ident, title = NULL, labelX = NULL, labelY = NULL )
resuin |
a list resulting of the kfino algorithm |
typeG |
char, type of graphic, either detection of outliers (with qualitative or quantitative display) or prediction. must be "quanti" or "quali" or "prediction" |
Tvar |
char, time variable in the data.frame datain |
Yvar |
char, variable which was analysed in the data.frame datain |
Ident |
char, column name of the individual id to be analyzed |
title |
char, a graph title |
labelX |
char, a label for x-axis |
labelY |
char, a label for y-axis |
The produced graphic can be, according to typeG:
This plot shows the detection of outliers with a qualitative rule: OK values (black), KO values (outliers, purple) and OOR values (out of range values defined by the user in 'kfino_fit', red)
This plot shows the detection of outliers with a quantitative display using the calculated probability of the kfino algorithm
This plot shows the prediction of the analyzed variable plus the OK values. Prediction corresponds to E[X_t | Y_1...t] for each time point t. Between 2 time points, we used a simple linear interpolation.
a ggplot2 graphic
data(spring1) library(dplyr) print(colnames(spring1)) # --- Without Optimisation on initial parameters param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu2<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", param=param2, doOptim=FALSE) # flags are qualitative kfino_plot(resuin=resu2,typeG="quali", Tvar="Day",Yvar="Poids",Ident="IDE", title="kfino spring1", labelX="Time (day)",labelY="Weight (kg)") # flags are quantitative kfino_plot(resuin=resu2,typeG="quanti", Tvar="Day",Yvar="Poids",Ident="IDE") # predictions on OK values kfino_plot(resuin=resu2,typeG="prediction", Tvar="Day",Yvar="Poids",Ident="IDE")
data(spring1) library(dplyr) print(colnames(spring1)) # --- Without Optimisation on initial parameters param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) resu2<-kfino_fit(datain=spring1, Tvar="dateNum",Yvar="Poids", param=param2, doOptim=FALSE) # flags are qualitative kfino_plot(resuin=resu2,typeG="quali", Tvar="Day",Yvar="Poids",Ident="IDE", title="kfino spring1", labelX="Time (day)",labelY="Weight (kg)") # flags are quantitative kfino_plot(resuin=resu2,typeG="quanti", Tvar="Day",Yvar="Poids",Ident="IDE") # predictions on OK values kfino_plot(resuin=resu2,typeG="prediction", Tvar="Day",Yvar="Poids",Ident="IDE")
A dataset for kfino algorithm
lambs
lambs
a data.frame
weight (in kg)
Date of weighing yyyy-mm-dd
id of the animal
Date of weighing with day and time yyyy-mm-dd hh:mm:ss
a rescaled date - fraction of the whole observational time
for one individual.
A dataset for kfino algorithm
merinos1
merinos1
a data.frame
weight (in kg)
Date of weighing yyyy-mm-dd
id of the animal
Date of weighing with day and time yyyy-mm-dd hh:mm:ss
a rescaled date - fraction of the whole observational time
for one individual.
A dataset for kfino algorithm
merinos2
merinos2
a data.frame
weight (in kg)
Date of weighing yyyy-mm-dd
id of the animal
Date of weighing with day and time yyyy-mm-dd hh:mm:ss
a rescaled date - fraction of the whole observational time
for one individual.
A dataset for kfino algorithm
spring1
spring1
a data.frame
weight (in kg)
Date of weighing yyyy-mm-dd
id of the animal
Date of weighing with day and time yyyy-mm-dd hh:mm:ss
a rescaled date - fraction of the whole observational time
for one individual.
utils_EM a function to estimate the parameters 'm_0' , 'mm', 'pp' through an Expectation-Maximization (EM) method
utils_EM(param, kappaOpt, Y, Tps, N, scalingC)
utils_EM(param, kappaOpt, Y, Tps, N, scalingC)
param |
list, see initial parameter list in |
kappaOpt |
numeric, truncation setting for initial parameters' optimization, default 7 |
Y |
character, name of the numeric variable to predict in the data.frame datain |
Tps |
character, time column name in the data.frame datain, a numeric vector. Tvar can be expressed as a proportion of day in seconds |
N |
numeric, length of the numeric vector of Y values |
scalingC |
numeric, scaling constant. To be changed if the function is not able to calculate the likelihood because the number of data is large |
utils_EM is a tool function used in the main kfino_fit
function. It uses the same input parameter list than the main function.
a list:
numeric, optimized m0
numeric, optimized mm
numeric, optimized pp
numeric, the calculated likelihood
set.seed(1234) Y<-rnorm(n=10,mean=50,4) Tps<-seq(1,10) N=10 param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) print(Y) utils_EM(param=param2,kappaOpt=7,Y=Y,Tps=Tps,N=N,scalingC=6)
set.seed(1234) Y<-rnorm(n=10,mean=50,4) Tps<-seq(1,10) N=10 param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) print(Y) utils_EM(param=param2,kappaOpt=7,Y=Y,Tps=Tps,N=N,scalingC=6)
utils_fit a fonction running the kfino algorithm to filter data and detect outliers under the knowledge of all parameters
utils_fit(param, threshold, kappa = 10, Y, Tps, N)
utils_fit(param, threshold, kappa = 10, Y, Tps, N)
param |
list, see initial parameter list in |
threshold |
numeric, threshold for confidence interval, default 0.5 |
kappa |
numeric, truncation setting for likelihood optimization, default 10 |
Y |
character, name of the numeric variable to predict in the data.frame datain |
Tps |
character, time column name in the data.frame datain, a numeric vector. Tvar can be expressed as a proportion of day in seconds |
N |
numeric, length of the numeric vector of Y values |
utils_fit is a tool function used in the main kfino_fit
function. It uses the same input parameter list than the main function.
a list
vector, the prediction of weights
vector, probability to be an outlier
numeric, the calculated likelihood
vector of lower bound confidence interval of the prediction
vector of upper bound confidence interval of the prediction
char, is an outlier or not
set.seed(1234) Y<-rnorm(n=10,mean=50,4) Tps<-seq(1,10) N=10 param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) print(Y) utils_fit(param=param2,threshold=0.5,kappa=10,Y=Y,Tps=Tps,N=N)
set.seed(1234) Y<-rnorm(n=10,mean=50,4) Tps<-seq(1,10) N=10 param2<-list(m0=41, mm=45, pp=0.5, aa=0.001, expertMin=30, expertMax=75, sigma2_m0=1, sigma2_mm=0.05, sigma2_pp=5, K=2, seqp=seq(0.5,0.7,0.1)) print(Y) utils_fit(param=param2,threshold=0.5,kappa=10,Y=Y,Tps=Tps,N=N)