Discrete Laplace mixture inference using the EM algorithm

disclapmix makes inference in a mixture of Discrete Laplace distributions using the EM algorithm. After the EM algorithm has converged, the centers are moved if the marginal likelihood increases by doing so. And then the EM algorithm is run again. This continues until the centers are not moved.

disclapmix(
  x,
  clusters,
  init_y = NULL,
  iterations = 100L,
  eps = 0.001,
  verbose = 0L,
  glm_method = "internal_coef",
  glm_control_maxit = 50L,
  glm_control_eps = 1e-06,
  init_y_method = "pam",
  init_v = NULL,
  ret_x = FALSE,
  ...
)

Arguments

x: Dataset.
clusters: The number of clusters/components to fit the model for.
init_y: Initial central haplotypes, if NULL, these will be estimated as described under the init_y_method argument.
iterations: Maximum number of iterations in the EM-algorithm.
eps: Convergence stop criteria in the EM algorithm which is compared to \(\frac{\max \{ v_{new} - v_{old} \}}{\max \{ v_{old} \}}\), where v is a matrix of each observation's probability of belonging to a certain center.
verbose: from 0 to 2 (both including): 0 for silent, 2 for extra verbose.
glm_method: internal_coef, internal_dev or glm.fit. Please see details.
glm_control_maxit: Integer giving the maximal number of IWLS iterations.
glm_control_eps: Positive convergence tolerance epsilon; the iterations converge when |x - x_{old}|/(|x| + 0.1) < epsilon, where x = beta_correction for internal_coef and x = deviance otherwise.
init_y_method: Which cluster method to use for finding initial central haplotypes, y: pam (recommended) or clara. Ignored if init_y is supplied.
init_v: Matrix with `nrow(x)` rows and `clusters` columns specifying initial posterior probabilities to get EM started, if none specified, then `matrix(1/clusters, nrow = nrow(x), ncol = clusters)` is used.
ret_x: Return data `x`
...: Used to detect obsolete usage (when using parameters centers, use.parallel, calculate.logLs or plots.prefix).

Value

A disclapmixfit object:

list("glm_method"): The supplied GLM method.
list("init_y"): The supplied initial central haplotypes, init_y.
list("init_y_method"): The supplied method for choosing initial central haplotypes (only used if init_y is NULL).
list("converged"): Whether the estimation converged or not.
list("x"): Dataset used to fit the model if `ret_x` is `TRUE`, else `NULL`.
list("y"): The central haplotypes, y.
list("tau"): The prior probabilities of belonging to a cluster, tau.
list("v_matrix"): The matrix v of each observation's probability of belonging to a certain cluster. The rows are in the same order as the observations in x used to generate this fit.
list("disclap_parameters"): A matrix with the estimated dicrete Laplace parameters.
list("glm_coef"): The coefficients from the last GLM fit (used to calculate disclap_parameters).
list("model_observations"): Number of observations.
list("model_parameters"): Number of parameters in the model.
list("iterations"): Number of iterations performed in total (including moving centers and re-estimating using the EM algorithm).
list("logL_full"): Full log likelihood of the final model.
list("logL_marginal"): Marginal log likelihood of the final model.
list("BIC_full"): BIC based on the full log likelihood of the final model.
list("BIC_marginal"): BIC based on the marginal log likelihood of the final model.
list("v_gain_iterations"): The gain \(\frac{\max \{ v_{new} - v_{old} \}}{\max \{ v_{old} \}}\), where v is vic_matrix mentioned above, during the iterations.
list("tau_iterations"): The prior probability of belonging to the centers during the iterations.
list("logL_full_iterations"): Full log likelihood of the models during the iterations (only calculated when verbose = 2L).
list("logL_marginal_iterations"): Marginal log likelihood of the models during the iterations (only calculated when verbose = 2L).
list("BIC_full_iterations"): BIC based on full log likelihood of the models during the iterations (only calculated when verbose = 2L).
list("BIC_marginal_iterations"): BIC based on marginal log likelihood of the models during the iterations (only calculated when verbose = 2L).

Details

glm_method: internal_coef is the fastest as it uses the relative changes in the coefficients as a stopping criterium, hence it does not need to compute the deviance until the very end. In normal situations, it would not be a problem to use this method. internal_dev is the reasonably fast method that uses the deviance as a stopping criterium (like glm.fit). glm.fit to use the traditional glm.fit IWLS implementation and is slow compared to the other two methods.

init_y_method: For init_y_method = 'clara', the sampling parameters are: samples = 100, sampsize = min(ceiling(nrow(x)/2), 100 + 2*clusters) and the random number generator in R is used.

Examples


# Generate sample database
db <- matrix(disclap::rdisclap(1000, 0.3), nrow = 250, ncol = 4)

# Add location parameters
db <- sapply(1:ncol(db), function(i) as.integer(db[, i]+13+i))

head(db)
#>      [,1] [,2] [,3] [,4]
#> [1,]   13   15   17   18
#> [2,]   14   15   15   17
#> [3,]   15   18   15   16
#> [4,]   14   14   16   17
#> [5,]   14   15   14   18
#> [6,]   14   13   16   17

fit1 <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "glm.fit")
#> 2023-01-24 11:22:31: Starting estimation for 1 clusters.
#> 2023-01-24 11:22:31: Estimating initial central haplotypes, y, using pam.
#> 2023-01-24 11:22:31: Initial central haplotypes, y, estimated.
#> 2023-01-24 11:22:31: Starting to generate model matrix.
#> 2023-01-24 11:22:31: Model matrix generated.
#> 2023-01-24 11:22:31: Starting to generate initial response vector.
#> 2023-01-24 11:22:31: Initial response vector done.
#> 2023-01-24 11:22:31: Model matrix and initial vectors has been generated.
#> 2023-01-24 11:22:31: Starting the EM algorithm using glm.fit IRLS method.
#> 2023-01-24 11:22:31: Iteration 1, max|vic - vic_old| / max(vic_old) = 0 / 0 = NaN (eps = 0.001)
#> 2023-01-24 11:22:31: Checking if the central haplotypes, y, are optimal.
#> 2023-01-24 11:22:31: Central haplotypes, y, were optimal, no need to more EM iterations.
#> 2023-01-24 11:22:31: Done.
fit1$disclap_parameters
#>            locus1    locus2    locus3    locus4
#> cluster1 0.301767 0.3225469 0.2832832 0.3239989
fit1$y
#>      locus1 locus2 locus3 locus4
#> [1,]     14     15     16     17

fit1b <- disclapmix(db, clusters = 1L, verbose = 1L, glm_method = "internal_coef")
#> 2023-01-24 11:22:31: Starting estimation for 1 clusters.
#> 2023-01-24 11:22:31: Estimating initial central haplotypes, y, using pam.
#> 2023-01-24 11:22:31: Initial central haplotypes, y, estimated.
#> 2023-01-24 11:22:31: No need to generate model matrix when using internal_coef.
#> 2023-01-24 11:22:31: Starting to generate initial response vector.
#> 2023-01-24 11:22:31: Initial response vector done.
#> 2023-01-24 11:22:31: Model matrix and initial vectors has been generated.
#> 2023-01-24 11:22:31: Starting the EM algorithm using internal_coef IRLS method.
#> 2023-01-24 11:22:31: Iteration 1, max|vic - vic_old| / max(vic_old) = 0 / 0 = NaN (eps = 0.001)
#> 2023-01-24 11:22:31: Checking if the central haplotypes, y, are optimal.
#> 2023-01-24 11:22:31: Central haplotypes, y, were optimal, no need to more EM iterations.
#> 2023-01-24 11:22:31: Done.
fit1b$disclap_parameters
#>            locus1    locus2    locus3    locus4
#> cluster1 0.301767 0.3225469 0.2832832 0.3239989
fit1b$y
#>      locus1 locus2 locus3 locus4
#> [1,]     14     15     16     17

max(abs(fit1$disclap_parameters - fit1b$disclap_parameters))
#> [1] 1.418349e-11

# Generate another type of database
db2 <- matrix(disclap::rdisclap(2000, 0.1), nrow = 500, ncol = 4)
db2 <- sapply(1:ncol(db2), function(i) as.integer(db2[, i]+14+i))
fit2 <- disclapmix(rbind(db, db2), clusters = 2L, verbose = 1L)
#> 2023-01-24 11:22:31: Starting estimation for 2 clusters.
#> 2023-01-24 11:22:31: Estimating initial central haplotypes, y, using pam.
#> 2023-01-24 11:22:31: Initial central haplotypes, y, estimated.
#> 2023-01-24 11:22:31: No need to generate model matrix when using internal_coef.
#> 2023-01-24 11:22:31: Starting to generate initial response vector.
#> 2023-01-24 11:22:31: Initial response vector done.
#> 2023-01-24 11:22:31: Model matrix and initial vectors has been generated.
#> 2023-01-24 11:22:31: Starting the EM algorithm using internal_coef IRLS method.
#> 2023-01-24 11:22:31: Iteration 1, max|vic - vic_old| / max(vic_old) = 0.4992969 / 0 = Inf (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 2, max|vic - vic_old| / max(vic_old) = 0.3665442 / 0.9992969 = 0.3668022 (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 3, max|vic - vic_old| / max(vic_old) = 0.09043604 / 0.9999979 = 0.09043623 (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 4, max|vic - vic_old| / max(vic_old) = 0.02416507 / 0.9999993 = 0.02416509 (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 5, max|vic - vic_old| / max(vic_old) = 0.007433358 / 0.9999994 = 0.007433362 (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 6, max|vic - vic_old| / max(vic_old) = 0.002234629 / 0.9999995 = 0.00223463 (eps = 0.001)
#> 2023-01-24 11:22:31: Iteration 7, max|vic - vic_old| / max(vic_old) = 0.0006675027 / 0.9999995 = 0.0006675031 (eps = 0.001)
#> 
#> 2023-01-24 11:22:31: Stopping after 7 iterations due to convergence, 0.001 > 0.0006675031
#> 
#> 2023-01-24 11:22:31: Checking if the central haplotypes, y, are optimal.
#> 2023-01-24 11:22:31: Central haplotypes, y, were optimal, no need to more EM iterations.
#> 2023-01-24 11:22:31: Done.
fit2$disclap_parameters
#>             locus1    locus2     locus3    locus4
#> cluster1 0.3058177 0.3171258 0.26271136 0.3414355
#> cluster2 0.1014672 0.1052192 0.08716499 0.1132849
fit2$y
#>      locus1 locus2 locus3 locus4
#> [1,]     14     15     16     17
#> [2,]     15     16     17     18
fit2$tau
#> [1] 0.3316232 0.6683768

Discrete Laplace mixture inference using the EM algorithm

Arguments

Value

Details

See also

Examples