Function creates shingles (strings with 2 characters, default), applies approximate nearest neighbour (ANN) algorithms via the rnndescent, RcppHNSW, RcppAnnoy and mlpack packages, and creates blocks using graphs via igraph.
Usage
blocking(
x,
y = NULL,
deduplication = TRUE,
on = NULL,
on_blocking = NULL,
ann = c("nnd", "hnsw", "annoy", "lsh", "kd"),
distance = c("cosine", "euclidean", "l2", "ip", "manhatan", "hamming", "angular"),
ann_write = NULL,
ann_colnames = NULL,
true_blocks = NULL,
verbose = c(0, 1, 2),
graph = FALSE,
seed = 2023,
n_threads = 1,
control_txt = controls_txt(),
control_ann = controls_ann()
)
Arguments
- x
reference data (a character vector or a matrix),
- y
query data (a character vector or a matrix), if not provided NULL by default and thus deduplication is performed,
- deduplication
whether deduplication should be applied (default TRUE as y is set to NULL),
- on
variables for ANN search (currently not supported),
- on_blocking
variables for blocking records before ANN search (currently not supported),
- ann
algorithm to be used for searching for ann (possible,
c("nnd", "hnsw", "annoy", "lsh", "kd")
, default"nnd"
which corresponds to nearest neighbour descent method),- distance
distance metric (default
cosine
, more options are possible see details),- ann_write
writing an index to file. Two files will be created: 1) an index, 2) and text file with column names,
- ann_colnames
file with column names if
x
ory
are indices saved on the disk (currently not supported),- true_blocks
matrix with true blocks to calculate evaluation metrics (standard metrics based on confusion matrix as well as all metrics from
igraph::compare()
are returned).- verbose
whether log should be provided (0 = none, 1 = main, 2 = ANN algorithm verbose used),
- graph
whether a graph should be returned (default FALSE),
- seed
seed for the algorithms (for reproducibility),
- n_threads
number of threads used for the ANN algorithms and adding data for index and query,
- control_txt
list of controls for text data (passed only to text2vec::itoken_parallel or text2vec::itoken),
- control_ann
list of controls for the ANN algorithms.
Value
Returns a list with containing:
result
--data.table
with indices (rows) of x, y, block and distance between pointsmethod
-- name of the ANN algorithm used,deduplication
-- information whether deduplication was applied,metrics
-- metrics for quality assessment, iftrue_blocks
is provided,colnames
-- variable names (colnames) used for search,graph
--igraph
class object.
Examples
## an example using RcppHNSW
df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))
result <- blocking(x = df_example$txt,
ann = "hnsw",
control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))
result
#> ========================================================
#> Blocking based on the hnsw method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2
## an example using mlpack::lsh
result_lsh <- blocking(x = df_example$txt,
ann = "lsh")
result_lsh
#> ========================================================
#> Blocking based on the lsh method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2