Skip to contents

Function creates shingles (strings with 2 characters, default), applies approximate nearest neighbour (ANN) algorithms via the rnndescent, RcppHNSW, RcppAnnoy and mlpack packages, and creates blocks using graphs via igraph.

Usage

blocking(
  x,
  y = NULL,
  deduplication = TRUE,
  on = NULL,
  on_blocking = NULL,
  ann = c("nnd", "hnsw", "annoy", "lsh", "kd"),
  distance = c("cosine", "euclidean", "l2", "ip", "manhatan", "hamming", "angular"),
  ann_write = NULL,
  ann_colnames = NULL,
  true_blocks = NULL,
  verbose = c(0, 1, 2),
  graph = FALSE,
  seed = 2023,
  n_threads = 1,
  control_txt = controls_txt(),
  control_ann = controls_ann()
)

Arguments

x

reference data (a character vector or a matrix),

y

query data (a character vector or a matrix), if not provided NULL by default and thus deduplication is performed,

deduplication

whether deduplication should be applied (default TRUE as y is set to NULL),

on

variables for ANN search (currently not supported),

on_blocking

variables for blocking records before ANN search (currently not supported),

ann

algorithm to be used for searching for ann (possible, c("nnd", "hnsw", "annoy", "lsh", "kd"), default "nnd" which corresponds to nearest neighbour descent method),

distance

distance metric (default cosine, more options are possible see details),

ann_write

writing an index to file. Two files will be created: 1) an index, 2) and text file with column names,

ann_colnames

file with column names if x or y are indices saved on the disk (currently not supported),

true_blocks

matrix with true blocks to calculate evaluation metrics (standard metrics based on confusion matrix as well as all metrics from igraph::compare() are returned).

verbose

whether log should be provided (0 = none, 1 = main, 2 = ANN algorithm verbose used),

graph

whether a graph should be returned (default FALSE),

seed

seed for the algorithms (for reproducibility),

n_threads

number of threads used for the ANN algorithms and adding data for index and query,

control_txt

list of controls for text data (passed only to text2vec::itoken_parallel or text2vec::itoken),

control_ann

list of controls for the ANN algorithms.

Value

Returns a list with containing:

  • result -- data.table with indices (rows) of x, y, block and distance between points

  • method -- name of the ANN algorithm used,

  • deduplication -- information whether deduplication was applied,

  • metrics -- metrics for quality assessment, if true_blocks is provided,

  • colnames -- variable names (colnames) used for search,

  • graph -- igraph class object.

Details

Imports

Author

Maciej Beręsewicz

Examples


## an example using RcppHNSW
df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))

result <- blocking(x = df_example$txt,
                   ann = "hnsw",
                   control_ann = controls_ann(hnsw = list(M = 5, ef_c = 10, ef_s = 10)))

result
#> ========================================================
#> Blocking based on the hnsw method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4 
#> 2 

## an example using mlpack::lsh

result_lsh <- blocking(x = df_example$txt,
                       ann = "lsh")

result_lsh
#> ========================================================
#> Blocking based on the lsh method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4 
#> 2