Skip to contents

Function for the integration with the reclin2 package. The function is based on reclin2::pair_minsim() and reuses some of its source code.

Usage

pair_ann(
  x,
  y = NULL,
  on,
  on_blocking = NULL,
  deduplication = TRUE,
  keep_block = TRUE,
  add_xy = TRUE,
  ...
)

Arguments

x

reference data (a data.frame or a data.table),

y

query data (a data.frame or a data.table, default NULL),

on

a character vector with column names for the ANN search,

on_blocking

blocking variables (currently not supported),

deduplication

whether deduplication should be performed (default TRUE),

keep_block

whether to keep the block variable in the set,

add_xy

whether to add x and y,

...

arguments passed to blocking() function.

Value

Returns a data.table with two columns .x and .y. Columns .x and .y are row numbers from data.frames x and y respectively. Returning data.table is also of a class pairs which allows for integration with the reclin2::compare_pairs() package.

Details

Imports

Author

Maciej Beręsewicz

Examples


# example using two datasets from reclin2

library(reclin2)
#> Loading required package: data.table

data("linkexample1", "linkexample2", package = "reclin2")

linkexample1$txt <- with(linkexample1, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample1$txt <- gsub("\\s+", "", linkexample1$txt)
linkexample2$txt <- with(linkexample2, tolower(paste0(firstname, lastname, address, sex, postcode)))
linkexample2$txt <- gsub("\\s+", "", linkexample2$txt)

# pairing records from linkexample2 to linkexample1 based on txt column

pair_ann(x = linkexample1, y = linkexample2, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold")
#>   Total number of pairs: 5 pairs
#> 
#> Key: <.y>
#>       .y    .x  id.x lastname.x firstname.x  address.x  sex.x postcode.x
#>    <int> <int> <int>     <fctr>      <fctr>     <fctr> <fctr>     <fctr>
#> 1:     1     2     2      Smith      George 12 Mainstr      M    1234 AB
#> 2:     2     3     3    Johnson        Anna 61 Mainstr      F    1234 AB
#> 3:     3     4     4    Johnson     Charles 61 Mainstr      M    1234 AB
#> 4:     4     6     6   Schwartz         Ben  1 Eaststr      M    6789 XY
#> 5:     5     6     6   Schwartz         Ben  1 Eaststr      M    6789 XY
#>                             txt.x  id.y lastname.y firstname.y     address.y
#>                            <char> <int>     <fctr>      <fctr>        <fctr>
#> 1:    georgesmith12mainstrm1234ab     2      Smith      Gearge 12 Mainstreet
#> 2:    annajohnson61mainstrf1234ab     3     Jonson          A. 61 Mainstreet
#> 3: charlesjohnson61mainstrm1234ab     4    Johnson     Charles    61 Mainstr
#> 4:     benschwartz1eaststrm6789xy     6   Schwartz         Ben        1 Main
#> 5:     benschwartz1eaststrm6789xy     7   Schwartz        Anna     1 Eaststr
#>     sex.y postcode.y                           txt.y
#>    <fctr>     <fctr>                          <char>
#> 1:   <NA>    1234 AB geargesmith12mainstreetna1234ab
#> 2:      F    1234 AB     a.jonson61mainstreetf1234ab
#> 3:      F    1234 AB  charlesjohnson61mainstrf1234ab
#> 4:      M    6789 XY         benschwartz1mainm6789xy
#> 5:      F    6789 XY     annaschwartz1eaststrf6789xy