Integration with existing packages
Maciej Beręsewicz
Source:vignettes/v4-integration.Rmd
v4-integration.Rmd
Data
In the example we will use the same dataset as in the Blocking records for record linkage vignette.
census <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/census.csv")
cis <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/cis.csv")
setDT(census)
setDT(cis)
census[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(census, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.
census[is.na(dob_mon), dob_mon := ""]
census[is.na(dob_year), dob_year := ""]
#> Warning in `[.data.table`(census, is.na(dob_year), `:=`(dob_year, "")):
#> Coercing 'character' RHS to 'integer' to match the type of column 7 named
#> 'dob_year'.
cis[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(cis, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.
Integration with the reclin2
package
The package contains function pair_ann
which aims at
integration with reclin2
package. This function works as
follows
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = "txt",
deduplication = FALSE)
#> First data set: 1 000 records
#> Second data set: 1 000 records
#> Total number of pairs: 1 000 pairs
#> Blocking on: 'txt'
#>
#> .x .y block
#> <int> <int> <num>
#> 1: 204 1 1
#> 2: 204 176 1
#> 3: 204 375 1
#> 4: 204 391 1
#> 5: 204 405 1
#> ---
#> 996: 187 980 498
#> 997: 650 981 499
#> 998: 642 991 500
#> 999: 414 994 501
#> 1000: 733 1000 502
Which provides you information on the total number of pairs. This can
be further included in the pipeline of the reclin2
package.
pair_ann(x = census[1:1000],
y = cis[1:1000],
on = "txt",
deduplication = FALSE,
ann = "hnsw") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.75) |>
link(selection = "threshold") |>
head()
#> Total number of pairs: 6 pairs
#>
#> Key: <.y>
#> .y .x person_id.x pername1.x pername2.x sex.x dob_day.x dob_mon.x
#> <int> <int> <char> <char> <char> <char> <int> <int>
#> 1: 11 945 DE256NG039003 HARRIET THOMSON F 12 1
#> 2: 71 427 DE159QA062001 LEWIS GREEN M 23 3
#> 3: 83 720 DE237GG025002 IMOGEN DARIS F 6 4
#> 4: 99 136 DE125LU022001 DANIEC MICCER M 21 4
#> 5: 154 949 DE256NG040002 CHLOE WILSON F 5 7
#> 6: 156 549 DE159QY035002 AVA KING F 7 7
#> dob_year.x enumcap.x enumpc.x
#> <int> <char> <char>
#> 1: 1995 39 SPRINGFIELD ROAD DE256NG
#> 2: 1973 62 CHURCH ROAD DE159QA
#> 3: 1968 25 WOODLANDS ROAD DE237GG
#> 4: 1947 22 PARK LANE DE125LU
#> 5: 1978 40 SPRINGFIELD ROAD DE256NG
#> 6: 1969 35 CHURCH ROAD DE159QY
#> txt.x x person_id.y
#> <char> <int> <char>
#> 1: HARRIETTHOMSONF121199539 SPRINGFIELD ROADDE256NG 945 <NA>
#> 2: LEWISGREENM233197362 CHURCH ROADDE159QA 427 <NA>
#> 3: IMOGENDARISF64196825 WOODLANDS ROADDE237GG 720 <NA>
#> 4: DANIECMICCERM214194722 PARK LANEDE125LU 136 <NA>
#> 5: CHLOEWILSONF57197840 SPRINGFIELD ROADDE256NG 949 <NA>
#> 6: AVAKINGF77196935 CHURCH ROADDE159QY 549 DE159QY035002
#> pername1.y pername2.y sex.y dob_day.y dob_mon.y dob_year.y
#> <char> <char> <char> <int> <int> <int>
#> 1: HARRIET THOMSON F 12 1 NA
#> 2: LEWIS GREEN M 23 3 NA
#> 3: IMOGEW DAVIS F 6 4 NA
#> 4: DAMIEL HILLER M 21 4 NA
#> 5: CHLOE WILSOM F 5 7 NA
#> 6: AVA KING F 7 7 NA
#> enumcap.y enumpc.y txt.y
#> <char> <char> <char>
#> 1: 39 SPRINGFIELD ROAD DE256NG HARRIETTHOMSONF121NA39 SPRINGFIELD ROADDE256NG
#> 2: 62 CHURCH ROAD DE159QA LEWISGREENM233NA62 CHURCH ROADDE159QA
#> 3: 25 WOODLANDS ROAD DE237GG IMOGEWDAVISF64NA25 WOODLANDS ROADDE237GG
#> 4: 22 PARK LANE DE125LU DAMIELHILLERM214NA22 PARK LANEDE125LU
#> 5: 40 SPRINGFIELD ROAD DE256NG CHLOEWILSOMF57NA40 SPRINGFIELD ROADDE256NG
#> 6: 35 CHURCH ROAD DE159QY AVAKINGF77NA35 CHURCH ROADDE159QY
#> y
#> <int>
#> 1: 11
#> 2: 71
#> 3: 83
#> 4: 99
#> 5: 154
#> 6: 156