Integration with existing packages

Setup

library(blocking)
library(reclin2)
#> Loading required package: data.table

Data

In the example we will use the same dataset as in the Blocking records for record linkage vignette.

census <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/census.csv")
cis <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/cis.csv")
setDT(census)
setDT(cis)
census[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(census, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.

census[is.na(dob_mon), dob_mon := ""]
census[is.na(dob_year), dob_year := ""]
#> Warning in `[.data.table`(census, is.na(dob_year), `:=`(dob_year, "")):
#> Coercing 'character' RHS to 'integer' to match the type of column 7 named
#> 'dob_year'.

cis[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(cis, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.

cis[is.na(dob_mon), dob_mon := ""]
cis[is.na(dob_year), dob_year := ""]
#> Warning in `[.data.table`(cis, is.na(dob_year), `:=`(dob_year, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 7 named 'dob_year'.

census[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
cis[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
census[, x:=1:.N]
cis[, y:=1:.N]

Integration with the `reclin2` package

The package contains function pair_ann which aims at integration with reclin2 package. This function works as follows

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = "txt", 
         deduplication = FALSE)
#>   First data set:  1 000 records
#>   Second data set: 1 000 records
#>   Total number of pairs: 1 000 pairs
#>   Blocking on: 'txt'
#> 
#>          .x    .y block
#>       <int> <int> <num>
#>    1:   204     1     1
#>    2:   204   176     1
#>    3:   204   375     1
#>    4:   204   391     1
#>    5:   204   405     1
#>   ---                  
#>  996:   187   980   498
#>  997:   650   981   499
#>  998:   642   991   500
#>  999:   414   994   501
#> 1000:   733  1000   502

Which provides you information on the total number of pairs. This can be further included in the pipeline of the reclin2 package.

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = "txt", 
         deduplication = FALSE,
         ann = "hnsw") |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.75) |>
  link(selection = "threshold") |> 
  head()
#>   Total number of pairs: 6 pairs
#> 
#> Key: <.y>
#>       .y    .x   person_id.x pername1.x pername2.x  sex.x dob_day.x dob_mon.x
#>    <int> <int>        <char>     <char>     <char> <char>     <int>     <int>
#> 1:    11   945 DE256NG039003    HARRIET    THOMSON      F        12         1
#> 2:    71   427 DE159QA062001      LEWIS      GREEN      M        23         3
#> 3:    83   720 DE237GG025002     IMOGEN      DARIS      F         6         4
#> 4:    99   136 DE125LU022001     DANIEC     MICCER      M        21         4
#> 5:   154   949 DE256NG040002      CHLOE     WILSON      F         5         7
#> 6:   156   549 DE159QY035002        AVA       KING      F         7         7
#>    dob_year.x           enumcap.x enumpc.x
#>         <int>              <char>   <char>
#> 1:       1995 39 SPRINGFIELD ROAD  DE256NG
#> 2:       1973      62 CHURCH ROAD  DE159QA
#> 3:       1968   25 WOODLANDS ROAD  DE237GG
#> 4:       1947        22 PARK LANE  DE125LU
#> 5:       1978 40 SPRINGFIELD ROAD  DE256NG
#> 6:       1969      35 CHURCH ROAD  DE159QY
#>                                               txt.x     x   person_id.y
#>                                              <char> <int>        <char>
#> 1: HARRIETTHOMSONF121199539 SPRINGFIELD ROADDE256NG   945          <NA>
#> 2:          LEWISGREENM233197362 CHURCH ROADDE159QA   427          <NA>
#> 3:       IMOGENDARISF64196825 WOODLANDS ROADDE237GG   720          <NA>
#> 4:          DANIECMICCERM214194722 PARK LANEDE125LU   136          <NA>
#> 5:     CHLOEWILSONF57197840 SPRINGFIELD ROADDE256NG   949          <NA>
#> 6:              AVAKINGF77196935 CHURCH ROADDE159QY   549 DE159QY035002
#>    pername1.y pername2.y  sex.y dob_day.y dob_mon.y dob_year.y
#>        <char>     <char> <char>     <int>     <int>      <int>
#> 1:    HARRIET    THOMSON      F        12         1         NA
#> 2:      LEWIS      GREEN      M        23         3         NA
#> 3:     IMOGEW      DAVIS      F         6         4         NA
#> 4:     DAMIEL     HILLER      M        21         4         NA
#> 5:      CHLOE     WILSOM      F         5         7         NA
#> 6:        AVA       KING      F         7         7         NA
#>              enumcap.y enumpc.y                                          txt.y
#>                 <char>   <char>                                         <char>
#> 1: 39 SPRINGFIELD ROAD  DE256NG HARRIETTHOMSONF121NA39 SPRINGFIELD ROADDE256NG
#> 2:      62 CHURCH ROAD  DE159QA          LEWISGREENM233NA62 CHURCH ROADDE159QA
#> 3:   25 WOODLANDS ROAD  DE237GG       IMOGEWDAVISF64NA25 WOODLANDS ROADDE237GG
#> 4:        22 PARK LANE  DE125LU          DAMIELHILLERM214NA22 PARK LANEDE125LU
#> 5: 40 SPRINGFIELD ROAD  DE256NG     CHLOEWILSOMF57NA40 SPRINGFIELD ROADDE256NG
#> 6:      35 CHURCH ROAD  DE159QY              AVAKINGF77NA35 CHURCH ROADDE159QY
#>        y
#>    <int>
#> 1:    11
#> 2:    71
#> 3:    83
#> 4:    99
#> 5:   154
#> 6:   156

Usage with `fastLink` package

Just use the block column in the function fastLink::blockData(). As a result you will obtain a list of records blocked for further processing.

Usage with `RecordLinkage` package

Just use the block column in the argument blockfld in the compare.dedup() or compare.linkage() function. Please note that block column for the RecordLinkage package should be stored as a character not a numeric/integer vector.

Maciej Beręsewicz

Setup

Data

Integration with the reclin2 package

Usage with fastLink package

Usage with RecordLinkage package

Integration with the `reclin2` package

Usage with `fastLink` package

Usage with `RecordLinkage` package