Skip to contents

Setup

library(blocking)
library(reclin2)
#> Loading required package: data.table

Data

In the example we will use the same dataset as in the Blocking records for record linkage vignette.

census <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/census.csv")
cis <- read.csv("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/cis.csv")
setDT(census)
setDT(cis)
census[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(census, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.
census[is.na(dob_mon), dob_mon := ""]
census[is.na(dob_year), dob_year := ""]
#> Warning in `[.data.table`(census, is.na(dob_year), `:=`(dob_year, "")):
#> Coercing 'character' RHS to 'integer' to match the type of column 7 named
#> 'dob_year'.
cis[is.na(dob_day), dob_day := ""]
#> Warning in `[.data.table`(cis, is.na(dob_day), `:=`(dob_day, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 5 named 'dob_day'.
cis[is.na(dob_mon), dob_mon := ""]
cis[is.na(dob_year), dob_year := ""]
#> Warning in `[.data.table`(cis, is.na(dob_year), `:=`(dob_year, "")): Coercing
#> 'character' RHS to 'integer' to match the type of column 7 named 'dob_year'.
census[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
cis[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
census[, x:=1:.N]
cis[, y:=1:.N]

Integration with the reclin2 package

The package contains function pair_ann which aims at integration with reclin2 package. This function works as follows

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = "txt", 
         deduplication = FALSE)
#>   First data set:  1 000 records
#>   Second data set: 1 000 records
#>   Total number of pairs: 1 000 pairs
#>   Blocking on: 'txt'
#> 
#>          .x    .y block
#>       <int> <int> <num>
#>    1:   204     1     1
#>    2:   204   176     1
#>    3:   204   375     1
#>    4:   204   391     1
#>    5:   204   405     1
#>   ---                  
#>  996:   187   980   498
#>  997:   650   981   499
#>  998:   642   991   500
#>  999:   414   994   501
#> 1000:   733  1000   502

Which provides you information on the total number of pairs. This can be further included in the pipeline of the reclin2 package.

pair_ann(x = census[1:1000], 
         y = cis[1:1000], 
         on = "txt", 
         deduplication = FALSE,
         ann = "hnsw") |>
  compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
  score_simple("score", on = "txt") |>
  select_threshold("threshold", score = "score", threshold = 0.75) |>
  link(selection = "threshold") |> 
  head()
#>   Total number of pairs: 6 pairs
#> 
#> Key: <.y>
#>       .y    .x   person_id.x pername1.x pername2.x  sex.x dob_day.x dob_mon.x
#>    <int> <int>        <char>     <char>     <char> <char>     <int>     <int>
#> 1:    11   945 DE256NG039003    HARRIET    THOMSON      F        12         1
#> 2:    71   427 DE159QA062001      LEWIS      GREEN      M        23         3
#> 3:    83   720 DE237GG025002     IMOGEN      DARIS      F         6         4
#> 4:    99   136 DE125LU022001     DANIEC     MICCER      M        21         4
#> 5:   154   949 DE256NG040002      CHLOE     WILSON      F         5         7
#> 6:   156   549 DE159QY035002        AVA       KING      F         7         7
#>    dob_year.x           enumcap.x enumpc.x
#>         <int>              <char>   <char>
#> 1:       1995 39 SPRINGFIELD ROAD  DE256NG
#> 2:       1973      62 CHURCH ROAD  DE159QA
#> 3:       1968   25 WOODLANDS ROAD  DE237GG
#> 4:       1947        22 PARK LANE  DE125LU
#> 5:       1978 40 SPRINGFIELD ROAD  DE256NG
#> 6:       1969      35 CHURCH ROAD  DE159QY
#>                                               txt.x     x   person_id.y
#>                                              <char> <int>        <char>
#> 1: HARRIETTHOMSONF121199539 SPRINGFIELD ROADDE256NG   945          <NA>
#> 2:          LEWISGREENM233197362 CHURCH ROADDE159QA   427          <NA>
#> 3:       IMOGENDARISF64196825 WOODLANDS ROADDE237GG   720          <NA>
#> 4:          DANIECMICCERM214194722 PARK LANEDE125LU   136          <NA>
#> 5:     CHLOEWILSONF57197840 SPRINGFIELD ROADDE256NG   949          <NA>
#> 6:              AVAKINGF77196935 CHURCH ROADDE159QY   549 DE159QY035002
#>    pername1.y pername2.y  sex.y dob_day.y dob_mon.y dob_year.y
#>        <char>     <char> <char>     <int>     <int>      <int>
#> 1:    HARRIET    THOMSON      F        12         1         NA
#> 2:      LEWIS      GREEN      M        23         3         NA
#> 3:     IMOGEW      DAVIS      F         6         4         NA
#> 4:     DAMIEL     HILLER      M        21         4         NA
#> 5:      CHLOE     WILSOM      F         5         7         NA
#> 6:        AVA       KING      F         7         7         NA
#>              enumcap.y enumpc.y                                          txt.y
#>                 <char>   <char>                                         <char>
#> 1: 39 SPRINGFIELD ROAD  DE256NG HARRIETTHOMSONF121NA39 SPRINGFIELD ROADDE256NG
#> 2:      62 CHURCH ROAD  DE159QA          LEWISGREENM233NA62 CHURCH ROADDE159QA
#> 3:   25 WOODLANDS ROAD  DE237GG       IMOGEWDAVISF64NA25 WOODLANDS ROADDE237GG
#> 4:        22 PARK LANE  DE125LU          DAMIELHILLERM214NA22 PARK LANEDE125LU
#> 5: 40 SPRINGFIELD ROAD  DE256NG     CHLOEWILSOMF57NA40 SPRINGFIELD ROADDE256NG
#> 6:      35 CHURCH ROAD  DE159QY              AVAKINGF77NA35 CHURCH ROADDE159QY
#>        y
#>    <int>
#> 1:    11
#> 2:    71
#> 3:    83
#> 4:    99
#> 5:   154
#> 6:   156

Just use the block column in the function fastLink::blockData(). As a result you will obtain a list of records blocked for further processing.

Usage with RecordLinkage package

Just use the block column in the argument blockfld in the compare.dedup() or compare.linkage() function. Please note that block column for the RecordLinkage package should be stored as a character not a numeric/integer vector.