Skip to content

Blocking

Dylan Hall edited this page Dec 21, 2022 · 5 revisions

⚠️ Blocking, as implemented in CODI's Data Owner Tools via anonlink, is not currently recommended for use in production environments. The script and notes below are left for informational purposes ⚠️

Currently there is optional functionality for evaluation purposes to use blocking techniques to try and make the matching more efficient. After running garble.py you can run block.py to generate an additional blocking .zip file to send to the linkage agent.

Example run - note this is using the default settings, i.e. looking for the CLKs from the garble.py run in output/ and using the example-schema/blocking-schema/lambda.json LSH blocking configuration (Read more about blocking schmea here, and more about anonlink's LSH-based blocking approach here):

$ python block.py
Statistics for the generated blocks:
	Number of Blocks:   79
	Minimum Block Size: 1
	Maximum Block Size: 285
	Average Block Size: 31.10126582278481
	Median Block Size:  9
	Standard Deviation of Block Size:  59.10477331947379
Statistics for the generated blocks:
	Number of Blocks:   82
	Minimum Block Size: 1
	Maximum Block Size: 232
	Average Block Size: 29.963414634146343
	Median Block Size:  9
	Standard Deviation of Block Size:  45.5122952108199
Statistics for the generated blocks:
	Number of Blocks:   75
	Minimum Block Size: 1
	Maximum Block Size: 339
	Average Block Size: 32.76
	Median Block Size:  10
	Standard Deviation of Block Size:  61.43725430238738
Statistics for the generated blocks:
	Number of Blocks:   80
	Minimum Block Size: 1
	Maximum Block Size: 307
	Average Block Size: 30.7125
	Median Block Size:  9
	Standard Deviation of Block Size:  58.4333860157515
Clone this wiki locally