Learned Similarity Joins for Large Tabular Corpora with Error-Controlled Candidate Generation
Abstract
Similarity joins over large tabular corpora are a recurring primitive in entity resolution, record linkage, data integration, and approximate deduplication. In modern settings, similarity is often induced by learned representations that combine heterogeneous attributes, missingness patterns, and domain-specific semantics. While learned similarity can increase match quality, it complicates candidate generation because classical blocking and locality-sensitive hashing are typically tuned to fixed token-level similarity measures and may not provide transparent error control when representations and thresholds evolve. This paper studies learned similarity joins for large tabular corpora with a focus on error-controlled candidate generation. We develop a pipeline in which a learned embedding model maps rows to dense vectors, a differentiable candidate generator proposes a compact set of candidate pairs, and a verification stage computes the final similarity predicate. The central challenge is to bound missed matches while maintaining computational efficiency under distributed execution constraints. We present a probabilistic framework that couples calibrated score distributions with approximate indexing primitives, enabling explicit control of recall loss as a function of candidate budget. The approach integrates constraint-aware training objectives, sketch-based prefilters, and multi-objective optimization over latency, memory, and energy. We analyze complexity, provide worst-case limitations, and derive practical error bounds for approximate candidate retrieval. The resulting design yields a join operator that can be embedded into query planners and executed at scale with predictable accuracy-efficiency trade-offs.
Downloads
Published
Issue
Section
License
Copyright (c) 2021 authors

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.