Learned Similarity Joins for Large Tabular Corpora with Error-Controlled Candidate Generation

Authors

  • Paolo Reyes Luzon College of Technology, Department of Computer Science, Aurora Boulevard, Quezon City, Philippines Author
  • Jomar Villanueva Mindanao Institute of Computing, Department of Computer Science, J.P. Laurel Avenue, Davao City, Philippines Author

Abstract

Similarity joins over large tabular corpora are a recurring primitive in entity resolution, record linkage, data integration, and approximate deduplication. In modern settings, similarity is often induced by learned representations that combine heterogeneous attributes, missingness patterns, and domain-specific semantics. While learned similarity can increase match quality, it complicates candidate generation because classical blocking and locality-sensitive hashing are typically tuned to fixed token-level similarity measures and may not provide transparent error control when representations and thresholds evolve. This paper studies learned similarity joins for large tabular corpora with a focus on error-controlled candidate generation. We develop a pipeline in which a learned embedding model maps rows to dense vectors, a differentiable candidate generator proposes a compact set of candidate pairs, and a verification stage computes the final similarity predicate. The central challenge is to bound missed matches while maintaining computational efficiency under distributed execution constraints. We present a probabilistic framework that couples calibrated score distributions with approximate indexing primitives, enabling explicit control of recall loss as a function of candidate budget. The approach integrates constraint-aware training objectives, sketch-based prefilters, and multi-objective optimization over latency, memory, and energy. We analyze complexity, provide worst-case limitations, and derive practical error bounds for approximate candidate retrieval. The resulting design yields a join operator that can be embedded into query planners and executed at scale with predictable accuracy-efficiency trade-offs.

Downloads

Published

2021-02-04