Learned Similarity Joins for Large Tabular Corpora with Error-Controlled Candidate Generation

Paolo Reyes; Jomar Villanueva

Authors

Paolo Reyes Luzon College of Technology, Department of Computer Science, Aurora Boulevard, Quezon City, Philippines Author
Jomar Villanueva Mindanao Institute of Computing, Department of Computer Science, J.P. Laurel Avenue, Davao City, Philippines Author

Abstract

Similarity joins over large tabular corpora are a recurring primitive in entity resolution, record linkage, data integration, and approximate deduplication. In modern settings, similarity is often induced by learned representations that combine heterogeneous attributes, missingness patterns, and domain-specific semantics. While learned similarity can increase match quality, it complicates candidate generation because classical blocking and locality-sensitive hashing are typically tuned to fixed token-level similarity measures and may not provide transparent error control when representations and thresholds evolve. This paper studies learned similarity joins for large tabular corpora with a focus on error-controlled candidate generation. We develop a pipeline in which a learned embedding model maps rows to dense vectors, a differentiable candidate generator proposes a compact set of candidate pairs, and a verification stage computes the final similarity predicate. The central challenge is to bound missed matches while maintaining computational efficiency under distributed execution constraints. We present a probabilistic framework that couples calibrated score distributions with approximate indexing primitives, enabling explicit control of recall loss as a function of candidate budget. The approach integrates constraint-aware training objectives, sketch-based prefilters, and multi-objective optimization over latency, memory, and energy. We analyze complexity, provide worst-case limitations, and derive practical error bounds for approximate candidate retrieval. The resulting design yields a join operator that can be embedded into query planners and executed at scale with predictable accuracy-efficiency trade-offs.

Learned Similarity Joins for Large Tabular Corpora with Error-Controlled Candidate Generation

Authors

Abstract

Downloads

Published

Issue

Section

License