A practical guide to building a multi-layer deduplication system using PostgreSQL's native text-matching capabilities, Elixir for orchestration and scoring, and an LLM for fuzzy name resolution.
Deduplication works best as layered prevention, not a single pass.
| Layer | When | Purpose |
|---|---|---|
| Search-select-add | User is typing | Autocomplete against existing records prevents duplicates before they exist |
| Real-time check | User clicks Save | Block creation if a likely match exists; let the user decide |
| Batch detection | Scheduled, triggered, or on-demand | Catch everything that slipped through: imports, API-created records, edge cases |
Each layer catches what the previous one missed. The real-time check and batch detection share the same blocking algorithms and AI fallback, so you build the logic once and call it from two contexts.
The core insight: you cannot compare every record to every other record. An org with 10,000 contacts produces 50 million pairs. Blocking narrows the search space by grouping records that share a coarse key, then only comparing within groups.
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- trigram similarity
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch; -- soundex, metaphoneEach strategy generates "blocks" — groups of record IDs that share a key. Use multiple strategies in parallel (UNION ALL) to catch different kinds of duplicates.
The cheapest and most reliable block. Group by a normalized field value.
-- Email blocking: group contacts sharing the same email
SELECT LOWER(email) AS block_key,
array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
AND email IS NOT NULL
AND archived_at IS NULL
GROUP BY LOWER(email)
HAVING count(*) > 1;Works for: email, phone number, tax ID, any field expected to be globally unique.
Soundex encodes a string by how it sounds, catching typos and spelling variations. Smith and Smyth both encode to S530.
-- Name blocking: exact last name + soundex first name
-- Catches: John/Jon, Robert/Robrt, Catherine/Katherine
SELECT 'ln_' || LOWER(last_name) || '_sdx_' || soundex(first_name) AS block_key,
array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
AND first_name IS NOT NULL
AND last_name IS NOT NULL
GROUP BY block_key
HAVING count(*) > 1;Run the inverse too (exact first + soundex last) to catch last-name typos:
SELECT 'fn_' || LOWER(first_name) || '_sdx_' || soundex(last_name) AS block_key,
array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
AND first_name IS NOT NULL
AND last_name IS NOT NULL
GROUP BY block_key
HAVING count(*) > 1;Soundex is fast and index-friendly but coarse. It works best combined with exact matches on a complementary field.
pg_trgm breaks strings into 3-character subsequences and compares overlap. More granular than soundex, but requires a pairwise comparison rather than a simple GROUP BY.
-- Trigram blocking: pairwise self-join within an org
-- c1.id < c2.id avoids duplicate pairs and self-matches
SELECT c1.id AS id_a, c2.id AS id_b,
similarity(
LOWER(c1.first_name || ' ' || c1.last_name),
LOWER(c2.first_name || ' ' || c2.last_name)
) AS sim_score
FROM contact c1
JOIN contact c2
ON c1.organization_id = c2.organization_id
AND c1.id < c2.id
WHERE c1.organization_id = $1
AND similarity(
LOWER(c1.first_name || ' ' || c1.last_name),
LOWER(c2.first_name || ' ' || c2.last_name)
) >= 0.6;Threshold guidance:
- 0.6 for person names (more variation, need wider net)
- 0.65 for company/entity names (shorter, more distinctive)
Trigram is the most expensive strategy — the self-join is inherently O(n^2) within each org. A GIN trigram index helps accelerate % and LIKE/ILIKE operators but does not eliminate the pairwise cost of similarity() in a self-join. Use trigram as the last blocking pass, and consider pre-filtering with cheaper blocks (email, soundex) first to reduce the candidate set.
Every blocking strategy needs an index. Without them, batch jobs on large orgs will time out.
-- Exact match indexes
CREATE INDEX idx_contact_email_lower ON contact (organization_id, LOWER(email));
CREATE INDEX idx_contact_phone_org ON contact (organization_id, phone_number);
-- Soundex indexes
CREATE INDEX idx_contact_soundex_first
ON contact (organization_id, LOWER(last_name), soundex(first_name));
CREATE INDEX idx_contact_soundex_last
ON contact (organization_id, LOWER(first_name), soundex(last_name));
-- Trigram GIN index (accelerates % and LIKE/ILIKE operators;
-- similarity() in a self-join still requires pairwise comparison
-- but the index helps when one side of the comparison is a constant)
CREATE INDEX idx_contact_fullname_trgm ON contact
USING GIN ((LOWER(first_name || ' ' || last_name)) gin_trgm_ops);For the save-time check, wrap blocking into a SQL function that returns full candidate records instead of ID arrays. Use a CTE with priority ordering to deduplicate candidates that appear in multiple blocks:
CREATE FUNCTION find_duplicate_candidates(
p_org_id UUID, p_email TEXT, p_phone TEXT,
p_first_name TEXT, p_last_name TEXT,
p_exclude_id UUID DEFAULT NULL
) RETURNS TABLE (id UUID, block_type TEXT, /* ... other fields */)
AS $$
WITH candidates AS (
-- Strategy 1: email exact (priority 1)
SELECT c.id, 'email' AS block_type, 1 AS priority
FROM contact c
WHERE c.organization_id = p_org_id
AND LOWER(c.email) = LOWER(p_email)
AND (p_exclude_id IS NULL OR c.id != p_exclude_id)
UNION ALL
-- Strategy 2: phone exact (priority 2)
SELECT c.id, 'phone', 2
FROM contact c
WHERE c.organization_id = p_org_id
AND c.phone_number = p_phone
AND (p_exclude_id IS NULL OR c.id != p_exclude_id)
UNION ALL
-- Strategy 3: name exact (priority 3)
SELECT c.id, 'name_exact', 3
FROM contact c
WHERE c.organization_id = p_org_id
AND LOWER(TRIM(c.first_name)) = LOWER(TRIM(p_first_name))
AND LOWER(TRIM(c.last_name)) = LOWER(TRIM(p_last_name))
AND (p_exclude_id IS NULL OR c.id != p_exclude_id)
UNION ALL
-- Strategy 4: name fuzzy — soundex + trigram (priority 4)
SELECT c.id, 'name_fuzzy', 4
FROM contact c
WHERE c.organization_id = p_org_id
AND (
(LOWER(c.first_name) = LOWER(p_first_name)
AND soundex(c.last_name) = soundex(p_last_name))
OR (soundex(c.first_name) = soundex(p_first_name)
AND LOWER(c.last_name) = LOWER(p_last_name))
OR similarity(
LOWER(c.first_name || ' ' || c.last_name),
LOWER(p_first_name || ' ' || p_last_name)
) >= 0.6
)
AND (p_exclude_id IS NULL OR c.id != p_exclude_id)
)
SELECT DISTINCT ON (cand.id) cand.id, cand.block_type
FROM candidates cand
ORDER BY cand.id, cand.priority ASC
LIMIT 50;
$$ LANGUAGE sql STABLE;DISTINCT ON (id) ORDER BY priority ASC ensures each candidate appears once, tagged with its highest-confidence block type.
Full org scans are expensive. After the initial full scan, subsequent runs can use incremental mode — only comparing records that changed since the last run (e.g., the last 24 hours for a daily schedule, or since the last completed run for event-triggered jobs). This is a two-part optimization — the DB narrows the blocks, and the application layer filters the pairs.
Part 1 — DB: prune stale blocks. Add a recency filter so only blocks containing at least one recently-modified member are returned:
HAVING count(*) > 1
AND bool_or(updated_at >= $2) -- at least one member was recently modifiedReturn the recently-modified member IDs alongside the block:
SELECT block_key,
array_agg(id) AS member_ids,
array_agg(id) FILTER (WHERE updated_at >= $2) AS recent_member_ids
FROM ...Part 2 — Application: filter pairs. The DB pruning alone only skips blocks where nothing changed — within a returned block, the application layer must still filter to pairs where at least one member is recent. This is what actually reduces the comparison space. Without this step, you are still expanding full blocks into O(n^2) pairs.
Together, these two steps reduce the effective comparison space from O(n^2) to O(k*n) where k is the number of recently-modified records. Note: for the trigram self-join strategy, you need to rewrite the SQL query itself to drive comparisons from the changed subset (e.g., join recently-changed records against all records, rather than all-against-all) to get the full benefit.
PostgreSQL handles candidate finding. Elixir handles everything else: expanding blocks into pairs, scoring matches, calling AI, clustering results, and persisting groups.
Blocks are arrays of IDs. Expand each block into all unique pairs, then deduplicate across blocks.
defmodule Blocking do
@block_type_priority [:name_exact, :email, :phone, :name_soundex, :trigram]
def expand_to_pairs(blocks, id_key \\ :contact_ids) do
blocks
|> Enum.flat_map(&block_to_pairs(&1, id_key))
|> dedupe_pairs()
end
defp block_to_pairs(block, id_key) do
ids = Map.fetch!(block, id_key)
block_type = block.block_type
for [a, b] <- unique_pairs(ids) do
{min(a, b), max(a, b), block_type}
end
|> filter_by_recency(block[:recent_member_ids])
end
defp dedupe_pairs(pairs) do
pairs
|> Enum.group_by(fn {a, b, _type} -> {a, b} end)
|> Enum.map(fn {{a, b}, group} ->
best_type = group
|> Enum.map(fn {_, _, type} -> type end)
|> Enum.min_by(&priority_index/1)
{a, b, best_type}
end)
end
defp filter_by_recency(pairs, nil), do: pairs
defp filter_by_recency(pairs, recent_ids) do
recent_set = MapSet.new(recent_ids)
Enum.filter(pairs, fn {a, b, _} ->
MapSet.member?(recent_set, a) or MapSet.member?(recent_set, b)
end)
end
defp priority_index(type) do
Enum.find_index(@block_type_priority, &(&1 == type)) || 999
end
endWhen the same pair appears in multiple blocks (e.g., two contacts share both email and name), keep only the highest-confidence block type. The priority list makes this explicit and tunable.
Score each candidate pair using field-level comparisons. Start with the strongest signals and fall through to weaker ones.
defmodule Matching do
def score_pair(record_a, record_b, block_type) do
cond do
block_type == :name_exact ->
{:match, "name_exact", 1.0}
names_match?(record_a, record_b) and emails_match?(record_a, record_b)
and phones_match?(record_a, record_b) ->
{:match, "first_last_email_phone", 1.0}
names_match?(record_a, record_b) and emails_match?(record_a, record_b) ->
{:match, "first_last_email", 0.98}
names_match?(record_a, record_b) and phones_match?(record_a, record_b) ->
{:match, "first_last_phone", 0.95}
emails_match?(record_a, record_b) ->
{:match, "email_exact", 0.90}
phones_match?(record_a, record_b) and last_names_match?(record_a, record_b) ->
{:match, "phone_lastname", 0.85}
true ->
:needs_ai
end
end
endDesign principles:
- Each match type has a fixed confidence. No fuzzy arithmetic.
- The
condchain runs from highest to lowest confidence. First match wins. - AI is the fallback of last resort, not the default path.
Users can mark a pair as "not duplicates" (Keep Separate). Store these as exclusion pairs and filter them before matching:
def filter_excluded(pairs, exclusion_set) do
Enum.reject(pairs, fn {a, b, _type} ->
normalized = {min(a, b), max(a, b)}
MapSet.member?(exclusion_set, normalized)
end)
endAlways normalize pair order (min/max) so {A, B} and {B, A} hash to the same key.
After matching, pairs need to be grouped transitively. If A matches B and B matches C, all three should be in one group — even if A and C were never directly compared.
Union-Find with path compression handles this efficiently:
defmodule Clustering do
def cluster_matches(matches, opts \\ []) do
id_key = Keyword.get(opts, :id_key, :contact_ids)
uf = init_union_find(matches)
uf = union_all_pairs(matches, uf)
uf
|> group_by_root()
|> Enum.reject(fn {_root, members} -> length(members) < 2 end)
|> build_clusters(matches, id_key)
end
defp init_union_find(matches) do
matches
|> Enum.flat_map(fn {a, b, _conf, _type} -> [a, b] end)
|> Enum.uniq()
|> Map.new(fn id -> {id, id} end)
end
defp find(uf, id) do
parent = Map.fetch!(uf, id)
if parent == id do
{id, uf}
else
{root, uf} = find(uf, parent)
{root, Map.put(uf, id, root)} # path compression
end
end
defp union(uf, a, b) do
{root_a, uf} = find(uf, a)
{root_b, uf} = find(uf, b)
if root_a == root_b, do: uf, else: Map.put(uf, root_a, root_b)
end
endSingletons (clusters of 1) are discarded — in practice they should not appear since every union merges two IDs into one component, but filtering them out guards against edge cases in input data.
When presenting duplicate groups to users, suggest the most complete record as the master:
def suggest_master(records, fields) do
Enum.max_by(records, fn record ->
base = Enum.count(fields, fn f ->
val = Map.get(record, f)
val != nil and val != ""
end)
bonus =
(if Map.get(record, :orders_count, 0) > 0, do: 2, else: 0) +
(if Map.get(record, :contacts_count, 0) > 0, do: 1, else: 0)
base + bonus
end)
endThe suggested master is a default, not a mandate. Users can always pick a different one.
Whether triggered by a cron schedule, a manual button, or an event (e.g., after a bulk import), don't process all orgs simultaneously. Use a GenServer worker with bounded concurrency:
@max_concurrent_jobs 8
@max_queue_depth 200
@detection_timeout :timer.minutes(30)Job deduplication: If a detection job for the same {type, org_id} is already running or queued, silently skip new requests. This prevents overlapping schedules or rapid re-triggers from double-queuing a slow org.
Task isolation: Run detection tasks under a Task.Supervisor with async_nolink so a crash in one org's detection doesn't take down the worker or other orgs' tasks.
Queue draining: When a task completes, recursively pop from the queue until all available slots are filled or the queue is empty. This maximizes throughput when multiple tasks finish near-simultaneously.
AI handles what PostgreSQL and deterministic rules cannot: deciding whether "Bob Smith" and "Robert J. Smith, SRA" are the same person, or whether "First National Bank" and "1st Natl Bank, LLC" are the same company.
AI is expensive and slow relative to SQL and Elixir logic. Use it only for candidates that:
- Were found by a fuzzy blocking strategy (soundex or trigram)
- Could not be resolved by deterministic field matching
- Have non-nil name fields to compare
Parse the LLM response into a typed schema. Never regex-parse free text.
schema = %{
query_name: :string,
matched_name: {:string, nullable: true},
confidence: :float, # 0.0 to 1.0
reasoning: :string,
match_index: :integer, # 0-based index into candidates, -1 if no match
entity_type: :string
}Structured output (JSON mode / tool use / schema-constrained generation) eliminates parsing failures and ensures a consistent response shape. Note that the confidence score reflects the model's self-assessment, not a calibrated probability — validate thresholds against labeled data from your domain before relying on them in production.
Person names and company names have different variation patterns. Use separate system prompts:
Person names — instruct the model to account for:
- Middle names and initials (John A. Smith vs John Smith)
- Nicknames (Bob = Robert, Bill = William, Jim = James)
- Professional designations (CPA, PhD, PE after the name)
- Generational suffixes (Jr, Sr, III)
- Hyphenated and maiden names
Company names — instruct the model to account for:
- Entity suffixes (LLC, Inc, Corp, Ltd — may be present or absent)
- Abbreviations (Intl = International, Natl = National, 1st = First)
- DBA names ("doing business as" variations)
- Punctuation and spacing (AT&T vs ATT vs A.T.&T.)
- Acronyms (FBI vs Federal Bureau of Investigation)
Set a hard threshold (0.85 works well) below which AI matches are rejected. This prevents low-confidence guesses from polluting duplicate groups. The threshold should be a single constant shared across all call sites.
@ai_confidence_threshold 0.85
def meets_threshold?(%{confidence: confidence}) do
confidence >= @ai_confidence_threshold
endEach AI call is independent. Parallelize with bounded concurrency and per-call timeouts:
candidates
|> Task.async_stream(
fn candidate -> call_llm(query_name, candidate.name) end,
max_concurrency: 10,
timeout: :timer.seconds(30),
on_timeout: :kill_task
)
|> Enum.flat_map(fn
{:ok, {:ok, match}} -> [match]
_ -> [] # timeout or error — silently drop
end)For the real-time check (user is waiting), cap AI candidates at 5 and use a short timeout (5–10 seconds is typical; 30 seconds is a reasonable hard upper bound but will feel slow in a synchronous UI). For batch detection, you can allow more concurrency and longer timeouts since no one is waiting.
Use a fast, cheap model. Name matching is a classification task, not a reasoning task. A small model at low temperature (0.1) performs well here — low temperature reduces randomness in a task where you want consistent yes/no decisions.
Errors during deduplication should never block the user or crash the system.
| Context | On error |
|---|---|
| Real-time check | Log the error, return "no duplicates found", let the save proceed |
| Batch detection | Log the error for the failing org, continue processing other orgs |
| AI call | Timeout or crash — drop that candidate, proceed with deterministic matches |
| Merge | Fail the transaction, leave the group as pending for retry |
User productivity always takes priority over perfect deduplication. The batch job is the safety net — anything missed by the real-time check will be caught on the next run.
When duplicates are confirmed and merged, every foreign key pointing to the archived records must be remapped to the master.
For each table that references the merged entity, update the foreign key:
-- Reassign all orders from duplicate contacts to the master
UPDATE order SET contact_id = $master_id
WHERE contact_id = ANY($duplicate_ids)
AND organization_id = $org_id;
-- Reassign notes
UPDATE note SET contact_id = $master_id
WHERE contact_id = ANY($duplicate_ids)
AND organization_id = $org_id;Watch for unique constraints: if a child table has a unique constraint on (contact_id, other_column), blindly reassigning can create conflicts. Check for collisions first and merge or remove the conflicting rows before reassigning.
After reassignment:
- Snapshot the duplicate records' data into a merge history table for audit trail
- Archive (soft-delete) the duplicate records
- Mark the duplicate group as merged
Wrap the entire merge in a single transaction (Ecto.Multi in Elixir). If any step fails, the whole merge rolls back. Pay attention to trigger ordering — if a database trigger fires on FK changes, the order of operations within the transaction matters.
[All Records in Org]
|
v
+-----------+
| BLOCKING | PostgreSQL: soundex, trigram, exact match
+-----------+ Produces blocks (groups of IDs sharing a key)
|
v
+-----------+
| EXPAND | Elixir: blocks -> unique pairs, dedupe across blocks
+-----------+ Filter by recency (incremental mode)
|
v
+-----------+
| EXCLUDE | Elixir: remove pairs marked "Keep Separate"
+-----------+
|
v
+-----------+
| MATCH | Elixir: deterministic field rules (email, phone, name)
+-----------+ AI fallback for fuzzy name candidates only
|
v
+-----------+
| CLUSTER | Elixir: Union-Find groups pairs transitively
+-----------+ Discard singletons
|
v
+-----------+
| PERSIST | Elixir: skip groups matching existing ones
+-----------+ Suggest master by data completeness
|
v
[User reviews and merges in UI]
| Technique | Cost | When to use |
|---|---|---|
| Exact field GROUP BY | O(n) with index | Always — cheapest blocking strategy |
| Soundex GROUP BY | O(n) with functional index | Names with typos/variations |
| Trigram self-join | O(n^2) within org, mitigated by GIN index | Catch-all for remaining name fuzziness |
| Incremental filtering | Reduces O(n^2) to O(k*n) | Every run after the first full scan |
| Deterministic scoring | O(1) per pair | Always — fast field comparisons |
| AI scoring | ~500ms per LLM call | Only fuzzy name candidates that fail deterministic rules |
| Union-Find clustering | Nearly O(n) with path compression | Always — groups validated pairs transitively |