Skip to content

Instantly share code, notes, and snippets.

@cpursley
Created March 24, 2026 18:11
Show Gist options
  • Select an option

  • Save cpursley/dfb69728caf0b1715d03ec99eda0c531 to your computer and use it in GitHub Desktop.

Select an option

Save cpursley/dfb69728caf0b1715d03ec99eda0c531 to your computer and use it in GitHub Desktop.
Hybrid Deduplication (Postgres + Elixir + AI)

Deduplication with PostgreSQL, Elixir, and AI

A practical guide to building a multi-layer deduplication system using PostgreSQL's native text-matching capabilities, Elixir for orchestration and scoring, and an LLM for fuzzy name resolution.


Strategy: Three Lines of Defense

Deduplication works best as layered prevention, not a single pass.

Layer When Purpose
Search-select-add User is typing Autocomplete against existing records prevents duplicates before they exist
Real-time check User clicks Save Block creation if a likely match exists; let the user decide
Batch detection Scheduled, triggered, or on-demand Catch everything that slipped through: imports, API-created records, edge cases

Each layer catches what the previous one missed. The real-time check and batch detection share the same blocking algorithms and AI fallback, so you build the logic once and call it from two contexts.


PostgreSQL: Blocking and Candidate Finding

The core insight: you cannot compare every record to every other record. An org with 10,000 contacts produces 50 million pairs. Blocking narrows the search space by grouping records that share a coarse key, then only comparing within groups.

Extensions You Need

CREATE EXTENSION IF NOT EXISTS pg_trgm;        -- trigram similarity
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;  -- soundex, metaphone

Blocking Strategies

Each strategy generates "blocks" — groups of record IDs that share a key. Use multiple strategies in parallel (UNION ALL) to catch different kinds of duplicates.

1. Exact Field Match

The cheapest and most reliable block. Group by a normalized field value.

-- Email blocking: group contacts sharing the same email
SELECT LOWER(email) AS block_key,
       array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
  AND email IS NOT NULL
  AND archived_at IS NULL
GROUP BY LOWER(email)
HAVING count(*) > 1;

Works for: email, phone number, tax ID, any field expected to be globally unique.

2. Soundex (Phonetic Matching)

Soundex encodes a string by how it sounds, catching typos and spelling variations. Smith and Smyth both encode to S530.

-- Name blocking: exact last name + soundex first name
-- Catches: John/Jon, Robert/Robrt, Catherine/Katherine
SELECT 'ln_' || LOWER(last_name) || '_sdx_' || soundex(first_name) AS block_key,
       array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
  AND first_name IS NOT NULL
  AND last_name IS NOT NULL
GROUP BY block_key
HAVING count(*) > 1;

Run the inverse too (exact first + soundex last) to catch last-name typos:

SELECT 'fn_' || LOWER(first_name) || '_sdx_' || soundex(last_name) AS block_key,
       array_agg(id) AS member_ids
FROM contact
WHERE organization_id = $1
  AND first_name IS NOT NULL
  AND last_name IS NOT NULL
GROUP BY block_key
HAVING count(*) > 1;

Soundex is fast and index-friendly but coarse. It works best combined with exact matches on a complementary field.

3. Trigram Similarity

pg_trgm breaks strings into 3-character subsequences and compares overlap. More granular than soundex, but requires a pairwise comparison rather than a simple GROUP BY.

-- Trigram blocking: pairwise self-join within an org
-- c1.id < c2.id avoids duplicate pairs and self-matches
SELECT c1.id AS id_a, c2.id AS id_b,
       similarity(
         LOWER(c1.first_name || ' ' || c1.last_name),
         LOWER(c2.first_name || ' ' || c2.last_name)
       ) AS sim_score
FROM contact c1
JOIN contact c2
  ON c1.organization_id = c2.organization_id
  AND c1.id < c2.id
WHERE c1.organization_id = $1
  AND similarity(
    LOWER(c1.first_name || ' ' || c1.last_name),
    LOWER(c2.first_name || ' ' || c2.last_name)
  ) >= 0.6;

Threshold guidance:

  • 0.6 for person names (more variation, need wider net)
  • 0.65 for company/entity names (shorter, more distinctive)

Trigram is the most expensive strategy — the self-join is inherently O(n^2) within each org. A GIN trigram index helps accelerate % and LIKE/ILIKE operators but does not eliminate the pairwise cost of similarity() in a self-join. Use trigram as the last blocking pass, and consider pre-filtering with cheaper blocks (email, soundex) first to reduce the candidate set.

Indexes

Every blocking strategy needs an index. Without them, batch jobs on large orgs will time out.

-- Exact match indexes
CREATE INDEX idx_contact_email_lower ON contact (organization_id, LOWER(email));
CREATE INDEX idx_contact_phone_org ON contact (organization_id, phone_number);

-- Soundex indexes
CREATE INDEX idx_contact_soundex_first
  ON contact (organization_id, LOWER(last_name), soundex(first_name));
CREATE INDEX idx_contact_soundex_last
  ON contact (organization_id, LOWER(first_name), soundex(last_name));

-- Trigram GIN index (accelerates % and LIKE/ILIKE operators;
-- similarity() in a self-join still requires pairwise comparison
-- but the index helps when one side of the comparison is a constant)
CREATE INDEX idx_contact_fullname_trgm ON contact
  USING GIN ((LOWER(first_name || ' ' || last_name)) gin_trgm_ops);

Real-Time Candidate Functions

For the save-time check, wrap blocking into a SQL function that returns full candidate records instead of ID arrays. Use a CTE with priority ordering to deduplicate candidates that appear in multiple blocks:

CREATE FUNCTION find_duplicate_candidates(
  p_org_id UUID, p_email TEXT, p_phone TEXT,
  p_first_name TEXT, p_last_name TEXT,
  p_exclude_id UUID DEFAULT NULL
) RETURNS TABLE (id UUID, block_type TEXT, /* ... other fields */)
AS $$
  WITH candidates AS (
    -- Strategy 1: email exact (priority 1)
    SELECT c.id, 'email' AS block_type, 1 AS priority
    FROM contact c
    WHERE c.organization_id = p_org_id
      AND LOWER(c.email) = LOWER(p_email)
      AND (p_exclude_id IS NULL OR c.id != p_exclude_id)

    UNION ALL

    -- Strategy 2: phone exact (priority 2)
    SELECT c.id, 'phone', 2
    FROM contact c
    WHERE c.organization_id = p_org_id
      AND c.phone_number = p_phone
      AND (p_exclude_id IS NULL OR c.id != p_exclude_id)

    UNION ALL

    -- Strategy 3: name exact (priority 3)
    SELECT c.id, 'name_exact', 3
    FROM contact c
    WHERE c.organization_id = p_org_id
      AND LOWER(TRIM(c.first_name)) = LOWER(TRIM(p_first_name))
      AND LOWER(TRIM(c.last_name)) = LOWER(TRIM(p_last_name))
      AND (p_exclude_id IS NULL OR c.id != p_exclude_id)

    UNION ALL

    -- Strategy 4: name fuzzy — soundex + trigram (priority 4)
    SELECT c.id, 'name_fuzzy', 4
    FROM contact c
    WHERE c.organization_id = p_org_id
      AND (
        (LOWER(c.first_name) = LOWER(p_first_name)
          AND soundex(c.last_name) = soundex(p_last_name))
        OR (soundex(c.first_name) = soundex(p_first_name)
          AND LOWER(c.last_name) = LOWER(p_last_name))
        OR similarity(
             LOWER(c.first_name || ' ' || c.last_name),
             LOWER(p_first_name || ' ' || p_last_name)
           ) >= 0.6
      )
      AND (p_exclude_id IS NULL OR c.id != p_exclude_id)
  )
  SELECT DISTINCT ON (cand.id) cand.id, cand.block_type
  FROM candidates cand
  ORDER BY cand.id, cand.priority ASC
  LIMIT 50;
$$ LANGUAGE sql STABLE;

DISTINCT ON (id) ORDER BY priority ASC ensures each candidate appears once, tagged with its highest-confidence block type.

Incremental Mode

Full org scans are expensive. After the initial full scan, subsequent runs can use incremental mode — only comparing records that changed since the last run (e.g., the last 24 hours for a daily schedule, or since the last completed run for event-triggered jobs). This is a two-part optimization — the DB narrows the blocks, and the application layer filters the pairs.

Part 1 — DB: prune stale blocks. Add a recency filter so only blocks containing at least one recently-modified member are returned:

HAVING count(*) > 1
  AND bool_or(updated_at >= $2)  -- at least one member was recently modified

Return the recently-modified member IDs alongside the block:

SELECT block_key,
       array_agg(id) AS member_ids,
       array_agg(id) FILTER (WHERE updated_at >= $2) AS recent_member_ids
FROM ...

Part 2 — Application: filter pairs. The DB pruning alone only skips blocks where nothing changed — within a returned block, the application layer must still filter to pairs where at least one member is recent. This is what actually reduces the comparison space. Without this step, you are still expanding full blocks into O(n^2) pairs.

Together, these two steps reduce the effective comparison space from O(n^2) to O(k*n) where k is the number of recently-modified records. Note: for the trigram self-join strategy, you need to rewrite the SQL query itself to drive comparisons from the changed subset (e.g., join recently-changed records against all records, rather than all-against-all) to get the full benefit.


Elixir: Orchestration, Scoring, and Clustering

PostgreSQL handles candidate finding. Elixir handles everything else: expanding blocks into pairs, scoring matches, calling AI, clustering results, and persisting groups.

Pair Expansion and Deduplication

Blocks are arrays of IDs. Expand each block into all unique pairs, then deduplicate across blocks.

defmodule Blocking do
  @block_type_priority [:name_exact, :email, :phone, :name_soundex, :trigram]

  def expand_to_pairs(blocks, id_key \\ :contact_ids) do
    blocks
    |> Enum.flat_map(&block_to_pairs(&1, id_key))
    |> dedupe_pairs()
  end

  defp block_to_pairs(block, id_key) do
    ids = Map.fetch!(block, id_key)
    block_type = block.block_type

    for [a, b] <- unique_pairs(ids) do
      {min(a, b), max(a, b), block_type}
    end
    |> filter_by_recency(block[:recent_member_ids])
  end

  defp dedupe_pairs(pairs) do
    pairs
    |> Enum.group_by(fn {a, b, _type} -> {a, b} end)
    |> Enum.map(fn {{a, b}, group} ->
      best_type = group
        |> Enum.map(fn {_, _, type} -> type end)
        |> Enum.min_by(&priority_index/1)
      {a, b, best_type}
    end)
  end

  defp filter_by_recency(pairs, nil), do: pairs
  defp filter_by_recency(pairs, recent_ids) do
    recent_set = MapSet.new(recent_ids)
    Enum.filter(pairs, fn {a, b, _} ->
      MapSet.member?(recent_set, a) or MapSet.member?(recent_set, b)
    end)
  end

  defp priority_index(type) do
    Enum.find_index(@block_type_priority, &(&1 == type)) || 999
  end
end

When the same pair appears in multiple blocks (e.g., two contacts share both email and name), keep only the highest-confidence block type. The priority list makes this explicit and tunable.

Deterministic Matching

Score each candidate pair using field-level comparisons. Start with the strongest signals and fall through to weaker ones.

defmodule Matching do
  def score_pair(record_a, record_b, block_type) do
    cond do
      block_type == :name_exact ->
        {:match, "name_exact", 1.0}

      names_match?(record_a, record_b) and emails_match?(record_a, record_b)
        and phones_match?(record_a, record_b) ->
        {:match, "first_last_email_phone", 1.0}

      names_match?(record_a, record_b) and emails_match?(record_a, record_b) ->
        {:match, "first_last_email", 0.98}

      names_match?(record_a, record_b) and phones_match?(record_a, record_b) ->
        {:match, "first_last_phone", 0.95}

      emails_match?(record_a, record_b) ->
        {:match, "email_exact", 0.90}

      phones_match?(record_a, record_b) and last_names_match?(record_a, record_b) ->
        {:match, "phone_lastname", 0.85}

      true ->
        :needs_ai
    end
  end
end

Design principles:

  • Each match type has a fixed confidence. No fuzzy arithmetic.
  • The cond chain runs from highest to lowest confidence. First match wins.
  • AI is the fallback of last resort, not the default path.

Exclusion Filtering

Users can mark a pair as "not duplicates" (Keep Separate). Store these as exclusion pairs and filter them before matching:

def filter_excluded(pairs, exclusion_set) do
  Enum.reject(pairs, fn {a, b, _type} ->
    normalized = {min(a, b), max(a, b)}
    MapSet.member?(exclusion_set, normalized)
  end)
end

Always normalize pair order (min/max) so {A, B} and {B, A} hash to the same key.

Clustering with Union-Find

After matching, pairs need to be grouped transitively. If A matches B and B matches C, all three should be in one group — even if A and C were never directly compared.

Union-Find with path compression handles this efficiently:

defmodule Clustering do
  def cluster_matches(matches, opts \\ []) do
    id_key = Keyword.get(opts, :id_key, :contact_ids)

    uf = init_union_find(matches)
    uf = union_all_pairs(matches, uf)

    uf
    |> group_by_root()
    |> Enum.reject(fn {_root, members} -> length(members) < 2 end)
    |> build_clusters(matches, id_key)
  end

  defp init_union_find(matches) do
    matches
    |> Enum.flat_map(fn {a, b, _conf, _type} -> [a, b] end)
    |> Enum.uniq()
    |> Map.new(fn id -> {id, id} end)
  end

  defp find(uf, id) do
    parent = Map.fetch!(uf, id)
    if parent == id do
      {id, uf}
    else
      {root, uf} = find(uf, parent)
      {root, Map.put(uf, id, root)}  # path compression
    end
  end

  defp union(uf, a, b) do
    {root_a, uf} = find(uf, a)
    {root_b, uf} = find(uf, b)
    if root_a == root_b, do: uf, else: Map.put(uf, root_a, root_b)
  end
end

Singletons (clusters of 1) are discarded — in practice they should not appear since every union merges two IDs into one component, but filtering them out guards against edge cases in input data.

Master Selection by Completeness

When presenting duplicate groups to users, suggest the most complete record as the master:

def suggest_master(records, fields) do
  Enum.max_by(records, fn record ->
    base = Enum.count(fields, fn f ->
      val = Map.get(record, f)
      val != nil and val != ""
    end)

    bonus =
      (if Map.get(record, :orders_count, 0) > 0, do: 2, else: 0) +
      (if Map.get(record, :contacts_count, 0) > 0, do: 1, else: 0)

    base + bonus
  end)
end

The suggested master is a default, not a mandate. Users can always pick a different one.

Concurrency with Back-Pressure

Whether triggered by a cron schedule, a manual button, or an event (e.g., after a bulk import), don't process all orgs simultaneously. Use a GenServer worker with bounded concurrency:

@max_concurrent_jobs 8
@max_queue_depth 200
@detection_timeout :timer.minutes(30)

Job deduplication: If a detection job for the same {type, org_id} is already running or queued, silently skip new requests. This prevents overlapping schedules or rapid re-triggers from double-queuing a slow org.

Task isolation: Run detection tasks under a Task.Supervisor with async_nolink so a crash in one org's detection doesn't take down the worker or other orgs' tasks.

Queue draining: When a task completes, recursively pop from the queue until all available slots are filled or the queue is empty. This maximizes throughput when multiple tasks finish near-simultaneously.


AI: The Fuzzy Name Fallback

AI handles what PostgreSQL and deterministic rules cannot: deciding whether "Bob Smith" and "Robert J. Smith, SRA" are the same person, or whether "First National Bank" and "1st Natl Bank, LLC" are the same company.

When to Call AI

AI is expensive and slow relative to SQL and Elixir logic. Use it only for candidates that:

  1. Were found by a fuzzy blocking strategy (soundex or trigram)
  2. Could not be resolved by deterministic field matching
  3. Have non-nil name fields to compare

Structured Output, Not Free Text

Parse the LLM response into a typed schema. Never regex-parse free text.

schema = %{
  query_name: :string,
  matched_name: {:string, nullable: true},
  confidence: :float,       # 0.0 to 1.0
  reasoning: :string,
  match_index: :integer,    # 0-based index into candidates, -1 if no match
  entity_type: :string
}

Structured output (JSON mode / tool use / schema-constrained generation) eliminates parsing failures and ensures a consistent response shape. Note that the confidence score reflects the model's self-assessment, not a calibrated probability — validate thresholds against labeled data from your domain before relying on them in production.

Entity-Specific System Prompts

Person names and company names have different variation patterns. Use separate system prompts:

Person names — instruct the model to account for:

  • Middle names and initials (John A. Smith vs John Smith)
  • Nicknames (Bob = Robert, Bill = William, Jim = James)
  • Professional designations (CPA, PhD, PE after the name)
  • Generational suffixes (Jr, Sr, III)
  • Hyphenated and maiden names

Company names — instruct the model to account for:

  • Entity suffixes (LLC, Inc, Corp, Ltd — may be present or absent)
  • Abbreviations (Intl = International, Natl = National, 1st = First)
  • DBA names ("doing business as" variations)
  • Punctuation and spacing (AT&T vs ATT vs A.T.&T.)
  • Acronyms (FBI vs Federal Bureau of Investigation)

Confidence Threshold

Set a hard threshold (0.85 works well) below which AI matches are rejected. This prevents low-confidence guesses from polluting duplicate groups. The threshold should be a single constant shared across all call sites.

@ai_confidence_threshold 0.85

def meets_threshold?(%{confidence: confidence}) do
  confidence >= @ai_confidence_threshold
end

Concurrency and Timeout

Each AI call is independent. Parallelize with bounded concurrency and per-call timeouts:

candidates
|> Task.async_stream(
  fn candidate -> call_llm(query_name, candidate.name) end,
  max_concurrency: 10,
  timeout: :timer.seconds(30),
  on_timeout: :kill_task
)
|> Enum.flat_map(fn
  {:ok, {:ok, match}} -> [match]
  _ -> []  # timeout or error — silently drop
end)

For the real-time check (user is waiting), cap AI candidates at 5 and use a short timeout (5–10 seconds is typical; 30 seconds is a reasonable hard upper bound but will feel slow in a synchronous UI). For batch detection, you can allow more concurrency and longer timeouts since no one is waiting.

Model Selection

Use a fast, cheap model. Name matching is a classification task, not a reasoning task. A small model at low temperature (0.1) performs well here — low temperature reduces randomness in a task where you want consistent yes/no decisions.


Error Handling Philosophy

Errors during deduplication should never block the user or crash the system.

Context On error
Real-time check Log the error, return "no duplicates found", let the save proceed
Batch detection Log the error for the failing org, continue processing other orgs
AI call Timeout or crash — drop that candidate, proceed with deterministic matches
Merge Fail the transaction, leave the group as pending for retry

User productivity always takes priority over perfect deduplication. The batch job is the safety net — anything missed by the real-time check will be caught on the next run.


Merge: Reassigning References

When duplicates are confirmed and merged, every foreign key pointing to the archived records must be remapped to the master.

Reassign FK References

For each table that references the merged entity, update the foreign key:

-- Reassign all orders from duplicate contacts to the master
UPDATE order SET contact_id = $master_id
WHERE contact_id = ANY($duplicate_ids)
  AND organization_id = $org_id;

-- Reassign notes
UPDATE note SET contact_id = $master_id
WHERE contact_id = ANY($duplicate_ids)
  AND organization_id = $org_id;

Watch for unique constraints: if a child table has a unique constraint on (contact_id, other_column), blindly reassigning can create conflicts. Check for collisions first and merge or remove the conflicting rows before reassigning.

Archive and Record

After reassignment:

  1. Snapshot the duplicate records' data into a merge history table for audit trail
  2. Archive (soft-delete) the duplicate records
  3. Mark the duplicate group as merged

Transaction Safety

Wrap the entire merge in a single transaction (Ecto.Multi in Elixir). If any step fails, the whole merge rolls back. Pay attention to trigger ordering — if a database trigger fires on FK changes, the order of operations within the transaction matters.


The Full Pipeline at a Glance

[All Records in Org]
        |
        v
  +-----------+
  | BLOCKING  |  PostgreSQL: soundex, trigram, exact match
  +-----------+  Produces blocks (groups of IDs sharing a key)
        |
        v
  +-----------+
  | EXPAND    |  Elixir: blocks -> unique pairs, dedupe across blocks
  +-----------+  Filter by recency (incremental mode)
        |
        v
  +-----------+
  | EXCLUDE   |  Elixir: remove pairs marked "Keep Separate"
  +-----------+
        |
        v
  +-----------+
  | MATCH     |  Elixir: deterministic field rules (email, phone, name)
  +-----------+  AI fallback for fuzzy name candidates only
        |
        v
  +-----------+
  | CLUSTER   |  Elixir: Union-Find groups pairs transitively
  +-----------+  Discard singletons
        |
        v
  +-----------+
  | PERSIST   |  Elixir: skip groups matching existing ones
  +-----------+  Suggest master by data completeness
        |
        v
  [User reviews and merges in UI]

Performance Summary

Technique Cost When to use
Exact field GROUP BY O(n) with index Always — cheapest blocking strategy
Soundex GROUP BY O(n) with functional index Names with typos/variations
Trigram self-join O(n^2) within org, mitigated by GIN index Catch-all for remaining name fuzziness
Incremental filtering Reduces O(n^2) to O(k*n) Every run after the first full scan
Deterministic scoring O(1) per pair Always — fast field comparisons
AI scoring ~500ms per LLM call Only fuzzy name candidates that fail deterministic rules
Union-Find clustering Nearly O(n) with path compression Always — groups validated pairs transitively
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment