DataBricks
How In-ATS Sourcing Works

Everything sales needs to know about candidate data ingestion, semantic search, costs, and timelines.

5

DataBrick Types

2

ATSs Live

~$188

100K Candidates / Mo

~6 hrs

100K Bullhorn Ingest

Agenda

1. What is a DataBrick? 2. How it works 3. Live demo 4. Search deep dive 5. Ingestion detail 6. Timing 7. Costs 8. ATSs 9. What's next

March 12, 2026 · Internal Only · Not for customer distribution

01

What is a DataBrick?

A DataBrick is a searchable data source. Instead of connecting directly to an ATS, we give recruiters different "bricks" of candidate data they can search across semantically.

🏢

ATS DataBrick

Candidates pulled directly from the customer's ATS (Bullhorn, Avionte, etc.)

source_type: ats

🎤

HeyMilo DataBrick

Past HeyMilo interview candidates — already have rich structured data from our interviews.

source_type: heymilo

🌐

PDL DataBrick

External candidates from PeopleDataLabs — enriched professional profiles.

source_type: pdl

📋

Candidate Directory

Workspace-level index of all candidates across sources for quick lookup.

source_type: candidate_directory

📄

Posting Directory

Index of job postings for matching and routing candidates to open reqs.

source_type: posting_directory

Why this matters for the customer

Recruiters don't need to manually search their ATS. They type a natural language query — "nurse with 5 years of experience in Houston" — and we search across ALL their data sources at once using AI-powered semantic search.

02

How It Works

Three steps — connect, ingest, search. That's it.

🔌
1. Connect
Customer connects their ATS — Bullhorn, Avionte, etc. Standard OAuth or API key, takes minutes.

→

📥
2. Ingest
We pull all their candidates, normalize the data, and build a searchable AI index. One-time process.

→

🔍
3. Search
Recruiters type natural language queries. AI finds and scores the best candidates across all their data — in seconds.

What the customer sees

Connect their ATS in settings (just like any other integration)
Wait for ingestion to finish (hours, not days for most clients)
Open sourcing, type what they're looking for in plain English
Get ranked candidates with AI scores and match explanations

What makes this different

Searches by meaning, not keywords — "nurse" finds "RN", "registered nurse", "LPN"
Searches across ALL data at once — work history, skills, education, certs
Multiple data sources in one search — ATS + HeyMilo interviews + external
AI evaluates candidates against the job — not just keyword matching

Talent rediscovery, not talent acquisition

Most staffing firms have 100K+ candidates sitting in their ATS that they never search effectively. We make that existing data work for them — no new candidates needed, just smarter access to what they already have.

03

Live Demo

End-to-end walkthrough — connect, ingest, search.

🎬

Live Demo

Connect ATS → Ingest candidates → Search → Evaluate → Results

① Connect

Customer connects their ATS via OAuth. DataBrick is provisioned automatically.

② Ingest

Candidates are pulled, transformed, and indexed. Progress is tracked in real-time.

③ Search

Recruiter types a natural language query. Top candidates returned with AI scores.

Demo Notes

Show the DataBricks management UI, real-time ingestion progress, search results with scored candidates, and the Sally sourcing flow for a complete picture.

04

How Search Works

What happens every time a recruiter types a query — results in seconds.

💬
Natural Language Query
"nurse, 5 yrs, Houston"

→

🧩

Query Decomposition

LLM splits into sub-queries

skills: 0.4, exp: 0.3, loc: 0.3

→

⚡

Parallel Vector Search

5 workers, hybrid α=0.5

BM25 + semantic combined

→

📊

Rerank + Aggregate

Weighted scores across results

→

🏆
LLM Evaluates
GPT-4o-mini scores vs criteria

Multi-Vector Namespace Search

Each candidate has multiple vector "namespaces" — separate embeddings for different parts of their profile.

Namespace	What It Captures
personal_info	Name, title, summary
highlights	Key skills, achievements
profile_text	Full profile narrative
contact	Location, availability

Plus cross-reference collections for work history, education, skills, certs — each with their own vectors and back-references to the main candidate.

Why This Beats ATS Search

Semantic understanding — "Java developer" matches "Senior Software Engineer (Java/Spring)" even without exact keywords.
Cross-field matching — "5 years healthcare" searches work history, skills, AND certifications simultaneously.
Multi-source aggregation — One search across ATS + HeyMilo interviews + external data, deduplicated and ranked.
AI evaluation — LLM scores candidates against the job's actual requirements, not just keyword frequency.

05

Ingestion Pipeline — Detail

How candidate data goes from ATS to searchable vector index.

🏢
ATS API
Rate-limited
Bullhorn: 1K/min
Avionte: 10 RPS

→

🔄

Transform

Normalize to canonical schema

LLM-generated mappers

→

💾

Data Lake

MongoDB buffer

Upsert + dedup

→

🧠
Weaviate
Vectorize + HNSW index
~80 vectors/sec

Per Candidate: ~16 Vectors

Component	Vectors	Notes
Main namespace vectors	4	personal, contact, location, metadata
Work history	~3	1 vector per entry
Education	~2	1 vector per entry
Skills	~5	1 vector per skill
Certifications	~1	1 vector per cert
Tags	~1	1 vector per tag group
Total	~16	Varies by data richness

Enrichment Calls Per ATS

ATS	API Calls / Candidate
Bullhorn	1 per 50 (bulk API)
Avionte	6 (1 basic + 5 enrichment)

Enrichment = skills, education, work history, certifications, tags

Avionte needs 6 separate API calls per candidate to get the full profile. Bullhorn returns everything in a single bulk call.

Idempotent by design

Re-running ingestion reuses existing data and only computes deltas. No duplicates, no wasted API calls.

06

Time to Ingest

How long it takes to go from zero to searchable, broken down by ATS.

total_time ≈ max(ats_fetch_time, weaviate_ingest_time) + 20% pipeline overhead

Bullhorn

1,000 candidates/min bulk API · Bottleneck = Weaviate

Candidates	ATS Fetch	Weaviate	Total
10K	10 min	33 min	~35 min
100K	1.7 hrs	5.6 hrs	~6 hrs
500K	8.3 hrs	27.8 hrs	~30 hrs
1M	16.7 hrs	55.6 hrs	~2.5 days

Avionte

10 RPS · 6 API calls/candidate · Bottleneck = ATS rate limit

Candidates	ATS Fetch	Weaviate	Total
10K	1.7 hrs	0.6 hrs	~1.8 hrs
100K	16.7 hrs	5.6 hrs	~18 hrs
500K	83 hrs	27.8 hrs	~3.7 days
1M	167 hrs	55.6 hrs	~7.4 days

Quick Formula for Sales

Bullhorn: total hours ≈ candidates ÷ 17,000 | Avionte: total hours ≈ (candidates × 6) ÷ 36,000
Example: 200K Bullhorn candidates → 200,000 ÷ 17,000 ≈ 12 hours. 200K Avionte → (200,000 × 6) ÷ 36,000 ≈ 33 hours.

Why the difference?

Bullhorn's bulk API returns 50 candidates per call, so ATS fetch is fast and Weaviate vectorization is the bottleneck. Avionte requires 6 separate API calls per candidate at 10 requests/second, so the ATS rate limit dominates.

07

Hosting Cost — Our Side

What it costs us to run the infrastructure per customer.

Monthly Cost = S × s + C × n + I × N

S × s — Session Cost

S = ~$0.50 per sourcing session

LLM cost for query decomposition + evaluation criteria generation.

s = sessions/month (1 session = 1 job posting search)

C × n — Eval Cost

C = $0.0018 per candidate evaluated

GPT-4o-mini scores each candidate returned from vector search.

n = candidates scanned/month

I × N — Infrastructure

I = $0.94/1K (≤1M) or $0.77/1K (>1M)

Weaviate node, transformer pod, GKE, MongoDB — shared across clients.

N = total candidates in thousands

Example Scenarios

Scenario	Candidates	Recruiters	Sessions/mo	S × s	C × n	I × N	Total/mo
Small agency	100K	1	160	$80	$14	$94	$188
Mid staffing firm	500K	3	480	$240	$43	$470	$753
Large enterprise	1M	6	960	$480	$346	$940	$1,766

Infrastructure Base Cost = ~$600/mo fixed

Weaviate node $383 + transformer pod $49 + GKE $73 + MongoDB $75 + network $20. Shared across all clients — gets amortized as more clients onboard.

Multi-client = lower per-client cost

Infrastructure is shared. 3 clients at 800K total candidates = $752 infra total, way less than 3 separate deployments.

08

Supported ATSs

What's live today and what's in the pipeline.

Bullhorn

LIVE

Rate limit	1,000 candidates/min (bulk API)
API calls / candidate	1 per 50 (bulk)
Batch size	100 candidates per batch
Concurrency	5 parallel workers
100K ingest time	~6 hours
Bottleneck	Weaviate vectorization (ATS is fast)
Data fields	Full profile in single bulk response

Avionte

LIVE

Rate limit	10 requests/second
API calls / candidate	6 (1 basic + 5 enrichment)
Batch size	100 candidates per batch
Concurrency	5 parallel workers
100K ingest time	~18 hours
Bottleneck	ATS rate limit (10 RPS × 6 calls)
Enrichment	Skills, education, work history, certs, tags

On the Radar

Ceipal

PLANNED

2 RPS, 6 calls/candidate. ~3.5 days for 100K. Slowest due to rate limits.

Ashby

PLANNED

Training sessions completed Jan 2026. Integration architecture defined.

Greenhouse

PLANNED

Training completed Jan 2026. Transformer service ready to generate mappers.

Adding a new ATS

Our transformer service uses LLMs to automatically generate data mappers for new ATSs. Once we have API access and a sample payload, a new ATS can be onboarded in days, not weeks.

09

What's Next

Where DataBricks is going — this is the foundation for a much bigger play.

✅ ATS Ingestion (Bullhorn + Avionte)

Pull candidates from ATS, normalize, index into searchable vector store. SHIPPED

✅ Semantic Search + AI Evaluation

Multi-vector search with LLM-based candidate scoring against job criteria. SHIPPED

✅ DataBricks Management Dashboard

UI to view, manage, and reindex vector collections. SHIPPED

🔄 More ATSs (Ceipal, Ashby, Greenhouse)

Transformer service auto-generates mappers for new ATSs. Training completed for Ashby + Greenhouse.

📊 Deep Analytics on DataBricks

Run analytics across a DataBrick — talent market trends, skill gap analysis, compensation benchmarking. Clickhouse backend for structured queries.

🔄 Data Migration Between Sources

Move candidates from one data source to another. ATS → ATS migration, or consolidating after M&A.

🧬 Cross-Source Enrichment

Use one DataBrick to enrich another. Example: match ATS candidates against PDL data to fill in missing emails, phone numbers, social profiles.

🔗 Candidate Dedup + Merge Across Sources

Same candidate exists in Bullhorn AND Avionte? Merge profiles, pick best data from each, create a unified candidate record.

📁 File Upload DataBrick

Upload CSVs or resume files directly as a DataBrick. No ATS required — useful for clients migrating off legacy systems.

🤖 Autonomous Sourcing Worker

Hourly automated sourcing cycles — search internal data, score candidates, apply exclusion rules, queue for outreach. Fully hands-off for recruiters.

The Big Picture

DataBricks turns HeyMilo from an interview tool into a candidate data platform. Any data in, any data out, AI-powered search and enrichment across everything.

10

Quick Reference — Sales Cheat Sheet

Numbers to have at your fingertips.

Ingestion Time Calculator

ATS	10K	50K	100K	500K
Bullhorn	35 min	3 hrs	6 hrs	30 hrs
Avionte	1.8 hrs	9 hrs	18 hrs	3.7 days

Monthly Cost Estimate

Candidates	1 Recruiter	3 Recruiters	6 Recruiters
50K	$127	$367	$727
100K	$188	$428	$788
500K	$564	$804	$1,164
1M	$1,034	$1,274	$1,766

Assumes 160 sessions/recruiter/month, 50 candidates evaluated per session, at 65% infra utilization.

Customer FAQ

"How long until my team can search?"

Depends on ATS and candidate count. Bullhorn: ~6 hrs for 100K. Avionte: ~18 hrs for 100K. Initial ingestion is one-time. After that, incremental updates keep data fresh.

"Is my data secure?"

Data stays within our GCP infrastructure (us-central1). Each workspace has isolated collections. No data is shared between customers. SOC 2 compliant.

"What if I have 2 ATSs?"

Each ATS gets its own DataBrick. Recruiters can search across both simultaneously — we handle dedup and cross-source ranking.

"How is this different from ATS search?"

ATS search is keyword-based. We use AI to understand what you mean, not just what you type. We also search across work history, skills, education, and certifications simultaneously.

11