Everything sales needs to know about candidate data ingestion, semantic search, costs, and timelines.
1. What is a DataBrick? 2. How it works 3. Live demo 4. Search deep dive 5. Ingestion detail 6. Timing 7. Costs 8. ATSs 9. What's next
A DataBrick is a searchable data source. Instead of connecting directly to an ATS, we give recruiters different "bricks" of candidate data they can search across semantically.
Candidates pulled directly from the customer's ATS (Bullhorn, Avionte, etc.)
Past HeyMilo interview candidates — already have rich structured data from our interviews.
External candidates from PeopleDataLabs — enriched professional profiles.
Workspace-level index of all candidates across sources for quick lookup.
Index of job postings for matching and routing candidates to open reqs.
Recruiters don't need to manually search their ATS. They type a natural language query — "nurse with 5 years of experience in Houston" — and we search across ALL their data sources at once using AI-powered semantic search.
Three steps — connect, ingest, search. That's it.
Most staffing firms have 100K+ candidates sitting in their ATS that they never search effectively. We make that existing data work for them — no new candidates needed, just smarter access to what they already have.
End-to-end walkthrough — connect, ingest, search.
Connect ATS → Ingest candidates → Search → Evaluate → Results
Customer connects their ATS via OAuth. DataBrick is provisioned automatically.
Candidates are pulled, transformed, and indexed. Progress is tracked in real-time.
Recruiter types a natural language query. Top candidates returned with AI scores.
Show the DataBricks management UI, real-time ingestion progress, search results with scored candidates, and the Sally sourcing flow for a complete picture.
What happens every time a recruiter types a query — results in seconds.
Each candidate has multiple vector "namespaces" — separate embeddings for different parts of their profile.
| Namespace | What It Captures |
|---|---|
| personal_info | Name, title, summary |
| highlights | Key skills, achievements |
| profile_text | Full profile narrative |
| contact | Location, availability |
Plus cross-reference collections for work history, education, skills, certs — each with their own vectors and back-references to the main candidate.
How candidate data goes from ATS to searchable vector index.
| Component | Vectors | Notes |
|---|---|---|
| Main namespace vectors | 4 | personal, contact, location, metadata |
| Work history | ~3 | 1 vector per entry |
| Education | ~2 | 1 vector per entry |
| Skills | ~5 | 1 vector per skill |
| Certifications | ~1 | 1 vector per cert |
| Tags | ~1 | 1 vector per tag group |
| Total | ~16 | Varies by data richness |
| ATS | API Calls / Candidate |
|---|---|
| Bullhorn | 1 per 50 (bulk API) |
| Avionte | 6 (1 basic + 5 enrichment) |
Avionte needs 6 separate API calls per candidate to get the full profile. Bullhorn returns everything in a single bulk call.
Re-running ingestion reuses existing data and only computes deltas. No duplicates, no wasted API calls.
How long it takes to go from zero to searchable, broken down by ATS.
1,000 candidates/min bulk API · Bottleneck = Weaviate
| Candidates | ATS Fetch | Weaviate | Total |
|---|---|---|---|
| 10K | 10 min | 33 min | ~35 min |
| 100K | 1.7 hrs | 5.6 hrs | ~6 hrs |
| 500K | 8.3 hrs | 27.8 hrs | ~30 hrs |
| 1M | 16.7 hrs | 55.6 hrs | ~2.5 days |
10 RPS · 6 API calls/candidate · Bottleneck = ATS rate limit
| Candidates | ATS Fetch | Weaviate | Total |
|---|---|---|---|
| 10K | 1.7 hrs | 0.6 hrs | ~1.8 hrs |
| 100K | 16.7 hrs | 5.6 hrs | ~18 hrs |
| 500K | 83 hrs | 27.8 hrs | ~3.7 days |
| 1M | 167 hrs | 55.6 hrs | ~7.4 days |
Bullhorn: total hours ≈ candidates ÷ 17,000 | Avionte: total hours ≈ (candidates × 6) ÷ 36,000
Example: 200K Bullhorn candidates → 200,000 ÷ 17,000 ≈ 12 hours. 200K Avionte → (200,000 × 6) ÷ 36,000 ≈ 33 hours.
Bullhorn's bulk API returns 50 candidates per call, so ATS fetch is fast and Weaviate vectorization is the bottleneck. Avionte requires 6 separate API calls per candidate at 10 requests/second, so the ATS rate limit dominates.
What it costs us to run the infrastructure per customer.
S = ~$0.50 per sourcing session
LLM cost for query decomposition + evaluation criteria generation.
s = sessions/month (1 session = 1 job posting search)
C = $0.0018 per candidate evaluated
GPT-4o-mini scores each candidate returned from vector search.
n = candidates scanned/month
I = $0.94/1K (≤1M) or $0.77/1K (>1M)
Weaviate node, transformer pod, GKE, MongoDB — shared across clients.
N = total candidates in thousands
| Scenario | Candidates | Recruiters | Sessions/mo | S × s | C × n | I × N | Total/mo |
|---|---|---|---|---|---|---|---|
| Small agency | 100K | 1 | 160 | $80 | $14 | $94 | $188 |
| Mid staffing firm | 500K | 3 | 480 | $240 | $43 | $470 | $753 |
| Large enterprise | 1M | 6 | 960 | $480 | $346 | $940 | $1,766 |
Weaviate node $383 + transformer pod $49 + GKE $73 + MongoDB $75 + network $20. Shared across all clients — gets amortized as more clients onboard.
Infrastructure is shared. 3 clients at 800K total candidates = $752 infra total, way less than 3 separate deployments.
What's live today and what's in the pipeline.
| Rate limit | 1,000 candidates/min (bulk API) |
| API calls / candidate | 1 per 50 (bulk) |
| Batch size | 100 candidates per batch |
| Concurrency | 5 parallel workers |
| 100K ingest time | ~6 hours |
| Bottleneck | Weaviate vectorization (ATS is fast) |
| Data fields | Full profile in single bulk response |
| Rate limit | 10 requests/second |
| API calls / candidate | 6 (1 basic + 5 enrichment) |
| Batch size | 100 candidates per batch |
| Concurrency | 5 parallel workers |
| 100K ingest time | ~18 hours |
| Bottleneck | ATS rate limit (10 RPS × 6 calls) |
| Enrichment | Skills, education, work history, certs, tags |
2 RPS, 6 calls/candidate. ~3.5 days for 100K. Slowest due to rate limits.
Training sessions completed Jan 2026. Integration architecture defined.
Training completed Jan 2026. Transformer service ready to generate mappers.
Our transformer service uses LLMs to automatically generate data mappers for new ATSs. Once we have API access and a sample payload, a new ATS can be onboarded in days, not weeks.
Where DataBricks is going — this is the foundation for a much bigger play.
Pull candidates from ATS, normalize, index into searchable vector store. SHIPPED
Multi-vector search with LLM-based candidate scoring against job criteria. SHIPPED
UI to view, manage, and reindex vector collections. SHIPPED
Transformer service auto-generates mappers for new ATSs. Training completed for Ashby + Greenhouse.
Run analytics across a DataBrick — talent market trends, skill gap analysis, compensation benchmarking. Clickhouse backend for structured queries.
Move candidates from one data source to another. ATS → ATS migration, or consolidating after M&A.
Use one DataBrick to enrich another. Example: match ATS candidates against PDL data to fill in missing emails, phone numbers, social profiles.
Same candidate exists in Bullhorn AND Avionte? Merge profiles, pick best data from each, create a unified candidate record.
Upload CSVs or resume files directly as a DataBrick. No ATS required — useful for clients migrating off legacy systems.
Hourly automated sourcing cycles — search internal data, score candidates, apply exclusion rules, queue for outreach. Fully hands-off for recruiters.
DataBricks turns HeyMilo from an interview tool into a candidate data platform. Any data in, any data out, AI-powered search and enrichment across everything.
Numbers to have at your fingertips.
| ATS | 10K | 50K | 100K | 500K |
|---|---|---|---|---|
| Bullhorn | 35 min | 3 hrs | 6 hrs | 30 hrs |
| Avionte | 1.8 hrs | 9 hrs | 18 hrs | 3.7 days |
| Candidates | 1 Recruiter | 3 Recruiters | 6 Recruiters |
|---|---|---|---|
| 50K | $127 | $367 | $727 |
| 100K | $188 | $428 | $788 |
| 500K | $564 | $804 | $1,164 |
| 1M | $1,034 | $1,274 | $1,766 |
Assumes 160 sessions/recruiter/month, 50 candidates evaluated per session, at 65% infra utilization.
Depends on ATS and candidate count. Bullhorn: ~6 hrs for 100K. Avionte: ~18 hrs for 100K. Initial ingestion is one-time. After that, incremental updates keep data fresh.
Data stays within our GCP infrastructure (us-central1). Each workspace has isolated collections. No data is shared between customers. SOC 2 compliant.
Each ATS gets its own DataBrick. Recruiters can search across both simultaneously — we handle dedup and cross-source ranking.
ATS search is keyword-based. We use AI to understand what you mean, not just what you type. We also search across work history, skills, education, and certifications simultaneously.