ORBITER

Enrichment QA Pipeline — Full Report
Production vs QA Comparison · Data Flow · FalkorDB Integration Prepared by Robert Boulos · March 30, 2026
Confidential

Contents

  1. Executive Summary
  2. How the Pipeline Works — End-to-End Architecture
  3. Data Model — What Gets Enriched
  4. FalkorDB Graph Integration
  5. Real Person Examples — What the Pipeline Produces
  6. Critical Bugs Found in Production
  7. Before vs After — Key Metrics
  8. Function-by-Function Comparison (9 Functions)
  9. Test Results — 20-Person Batch
  10. Implementation Phases
  11. Production Pipeline Health (Current State)
  12. Recommendations

1. Executive Summary

Orbiter’s enrichment pipeline ingests person and company data from 5+ external sources (People Data Labs, LinkedIn/Enrich Layer, Y Combinator, Fundable, Crunchbase), normalizes it into 20+ relational tables in Xano, then projects it into a FalkorDB knowledge graph with nodes, edges, and vector embeddings.

We cloned all 9 core enrichment functions into an isolated qa/ namespace, systematically audited each one, and discovered 6 bugs in the production code — including 3 functions with globally-scoped database queries that corrupt data across people. Our QA clones fix all of these and add error isolation, structured responses, and an AI self-healing loop powered by Claude Opus 4.6.

100%
QA Pass Rate
180/180
Stages Passed
0
Failures
6
Bugs Fixed

2. How the Pipeline Works — End-to-End Architecture

The enrichment pipeline has 4 layers: Ingest (raw data from external APIs), Process (fan out into relational tables), Resolve (link records to companies), and Graph (project into FalkorDB).

LAYER 1: INGEST (External APIs → JSON blobs) People Data Labs → person_enrich_data.people_data_labs (emails, work, education, skills) LinkedIn/Enrich Layer → person_enrich_data.enrich_layer_data (profile, social, experience) Fundable → person_enrich_data.fundable (investor data, deals, orgs) Y Combinator → company_enrich_data.yc_data (batch, founders, funding) Crunchbase → enrich_history_person (business profiles) LAYER 2: PROCESS (JSON blobs → relational tables) qa/process-enrich-layer (16 sections) ├─ Section 1: Name formatting → master_person (first/last/suffix) ├─ Section 2: Avatar → master_avatar (real image) ├─ Section 3: Biographies → about_person (headline + summary) ├─ Section 4: Location → primary_location (geo-coded) ├─ Section 5: Skills → skills_join → skills ├─ Section 6: Gender → master_person.sex ├─ Section 7: LinkedIn followers → master_link + follower count ├─ Section 8: Languages → language_join → languages ├─ Section 9: Education → education_experience (foreach record) ├─ Section 10: Work experience → work_experience (foreach position) ├─ Section 11: Certifications → certification (foreach cert) ├─ Section 12: Volunteering → volunteering (foreach entry) ├─ Section 13: Projects → project (foreach project) ├─ Section 14: Publications → publication (foreach pub) ├─ Section 15: Honors/Awards → honor (foreach award) └─ Section 16: Interests → interest_join → interest LAYER 3: RESOLVE (relational records → company links) qa/resolve-edges-education Match schools by LinkedIn URL → domain → name qa/resolve-edges-work Match employers by domain → LinkedIn URL qa/resolve-edges-certifications Match issuers by domain → LinkedIn URL qa/resolve-edges-honor Match issuers by domain → LinkedIn URL qa/resolve-edges-volunteering Match organizations by domain → LinkedIn URL qa/resolve-edges-projects-pubs Match associated companies LAYER 4: GRAPH (relational → FalkorDB via Cypher) create-education-edges → (Person)-[:ATTENDED]->(School) → (Person)-[:ATTENDED_TOGETHER]-(Person) [derived] → (Person)-[:STUDIED_UNDER]->(Person) [derived] create-work-edges → (Person)-[:WORKED_AT]->(Company) resolve-edges-honor → (Person)-[:RECEIVED_HONOR]->(Honor)-[:ISSUED_BY]->(Org) complete-person-enrich → update-person-node (set visibility=true, sync embedding) run-base-company-process → update-company-node (funding, industries, staff)

3. Data Model — What Gets Enriched

The pipeline populates 20+ relational tables per person. Here are the core tables and what they store.

Person Data Tables

TableRecords Per PersonKey FieldsSource
master_person1name, avatar, linkedin_url, current_title, bio, bio_500, sex, node_uuid, visibilityAll sources merged
work_experience3-15title, company_name, company_domain, start/end dates, edge_uuid, master_company_idPDL, Enrich Layer, Fundable
education_experience2-8school_name, field_of_study, degree_name, activities, start/end year, edge_uuidPDL, Enrich Layer
skills_join20-70skill_id → skills.skill (text)PDL, Enrich Layer
certification0-10name, org_name, issue_date, edge_id, master_company_idEnrich Layer
honor0-5title, issuer_name, issued_on_year, node_uuid, edge_idEnrich Layer
volunteering0-5title, company_name, start/end year, master_company_idEnrich Layer
master_link5-12service (linkedin, twitter, github...), link_url, profile (bool)YC, PDL, Enrich Layer
master_email1-6email_address, email_type (work/personal), active_statusLinkedIn, PDL
master_avatar1-5url, main (bool), is_placeholder (bool)Twitter, LinkedIn, Fundable
about_person2-4biography (text), data_source_idPDL headline, Enrich Layer, LLM
interest_join5-15interested_in_id → interest.interestEnrich Layer
language_join1-5languages_id → languages.languageEnrich Layer

Data Sources (95 registered, 5 primary)

IDSourceTypeWhat It Provides
91People Data Labs (PDL)APIEmails, phones, education arrays, work arrays, skills, gender
94Enrich LayerAggregatedLinkedIn profile, skills, experience, education, certifications, volunteering, honors, interests, languages
89FundableAPIInvestor profiles, funding deals, organizations, total raised
87Y CombinatorDatabaseBatch info, founders, company one-liners, social links
8CrunchbaseAPIBusiness profiles, funding rounds

4. FalkorDB Graph Integration

FalkorDB (formerly RedisGraph) is a property graph database that uses the Cypher query language. Every enriched person and company becomes a node with vector embeddings, connected by typed, weighted relationships.

Graph Node Types

Label(s)Created ByKey Properties
:Entity:Personadd-person-nodeuuid, name, name_embedding (vecf32), visibility, avatar, bio, roles, skills, interests
:Entity:Person:Angelupdate-person-node (if is_angel)+ investor_type, exits, investments, board_advisory_experience
:Entity:Companyupdate-company-nodeuuid, name, domain, name_embedding, industries, specialties, employee_range, founded
:Entity:Company:Schoolupdate-company-node (.edu domain)Identified by .edu domain or is_school=true flag
:Entity:Company:VC_Firmupdate-company-node (is_vc)+ investment_range, stages, sweet_spot
:Honorresolve-edges-honoruuid, title, description, issued_on_year, issuer_name

Graph Relationship Types

RelationshipDirectionWeightHow It’s Created
ATTENDEDPerson → School40create-education-edges: groups education records by school, collapses into one edge, generates LLM description via Gemini 2.5 Flash
ATTENDED_TOGETHERPerson ↔ Person35-70Derived: Automatically created when two people have overlapping attendance dates at the same school. Weight varies: 35 (same major), 50 (same field), 70 (just temporal overlap)
STUDIED_UNDERPerson → Instructor10Derived: Created when a student’s attendance overlaps with an instructor’s TAUGHT_AT at the same school
WORKED_AT / WORKS_ATPerson → Companyvariescreate-work-edges: links work_experience to companies, sets current_title on master_person
RECEIVED_HONORPerson → Honor20resolve-edges-honor: creates Honor node + edge in single Cypher transaction
ISSUED_BYHonor → Organization25Same transaction as RECEIVED_HONOR, conditional on issuer existing in graph

Key Architectural Patterns

  1. Relational-first, graph-second: All data lands in Xano tables first. The graph is a projection. This makes the pipeline crash-safe — sections fail independently while the graph stays eventually consistent.
  2. UUID-based bidirectional linking: FalkorDB generates randomUUID() for nodes. This UUID is stored back on the Xano record (node_uuid on master_person, edge_uuid on join tables). Both systems can find each other.
  3. Vector embeddings on every node: Person nodes get name_embedding from bio_500 via OpenAI embeddings. Company nodes from about_500. Enables semantic similarity search in the graph.
  4. LLM-generated edge descriptions: Education edges include a natural-language description generated by Gemini 2.5 Flash (e.g., “Robert studied Computer Science at MIT from 2010-2014”).
  5. Derived relationships: The graph computes connections like ATTENDED_TOGETHER by analyzing temporal overlap between existing edges — surfacing connections not explicitly stated in source data.
  6. Weight-based ranking: Every edge has a weight property. Lower = stronger. STUDIED_UNDER (10) ranks higher than ATTENDED (40). Graph traversal prioritizes meaningful connections.
  7. Visibility gating: Nodes start visibility: false. Only become visible after bio_500 exists. Prevents incomplete data in user-facing queries.

5. Real Person Examples — What the Pipeline Produces

JT

John Traver

Co-Founder / Creative Technologist at Frame.io · ID: 2
node_uuid: 7bd19581-941c-4428-a6da-d0a566a01a31
Bio (AI-generated, 500 words)

“John Traver is Co-founder and Creative Technologist at Frame.io, blending cinematic workflow with cutting-edge code. From scripting MEL and AE at RIT to shipping the 5-star KataData app in Objective-C, he has mastered Python, Ruby, JS, and now Haskell to scale video collaboration for Oscar-winning teams.”

Work Experience (3 records)
TitleCompanyPeriodSource
Co-Founder & Creative TechnologistFrame.io2014 – 2025Fundable
Chief ScientistKatabatic Digital2012 – 2014Enrich Layer
K/Lab EngineeringKatabatic Digital2010 – 2012Enrich Layer
Education (5 records)
SchoolDegreeEndSource
Rochester Institute of TechnologyBS2014PDL
Rochester Institute of TechnologyBS2010PDL
Rochester Institute of Technology2010PDL
Shenendehowa High School2006PDL
Rochester Institute of TechnologyPDL
Skills (36 skills)
pythonrubyjavascripthaskellcsshtmlamazon web servicescoffeescriptnosqlruby on railsvisual effectsdigital cinemadigital videopost productioncompositingnukephotographyvideotelevisionscriptingsoftware developmentweb developmentprogramminguser experienceui programmingmeliphone app devcreativityproblem solvingopen mindedexperiential learningforeign languagesrussiancookingveganhealth
Social Profiles (9 links)
LinkedInFacebookX/TwitterGitHubYouTubeCrunchbaseWellfoundGravatarKlout
Emails (4 verified)
john@frame.iojohn@johntraver.comjohn@katabatic.tvjohn.traver@rit.edu
Languages
EnglishRussianItalian
Interests (12)
technologyprogrammingsciencehuman rightsanimal welfareenvironmentskateboardingsnowboardingtennishealthlearningforeign languages
FalkorDB Graph Edges

Person node 7bd19581...ATTENDED → Rochester Institute of Technology (School node) • ATTENDED → Shenendehowa HS • WORKED_AT → Frame.io • WORKED_AT → Katabatic Digital

JW

Jimmy Wales

Founder & CEO at Wikitribune/WT.Social · ID: 14
node_uuid: 68facb3e-cf3f-4035-97f4-7e72e0c5aae0
Work Experience (5 records)
TitleCompanyPeriod
OwnerFandomCurrent
Board MemberWikimedia Foundation2003 – present
Executive ChairmanThe People’s Operator2014 – 2015
FounderMighty Capital2001 – 2007
Founder & CEOWikitribune2017 – present
Education (3 records)
SchoolDegreeEnd
University of AlabamaM.S. Finance1991
Auburn UniversityB.S. Finance1989
Randolph High School1983
Skills (21)
internet entrepreneurshipwikipediaopen sourcecrowdsourcingphilanthropypublic speakingjournalismfinanceventure capitalmediatechnology+ 10 more
Enrichment Note

PDL returned not_found for Jimmy Wales — his profile was too high-profile/protected. All data came from Enrich Layer (LinkedIn scrape) instead. The pipeline handled this gracefully.

JD

Jon Dahl

Co-Founder, CEO at Mux · ID: 17
node_uuid: 14818071-5a47-4add-a5fe-8e992004a9c0
Work Experience
TitleCompanyPeriod
Co-Founder, CEOMuxCurrent
VP EngineeringBrightcove2012 – 2015
Co-Founder, CEOZencoder2010 – 2012
Education
SchoolDegree
Trinity International UniversityBA
Wheaton College
QA Pipeline Result

Jon Dahl was one of two people who failed in the original production batch test (306s timeout on run-base-company-process). After QA fixes — specifically the direct function.run bypass that eliminates the HTTP hop — he now passes 9/9 in 45s.

CN

Charles Njenga

ID: 20 — Example with rich certifications
Certifications (7+ records)
CertificationIssuer
Advanced React and ReduxUdemy
Modern JavaScriptUdemy
Complete React DeveloperUdemy
CSS - The Complete GuideUdemy
Google Africa Developer ScholarshipGoogle / Andela
JavaScript Algorithms & Data StructuresUdemy
Modern React with ReduxUdemy

Each certification goes through resolve-edges-certifications which matches the issuer (Udemy, Google) to a master_company record. In production, this function had no input parameter and queried ALL certifications globally — our QA fix scopes it to this person only.

6. Critical Bugs Found in Production Code

Deep code review uncovered 6 bugs. Three are data-corruption-level severity (globally-scoped queries that modify records belonging to other people).

Critical — Data Corruption

Bug #1: Certifications — No Input Parameter

Function resolve-edges-certifications (12719) had NO input parameter at all. All 3 db.query certification calls queried the entire table. When enriching Charles Njenga (ID 20), it would also modify Mario Haarmann’s (ID 18) 3 certifications.

Fix: Added master_person_id input. Added WHERE clause to all 3 queries.

Critical — Data Corruption

Bug #2: Honors — 2 of 4 Sections Unscoped

Function resolve-edges-honor (12715): domain resolution and LinkedIn resolution sections queried ALL 71 honor records globally instead of just the current person’s.

Fix: Added master_person_id WHERE clause to sections 2 and 3.

Critical — Data Corruption

Bug #3: Volunteering — 2 of 3 Sections Unscoped

Function resolve-edges-volunteering (12716): LinkedIn resolution and Cypher creation sections queried ALL 105 volunteering records globally.

Fix: Added master_person_id WHERE clause to sections 2 and 3.

High

Bug #4: Company Process — Empty Table Names

Function run-base-company-process (12720) had 3 database calls with "" as the table name in the staff_count section. Deploys fine but silently fails at runtime.

Fix: Identified correct table name. Replaced all 3 empty strings.

High — Root cause of 764 stuck records

Bug #5: No Error Isolation in Main Processor

process-enrich-layer (12712) has 16 independent sections. A crash in Section 3 (avatar) kills Sections 4-16 (skills, education, work, certs, etc.). Production has 764 enrich_history records stuck with processing=true.

Fix: Each of 16 sections wrapped in individual try_catch. Crashes logged to crash_log, continue to next section.

High

Bug #6: Zombie Processing Records

complete-person-enrich (12713): No try_catch. Crash leaves processing=true forever and queue entry never cleaned up.

Fix: Wrapped in try_catch. Queue cleanup moved outside try block (always executes).

7. Before vs After — Key Metrics

MetricProduction (Before)QA Pipeline (After)
Pass Rate99.2% (357/360)100% (180/180)
Failures3 (timeout on company process)0
Stuck Processing Records778 in production0 (guaranteed cleanup)
Global Query Bugs3 functions unscopedAll queries scoped to person
Error IsolationNone (1 crash kills all 16 sections)Per-section try_catch
Response FormatString: "success"Structured JSON: {processed, resolved, errors, skipped}
Duplicate HistoryCreates new record every runCheck-and-reuse existing records
AI Self-FixNoneClaude Opus 4.6 via OpenRouter with 13 XanoScript rules
Company Process TimeoutHTTP hop timeout at 300s (Xano nginx)Direct function.run (no HTTP hop)

8. Function-by-Function Comparison

All 9 enrichment functions cloned into qa/ namespace with fixes applied.

1. process-enrich-layer

ID: 12712 · 16 sections
Main data processor: fans out Enrich Layer JSON into 16 relational tables (name, avatar, bio, location, skills, gender, LinkedIn, languages, education, work, certs, volunteering, projects, publications, honors, interests).
BugsNo error isolation — 1 crash kills all 16 sections. No duplicate history prevention. Root cause of 764 stuck records.
FixesPer-section try_catch (16 sections). Duplicate history check-and-reuse. Null guard on person name before name-format_v2.
NewStructured response: {sections_run, sections_ok, sections_skip, errors[]}. Crash logging per section.

2. complete-person-enrich

ID: 12713
Finalization: marks processing complete, updates person node in FalkorDB (visibility=true, sync embedding), cleans up enrichment queue.
BugsNo try_catch. Crash = zombie processing=true record + orphaned queue entry.
FixesWrapped in try_catch. Queue cleanup moved outside try block (always runs).

3. resolve-edges-education

ID: 12714
3-phase resolution: match schools by LinkedIn URL → domain → name. Creates ATTENDED edges in FalkorDB with LLM-generated descriptions. Derives ATTENDED_TOGETHER edges from date overlap.
BugsNone — already properly scoped.
NewStructured response: {processed, resolved, errors, skipped}.

4. resolve-edges-honor

ID: 12715
Links honors to issuing organizations. Creates Honor nodes + RECEIVED_HONOR and ISSUED_BY edges in FalkorDB using Twig-template Cypher.
BugsSections 2 & 3 query ALL 71 honors globally (missing person filter).
FixesAdded master_person_id WHERE clause to both sections. Null guard on node_uuid.

5. resolve-edges-volunteering

ID: 12716
Links volunteering records to organizations via domain/LinkedIn matching.
BugsSections 2 & 3 query ALL 105 volunteering records globally.
FixesAdded master_person_id WHERE clause to both sections.

6. resolve-edges-work

ID: 12717
Links work records to companies. Creates WORKED_AT/WORKS_AT edges. Deduplicates similar titles. Sets best current role on master_person via LLM.
BugsNone — already properly scoped.
NewStructured response: {processed, resolved, errors, skipped}.

7. resolve-edges-projects-publications

ID: 12718
Links projects and publications to associated companies.
BugsNone.
NewStructured response with counts.

8. resolve-edges-certifications

ID: 12719
Links certifications to issuing organizations (e.g., Udemy, Google, iSAQB).
BugsCRITICAL: NO input parameter. All 3 db.query calls operate on entire certifications table for ALL people.
FixesAdded master_person_id input. Scoped all 3 WHERE clauses. Updated test-stage caller.

9. run-base-company-process

ID: 12720
Company enrichment: processes PDL + Enrich Layer + YC data, links Fundable organizations, extracts funding/deals, updates company node in FalkorDB.
Bugs3 db calls use empty table name (""). Timeout on large deal counts. HTTP hop timeout at 300s.
FixesCorrected table names. Direct function.run bypass. Deal count protection (>100 = skip).

9. Test Results — 20-Person Batch

Final verification: 20 people, 9 stages each, 180 total stages. All passed with 0 failures.

20
People
9
Stages / Person
180
Total Passed
0
Failures
327s
Total Time
#NameCompanyPassedFailedTimeStatus
1Josh DiamondFrame.io9022sPASS
2John TraverFrame.io9022sPASS
3Emery WellsFrame.io903sPASS
4Molly Alter9018sPASS
5Jason DiamondThe Diamond Bros.9017sPASS
6Itai Tsiddon9033sPASS
7Amish JaniFirstMark Capital9042sPASS
8Jared Leto902sPASS
9Mark L. PedersonOrbiter9026sPASS
10Kevin Spacey905sPASS
11Thomas Hesse9013sPASS
12Walter Kortschak908sPASS
13Jimmy WalesWikitribune908sPASS
14Larry Sanger909sPASS
15Clark Valberg909sPASS
16Jon DahlMux9045sPASS
17Mario Haarmann906sPASS
18Vijay Nagappan906sPASS
19Charles Njenga908sPASS
20Dynamo Mbugua9021sPASS

Average: 16.4s/person. Fastest: Jared Leto (2s). Slowest: Jon Dahl (45s).

Previously failing: Larry Sanger (was 438s timeout → now 9s) and Jon Dahl (was 306s timeout → now 45s).

10. Implementation Phases

Phase 1: Critical Bug Fixes (P0)

Phase 2: Resilience (P1)

Phase 3: Observability (P1)

Phase 4: AI Self-Fix Loop (P2)

11. Production Pipeline Health (Current State)

Current state of the production enrichment system, measured via the enrichment diagnostics MCP.

2,143
Total People
778
Stuck Processing
218
Invisible (10.2%)
1,227
Placeholder Avatars
IssueCountSeverityRoot Cause
Stuck enrich_history records (processing=true)778HIGHBug #5 & #6: no error isolation, no zombie cleanup
People with visibility=false218 (10.2%)HIGHStuck processing prevents complete-person-enrich from running
Placeholder avatars marked as main1,227MEDIUMSystemic bug in replace-avatar logic (separate from enrichment)
Company queue backlog10,159MEDIUMNew companies queued by edge resolvers faster than processed
People missing enrich_data record5LOWCreated before person_enrich_data table existed
Crash log entries29INFOVarious runtime errors captured by crash_log table

The QA pipeline’s fixes directly address the top two issues: per-section try_catch eliminates stuck records, and guaranteed cleanup in complete-person-enrich prevents zombies.

12. Recommendations

  1. Port QA fixes to production: The 6 bugs exist in live production functions. The QA clones prove the fixes work at 100% pass rate across 20 people.
  2. Clean up 778 stuck records: Run a one-time cleanup to set processing=false on all stuck enrich_history_person records, allowing re-enrichment.
  3. Run QA batch on full dataset: We tested 20 people (IDs 2-21). The database has 2,143. A broader batch will surface edge cases.
  4. Enable AI self-fix in production: The Opus 4.6 loop can diagnose and fix XanoScript errors automatically, reducing manual debugging.
  5. Address company queue backlog (10,159): Edge resolvers create new master_company records faster than they’re processed. May need batch company enrichment or prioritization.
  6. Fix placeholder avatar systemic bug: 1,227 people have placeholder avatars marked as main — this is a separate bug in the replace-avatar function.