Claradb
Sign in
BlogEngineering

Building the AI company graph: entity resolution at scale

Connecting companies to their repos, models, papers, and patents requires solving entity resolution across messy public data. Here is how we approach it.

The entity resolution problem

In a generic company database, an entity is a row: name, domain, headquarters, employee count. In an AI-specific intelligence system, an entity is a graph node connected to repositories, models, papers, patents, funding rounds, and team members. The challenge is not storing the data. The challenge is linking it correctly.

A single AI company might maintain 40 public repositories under three different GitHub organizations, publish models on Hugging Face under a handle that does not match the company name, and have researchers publishing papers under university affiliations. Resolving these connections at scale is a non-trivial engineering problem.

How we approach linking

Claradb uses a combination of deterministic matching, heuristic scoring, and manual review to build the company graph. The deterministic layer handles the easy cases: verified domain links, explicit GitHub organization connections, and Hugging Face organization pages that list company affiliation.

The heuristic layer handles the harder cases: matching paper authors to company employees, connecting unlabeled repositories to organizations based on contributor overlap, and resolving model ownership when the publishing account does not match the company entity.

// Simplified entity resolution pipeline
const pipeline = {
  deterministic: [
    "domain_match",
    "github_org_verified",
    "hf_org_verified",
  ],
  heuristic: [
    "contributor_overlap",
    "author_affiliation",
    "model_naming_pattern",
  ],
  review: [
    "confidence_below_threshold",
    "conflicting_signals",
    "new_entity_candidates",
  ],
};

Keeping the graph current

The company graph is not a static snapshot. Claradb mirrors public data sources on a regular cadence and re-runs entity resolution when new signals appear. This means a company that publishes a new model or creates a new repository will have that activity reflected in their momentum score within the next update cycle.

The mirroring system is designed to be eventually consistent rather than real-time. The tradeoff is intentional: the product is built for market research cadences, not trading signals. A score that updates daily is more useful for diligence workflows than a score that updates every minute but is noisy.

Where the boundaries are

Entity resolution is never perfect. Claradb is transparent about confidence levels and makes the linking methodology inspectable so users can evaluate the graph quality for themselves. When a connection is uncertain, the profile shows the evidence and lets the analyst decide rather than hiding the ambiguity behind a clean interface.