Entity Resolution Methodology
BLUF
Entity resolution — also called identity resolution, record linkage, or deduplication — is the process of determining whether multiple data records or mentions across different sources refer to the same real-world entity (person, organization, vessel, device, or account). It is a foundational analytical task in OSINT: adversaries, criminals, and intelligence targets routinely use pseudonyms, transliterated name variants, shell companies, and multiple accounts to fragment their digital footprint. Entity resolution reassembles that footprint into a coherent identity picture.
Assessment: Entity resolution is the single most analytically intensive step in most OSINT investigations — it requires systematic methodology, tolerance for ambiguity, and rigorous evidence documentation to avoid misidentification. Misidentification is not merely an analytical failure; in intelligence and legal contexts it is an operational and ethical risk that can cause concrete harm to innocent parties.
Entity Types in OSINT Context
Five primary entity classes recur across OSINT investigations. Each carries distinct attribute structures, distinct sources, and distinct failure modes.
- Natural Persons. Individuals — the most common and most complex resolution target. Challenges: aliases, name variants, transliteration differences across scripts, pseudonyms, multiple nationalities, identity-document forgery, and the simple statistical problem of name collision (how many “Mohammed Hassan” or “John Smith” entries exist in a global corpus?).
- Organizations / Legal Entities. Companies, NGOs, government bodies. Challenges: beneficial ownership obscuration through nominee directors, name variations across jurisdictions, subsidiary and holding-company structures, shell companies registered at agent addresses, and intentional name collisions (“Acme Holdings Ltd” registered in five jurisdictions, only one of which is the actual operator).
- Online Accounts / Personas. Social media accounts, forum usernames, email addresses, gaming handles. Challenges: sock puppets, coordinated inauthentic behavior (CIB), anonymous account attribution, and platform-specific identity models.
- Devices / Infrastructure. IP addresses, domain names, hardware fingerprints, TLS certificates. Challenges: shared infrastructure across unrelated campaigns, VPN/proxy attribution, CDN obfuscation, dynamic IP reassignment.
- Vessels / Vehicles / Aircraft. Ships (IMO numbers, MMSI, flags of convenience), aircraft (ICAO 24-bit hex codes, tail numbers), vehicles (VIN, plate numbers). Challenges: AIS/ADS-B spoofing, flag-of-convenience reflagging, re-registration cycles, and stolen-identity overlays.
See Network Analysis Methodology for the link-level analysis that typically follows successful entity resolution.
Core Concepts
Record linkage — Determining whether two records from different databases refer to the same entity. A classical data-science problem (Fellegi-Sunter, 1969) increasingly applied in OSINT contexts where the “databases” are heterogeneous: corporate registries, social media APIs, leaked dumps, news archives.
Deduplication — Identifying and merging duplicate records within a single dataset. The narrowest case of entity resolution and the most tractable, because schema is consistent.
Disambiguation — Distinguishing between multiple distinct entities that share similar attributes: common names, similar usernames, shared infrastructure. The inverse of record linkage — the analytical task is to split what naive matching would merge.
Confidence scoring — Assigning a probability that two records refer to the same entity, based on the strength of matching attributes. Resolution is rarely binary. It is a confidence continuum, and the discipline of explicit confidence assignment is what separates serious entity resolution from intuitive pattern-matching.
The false positive / false negative tradeoff. High-confidence matching (requiring many corroborating attributes) reduces false positives — incorrectly merging distinct entities — but increases false negatives, failing to link records that genuinely belong together. Assessment: In intelligence and legal contexts, false positives carry higher cost: misidentification of an innocent person can lead to surveillance, reputational damage, sanctions exposure, or physical harm. Analysts should tune methodology toward precision over recall unless the operational context explicitly inverts that ranking (e.g., screening-stage triage where downstream review will catch false positives).
Person Entity Resolution — Step-by-Step Methodology
The highest-stakes and most complex entity class. A six-step process.
Step 1 — Seed Identification
Define what is known about the target: full name in all scripts/romanizations, known aliases, date of birth (or estimated age range), nationality, known locations (current and historical), known accounts, known associates, distinguishing biographical anchors (employer, alma mater, military service). Document the confidence level of each seed attribute — seed attributes drawn from authoritative sources (passport, court record) are weighted differently than those drawn from secondary reporting.
Step 2 — Name Variant Generation
Names are the most common entry point and the most error-prone. Before any cross-source query, generate the variant set:
- Transliteration variants:
- Arabic: Mohammed / Muhammad / Muḥammad / Mohamed / Mohamad
- Russian: Sergey / Sergei / Sergej / Serguei
- Chinese: pinyin vs. Wade-Giles vs. Cantonese/Yue romanization (Mao Zedong / Mao Tse-tung; Hong Kong vs. mainland surname order)
- Persian/Farsi: name segments may collapse or expand between Arabic-script source and Latin transcription
- Diminutives and given-name variants: Robert / Rob / Bobby / Bob / Robbie; Alexander / Sasha / Sanya; Mikhail / Misha
- Compound surname variations: hyphenated vs. non-hyphenated; maiden name vs. married name; patronymic forms (Russian Ivanov / Ivanovich / Ivanovna distinguish persons within a single family — a non-trivial source of misidentification for non-Russian-literate analysts)
- Phonetic equivalents: Soundex, Metaphone, Double Metaphone, Jaro-Winkler similarity scoring
- Script variants: Cyrillic ↔ Latin, Arabic ↔ Latin, Han ↔ Pinyin, Hebrew ↔ Latin; always retain the native-script form alongside any romanization
Assessment: Most failed person resolutions trace to incomplete name-variant generation at this step. The standard error is querying only the romanization the analyst encountered first, missing the form used by the entity’s home-jurisdiction primary sources. See the multi-lingual OSINT standing rule on actor-language tiering.
Step 3 — Cross-Source Collection
Systematically query available sources using all name variants. Standard sources by domain:
- Corporate registries: Companies House (UK), SEC EDGAR (US), OpenCorporates (aggregator), ICIJ Offshore Leaks Database, national equivalents in every relevant jurisdiction
- Court records: PACER (US federal), CourtListener, RECAP, state court databases, foreign equivalents where available
- Social media: LinkedIn, Facebook, Twitter/X, VK, Telegram, Instagram, TikTok, Mastodon, Bluesky, regional platforms (Weibo, Naver)
- Academic publications: Google Scholar, ResearchGate, ORCID, Scopus, institutional repositories
- Domain / WHOIS registrations: historical WHOIS (DomainTools, ViewDNS), passive DNS
- Electoral / voter records where lawfully public (US states, UK electoral roll)
- Property records: county assessor records (US), HM Land Registry (UK), national cadastres
- News archives: Factiva, LexisNexis, Google News, ProQuest, regional press archives
- Data breach databases: Have I Been Pwned, IntelX (Intelligence X), DeHashed — for correlating handles, emails, password hashes to identities. Use only within the analyst’s lawful and ethical scope; see OSINT Ethics. For dark web source collection from .onion breach marketplaces and dark web forums, see Dark Web Methodology.
Step 4 — Attribute Matching and Confidence Assignment
Build a matching matrix for every candidate pair. Each attribute carries a weight reflecting its discriminative power:
| Attribute | Weight | Notes |
|---|---|---|
| Full name (exact) | High | Discriminative power inversely proportional to name frequency in the relevant population |
| Date of birth | Very High | Most discriminating single biographical attribute |
| Email address | Very High | Unique identifier if verified against the actual mailbox |
| Phone number (MSISDN) | Very High | Typically unique to a SIM/account pair |
| National ID / Passport number | Definitive | If verifiable against issuing authority — otherwise treat as claim |
| Physical description (height, distinguishing features) | Medium | Approximate; subject to age and self-report error |
| Geographic location (current / historical) | Medium | Must overlap plausibly with timeline |
| Social network overlap (shared contacts) | Medium-High | Probabilistic; strong when overlap is large or unusual |
| Writing style / linguistic fingerprint (stylometry) | Low-Medium | Requires sufficient corpus; not standalone-conclusive |
| Profile photographs | Medium-High | Requires facial-recognition corroboration with controlled false-positive rate |
Composite confidence rubric (aligned with Intelligence Confidence Levels):
- High: 3+ Very High attributes match with no contradictions
- Medium: 2 Very High or 4+ High attributes match
- Low: 1 Very High or 2–3 High attributes with no contradictions
- Insufficient: fewer than 2 High attributes — do not assert identity
Step 5 — Contradiction Analysis
After matching, actively search for any attribute that contradicts the proposed entity match:
- Two records assert the entity was in different cities on the same date (and no plausible travel pattern reconciles them)
- Age / DOB inconsistency exceeding measurement error
- An educational credential dated before the institution opened
- An account creation date that precedes the claimed platform launch
- Native-language fluency claimed in one record contradicted by writing samples in another
Any clean contradiction degrades confidence or rules out the match entirely. Assessment: Contradictions are more informative than confirmations in entity resolution. A single unambiguous contradiction eliminates a candidate; confirmations only accumulate probabilistically. Analysts should spend at least as much time searching for contradiction as for confirmation.
Step 6 — Synthesis and Documentation
For each resolved entity:
- Document every attribute match: source, URL, date accessed, confidence rating, screenshot or archive link
- State the final composite confidence and the reasoning
- Explicitly list what would upgrade confidence and what would downgrade or break the match
- Identify the Gap — what additional information, if obtained, would resolve remaining ambiguity
This documentation is the audit trail. In journalistic, legal, or intelligence-product contexts, it is what permits review, replication, and defensible publication. For the general OSINT source-evaluation standard applied across all collection steps, see Source Verification Framework.
Online Persona / Account Resolution
Distinct methodology for resolving anonymous online accounts to individuals or to each other. The attribute surface is narrower but richer in machine-tractable signals.
Username correlation. Most users reuse usernames across platforms (habit, vanity, brand consistency). Tools that query 300+ platforms for a given handle: Sherlock, WhatsMyName, Maigret, Namechk. Unique or unusual usernames are high-confidence correlators; common usernames (john123, admin) are near-useless. Assessment: Username reuse is the single most productive technique in online persona resolution and should usually be the first step after seed identification.
Email correlation. Email addresses surface repeatedly in forum registrations, WHOIS data, breach dumps, support tickets, and disclosed correspondence. Query an email against IntelX, DeHashed, or HIBP to map its breach history — this often reveals the full set of platforms where the address was used, and the usernames registered against it. The chain email → username → platform account → posted content is the standard persona-reconstruction path.
Writing style / stylometry. Word frequency analysis, punctuation habits, function-word distribution, syntactic patterns. Tools: JGAAP (Java Graphical Authorship Attribution Program), LIWC (Linguistic Inquiry and Word Count), custom n-gram models. Primarily useful for narrowing a field of candidates rather than for definitive attribution. Gap: Stylometry is admissible in some academic and intelligence contexts but rarely meets criminal-court evidentiary standards as standalone evidence. Treat as corroborating, never as load-bearing.
Digital fingerprinting.
- Browser fingerprint (if website analytics are available to the analyst): OS, screen resolution, browser version, timezone, font list, canvas hash
- Device fingerprints leaked through metadata — pre-2014 EXIF data in posted images frequently contained GPS coordinates and device serial numbers; many platforms now strip this, but historical content retains it. See Geolocation Methodology for image-based location reconstruction from visual and metadata cues.
- IP address geolocation, VPN/Tor-aware: cluster analysis on IP ranges, timing correlations across sessions
Behavioral pattern analysis. See Pattern of Life Analysis for the systematic behavioral-inference framework applied in targeting and investigative contexts.
- Posting time analysis: when does the account post? Histogram by hour-of-day correlates with timezone and work schedule; gaps correlate with sleep, employment, religious observance
- Topic coherence: what subjects does the account consistently engage with? Expertise depth, geographic knowledge, linguistic background
- Network analysis: who does the account interact with? Shared contacts with known accounts are strong correlators — see Network Analysis Methodology for the link-analysis layer
Organizational Entity Resolution
For companies, the analytic challenge is less pseudonymous identity and more penetrating beneficial ownership opacity.
GLEIF LEI (Legal Entity Identifier). 20-character unique entity code, mandatory for financial market participants, searchable at gleif.org. Provides definitive identification for registered legal entities within scope. Fact: LEI coverage is mandatory for derivatives counterparties under EMIR and Dodd-Frank but is not universal across non-financial corporates.
Corporate resolution workflow:
- Identify the entity’s legal name and jurisdiction of registration
- Search the primary registry (Companies House, SEC EDGAR, ICIJ Offshore Leaks, OpenCorporates, national equivalents)
- Trace all associated entities: subsidiaries, holding companies, the cluster of entities at the same registered address
- Map directors across entities — shared directorships are a strong indicator of related entities
- Map registered address: is it a residential address, a working office, or a registered-agent address servicing hundreds of shell companies (Mossack Fonseca, Trident Trust, etc.)?
OpenCorporates Match API automates cross-registry corporate entity matching at scale. For human-rights and sanctions investigations, ICIJ Offshore Leaks remains the highest-yield single dataset for shell-company resolution.
See also Corporate OSINT and Due Diligence and Financial Intelligence. For cryptocurrency and blockchain-based entity resolution tracing on-chain ownership, see Crypto Tracing Tools Guide.
Vessel / Aircraft / Device Resolution
Vessels.
- IMO number: unique, permanent, assigned by IHS Maritime at build. The gold-standard identifier; does not change with reflagging or sale.
- MMSI: dynamic — can be reassigned, frequently spoofed. AIS spoofing has been extensively documented in the Black Sea, the Persian Gulf, and around sanctioned-cargo operations.
- Flag of convenience: vessel may change registration multiple times in a service life while IMO remains constant. Cross-reference reflagging history against ownership-change records.
- AIS track verification: MarineTraffic, VesselFinder, and Lloyd’s List archives provide historical tracks that corroborate or refute claimed voyages.
Aircraft.
- ICAO 24-bit hex address: hardware-encoded in the Mode-S transponder; more stable than tail number (which changes on re-registration).
- ADS-B Exchange: unfiltered ICAO hex tracking, including most military and state-aircraft traffic. The reference source for adversary aviation OSINT.
- FlightAware / Flightradar24: filtered datasets — military and some private traffic are excluded by request or by regulator demand.
Infrastructure (IP / domain).
- IP address attribution: ASN (Autonomous System Number) identifies the hosting provider; passive DNS resolves historical domain-to-IP mappings, revealing infrastructure pivots that real-time DNS would miss.
- BGP hijacking detection: tools like BGPmon, RIPE RIS, and Cloudflare Radar track unexpected route changes — useful in attributing route-hijack incidents and DNS poisoning operations.
- TLS certificate history: crt.sh aggregates all Certificate Transparency log entries for a domain, exposing every certificate ever issued and revealing infrastructure relationships hidden in current DNS.
See Shodan-Censys Guide for the infrastructure-scanning layer that complements certificate and DNS pivoting.
Tools Reference
| Tool | Function | Primary Entity Type |
|---|---|---|
| Maltego | Graph-based entity resolution and link analysis | All types |
| SpiderFoot | Automated multi-source resolution | Online accounts, IP, domain |
| Sherlock / WhatsMyName / Maigret | Username cross-platform search | Online personas |
| IntelX (Intelligence X) | Email / username in breach data | Online personas |
| OpenCorporates | Cross-registry corporate matching | Organizations |
| GLEIF LEI search | Legal entity identifier resolution | Organizations (financial) |
| ICIJ Offshore Leaks | Shell-company and beneficial-ownership data | Organizations |
| MarineTraffic / VesselFinder | AIS vessel tracking | Vessels |
| ADS-B Exchange | ICAO hex aircraft tracking (unfiltered) | Aircraft |
| JGAAP / LIWC | Authorship attribution / stylometry | Online personas |
| crt.sh | TLS certificate transparency history | Infrastructure |
| RIPE RIS / BGPmon | BGP route monitoring | Infrastructure |
See Maltego Guide for the operational tradecraft of graph-based resolution and Social Media Intelligence for platform-specific persona resolution.
Ethical and Legal Constraints
- Misidentification is the primary risk. A false positive in person entity resolution can destroy an innocent person’s reputation, expose them to sanctions screening errors, or — in adversarial environments — result in physical harm. The cost asymmetry of false positives over false negatives must drive methodology.
- Minimum-necessary-information principle. Resolve only to the granularity the investigation requires. Over-collection creates retention and disclosure liability without analytic gain.
- Data retention and lawful basis. Resolved entity profiles containing personal data are subject to GDPR (EU), LGPD (Brazil), CCPA (California), and equivalent regimes. Establish lawful basis before collection, not after; document retention schedules.
- Publication threshold. Do not publish resolution results involving private individuals without additional public-interest justification, ideally cleared with editorial / legal review.
- Use of breach data. Many breach-data resources are legally usable in jurisdictions where the data is already public, but operational ethics and platform terms-of-service may impose additional constraints. Document the basis on which breach data is used.
See OSINT Ethics and OSINT Legal Framework. For human-rights contexts where evidentiary thresholds are paramount, see OSINT for Human Rights and the Berkeley Protocol.
Sources
- NIST Special Publication 800-188: De-identifying Government Datasets — High
- Elmagarmid, Ahmed et al. — “Duplicate Record Detection: A Survey” (IEEE Transactions on Knowledge and Data Engineering, 2007) — High
- Fellegi, Ivan P. & Sunter, Alan B. — “A Theory for Record Linkage” (Journal of the American Statistical Association, 1969) — High (foundational)
- Bellingcat methodology documentation (bellingcat.com/resources) — High
- IntelX (Intelligence X) documentation and API guides — Medium
- Bazzell, Michael — Open Source Intelligence Techniques (10th ed., 2023) — High
- Berkeley Protocol on Digital Open Source Investigations (UN OHCHR / UC Berkeley HRC, 2020) — High
- ICIJ Offshore Leaks Database documentation — High
- GLEIF (Global Legal Entity Identifier Foundation) — gleif.org — High
Related
OSINT · Network Analysis Methodology · Link Analysis · Attribution · Social Media Intelligence · Financial Intelligence · Maltego Guide · Shodan-Censys Guide · OSINT Ethics · OSINT Legal Framework · Corporate OSINT and Due Diligence · OSINT for Human Rights · Pattern of Life Analysis · Source Verification Framework · Intelligence Confidence Levels · Geolocation Methodology · Crypto Tracing Tools Guide · Dark Web Methodology