Entity Resolution Methodology

BLUF

Entity resolution — also called identity resolution, record linkage, or deduplication — is the process of determining whether multiple data records or mentions across different sources refer to the same real-world entity (person, organization, vessel, device, or account). It is a foundational analytical task in OSINT: adversaries, criminals, and intelligence targets routinely use pseudonyms, transliterated name variants, shell companies, and multiple accounts to fragment their digital footprint. Entity resolution reassembles that footprint into a coherent identity picture.

Assessment: Entity resolution is the single most analytically intensive step in most OSINT investigations — it requires systematic methodology, tolerance for ambiguity, and rigorous evidence documentation to avoid misidentification. Misidentification is not merely an analytical failure; in intelligence and legal contexts it is an operational and ethical risk that can cause concrete harm to innocent parties.

Entity Types in OSINT Context

Five primary entity classes recur across OSINT investigations. Each carries distinct attribute structures, distinct sources, and distinct failure modes.

Natural Persons. Individuals — the most common and most complex resolution target. Challenges: aliases, name variants, transliteration differences across scripts, pseudonyms, multiple nationalities, identity-document forgery, and the simple statistical problem of name collision (how many “Mohammed Hassan” or “John Smith” entries exist in a global corpus?).
Organizations / Legal Entities. Companies, NGOs, government bodies. Challenges: beneficial ownership obscuration through nominee directors, name variations across jurisdictions, subsidiary and holding-company structures, shell companies registered at agent addresses, and intentional name collisions (“Acme Holdings Ltd” registered in five jurisdictions, only one of which is the actual operator).
Online Accounts / Personas. Social media accounts, forum usernames, email addresses, gaming handles. Challenges: sock puppets, coordinated inauthentic behavior (CIB), anonymous account attribution, and platform-specific identity models.
Devices / Infrastructure. IP addresses, domain names, hardware fingerprints, TLS certificates. Challenges: shared infrastructure across unrelated campaigns, VPN/proxy attribution, CDN obfuscation, dynamic IP reassignment.
Vessels / Vehicles / Aircraft. Ships (IMO numbers, MMSI, flags of convenience), aircraft (ICAO 24-bit hex codes, tail numbers), vehicles (VIN, plate numbers). Challenges: AIS/ADS-B spoofing, flag-of-convenience reflagging, re-registration cycles, and stolen-identity overlays.

See Network Analysis Methodology for the link-level analysis that typically follows successful entity resolution.

Core Concepts

Record linkage — Determining whether two records from different databases refer to the same entity. A classical data-science problem (Fellegi-Sunter, 1969) increasingly applied in OSINT contexts where the “databases” are heterogeneous: corporate registries, social media APIs, leaked dumps, news archives.

Deduplication — Identifying and merging duplicate records within a single dataset. The narrowest case of entity resolution and the most tractable, because schema is consistent.

Disambiguation — Distinguishing between multiple distinct entities that share similar attributes: common names, similar usernames, shared infrastructure. The inverse of record linkage — the analytical task is to split what naive matching would merge.

Confidence scoring — Assigning a probability that two records refer to the same entity, based on the strength of matching attributes. Resolution is rarely binary. It is a confidence continuum, and the discipline of explicit confidence assignment is what separates serious entity resolution from intuitive pattern-matching.

The false positive / false negative tradeoff. High-confidence matching (requiring many corroborating attributes) reduces false positives — incorrectly merging distinct entities — but increases false negatives, failing to link records that genuinely belong together. Assessment: In intelligence and legal contexts, false positives carry higher cost: misidentification of an innocent person can lead to surveillance, reputational damage, sanctions exposure, or physical harm. Analysts should tune methodology toward precision over recall unless the operational context explicitly inverts that ranking (e.g., screening-stage triage where downstream review will catch false positives).

Person Entity Resolution — Step-by-Step Methodology

The highest-stakes and most complex entity class. A six-step process.

Step 1 — Seed Identification

Define what is known about the target: full name in all scripts/romanizations, known aliases, date of birth (or estimated age range), nationality, known locations (current and historical), known accounts, known associates, distinguishing biographical anchors (employer, alma mater, military service). Document the confidence level of each seed attribute — seed attributes drawn from authoritative sources (passport, court record) are weighted differently than those drawn from secondary reporting.

Step 2 — Name Variant Generation

Names are the most common entry point and the most error-prone. Before any cross-source query, generate the variant set:

Transliteration variants:
- Arabic: Mohammed / Muhammad / Muḥammad / Mohamed / Mohamad
- Russian: Sergey / Sergei / Sergej / Serguei
- Chinese: pinyin vs. Wade-Giles vs. Cantonese/Yue romanization (Mao Zedong / Mao Tse-tung; Hong Kong vs. mainland surname order)
- Persian/Farsi: name segments may collapse or expand between Arabic-script source and Latin transcription
Diminutives and given-name variants: Robert / Rob / Bobby / Bob / Robbie; Alexander / Sasha / Sanya; Mikhail / Misha
Compound surname variations: hyphenated vs. non-hyphenated; maiden name vs. married name; patronymic forms (Russian Ivanov / Ivanovich / Ivanovna distinguish persons within a single family — a non-trivial source of misidentification for non-Russian-literate analysts)
Phonetic equivalents: Soundex, Metaphone, Double Metaphone, Jaro-Winkler similarity scoring
Script variants: Cyrillic ↔ Latin, Arabic ↔ Latin, Han ↔ Pinyin, Hebrew ↔ Latin; always retain the native-script form alongside any romanization

Assessment: Most failed person resolutions trace to incomplete name-variant generation at this step. The standard error is querying only the romanization the analyst encountered first, missing the form used by the entity’s home-jurisdiction primary sources. See the multi-lingual OSINT standing rule on actor-language tiering.

Step 3 — Cross-Source Collection

Systematically query available sources using all name variants. Standard sources by domain:

Corporate registries: Companies House (UK), SEC EDGAR (US), OpenCorporates (aggregator), ICIJ Offshore Leaks Database, national equivalents in every relevant jurisdiction
Court records: PACER (US federal), CourtListener, RECAP, state court databases, foreign equivalents where available
Social media: LinkedIn, Facebook, Twitter/X, VK, Telegram, Instagram, TikTok, Mastodon, Bluesky, regional platforms (Weibo, Naver)
Academic publications: Google Scholar, ResearchGate, ORCID, Scopus, institutional repositories
Domain / WHOIS registrations: historical WHOIS (DomainTools, ViewDNS), passive DNS
Electoral / voter records where lawfully public (US states, UK electoral roll)
Property records: county assessor records (US), HM Land Registry (UK), national cadastres
News archives: Factiva, LexisNexis, Google News, ProQuest, regional press archives
Data breach databases: Have I Been Pwned, IntelX (Intelligence X), DeHashed — for correlating handles, emails, password hashes to identities. Use only within the analyst’s lawful and ethical scope; see OSINT Ethics. For dark web source collection from .onion breach marketplaces and dark web forums, see Dark Web Methodology.

Step 4 — Attribute Matching and Confidence Assignment

Build a matching matrix for every candidate pair. Each attribute carries a weight reflecting its discriminative power:

Attribute	Weight	Notes
Full name (exact)	High	Discriminative power inversely proportional to name frequency in the relevant population
Date of birth	Very High	Most discriminating single biographical attribute
Email address	Very High	Unique identifier if verified against the actual mailbox
Phone number (MSISDN)	Very High	Typically unique to a SIM/account pair
National ID / Passport number	Definitive	If verifiable against issuing authority — otherwise treat as claim
Physical description (height, distinguishing features)	Medium	Approximate; subject to age and self-report error
Geographic location (current / historical)	Medium	Must overlap plausibly with timeline
Social network overlap (shared contacts)	Medium-High	Probabilistic; strong when overlap is large or unusual
Writing style / linguistic fingerprint (stylometry)	Low-Medium	Requires sufficient corpus; not standalone-conclusive
Profile photographs	Medium-High	Requires facial-recognition corroboration with controlled false-positive rate

Composite confidence rubric (aligned with Intelligence Confidence Levels):

High: 3+ Very High attributes match with no contradictions
Medium: 2 Very High or 4+ High attributes match
Low: 1 Very High or 2–3 High attributes with no contradictions
Insufficient: fewer than 2 High attributes — do not assert identity

Step 5 — Contradiction Analysis

After matching, actively search for any attribute that contradicts the proposed entity match:

Two records assert the entity was in different cities on the same date (and no plausible travel pattern reconciles them)
Age / DOB inconsistency exceeding measurement error
An educational credential dated before the institution opened
An account creation date that precedes the claimed platform launch
Native-language fluency claimed in one record contradicted by writing samples in another

Any clean contradiction degrades confidence or rules out the match entirely. Assessment: Contradictions are more informative than confirmations in entity resolution. A single unambiguous contradiction eliminates a candidate; confirmations only accumulate probabilistically. Analysts should spend at least as much time searching for contradiction as for confirmation.

Step 6 — Synthesis and Documentation

For each resolved entity:

Document every attribute match: source, URL, date accessed, confidence rating, screenshot or archive link
State the final composite confidence and the reasoning
Explicitly list what would upgrade confidence and what would downgrade or break the match
Identify the Gap — what additional information, if obtained, would resolve remaining ambiguity

This documentation is the audit trail. In journalistic, legal, or intelligence-product contexts, it is what permits review, replication, and defensible publication. For the general OSINT source-evaluation standard applied across all collection steps, see Source Verification Framework.

Online Persona / Account Resolution

Distinct methodology for resolving anonymous online accounts to individuals or to each other. The attribute surface is narrower but richer in machine-tractable signals.

Username correlation. Most users reuse usernames across platforms (habit, vanity, brand consistency). Tools that query 300+ platforms for a given handle: Sherlock, WhatsMyName, Maigret, Namechk. Unique or unusual usernames are high-confidence correlators; common usernames (john123, admin) are near-useless. Assessment: Username reuse is the single most productive technique in online persona resolution and should usually be the first step after seed identification.

Email correlation. Email addresses surface repeatedly in forum registrations, WHOIS data, breach dumps, support tickets, and disclosed correspondence. Query an email against IntelX, DeHashed, or HIBP to map its breach history — this often reveals the full set of platforms where the address was used, and the usernames registered against it. The chain email → username → platform account → posted content is the standard persona-reconstruction path.

Writing style / stylometry. Word frequency analysis, punctuation habits, function-word distribution, syntactic patterns. Tools: JGAAP (Java Graphical Authorship Attribution Program), LIWC (Linguistic Inquiry and Word Count), custom n-gram models. Primarily useful for narrowing a field of candidates rather than for definitive attribution. Gap: Stylometry is admissible in some academic and intelligence contexts but rarely meets criminal-court evidentiary standards as standalone evidence. Treat as corroborating, never as load-bearing.

Digital fingerprinting.

Browser fingerprint (if website analytics are available to the analyst): OS, screen resolution, browser version, timezone, font list, canvas hash
Device fingerprints leaked through metadata — pre-2014 EXIF data in posted images frequently contained GPS coordinates and device serial numbers; many platforms now strip this, but historical content retains it. See Geolocation Methodology for image-based location reconstruction from visual and metadata cues.
IP address geolocation, VPN/Tor-aware: cluster analysis on IP ranges, timing correlations across sessions

Behavioral pattern analysis. See Pattern of Life Analysis for the systematic behavioral-inference framework applied in targeting and investigative contexts.

Posting time analysis: when does the account post? Histogram by hour-of-day correlates with timezone and work schedule; gaps correlate with sleep, employment, religious observance
Topic coherence: what subjects does the account consistently engage with? Expertise depth, geographic knowledge, linguistic background
Network analysis: who does the account interact with? Shared contacts with known accounts are strong correlators — see Network Analysis Methodology for the link-analysis layer

Organizational Entity Resolution

For companies, the analytic challenge is less pseudonymous identity and more penetrating beneficial ownership opacity.

GLEIF LEI (Legal Entity Identifier). 20-character unique entity code, mandatory for financial market participants, searchable at gleif.org. Provides definitive identification for registered legal entities within scope. Fact: LEI coverage is mandatory for derivatives counterparties under EMIR and Dodd-Frank but is not universal across non-financial corporates.

Corporate resolution workflow:

Identify the entity’s legal name and jurisdiction of registration
Search the primary registry (Companies House, SEC EDGAR, ICIJ Offshore Leaks, OpenCorporates, national equivalents)
Trace all associated entities: subsidiaries, holding companies, the cluster of entities at the same registered address
Map directors across entities — shared directorships are a strong indicator of related entities
Map registered address: is it a residential address, a working office, or a registered-agent address servicing hundreds of shell companies (Mossack Fonseca, Trident Trust, etc.)?

OpenCorporates Match API automates cross-registry corporate entity matching at scale. For human-rights and sanctions investigations, ICIJ Offshore Leaks remains the highest-yield single dataset for shell-company resolution.

See also Corporate OSINT and Due Diligence and Financial Intelligence. For cryptocurrency and blockchain-based entity resolution tracing on-chain ownership, see Crypto Tracing Tools Guide.

Vessel / Aircraft / Device Resolution

Vessels.

IMO number: unique, permanent, assigned by IHS Maritime at build. The gold-standard identifier; does not change with reflagging or sale.
MMSI: dynamic — can be reassigned, frequently spoofed. AIS spoofing has been extensively documented in the Black Sea, the Persian Gulf, and around sanctioned-cargo operations.
Flag of convenience: vessel may change registration multiple times in a service life while IMO remains constant. Cross-reference reflagging history against ownership-change records.
AIS track verification: MarineTraffic, VesselFinder, and Lloyd’s List archives provide historical tracks that corroborate or refute claimed voyages.

Aircraft.

ICAO 24-bit hex address: hardware-encoded in the Mode-S transponder; more stable than tail number (which changes on re-registration).
ADS-B Exchange: unfiltered ICAO hex tracking, including most military and state-aircraft traffic. The reference source for adversary aviation OSINT.
FlightAware / Flightradar24: filtered datasets — military and some private traffic are excluded by request or by regulator demand.

Infrastructure (IP / domain).

IP address attribution: ASN (Autonomous System Number) identifies the hosting provider; passive DNS resolves historical domain-to-IP mappings, revealing infrastructure pivots that real-time DNS would miss.
BGP hijacking detection: tools like BGPmon, RIPE RIS, and Cloudflare Radar track unexpected route changes — useful in attributing route-hijack incidents and DNS poisoning operations.
TLS certificate history: crt.sh aggregates all Certificate Transparency log entries for a domain, exposing every certificate ever issued and revealing infrastructure relationships hidden in current DNS.

See Shodan-Censys Guide for the infrastructure-scanning layer that complements certificate and DNS pivoting.

Tools Reference

Tool	Function	Primary Entity Type
Maltego	Graph-based entity resolution and link analysis	All types
SpiderFoot	Automated multi-source resolution	Online accounts, IP, domain
Sherlock / WhatsMyName / Maigret	Username cross-platform search	Online personas
IntelX (Intelligence X)	Email / username in breach data	Online personas
OpenCorporates	Cross-registry corporate matching	Organizations
GLEIF LEI search	Legal entity identifier resolution	Organizations (financial)
ICIJ Offshore Leaks	Shell-company and beneficial-ownership data	Organizations
MarineTraffic / VesselFinder	AIS vessel tracking	Vessels
ADS-B Exchange	ICAO hex aircraft tracking (unfiltered)	Aircraft
JGAAP / LIWC	Authorship attribution / stylometry	Online personas
crt.sh	TLS certificate transparency history	Infrastructure
RIPE RIS / BGPmon	BGP route monitoring	Infrastructure

See Maltego Guide for the operational tradecraft of graph-based resolution and Social Media Intelligence for platform-specific persona resolution.

Ethical and Legal Constraints

Misidentification is the primary risk. A false positive in person entity resolution can destroy an innocent person’s reputation, expose them to sanctions screening errors, or — in adversarial environments — result in physical harm. The cost asymmetry of false positives over false negatives must drive methodology.
Minimum-necessary-information principle. Resolve only to the granularity the investigation requires. Over-collection creates retention and disclosure liability without analytic gain.
Data retention and lawful basis. Resolved entity profiles containing personal data are subject to GDPR (EU), LGPD (Brazil), CCPA (California), and equivalent regimes. Establish lawful basis before collection, not after; document retention schedules.
Publication threshold. Do not publish resolution results involving private individuals without additional public-interest justification, ideally cleared with editorial / legal review.
Use of breach data. Many breach-data resources are legally usable in jurisdictions where the data is already public, but operational ethics and platform terms-of-service may impose additional constraints. Document the basis on which breach data is used.

See OSINT Ethics and OSINT Legal Framework. For human-rights contexts where evidentiary thresholds are paramount, see OSINT for Human Rights and the Berkeley Protocol.

Sources

NIST Special Publication 800-188: De-identifying Government Datasets — High
Elmagarmid, Ahmed et al. — “Duplicate Record Detection: A Survey” (IEEE Transactions on Knowledge and Data Engineering, 2007) — High
Fellegi, Ivan P. & Sunter, Alan B. — “A Theory for Record Linkage” (Journal of the American Statistical Association, 1969) — High (foundational)
Bellingcat methodology documentation (bellingcat.com/resources) — High
IntelX (Intelligence X) documentation and API guides — Medium
Bazzell, Michael — Open Source Intelligence Techniques (10th ed., 2023) — High
Berkeley Protocol on Digital Open Source Investigations (UN OHCHR / UC Berkeley HRC, 2020) — High
ICIJ Offshore Leaks Database documentation — High
GLEIF (Global Legal Entity Identifier Foundation) — gleif.org — High

OSINT · Network Analysis Methodology · Link Analysis · Attribution · Social Media Intelligence · Financial Intelligence · Maltego Guide · Shodan-Censys Guide · OSINT Ethics · OSINT Legal Framework · Corporate OSINT and Due Diligence · OSINT for Human Rights · Pattern of Life Analysis · Source Verification Framework · Intelligence Confidence Levels · Geolocation Methodology · Crypto Tracing Tools Guide · Dark Web Methodology

Intelligence notes

Explorer

Entity Resolution Methodology

Entity Resolution Methodology

BLUF

Entity Types in OSINT Context

Core Concepts

Person Entity Resolution — Step-by-Step Methodology

Step 1 — Seed Identification

Step 2 — Name Variant Generation

Step 3 — Cross-Source Collection

Step 4 — Attribute Matching and Confidence Assignment

Step 5 — Contradiction Analysis

Step 6 — Synthesis and Documentation

Online Persona / Account Resolution

Organizational Entity Resolution

Vessel / Aircraft / Device Resolution

Tools Reference

Ethical and Legal Constraints

Sources

Graph View

Table of Contents

Backlinks

Intelligence notes

Explorer

Entity Resolution Methodology

Entity Resolution Methodology

BLUF

Entity Types in OSINT Context

Core Concepts

Person Entity Resolution — Step-by-Step Methodology

Step 1 — Seed Identification

Step 2 — Name Variant Generation

Step 3 — Cross-Source Collection

Step 4 — Attribute Matching and Confidence Assignment

Step 5 — Contradiction Analysis

Step 6 — Synthesis and Documentation

Online Persona / Account Resolution

Organizational Entity Resolution

Vessel / Aircraft / Device Resolution

Tools Reference

Ethical and Legal Constraints

Sources

Related

Graph View

Table of Contents

Backlinks