Part 03 — Collection for the Open-Domain Practitioner

BLUF

The independent analyst’s collection architecture is not a scaled-down version of institutional all-source collection. It is a structurally different design constrained by a single condition: every input must be lawfully obtainable from the open domain. This chapter specifies what that constraint does to collection planning, source-class diversity, language coverage, confidence ceilings, and chain-of-custody discipline. It is procedurally dense by design — collection is the operational layer where independent practitioners most frequently underperform institutional standards, and most of that underperformance is correctable with explicit architecture rather than additional access.

The chapter does not re-derive OSINT doctrine. PAI/CAI distinctions, the “INT of first resort” framing, and DNI OSINT Strategy alignment are covered in the Open-Source Intelligence Manual and are assumed here. This chapter addresses what changes when OSINT is not one INT among several — when it is the entire collection capability.

1. The All-OSINT Collection Architecture

Institutional OSINT doctrine treats open sources as one collection discipline among several. The DNI IC OSINT Strategy 2024–2026 frames OSINT as the “INT of first resort” — the discipline that establishes baseline, context, and tipping/cueing for the closed disciplines. In that frame, OSINT typically supplies the 60–80 percent of a finished product that does not require classified collection, and the closed disciplines fill the residual that does. The independent analyst inherits the doctrine without inheriting the residual collection.

This is not a quantitative shift from “mostly open” to “all open.” It is a qualitative shift in the design of the collection plan.

Assessment. The “80 percent open source” framing is misleading when imported into independent practice. It implies that the independent analyst is doing the same work as an institutional OSINT cell with a 20 percent reduction in available collection. The reality is that the 20 percent the institutional analyst can fall back on — HUMINT for intent, SIGINT for confirmation, GEOINT tasking against denied areas — is precisely the collection that resolves the highest-confidence questions. Removing it does not remove 20 percent of the work; it removes the collection that anchors the high-confidence end of the assessment spectrum. The independent analyst must redesign around the absence rather than pretend it is a marginal gap.

Three structural consequences follow from the all-OSINT constraint.

1.1 Gaps That Cannot Be Substituted

Certain collection requirements have no open-source equivalent. The independent analyst must label these as gaps in the analytical product rather than attempt to paper over them with weaker substitutes.

Closed-collection requirement	Open-source substitute available?	Independent-analyst posture
Internal deliberations of an adversary leadership cell	None	Label as Gap. Inference from observable indicators is legitimate; reconstruction of internal debate is not.
Pre-decisional intent of a state actor	None reliable	Label as Gap. Public signaling can be analyzed as signaling; it cannot be treated as intent.
Source reliability confirmation via parallel-source corroboration in denied channels	None	Label as Gap. Solo source grading must absorb the uncertainty (see Part 04).
Identity confirmation of pseudonymous online actors against name-traced records	Partial via leaks, breach data, court records	Use only where lawful and ethically defensible; label residual uncertainty.
Real-time targeting-grade geolocation of mobile assets	None for live tracking; partial for after-action	Label as Gap for live; assess from historical traces with explicit time-lag caveat.

The discipline is not heroic substitution. It is honest labeling.

1.2 Gaps That Can Be Partially Substituted

Some closed-collection requirements have legitimate open-source approximations. The approximation is never equivalent, but it is not nothing.

SIGINT-adjacent substitution. Some signals-derived information is observable in open data: BGP routing announcements (RIPE, BGPlay, RouteViews), TLS certificate transparency logs (crt.sh, Censys), DNS passive records (DomainTools, SecurityTrails, Validin), and protocol-level scanning (Shodan, Censys, Greynoise). This is not SIGINT — it is the publicly observable surface of internet infrastructure. It cannot replace intercepted communications. It can sometimes establish that a relationship exists between two infrastructure assets without revealing what passed between them.
GEOINT-adjacent substitution. Commercial satellite imagery (Planet Labs, Maxar Open Data Program, ESA Sentinel-2 free tier, Airbus SPOT) provides imagery that twenty years ago would have been national-technical-means only. Resolution is lower, revisit cadence is constrained, tasking is generally not available to individuals, but for after-action analysis of conflict-relevant terrain it is operationally sufficient for many independent products. AIS (vessel tracking) and ADS-B (aircraft tracking) provide near-real-time movement intelligence for cooperative emitters; both have well-documented gaps (military emitters typically off, spoofing common in conflict zones).
MASINT-adjacent substitution. Acoustic event reporting (CTBTO data subset, USGS seismic, ShakeMap), volcanic and thermal monitoring (NASA FIRMS, MODIS), and atmospheric reporting (NOAA, Copernicus) provide MASINT-class environmental signatures at near-zero cost. These are usable for verification of explosive events, large-scale industrial activity, and environmental indicators of military operations.

Assessment. The substitution is partial. It is also defensible when explicitly framed. A statement of the form “BGP withdrawal of prefix X observed at timestamp Y; correlates with reported outage; SIGINT-class confirmation of root cause not available” is a legitimate analytical claim. A statement of the form “intercepted communications indicate” — when no intercept exists — is fabrication.

1.3 Confidence Ceilings Under the All-OSINT Constraint

The all-OSINT constraint shifts the realistic confidence ceiling depending on what is being assessed. The PHIA Probability Yardstick and the IC’s analytic confidence model (high/moderate/low) remain the appropriate vocabulary; what changes is the distribution of legitimate ceilings across assessment types.

Assessment type	Realistic confidence ceiling (independent OSINT)	Reasoning
Physical capability (units present, equipment fielded, infrastructure built)	High plausible	Physical indicators are observable; multi-source corroboration is achievable from imagery, social media, official acknowledgments.
Activity patterns (frequency, periodicity, escalation/de-escalation)	Moderate to high	Pattern claims survive solo verification well when the time-series is long and the source base is diverse.
Attribution of cyber operations	Moderate (rarely high)	Open-source attribution depends heavily on vendor reporting, which the independent analyst cannot independently verify; high-confidence attribution typically requires non-open inputs.
Intent (what a state, group, or actor plans to do)	Low to moderate, rarely higher	Intent assessments without HUMINT are inferential. Public signaling can be coded as signaling; treating it as decisional intent is overreach.
Internal deliberation	Low	Reconstruct only what is publicly stated; flag everything else as Gap.
Identity attribution of pseudonymous actors	Variable — low to high depending on evidence class	Document-level evidence (registration records, breach data, archived self-disclosure) can reach high; behavioral-pattern inference rarely does.

The operational discipline is to calibrate the confidence language to the assessment type, not to the strength of the writer’s belief. The independent analyst’s most common calibration failure is borrowing institutional confidence vocabulary on intent assessments where institutional analysts would have access to closed inputs the independent does not.

2. Collection Architecture by Domain

The architecture below assumes the practitioner has standing PIRs (see Part 02) and is constructing a stack that services them on Tier 1 / Tier 2 / Tier 3 cadences (see §7). The lists are illustrative of the source classes a competent practitioner should cover, not exhaustive. Tool currency changes constantly; the source-class logic does not.

2.1 Geopolitics and Conflict Analysis

The geopolitics and conflict stack must cover three layers: official state communications, professional military-oriented analysis, and on-the-ground event reporting. Each layer has a distinct reliability profile.

Official state communications (primary, state-aligned)

MFA press releases and statements: every state actor of interest. Bookmark the canonical-language version, not just the EN relay.
MoD releases and parliamentary records: national defense ministries, defense committee transcripts, Hansard equivalents.
Heads-of-state and spokesperson statements: official websites, government social media accounts where verified.

These are primary in the sense that they are the actor’s own words. They are not authoritative — state communications are designed instruments. Label [primary, state] or [state-aligned], never [primary, authoritative], per the standing multi-lingual OSINT doctrine.

Professional military-oriented open sources (secondary, generally reliable)

Institute for the Study of War (ISW) — daily updates on Ukraine, Russia, Iran-aligned activity. Methodology transparent, sourcing footnoted; treat as a strong secondary aggregator with known editorial perspective.
UK MoD Defence Intelligence updates — selective, but useful for the items the UK chooses to release publicly.
IISS (International Institute for Strategic Studies) — The Military Balance annual, Strategic Survey, and analytical blog.
War on the Rocks, Defense One, Breaking Defense — practitioner-oriented commentary; reliability of individual pieces varies with author.
RUSI, IISS Manama Dialogue / Shangri-La output, CNA, RAND, CSIS — institutional analytical products with traceable methodology.

Conflict event monitoring (event-level, multi-source)

ACLED (Armed Conflict Location & Event Data Project) — coded conflict events with location, fatality estimates, actor attribution. The most widely cited academic-grade conflict event database. Free for non-commercial use.
UCDP (Uppsala Conflict Data Program) — longer-baseline conflict data, useful for trend analysis.
Bellingcat Ukraine Monitor / Bellingcat Justice & Accountability — case-level geolocated incident reporting.
GeoConfirmed — volunteer geolocation network; useful as a clearing house, not as an authoritative source.
IDF / Russian MoD / Ukrainian General Staff briefings — primary state-aligned, treat as such.

Geospatial (collection-tier-dependent)

ESA Sentinel-2 (free, 10 m resolution, 5-day revisit) — adequate for medium-resolution monitoring.
NASA Landsat 8/9 (free, 30 m resolution) — long-baseline change detection.
Planet Labs (PlanetScope ~3 m, SkySat ~50 cm) — commercial, accessible to journalists and researchers via the Planet & Journalism program; otherwise paid.
Maxar Open Data Program — high-resolution imagery released for selected disaster and crisis events.
NASA FIRMS — active fire / thermal anomaly detection, near-real-time, useful for industrial-fire and large-explosive-event detection.
Sentinel Hub EO Browser — interactive front end for Sentinel and Landsat data.

Financial and economic indicators (context layer)

FRED (Federal Reserve Economic Data) — macroeconomic time series.
World Bank Open Data, IMF Data Mapper — sovereign-level indicators.
UN Comtrade — international trade flows by commodity and partner.
Shipping data: MarineTraffic, VesselFinder, EquasisInfo — AIS-derived; recognize spoofing and dark-fleet gaps.
Flight data: ADSB Exchange, Flightradar24, Opensky Network — Opensky is the research-friendliest with historical data access.

2.2 Cyber Threat Intelligence

The independent CTI practitioner operates against a vendor-dominated ecosystem where most high-quality intelligence sits behind paid platforms. The free and low-cost stack below is operationally sufficient for technical and infrastructure-oriented work; strategic attribution remains harder.

Threat feeds (free or freemium)

AlienVault OTX — community pulses, free with registration; quality varies by submitter.
Abuse.ch — Feodo Tracker, URLhaus, MalwareBazaar, ThreatFox, SSLBL. The single most useful free CTI suite.
CIRCL MISP feeds — selected feeds from the Computer Incident Response Center Luxembourg.
Spamhaus DROP/EDROP lists — networks not to talk to.

Malware repositories and sandboxes

VirusTotal — free tier is rate-limited and read-only at depth; private API and Enterprise tier are paid. Note: anything uploaded becomes searchable by other VT users, which has OPSEC consequences for incident-response work.
MalwareBazaar (abuse.ch) — free, supports hash-based and YARA-rule retrieval.
Hybrid Analysis (CrowdStrike Falcon Sandbox public instance) — free with registration.
Any.Run — interactive sandbox with a free community tier.
Triage (Hatching) — paid, but tier accessible to small consultancies.

Infrastructure tracking

Shodan — free tier severely limited (about 50 results per query, no exports); the paid Membership/Small Business tier (roughly $49/month, verified June 2026) is the practical entry point for independents.
Censys — free Community tier with reasonable depth; paid for full historical access.
Greynoise — free Community tier for individual IP context; paid Enterprise for bulk and historical.
FOFA, ZoomEye, Hunter, Quake — Chinese-origin equivalents to Shodan/Censys; useful for source diversity, with the standard cautions about state-aligned data provenance.
Validin, SecurityTrails — passive DNS and infrastructure history.
crt.sh, Censys Certificates — certificate transparency search.

Threat reporting (vendor and government)

Vendor APT reports: Mandiant (now Google), CrowdStrike, ESET, Kaspersky, Trend Micro, Recorded Future, Microsoft Threat Intelligence, Cisco Talos.
Read Kaspersky with state-alignment caveat; treat all vendor reporting as having commercial incentive to attribute and name.
CISA advisories, Joint Cybersecurity Advisories (Five Eyes), ENISA threat landscape reports.
NCSC (UK), BSI (Germany), ANSSI (France), CCCS (Canada), ACSC (Australia) — national CSIRT advisories.

Vulnerability and CVE

NVD (NIST National Vulnerability Database) — canonical CVE source, with the well-known scoring-lag problem.
MITRE CVE List — upstream of NVD.
CISA Known Exploited Vulnerabilities (KEV) catalog — operationally prioritized.
VulnCheck, Vulners — commercial enrichment layers.

Community channels

The X/Twitter CTI community remains the fastest open dissemination layer for emerging incidents, despite platform degradation. Curated lists (#CTI, threatintel, DFIR) — see archival caveats below.
Mastodon infosec instances (infosec.exchange) — significant post-2023 migration of CTI practitioners; better long-term archive properties than X.
VirusTotal-shared YARA hunts, MISP community circles, Slack/Discord domain-specific channels (FIRST, Curated Intelligence).

2.3 Financial Crime and Corporate Due Diligence

The financial-crime stack is the most jurisdictionally fragmented of the four domain stacks. Source quality varies dramatically by country, and the independent practitioner must learn each jurisdiction’s registry idiosyncrasies as a discrete skill.

Corporate registries

Companies House (UK) — free, comprehensive, downloadable. The gold standard for free corporate data.
SEC EDGAR (US) — mandatory disclosures for US public companies; XBRL structured data available.
OpenCorporates — aggregates 200+ jurisdictions; free for non-commercial use up to limits.
National registries: extremely variable. Germany’s Unternehmensregister, France’s Pappers/Infogreffe, Brazil’s Receita Federal CNPJ lookup, Singapore’s ACRA — each with its own access model.

Beneficial ownership

OpenOwnership — aggregated beneficial ownership data; coverage uneven.
ICIJ Offshore Leaks Database — Panama, Paradise, Pandora, Bahamas, Offshore Leaks combined; primary public access to leaked offshore registries.
UK People with Significant Control register (via Companies House) — requires verification but free.
EU beneficial ownership registers — access conditions varied dramatically post-CJEU Sovim ruling (2022); some jurisdictions have re-restricted public access.

Sanctions and PEP data

OpenSanctions — aggregated free database covering 180+ sources including OFAC SDN, EU Consolidated, UK OFSI, UN Security Council, regional and sectoral lists. The best free starting point.
OFAC SDN, OFAC Consolidated — US Treasury primary.
EU Consolidated Financial Sanctions List, UK OFSI Consolidated List, Swiss SECO list, Canadian SEMA — national/regional primary.
World-Check, Dow Jones Risk Center, LexisNexis Bridger — commercial PEP/sanctions screening; price points generally above independent-practitioner ranges, but accessible via consortium or short-term subscription.

Court records

PACER (US federal) — fee-based ($0.10/page, capped); the primary source for US federal civil and criminal records.
RECAP Archive (CourtListener / Free Law Project) — free public mirror of PACER documents already paid for by other users; first stop before paying PACER.
State court systems (US) — extremely variable; some free, some paywalled, some not online at all.
UK Courts and Tribunals Judiciary, BAILII — UK case law.
National court systems for non-Anglosphere jurisdictions — often the largest collection gap in financial-crime work.

Land, property, and asset registries

HM Land Registry (UK) — title-level searches available for nominal fees.
US county recorder offices — fragmented by county; some online, some not; many available via aggregators like NETROnline.
Cadastral and beneficial-property registries in EU jurisdictions — variable, with GDPR-influenced access restrictions.
Aircraft (FAA registry, EASA, ICAO), vessels (IMO, Equasis) — useful for high-value movable assets.

Procurement and contracting

USASpending.gov — US federal contracts and counterparties; powerful when used in conjunction with FPDS.
TED (Tenders Electronic Daily) — EU procurement.
World Bank Contracts, ADB, IDB, AfDB procurement portals — multilateral lender contracts.
National procurement portals: variable.

Investigative journalism aggregations

OCCRP Aleph — aggregated documents and leaks from investigative consortia; the single most useful free database for oligarch, kleptocracy, and offshore research. Free with registration.
ICIJ — Panama Papers, Pandora Papers, FinCEN Files, etc.
DataLeaks (various), Distributed Denial of Secrets (DDoSecrets) — leaked-data archives with ethics and legality conditions that vary by item; consult Part 10 before relying on leaked material.

2.4 Fact-Checking and Verification

The fact-checking stack overlaps heavily with the others but adds verification-specific tooling. The defining discipline is archive before collection — claims are deleted or edited faster than any other class of OSINT input.

Pre-collection archiving

archive.today (archive.ph, archive.is) — manual capture, handles JavaScript-heavy pages well.
Wayback Machine — automatic capture; submit URL via “Save Page Now” before any other action.
ArchiveBox — self-hosted archival; useful for long-term personal repositories.
yt-dlp — video and audio capture from most platforms.

Reverse image and video search

Google Reverse Image Search, Yandex Reverse Image — Yandex is consistently the strongest for faces and locations.
TinEye — exact-match focus; useful for finding earliest occurrence of an image.
RevEye browser extension — multi-engine reverse-image quick-search.
InVID / WeVerify browser extension — frame extraction, metadata, reverse-search across multiple engines; the standard verification toolkit.
PimEyes — face-search; capability versus ethics debate is unresolved (see Part 10).

Geolocation and chronolocation

SunCalc — solar position by time and place; shadow-angle verification.
Google Earth Pro (desktop) — historical imagery layer, free.
Mapillary, KartaView (formerly OpenStreetCam) — community street-level imagery.
Sentinel Hub EO Browser, Google Earth Engine — multi-spectral remote-sensing access.
Wikimapia, OpenStreetMap — collaborative mapping with depth in non-Western coverage.
Acoustic geolocation: when the audio is the evidence, mortar-fire timing and echo analysis are domain-specific specialties; see Geolocation Methodology.

Social media collection and archival

snscrape — community-maintained scraper for X, Reddit, Telegram, Facebook (results vary; X is fragile post-API changes).
Instaloader (Instagram), gallery-dl (multi-platform) — image and metadata extraction.
CrowdTangle replacement: Meta Content Library — access restricted to qualified researchers; significant operational regression versus prior CrowdTangle access.
4chan/8kun archives (4plebs, archive.4plebs.org) — for extremism and meme-pipeline research.
Telegram archival: TGStat, Telemetr, Telegram Analytics for channel metadata; manual or scripted dumps for content (see §4).

Verification community and standards

First Draft Network legacy resources (the organization closed in 2022; archived resources remain). Its mission continued at the Information Futures Lab (Brown University School of Public Health), which absorbed First Draft’s staff and programs; Bellingcat (below) is the strongest live successor for hands-on verification training.
IFCN (International Fact-Checking Network) — code of principles, signatories list.
EFCSN (European Fact-Checking Standards Network) — EU-aligned standards.
AFP Fact Check, Snopes, PolitiFact — methodology archives are reference material even when the specific check is not relevant.
Bellingcat investigation methodology archive — case-level worked examples remain the strongest free training material for independent practitioners.

3. Multi-Lingual Collection as Structural Imperative

The most persistent quality failure in independent English-language geopolitical analysis is insufficient non-English source coverage. This is not stylistic. It is structural — analyses built exclusively on EN sources for non-English-speaking actors systematically miss what those actors actually say.

The standing multi-lingual sourcing doctrine (see OSINT) is: for any investigation or assessment involving a principal state actor whose primary communications language is not English, native-language primary sources are mandatory. The doctrine reflects three operational realities.

Translation degradation is systematic, not random. English-language relays of state communications from PRC, Russia, Iran, North Korea, and other non-English principals are routinely softened, contextualized, or shortened. The softening is sometimes editorial (the relayer assesses what the audience needs), sometimes diplomatic (the relayer assesses what the audience should hear), and sometimes structural (compression discards modifiers that carry the analytical signal). The effect is consistent: the EN version is a lower-fidelity copy.

Framing divergence is itself analytical signal. When the CN-language version of a Foreign Ministry briefing uses harsher framing toward the United States than the EN-language version, the divergence is information about the intended audience — the harsher framing is for domestic and aligned consumption, the softer version is for international audiences. This is observable Cognitive Warfare practice. An analyst working from the EN version cannot see it, and therefore cannot incorporate it.

Source identity is language-specific. Some sources do not have meaningful EN versions. Russian independent investigative outlets (iStories, Meduza, The Insider Russia), Iranian opposition outlets in FA, Chinese-language financial media (Caixin in CN versus EN versions where coverage diverges) — the EN side is a different and smaller publication than the canonical one.

3.1 Language Tiers (Standing Doctrine)

The mapping below is the operational summary; the full mapping with labeling rules is maintained in PIA standing doctrine.

Actor	Primary language	Native-language primary sources (minimum)	EN relays (acceptable as secondary, not as substitute)
PRC	ZH	fmprc.gov.cn, Xinhua (CN), People’s Daily (CN), Global Times (CN), MOFCOM, PLA Daily (CN)	Xinhua EN, Global Times EN, CGTN
Russia	RU	kremlin.ru, mid.ru, TASS RU, RIA Novosti RU; Meduza, iStories, The Insider for independent	TASS EN, RT (state-aligned), Reuters/AP coverage
Iran	FA + AR for proxy statements	IRNA FA, Mehr News FA, MFA Iran FA; AR-language proxy channels (Hezbollah, Houthi statements)	Press TV (state-aligned), Tehran Times
Ukraine	UK	president.gov.ua UK, SSU UK, Ukrinform UK, Ministry of Defence UA	Ukrinform EN, Kyiv Independent
Israel	HE (military), EN (diplomatic)	IDF Hebrew Telegram, MFA HE	IDF Spokesperson EN, MFA EN
Turkey	TR	Anadolu TR, MFA TR, Daily Sabah TR	Anadolu EN, Daily Sabah EN, Hurriyet Daily News
France	FR	élysée.fr, diplomatie.gouv.fr, Le Monde, Le Figaro	France 24, RFI
Germany	DE	bundesregierung.de, auswärtiges-amt.de, FAZ, Süddeutsche	DW

The discipline is: for any actor in the table, treat the EN relay as a secondary source only. A finished product whose principal evidence is the EN relay of a primary-language statement should explicitly note “native-language original not consulted” — and that note should be a flag for the analyst that the product is methodologically incomplete.

3.2 Labeling Rules

The labeling discipline matters as much as the collection. State communications from PRC, Russia, Iran, and other actors with state-directed media ecosystems must carry an explicit label.

[primary, state] — direct from the state actor (MFA statement, presidential press service).
[state-aligned] — outlet structurally aligned with the state (Xinhua, RT, Press TV, Sputnik, Global Times).
[independent, exile] or [independent, in-country] — for outlets like Meduza (exile) or iStories (mixed).
Never [primary, authoritative] for state communications from these actors. Authority is a separate dimension and state communications from non-democratic states are not authoritative on the substance of their own claims; they are authoritative only as evidence of what the state chose to say.

3.3 Practical Toolchain for Non-Proficient Analysts

The independent analyst typically does not read all of the tier-1 languages. The practical toolchain below is sufficient for gisting and triage, with explicit caveats about its limits.

DeepL — strongest free machine translation for European languages including RU; competent for written but not transcribed speech.
Google Translate — adequate for gisting in ZH, FA, AR, KO; weaker than DeepL on EU languages.
GPT-4-class LLMs (Claude, GPT-4, Gemini) for translation — strong on context-aware translation, including ZH and AR; weaker on idiom and on detecting deliberate ambiguity in diplomatic text.
Native-speaking collaborator review — the only fully reliable verification layer. For any finished product where the framing-divergence claim is load-bearing, a native-speaker check on the translation should be sought.

Caveat on labeling. If the analyst relies on machine translation for a finished product, the analyst should explicitly note “machine-translated; native-speaker verification not performed” in the methodology footnote. This is a defensible disclosure; presenting a machine-translation gist as a verified rendering is not.

Social media has functionally replaced broadcast monitoring as the primary real-time collection environment for conflict, political, and crisis analysis. The signal-to-noise problem is severe, the platforms are unstable, and most pre-2022 collection tooling has degraded under platform changes. The discipline below is what survives the volatility.

4.1 Platform Architecture for Collection

Different platforms dominate different operational environments. The independent analyst should know which platform is the canonical collection target for each PIR.

X / Twitter — primary real-time open environment for conflict monitoring, official statements, vendor CTI announcements, and analyst-community discussion. Heavily degraded since 2022 in archival reliability, search functionality, and account-level visibility, but still operationally dominant for live events.
Telegram — critical for Russian, Ukrainian, Middle Eastern, and Central Asian reporting. Channel-based architecture is fundamentally different from feed-based platforms and requires different collection tooling.
Facebook — primary for Sub-Saharan African conflict and local-community information in many regions; degraded access via Meta Content Library replacing CrowdTangle.
Instagram — visual primary source for some conflict environments; metadata stripping makes verification harder.
TikTok — emergent primary source for fast-moving narrative diffusion; collection tooling is immature.
YouTube — long-form video evidence, channel-level archival, historical depth.
Reddit — domain-specific intelligence (r/UkraineWarReports, r/CredibleDefense, regional and topic subs); useful as a clearing house for community-aggregated reporting.
VKontakte, Odnoklassniki — Russian-language equivalents; primary for some Russian state-aligned and military-blogger content.
WeChat — largely closed to non-account-holders; the most significant collection gap in PRC-focused work.

4.2 Telegram Monitoring Methodology

Telegram is structurally different from feed-based platforms and warrants its own methodology section because it is where most independent practitioners underperform.

Channel identification. The first task is finding the channels that matter. Tools:

TGStat (tgstat.com / tgstat.ru) — channel directory, statistics, subscriber growth, post engagement.
Telemetr (telemetr.io) — channel analytics, paid tiers offer historical data.
Telegram Analytics (tgstat.ru’s analytics interface) — same source family.
Community-curated lists: monitoring networks like Bellingcat, GeoConfirmed, and academic researchers maintain channel rosters for specific conflicts.

Channel clustering by affiliation. Once channels are identified, cluster them by likely affiliation (state-aligned, milblogger, independent, opposition, civilian). Affiliation drives source-grading; identical event reporting from a state-aligned channel and an independent channel is two source-classes, not two sources.

Both-sides monitoring. For conflict reporting, both-sides discipline is non-negotiable. Russian milbloggers and Ukrainian channels both report events; the framing divergence is itself analytical signal. The discipline parallels the EN/native-language divergence discussed in §3.

Origination versus amplification. The most common Telegram collection error is treating an amplifier (a high-reach channel that re-posts content from smaller channels) as the origin. The originating channel — typically a smaller, lower-reach account close to the event — is the source. The amplifier is dissemination, not collection. Discipline: when archiving a Telegram post, trace it backward to its earliest appearance before recording the source.

Archival. Telegram does not provide good native archival. Practical tooling:

Manual: screenshot + URL + timestamp + channel handle, hashed.
Scripted: telegram-export (community tool), tdlib-based custom scripts, Telethon library for Python.
Hybrid: TGStat post-level URLs are linkable and persistent for posts in indexed channels.

4.3 Account Credibility Assessment

Whether on X, Telegram, or any other social platform, account credibility is a recurring assessment problem. The dimensions that survive solo grading:

Dimension	What to check	What it tells you
Longevity	Account creation date, posting consistency over time	Recently created or recently re-activated accounts in crisis windows are inversion-attack risk indicators.
Posting consistency	Frequency, topical range, language, time-of-day patterns	Sudden change in posting pattern is a flag for compromise, account-hand-off, or coordinated behavior.
Geographic anchoring	Stated location, time zone of activity, language register	Mismatch between claimed location and operational signature is a sock-puppet indicator.
Network	Who follows whom; who the account engages with; who engages back	Real practitioners have real networks; fabricated personas have shallow or astroturfed networks.
Cross-reference	Does the account appear in established monitoring networks’ rosters; has it been previously cited; what is its track record	The longest-running signal of credibility.
Verification artifacts	Cross-platform identity confirmation, real-world appearance (interviews, conference presentations), institutional affiliation	Strong; not always available.

The grading is solo, which means it is fallible; the discipline is to grade transparently and revise on new evidence rather than to over-claim.

4.4 X API Access for Independent Analysts (2026 Status)

The collapse of free programmatic access to X/Twitter post-2023 reshaped the independent collection environment. The current state:

Free tier: write-only. No programmatic read access. Functionally useless for collection.
Basic ($100/month, 2026 pricing): read access at low rate limits; the practical entry point for individuals who must have programmatic access.
Pro ($5,000/month): professional tier; out of independent-practitioner range for most.
Academic Research track: essentially closed since 2023.

Practical alternatives:

snscrape — community-maintained Python scraper; active development paused in 2023 and it is now largely broken against X following the 2023 API/anti-scraping lockdown (some users sustain partial function by pinning older versions, but it is unreliable for production). Still usable against some other platforms it supports.
nitter instances — fragmented; most public instances have been deprecated or rate-limited.
Pushshift — Reddit historical archive (not an X archive); Reddit banned Pushshift in May 2023 and access is now restricted to approved Reddit moderators for moderation use cases only.
Manual collection — for low-volume work, manual capture remains viable and is sometimes the only option.
Aggregator subscriptions — services like Brandwatch, Sprinklr, Talkwalker remain accessible at enterprise prices; consortium access via journalism organizations is sometimes available.

The structural assessment is that X collection by independents will continue to degrade unless the API pricing model changes. Practitioners should not build workflows that assume restored free programmatic access. Plan for manual or low-volume scripted collection, and shift archival-heavy workflows to platforms with healthier APIs (Mastodon, Bluesky) where the analytical target permits.

Social media content is the most volatile class of OSINT input. The discipline: archive at the moment of collection, not retroactively.

Wayback Machine — submit URL via “Save Page Now”; works for X tweet URLs and most public posts.
archive.today — handles JavaScript-heavy platforms better than Wayback in many cases; manual but reliable.
HTTrack — full-site capture; useful for entire blog or small-site archival.
yt-dlp — video, including from Twitter, TikTok, YouTube, Telegram, Facebook.
Screenshot + content + metadata — at minimum, capture (1) full-page screenshot, (2) raw HTML or text, (3) URL, (4) UTC timestamp of capture, (5) SHA-256 hash of archived artifact.

The next section formalizes this discipline.

5. Gray Literature, Regulatory Filings, and Court Records

Some of the most analytically significant open-source material is not indexed by standard web search. The independent practitioner who works only from search-engine surfaces will systematically miss it.

5.1 Gray Literature

“Gray literature” — think-tank reports, policy briefs, working papers, dissertations, conference proceedings, technical reports — sits outside both commercial publication and standard academic indexing.

Google Scholar — broad academic search; coverage uneven; useful starting point.
BASE (Bielefeld Academic Search Engine) — broader than Google Scholar for European and gray sources; over 300 million records.
CORE — open-access aggregator; over 200 million records.
SSRN — working papers; particularly strong in finance, economics, law.
Academia.edu, ResearchGate — researcher-shared papers; coverage uneven and platform-incentivized; pre-prints often available here before formal publication.
Think-tank archives — RAND, CSIS, IISS, RUSI, Chatham House, Brookings, CFR, ECFR, CEPS, CARI, SWP. Most publish substantial work openly; methodology and bias profiles vary.
ProQuest Dissertations & Theses, OATD (Open Access Theses and Dissertations) — doctoral dissertations are an underused source for region-specialist and methodology-specialist work.
Government technical reports — DTIC (US Defense Technical Information Center), HMG publications, USGS, NIST.
Conference proceedings — IEEE Xplore, ACM Digital Library, USENIX Security, ACM CCS, IEEE S&P, Black Hat archives, DEF CON Media Server.

5.2 Regulatory Filings

Mandatory disclosure regimes generate large volumes of structured, primary-source data. Most are free.

SEC EDGAR (US) — public-company filings: 10-K, 10-Q, 8-K, 13F (institutional holdings), 13D/13G (significant ownership), DEF 14A (proxy statements). XBRL-structured data is available for programmatic ingestion.
FCA register (UK) — financial services firms and individuals.
AMF (France), BaFin (Germany), CONSOB (Italy), CNMV (Spain) — EU equivalents.
CVM (Brazil) — primary source for Brazilian public companies.
CSRC (China) — PRC equivalent; access and language constraints apply.
Energy and resources: EIA (US), national oil-and-gas regulators, mining-claim registries.
Trade and customs: ITC DataWeb, UN Comtrade, Panjiva and ImportGenius (commercial enrichment of bills of lading), shipping data.

5.3 Lobbying and Influence Disclosures

FARA database (US DOJ) — Foreign Agents Registration Act filings. Primary source for state-sponsored influence operations and lobbying on behalf of foreign principals. Public, free, searchable.
LDA LD-2 reports (US House and Senate) — domestic lobbying disclosure; quarterly filings with subject, client, lobbyist.
EU Transparency Register — voluntary register of EU lobbying activity; less complete than US equivalents but improving.
UK Lobbying Register — narrower scope than US FARA; covers consultant lobbyists only.
OpenSecrets (Center for Responsive Politics) — US campaign finance and lobbying aggregator; enrichment layer over primary filings.

5.4 Procurement and Government Contracting

USASpending.gov — primary source for US federal contracts, grants, and direct payments; includes counterparty identification and value.
FPDS (Federal Procurement Data System) — upstream of USASpending; structured contract-level data.
TED (Tenders Electronic Daily) — EU procurement, including high-value defense and infrastructure tenders.
World Bank Contracts, IDB Procurement, ADB Procurement, AfDB Procurement — multilateral-lender contract data, including contractor names and values.
National procurement portals — extremely variable; Brazil’s ComprasNet, India’s GeM, Ukraine’s ProZorro are useful examples of well-structured national portals.

5.5 Court Records as Intelligence

Civil litigation is one of the highest-yield open-source domains and is consistently underused by independent practitioners. Discovery filings, deposition transcripts, exhibit lists, and case dockets routinely reveal corporate structure, internal communications, financial relationships, and operational detail that no other open source provides.

PACER (US federal) — primary source for US federal civil and criminal cases; fee-based at $0.10/page (capped at $3 per document; $30 free per quarter).
CourtListener / RECAP Archive (Free Law Project) — free mirror of PACER documents already paid for by other users. First stop. Many landmark cases are fully available here without PACER charges.
PacerMonitor, Bloomberg Law, Westlaw, Lexis — commercial enrichment layers.
State court systems (US) — fragmented; many states have online dockets but full document access varies; specialist trackers like Trellis exist for some jurisdictions.
UK Courts and Tribunals Judiciary, BAILII — UK case law (judgments, not full filings).
Court records in non-Anglosphere jurisdictions — variable; for some countries the largest single financial-crime collection gap.
International tribunals — ICC, ICTY (archive), ICTR (archive), ICJ, ITLOS, regional human-rights courts; primary source for transnational accountability work.

6. Archival Discipline and Chain of Custody

For independent analysts, archival is both a quality discipline (the source you cite must remain verifiable) and a legal protection (the chain of custody on evidence determines whether it survives a defamation defense or, in higher-stakes contexts, admissibility in accountability proceedings).

The Berkeley Protocol on Digital Open Source Investigations (OHCHR, 2020) is the doctrinal standard. The protocol is written for international criminal and human-rights investigations; the discipline scales down to ordinary independent analytical work without modification. Adopt the discipline at the outset, not retroactively.

6.1 Archive Before Analysis

Every source consulted should be archived at the moment of collection. Retroactive archival recovers the artifact but loses the chain — you cannot prove the artifact looked the same at the moment you formed your assessment as it does at the moment you went back to archive it.

Tool	Use case	Notes
Wayback Machine (“Save Page Now”)	First-line, automatic, broadly compatible	Submit URL via web form or API; free; long-term retention; some pages decline crawling.
archive.today (archive.ph, archive.is)	JavaScript-heavy pages; pages Wayback declines	Manual; reliable; jurisdiction of operation occasionally affects availability.
ArchiveBox	Self-hosted personal archive	Useful for sensitive or large collections; full-fidelity local copies.
HTTrack / wget mirroring	Full-site capture	Heavy-handed but complete; useful for ephemeral sites.
yt-dlp	Video and audio (YouTube, Twitter, TikTok, Telegram, etc.)	Maintained, fast-moving; format support changes; check for updates.
Zotero with attached snapshots	Document library + per-source snapshot	Best for cited-source workflows; supports BibTeX export.
Hunchly	Browser-integrated investigation capture	Paid; designed for investigative work; produces case files.

6.2 Hash Logging

At the moment of capture, compute a cryptographic hash of the archived artifact and log it. SHA-256 is the standard.

The hash establishes that the artifact has not been modified since capture.
The capture timestamp (UTC, with seconds) establishes when you had the evidence.
The combination defends against the most common challenge to OSINT-based assessments: the assertion that the source was different when you cited it.

A minimum collection register row:

timestamp_utc       | source_url                       | archive_url                                  | sha256                                                           | content_class | source_grade
2026-05-15T13:42:09Z| https://example.gov/statement-09 | https://web.archive.org/web/20260515134210/  | 9a3d4e5c8f7b2a1d0e6c5b4a3f2e1d0c9b8a7f6e5d4c3b2a1f0e9d8c7b6a5f4e | primary,state | B2

The register should be a single append-only file per case or per investigation, not a scattered set of notes. Plain text, CSV, or a small SQLite table — anything that supports time-stamped append and integrity checking.

6.3 Chain of Custody for Higher-Stakes Work

If there is any possibility that the analysis may become evidence in litigation, court proceedings, accountability mechanisms (IHL/ICC, regional human-rights bodies), or formal investigations, chain-of-custody discipline must be in place from the first collection action — not introduced when the case escalates.

The Berkeley Protocol identifies the core requirements:

Identifiability of the collector — who collected this artifact, on what device, from what network.
Reproducibility of the collection action — what tool, what command, what version.
Integrity of the artifact — hash at collection, hash at handoff, hash at presentation.
Authenticity of the source — corroborated identification of the originating account, URL, or channel.
Continuity of custody — every transfer (between devices, between collaborators, into evidence) documented.

For independent analysts, the practical implication is to default to the discipline even on low-stakes work, because the cost of doing it well is small and the cost of needing it and not having it is large.

6.4 Source Reliability Pre-Assessment at Capture

The collection register should include a pre-assessment of source grade — the Admiralty Code letter (source reliability) and digit (information credibility), or an equivalent grading scheme. This is a *pre-*assessment, made at the moment of capture before deep analysis. It is not the final grading; it is the initial flag.

Pre-assessment at capture is a forcing function: it stops the analyst from collecting indiscriminately and treating everything as equal. The detailed grading discipline — including how to grade when the institutional counter-intelligence apparatus is absent — is the subject of Part 04.

7. Collection Management and Feed Hygiene

Even at solo scale, collection requires explicit management or it becomes noise. The discipline below is the minimum operational structure for sustained independent practice.

7.1 Source Tiering

Sources should be tiered by the cadence at which they are reviewed, not by their importance in the abstract. A tier-3 source is not less important — it is less time-sensitive.

Tier	Cadence	Source class	Examples
Tier 1	Daily (push/alert)	Real-time, direct on standing PIRs	ISW daily, MFA press feeds for principal actors, monitored Telegram channels, threat-feed alerts, breaking-news watchlists.
Tier 2	Weekly	Contextual, important, non-real-time	Think-tank weekly briefs, regulatory filings (weekly sweep), vendor APT reports, peer-analyst output, academic working papers.
Tier 3	Monthly or periodic	Background depth, slow-moving	Dissertations, books, long-form journalism archives, declassified document releases, longitudinal data updates.

The tiering is not static. A source that consistently produces above-tier signal should be promoted; a source that consistently produces below-tier noise should be demoted or removed. The hygiene check should be on the calendar (a quarterly tier review is the minimum cadence).

7.2 Feed Architecture

RSS readers — the spine of the architecture. Tier 1 and Tier 2 sources should be in an RSS reader where available. Tools: Inoreader (commercial, with API access), Feedly, Miniflux (self-hosted), FreshRSS (self-hosted). RSS-to-RSS bridges (RSS-Bridge) recover feeds from sites that have dropped native RSS.
Telegram clients — for channel-monitoring tier 1. Mute defaults are essential; otherwise the volume destroys other work.
Email digest subscriptions — vendor reports, government advisories, think-tank newsletters. Filter to a dedicated label/folder to prevent inbox erosion.
Push alerts — Google Alerts (crude but free), Talkwalker Alerts (more sophisticated free tier), mention.com, F5Bot for Reddit. Use sparingly; alert noise rapidly exceeds analytical capacity.
Browser dashboards — TweetDeck-class dashboards for X (degraded but usable for those with API access); custom HTML dashboards for the determined.
Calendar-driven review — Tier 3 sources should be on a recurring calendar item, not on push.

7.3 Deduplication

Most events are reported by multiple sources. Without deduplication discipline, the analyst will systematically over-count: three reports of the same event from three derivative sources can look like three-source corroboration when in fact they are one source amplified three times.

The discipline:

Establish which source is primary for the event (the originating reporter or the actor’s own statement).
Treat downstream sources as amplification, not corroboration, unless they add independent reporting.
Distinguish independent corroboration (a second source with its own reporting chain that confirms the same event) from concurrent reporting (a second outlet running the same wire copy or citing the same primary).

For most independent practitioners this discipline is the single highest-leverage improvement to source-grading hygiene. It is also the discipline most frequently lost under deadline pressure.

7.4 Collection Plan as Living Document

The collection plan should be a written, versioned document, not an implicit list. The minimum structure:

Standing PIRs — the questions the collection is meant to answer (see Part 02).
Source roster per PIR — which sources, in which tier, in which language, address each PIR.
Gaps register — collection requirements with no current source coverage; explicit, dated, with status.
Source health log — sources that have degraded (paywall change, deletion, account loss), with replacement candidates.
Review cadence — when the plan itself is re-reviewed (quarterly minimum).

The collection plan is the document that converts the all-OSINT constraint from a limitation into an architecture. Without it, the practitioner is in reactive collection — chasing whatever the feed surfaces — and the analytical product inherits the feed’s bias instead of the analyst’s design.

Closing Note

The all-OSINT constraint is not a deficiency to be apologized for. It is the defining operational constraint of independent practice, and competent practitioners do not pretend otherwise. The collection architecture above is built to operate inside the constraint, not to wish it away. The next chapter (Part 04) addresses the grading discipline that must be applied to every artifact this architecture surfaces — without the institutional counter-intelligence apparatus that would normally validate it. The chapter after that addresses the analysis that the graded inputs feed. The technology and tooling layer is consolidated in Part 11; this chapter has named tools where source-class examples required them, but the consolidated tool matrix is there.

The discipline scales. The architecture scales. What does not scale is the absence — the inability to ever close the loop with closed-source confirmation. That absence is what makes the next chapter necessary.

Intelligence notes

Explorer

Independent Intelligence Analysis — Part 03: Collection for the Open-Domain Practitioner